code/data_analysis.Rmd

---
title: "Linked fate lit. review"
author: "Jae Yeon Kim"
output:
  html_document:
    number_sections: true
    toc: yes
  pdf_document:
    toc: yes
---

# Setup

```{r include=FALSE}

# Import packages 

if (!require("pacman")) install.packages("pacman")

pacman::p_load(
        tidyverse, # the tidyverse framework
        patchwork, # arranging ggplots 
        desc, # descriptive stat analysis
        ggpubr, # arranging ggplots 
        ggthemes, # fancy ggplot themes
        broom, # modeling
        ggfortify, # extended version of ggplot 
        ggsci, # color palette
        Hmisc, # capitalization
        stargazer, # model reports  
        conflicted, # resovling conflicts 
        here # reproducibility
        )

# Resolve conflicts

conflicted::conflict_prefer("select", "dplyr")
conflicted::conflict_prefer("filter", "dplyr")
conflicted::conflict_prefer("count", "dplyr")
conflicted::conflict_prefer("mutate", "dplyr")

devtools::install_github("jaeyk/makereproducible",
        dependencies = TRUE)

library(makereproducible)
```

```{r}

# load data 

lit_review <- read_csv(make_here("/home/jae/linked_fate_review/raw_data/linked_fate_review.csv"))

```


# Data Collection 

I used two search queries: (1) “linked fate” + “dawson” and (2) “linked fate” + “group consciousness” + “solidarity.” I culled the initial 103 results for each query, sorted by relevance, removed duplicates, and excluded non-empirical research in the data.

## The total N of articles

First, let's look at the total N of articles.

```{r echo=FALSE}

cat(("The total number of articles is"), nrow(lit_review))

```

Some articles are lit reviews, or law review articles or theoretical pieces that don't discuss data. Let's exclude them.

```{r echo=FALSE}

cat("The total number of empirical articles is", 
    length(subset(lit_review, is.na(Survey) == FALSE)$Author))

```

```{r echo=FALSE}
cat("The percentage of empirical articles in the data is", 
    length(subset(lit_review, is.na(Survey) == FALSE)$Author)/length(lit_review$Author))
```

```{r echo=FALSE}

cat("The total number of journals in the data is",
    length(unique(lit_review$Journal)))

```


## Overall patterns

The resulting sample (N = 89) is not an exhaustive list of published articles on linked fate. For instance, the search engine tends to miss recent publications because the results are sorted by relevance. This selection bias partly explains why the number of articles on linked fate appears to dip slightly in the last few years.

### Publication trend on linked fate

```{r echo=FALSE}

lit_review %>%
  group_by(Pub.year) %>%
  count() %>%
  ggplot(aes(x = Pub.year, y = cumsum(n))) +
  geom_point() +
  geom_line() +
  theme_pubr() +
  labs(title = "Publication trend on linked fate", x = "Publication year", y = "Cumulative count") + 
  scale_x_continuous(breaks = c(2000, 2005, 2010, 2015)) +
  scale_y_continuous(breaks = scales::pretty_breaks()) 

ggsave(here("output/pub_trend.png"), width = 7) 

ggsave(here("output/figure1.png"), width = 7) 

```

### Publication on linked fate by subject group

```{r echo=FALSE}

lit_review %>%
  gather(group, value, Black, Black_immigrants, Asian, Latinx, Whites, Others) %>%
  mutate(group = recode(group, 
                        "Latinx" = "Latino")) %>%
  filter(value == 1) %>%
  group_by(Pub.year, group) %>%
  dplyr::summarize(n = n()) %>%
  mutate(prop = n / sum(n),
         prop = round(prop,2)) %>%
  mutate(group = recode(group, 
                        'Black_immigrants' = "Black immigrants",
                        'Asian' = 'Asian Americans')) %>%
  filter(Pub.year >= 2002) %>%
  ggplot(aes(x = Pub.year, y = prop, fill = group)) +
    geom_col(position = "fill") +
    labs(title = "Publication trend on linked fate sorted by research subjects",
      x = "Publication year", y = "Percentage",
      fill = "Research subjects") +
    scale_x_continuous(breaks = c(2000, 2005, 2010, 2015)) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    theme_pubr() +
    scale_fill_npg()

ggsave(here("output/pub_trend_sub_groups.png"), width = 8)

```

```{r}

lit_review %>%
  mutate(total = Black + Black_immigrants + Asian + Latinx + Whites + Others) %>%
  gather(group, value, Black, Black_immigrants, Asian, Latinx, Whites, Others, total) %>%
  filter(value >= 1) %>%
  group_by(group) %>%
  count()

lit_review %>%
  mutate(total = Black + Black_immigrants + Asian + Latinx + Whites + Others) %>%
  gather(group, value, Black, Black_immigrants, Asian, Latinx, Whites, Others, total) %>%
  filter(value >= 2) %>%
  group_by(group) %>%
  count()
```

## Journal patterns 

General social science categories include journals like APSR, AJSP, JOP, and Political Research Quarterly (PQR) and American Politics Research (APR).

```{r echo=FALSE}

lit_review %>%
  group_by(Journal_type) %>%
  count() %>%
  ggplot(aes(x = reorder(Journal_type, n), y = n)) +
  geom_col() +
  coord_flip() +
  theme_pubr() +
  labs(title = "Articles by journal type",
    x = "Journal type", y = "Count")

ggsave("/home/jae/linked_fate_review/output/pub_trend_journals.png")
```

## Data patterns 

Note that these counts are not mutually exclusive. For instance, some studies use surveys plus experiments or qualitative evidence (interviews). Since methods follow trends, I did line rather than bar plotting.

1. Survey: relying on survey data 
    
2. Experiment: relying on experimental data 

3. Qualitative: relying on qualitative evidence (e.g., interviews)

```{r echo=FALSE}

lit_review %>%
  gather(data_types, values, Survey, Experiment, Qualitative) %>%
  filter(values == 1) %>%
  group_by(Pub.year, data_types) %>%
  dplyr::summarize(n = n()) %>%
  mutate(prop = n / sum(n),
         prop = round(prop,2)) %>%
  filter(Pub.year >= 2002) %>%
  ggplot(aes(x = Pub.year, y = prop, fill = data_types)) +
    geom_col(position = "fill") +
    labs(title = "Publication trend by data type",
      x = "Publication year", y = "Percentage",
      fill = "Data types") +
    scale_x_continuous(breaks = c(2000, 2005, 2010, 2015)) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    theme_pubr() +
    scale_fill_npg()

ggsave(here("output/pub_trend_data_types.png"))
```