# Setup
Load Packages

In [None]:
# You *must* run this cell first. Do not change the contents of this cell.
library(testthat)
library(ottr)
library(tidyverse)

Load data

In [None]:
# read in data
df_gsi <- read.csv("gsi-scores-2023.csv")

Look at the head of the dataframe. 

In [None]:
# look at first 3 rows of the data
head(df_gsi, 3)

---
## What's our stance on the Global Slavery Index?
* Group A: Advocates
* Group B: Skeptics

Optional, if you would some inspiration, here are some sources you can skim...

**Both Groups**
* [Walk Free Methodology](https://www.walkfree.org/global-slavery-index/methodology/methodology-content/#prevalence)
* [Excel file](2023-GSI-Data-Full.xlsx): Looks at the breakdown of the vulnerability scores, where they are getting the data from.


**Group A**
* [The Politics of Global Data Reporting: A Triangulated Solution for Estimating Modern Slavery](https://www.tandfonline.com/doi/full/10.1080/09332480.2017.1383112#d1e93)
* [Media career of the GSI](https://www.emerald.com/insight/content/doi/10.1108/S0733-558X20210000074030/full/html)
* [Review of the 2018 Launch](https://blogs.nottingham.ac.uk/newsroom/2018/07/19/2018-global-slavery-index-launches-at-the-united-nations-headquarters/)


**Group B:**
* [Gallagher](https://www.antitraffickingreview.org/index.php/atrjournal/article/view/228): Critiques the 2014 version, which has slightly different methods than the most current one
* [McGrath and Watson](https://www.sciencedirect.com/science/article/pii/S0016718518301222#b0115): Start at section 5

---
With your group, discuss...

**For Group A: How can this be useful?**
1. Make a case for the usage Walk Free's choice to use nationally representative surveys to estimate prevalence of forced labor. Note its limitations, but focus on how we can also learn from them.
2. What are the sources of the data on the vulnerability measures? Do you trust them? What is useful about having these vulnerability measures?
3. Is there a political/more practical purpose of the GSI that is helpful?

**For Group B: What should we be cautious about?**
1. What is the definition of trafficking being used in the surveys? Is this consistent with the definitions we have learned?
2. Why can't we trust the regional and global estimates? What is the problem with using data from 68 countries to estimate prevalence in other countries?
3. What are the cons of the way that vulnerability is being measured?

   


---
## Tutorial of Data
## Explore Data

### Country level data
<img src="surveyed-countries.png" width="100%"/>

**Comment 1:** 
There exists data on countries that have been surveyed, and countries that were not. 

In [None]:
# sort the table by prevalence per 1000
df_gsi %>% 
    # select some columns
    select(country, prev_per_1000, prev_total, surveyed) %>%
    # sort by the prevalence per 1000
    arrange(desc(prev_per_1000))

#### We can use these sorted tables to identify the countries with high degrees of prevalence. 

**Note:**
If we cared about figuring out where has the highest prevalence, what can and can't we learn from this dataset?

In [None]:
# among the direct survey data
df_gsi %>% filter(country == "North Korea") 

**Discuss:** If you were to report on the countries with the highest prevalence of human trafficking, given this data, what would you say?

_Your answer here_

---
### Regional Level Data
Regional estimates are also based on non-surveyed and surveyed data. 

Let's look at regional estimates of prevalence. Group by `region` and find the average prevalence per 1000, and sum of the total prevalence for each region. Can we trust these estimates?

In [None]:
df_gsi %>% 
    # telling R that categories are stored in the column region
    group_by(region) %>% 
    summarise(avg_prev_1000 = mean(prev_per_1000, na.rm = T), # average
              median_prev_1000 = median(prev_per_1000, na.rm = T), # median
              sum_total_prev = sum(prev_total, na.rm = T)) %>% # sum

    # sorting in descending order
    arrange(desc(avg_prev_1000))

**Note: How much of this data is real versus statistically inferred?** 
1. Group by region
2. Find the proportion of countries that were surveyed in each region
   
Discuss: Which regions would you be more or less comfortable taking estimates from?

In [None]:
df_gsi %>% group_by(region) %>%
    summarise(proportion_surveyed = mean(surveyed))

**Discuss:** What claims would you make about regional prevalence? Could you still use this data to identify regional patterns? How would you go about doing so?

_Your answer here_

---
## Global Prevalence
**What is the global prevalence of human trafficking? This is the number we see everywhere!** 

        df %>% summarise(total_prev = ...(..., na.rm = T))

In [None]:
df_gsi %>% summarise(total_prev = sum(prev_total, na.rm = T))

**Discuss:** This estimate is based off of inferences of 68 countries to 160 countries. What are your reactions? How can this estimate be helpful and/or harmful? Why do people keep citing it?

_Your answer here_

---
## Exploring Vulnerability
Let's look at vulnerbaility measures. The main score is stored in the column `vulnerability`. The five dimension scores are stored in `governance_issues`, `basic_needs`, `inequality` `disenfranchised_groups`, `effect_conflict`. The higher the score, the more vulnerable. 

**Comment:** We could sort by the vulnerability to see which country we want to target. 

These are not from the surveys, but rather from the various datasets. We could sort the data by the vulnerability score, and see which countries have the most room for improvement. (Higher scores are more vulnerable)

In [None]:
df_gsi %>%  arrange(desc(vulnerability)) %>% head()

**Comment:** We could also look at vulnerability across regions. 

In [None]:
df_gsi %>%
group_by(region) %>%  
    # one row per region
    summarise(
        # find the average vulnerability score
        avg_vulnerability = mean(vulnerability, na.rm = T),
        # find the average prevalence score
        avg_prev_1000 = mean(prev_per_1000, na.rm = T)
    ) %>%
    # sort dataframe in descending order
    arrange(desc(avg_vulnerability))

---
### **Exploring Relationships**
**Comment:** Use `lm()` to explore relationships between prevalence per 1000 and the vulnerability dimensions. 

        lm(dv ~ iv, data = df_gsi_full)

In [None]:
# looks at the relationship between vulnerability dimensions and prevalence
lm(prev_per_1000 ~ basic_needs + 
                   inequality + 
                   disenfranchised_groups + 
                   effect_conflict +
                   governance_issues,
   data = df_gsi) %>% summary()

_Do you trust these relationships? Why or why not? What if we only looked at the real surveyed countries, would you trust the estimates then?_

_Your answer here_