# Assignment 1: Introduction to Human Trafficking

### Substantive Objectives
What does human trafficking look like? In class, we've gone through readings that describe and analyze the practice. In this assignment, we will supplement our existing knowledge using data. The increasing emphasis on  data in society and the social sciences has created a new demand for us to have the skills and active practice of thinking critically through how others interpret data, to double check their work, and to be able to draw conclusions from data ourselves. 

This assignment will be using dataset that aggregates reported cases of trafficking around the world, highlighting the forms of trafficking,  means of control, and recruiters involved. The goal is primarily for you to think critically about different measurements and interpretations of the same data, and additionally to introduce you to coding and working with large datasets. 

### Coding Objectives

Learn how to use the following functions. 
1) `read.csv()`
2) `nrow()`
3) `filter()` and `select()`
4) `summarise()`
5) `mean()`

## Setup
The cell below loads the [packages](https://www.geeksforgeeks.org/packages-in-r-programming/) needed for this assignment. You must run the cell for the rest of the assignment to work. 

In [None]:
# You *must* run this cell first. Do not change the contents of this cell.
library(testthat)
library(ottr)
library(tidyverse) %>% suppressMessages()

<!-- BEGIN QUESTION -->

-----
## Question 1: ChatGPT
**Ask chat gpt the following: *"Define human trafficking. Expand on the history of human trafficking, how the modern concept of human trafficking emerged."*** 
    
**a) (1 point) Copy paste the output below.**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**b) (3 points) Answer the following questions.**

>**(i) (1 point) What would you change or add based on the history and definition of human trafficking presented in class and in your readings? Name at least two. \
(ii) (1 point) What are the strengths and weaknesses of its answer? \
(iii) (1 point) What steps should you take to verify its response?**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

-----
## Question 2: Types of Data and Human Trafficking
<span style="color:#3268a4">

<!-- BEGIN QUESTION -->

**2a)(4 points) What can we learn about victims from in-depth interviews and individual cases? What can we learn about trafficking from large datasets of information? Name 2 from each.**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## Question 3: Understanding the Data
<span style="color:#3268a4">

The global counter-trafficking community has recognized the importance of inter-organizational coordination in the standardization and consolidation of human trafficking data worldwide. While we are far from reaching an ideal standard, organizations such as the IOM have started taking steps towards this. The code chunk below loads a synthetic dataset provided by the [Counter Trafficking Data Collaborative](https://www.ctdatacollaborative.org/page/about) that is "the first global data hub on human trafficking, publishing harmonized data from counter-trafficking organizations around the world," that aggregates individual level data from various organizations. 

Note: A synthetic dataset is information that's been generated on a computer to augment or replace real data. In this case, the use of synthetic data is to protect sensitive data on human trafficking victims. 

In [None]:
# RUN CODE. DO NOT CHANGE
ctdc <- read.csv("2024_CTDC_synthetic.csv") %>% select(-X)

### Dataset Structure

**a. (1 point) How many observations are included in this dataset? Use the `nrow()` function. The solution should be an integer.**

*There are much larger datasets out there as well. This is a shameless plug for learning R or some coding language:). How long would that take to punch into a calculator?*


In [None]:
# TO DO
n_cases <- NULL # YOUR CODE HERE
n_cases

In [None]:
. = ottr::check("tests/q3a.R")

<!-- BEGIN QUESTION -->

**b) (2 points) Looking at the dataset, we see that each row/observation represents a person, but not every single person in this dataset has been identified as a victim of trafficking. Answer the following questions.**

> (i) Which columns would you use to determine if an observation is a victim of trafficking? Look through the column names to see what makes sense. You can also reference the [codebook.](https://www.ctdatacollaborative.org/sites/g/files/tmzbdl2011/files/2024-02/Codebook_CTDC_global_synthetic_data_v2024.pdf)\
(ii) What is the difference between estimated prevalence of trafficking and the number of detected trafficking cases?

Run the code chunk below to see the first five rows of the dataset, and the list of column names. 

In [None]:
# NO ACTION NEEDED. RUN CELL.
# prints first 5 observations of the dataset
head(ctdc, 5)

# prints full list of column names
print("The column names are abridged in the dataframe. The full list of column names are printed below: ")
colnames(ctdc)

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Dataset Variables
The above questions are more concerned about the data structure and interpretation of data. The questions below are concerned about the content of the data itself, thinking through what type of data is being collected on victims, and why they have been decided as key information to collect. Look through the [codebook](https://www.ctdatacollaborative.org/sites/g/files/tmzbdl2011/files/2024-02/Codebook_CTDC_global_synthetic_data_v2024.pdf) provided by the CTDC and what you have learned in lecture to help answer the questions below. 

**c) (2 points) What do the variables that start with "means" represent, and why is this concept critical to trafficking? Pick one of the "means" and describe how it is used in the context of trafficking?**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**d) (2 points) What are the types of trafficking included in the dataset, what forms of trafficking are in the "other" category. Name one reporting trend that differs among these forms of trafficking that the readings have informed you of (e.g. X form of trafficking is under-reported in men due to the stigma).**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**e) (2 points) What is a recruiter in human trafficking? How might friends, family, and intimate partners become recruiters?**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## Question 4: Replicating estimates
The CTDC synthetic data [dashboard](https://www.ctdatacollaborative.org/global-synthetic-data-dashboard) has generated summary statistics of the human trafficking reports in their dataset. Let's try to replicate these numbers. 


### Forms of Trafficking
**We are interested in estimating the proportion of (1) each form of trafficking, (2) means of control, and (3) recruiters among those who are trafficking victims. To do so...**

First, lets **subset** to relevant data to analyze the different forms of trafficking. In the code chunk below...
* The `filter` function is selecting from the ctdc dataset the **rows** where the data is affirmative for either forced labor, sexual exploitation, or another form of exploitation. 
* The `select` function is selecting all **columns** that start with "is". 
* The subsetted data is stored in `forms_ht` using assignment operator `<-`. 


In [None]:
# EXAMPLE CODE #1 - RUN CELL AND UNDERSTAND EACH LINE

# Goal: Subset to affirmative trafficking cases. Store in dataframe called forms_ht
 
forms_ht <- ctdc %>% 
                # filter to the rows that are affirmative for trafficking
                filter((isForcedLabour >0 | isSexualExploit > 0 | isOtherExploit >0)) %>%
                # select the columns that start with "is". 
                select(starts_with("is"))

# display first 6 rows
head(forms_ht)

Now we use the `summarise()` function to find the proportion of each form of trafficking within our data. Notice that we include, `na.rm = T`. What would happen if we didnt (rhetorical question)? 

In [None]:
# EXAMPLE CODE #2  - RUN CELL AND UNDERSTAND EACH LINE
forms_summary <- forms_ht %>% 
    # find the means of each type of trafficking
    summarise( 
        isForcedLabour = mean(isForcedLabour, na.rm = T),
        isSexualExploit = mean(isSexualExploit, na.rm = T), 
        isOtherExploit = mean(isOtherExploit, na.rm = T)
    ) 

# display
forms_summary

### Means of Control

**a) (2 points) Your task is to do the same thing, but with the means of control. `filter()` to the subset of data that is affirmative for human trafficking (as seen above), then `select()` all the variables that start with "means". Make sure to comment what each line of code is doing.**

Hint: Take the code from the first example chunk above. You can select the relevant columns using `%>% select(starts_with("means"))`

In [None]:
# YOUR ANSWER HERE
# Modify the code from the first example code chunk above
means_ht <- NULL # YOUR CODE HERE
head(means_ht)

In [None]:
. = ottr::check("tests/q4a.R")

**b) (1 point) Now we want to find the prevalence of each means. Use the `summarise()` function on the `means_ht` to return the means (i.e. average) of each relevant column, including the sum column.** 

Hint: Take the code from the second example code chunk below, and modify it. \
Preserve the order of the columns. As you copy-paste each column name, think about what it entails for the victims. 

In [None]:
# YOUR SOLUTION HERE
# Modify the code from the second example code chunk above
# You should have 9 lines of code:
#       = The averages for the 8 different means of control  
#       - The average total means of controls for a reported victim

means_summary <- NULL # YOUR CODE HERE

# Display
means_summary

In [None]:
. = ottr::check("tests/q4b.R")

### Recruiters

**c) (2 points) Now we want to do the same thing, but with the recruiter. Filter to the subset of data that is affirmative for human trafficking, then select all the variables that start with "recruiter". Then, use the `summarise()` and `mean()` function on the `recruiter_ht` to return the means (i.e. average) of each column**

*Note:* This is a repeat of the process as above. As a challenge, try not to copy paste and write the code out yourself. 

*Select the relevant columns using `%>% select(starts_with("recruiter"))`*

In [None]:
# YOUR SOLUTION HERE
# PART 1: SUBSET
# Create subset of affirmative cases and relevant recruiter variables
recruiter_ht <- NULL # YOUR CODE HERE
# display
head(recruiter_ht)

# PART 2: SUMMARIZE
# Summarize: Make sure to retain the column order!
recruiter_summary <- NULL # YOUR CODE HERE

# display
recruiter_summary

In [None]:
. = ottr::check("tests/q4c.R")

<!-- BEGIN QUESTION -->

**Bonus (no points): Here's a shortcut! You actually didn't need to type out the individual columns, but it's just good practice to do so when you are just starting out. Repetition is the key to ingraining!**

**Try using the `summarise_all()` function to find the proprotion of the sample with each means of trafficking using the `means_ht` dataframe, and type of recruiter using `recruiter_ht` that you created in the question above.**

Example code using the `forms_ht'dataframe:

In [None]:
# forms of trafficking
forms_ht %>% summarise_all(mean, na.rm = T)

In [None]:
# YOUR SOLUTION HERE
means_summary_shortcut <- NULL # YOUR CODE HERE

# Part 1: Means ht
recruiter_summary_shortcut <- NULL # YOUR CODE HERE

In [None]:
. = ottr::check("tests/4bonus.R")

<!-- END QUESTION -->

#  Question 5: Interpretations

<!-- BEGIN QUESTION -->

I asked chatGPT to tell me the most commonly reported forms of human trafficking, means of control, and typical recruiter. It responded...
    
> **Most Commonly Reported Form of Human Trafficking** \
    > Sexual Exploitation: This is the most prevalent form of human trafficking, where victims are forced into prostitution, pornography, and other forms of sexual exploitation.\
> **Most Commonly Reported Means of Control** \
    > Threats and Violence: Physical violence and threats are the most commonly reported methods used by traffickers to control their victims.\
> **Most Commonly Reported Type of Recruiter** \
    > Acquaintances or Family Members: The majority of traffickers are known to the victims, including friends, romantic partners, or even family members. This familiarity helps traffickers gain the trust of victims before exploiting them.

**5a) Based on the data from CTDC, how would you assess ChatGPT's answer?**

*Hint: Do the numbers match your numbers? Does ChatGPT accurate use "prevalence" and "commonly reported"?*

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**b) (3 points) The first chart below is taken from the CTDC dashboard and should match your numbers. The second chart is taken from UNODC's 2022 report from your Week 2 readings. Please...**

> (1) Provide two possible explanations for why we see differences in their numbers.\
> (2) Provide at least one possible real world implications of the differences in results from two well-funded and credible organizations.

<img src="dashboard1.png" width="30%"/>
<img src="UNODC_Trafficking_Prevalence.png" width="70%"/>



_Type your answer here, replacing this text._

<!-- END QUESTION -->


# Submitting Your Notebook (please read carefully!)

Congrats, you're done! Hopefully, through this pset you have began to...
1. Understand how to work with ChatGPT, being aware of its limitations (and realize that through this course, you will be able to improve its answers)
2. Think critically about numbers and their interpretation. This will allow you to see through imprecise language and citations of statistics in general. 
3. Started working with datasets in R! An incredibly useful skill to have and we will keep on building your proficiency. 
4. Understand the transition from slavery to human trafficking and how the practice and concept of human trafficking emerged. 
4. Understand the forms, means of control, and recruiters in human trafficking on a global scale. 

To submit your notebook...

### 1. Click `File` $\rightarrow$ `Save Notebook`.

### 2. Wait 5 seconds.

### 3. Select the cell below and hit run.

In [None]:
ottr::export("pset1.ipynb")

After you hit "Run" on the cell above, click the download link. A .zip file should download to your computer.

(If you make changes to your notebook, you'll need to hit save and then run the cell above again before you submit to get a new version of it.)

### 4. Submit the .zip file you just downloaded <a href="https://www.gradescope.com/" target="_blank">on Gradescope here</a>.

Notes:

- **This does not seem to work on Chrome for iPad or iPhone.** If you're using an iPad or iPhone, you need to download the file using **Safari**.
- If your web browser automatically unzips the .zip file (so you see a folder instead of a .zip file), you can just upload the .ipynb file that is inside the folder.
- If this method is not working for you, try this: hit `File`, then `Download as`, then `Notebook (.ipynb)` and submit that.