# Analytical Assignment 1: Introduction to Human Trafficking

### Substantive Objectives
What does human trafficking look like? In class, we have gone through readings that describe and analyze the practice. In this assignment, we will supplement our existing knowledge using data. The increasing emphasis on data in society and the social sciences has created a new demand for us to think critically through how others interpret data, double check their work, and be able to draw conclusions from data ourselves.

This assignment will be using dataset that aggregates reported cases of trafficking around the world, highlighting the forms of trafficking,  means of control, and recruiters involved. The goal is primarily for you to think critically about different measurements and interpretations of the same data, and additionally to introduce you to coding and working with large datasets. 

### Coding Objectives

Learn how to use the following functions. 
1) `read.csv()`
2) `nrow()`
3) `filter()` and `select()`
4) `summarise()`
5) `mean()`
6) `prop.table`

## Setup
The cell below loads the [packages](https://www.geeksforgeeks.org/packages-in-r-programming/) needed for this assignment. You must run the cell for the rest of the assignment to work. 

In [None]:
# You *must* run this cell first. Do not change the contents of this cell.
library(testthat)
library(ottr)
library(tidyverse) 

<!-- BEGIN QUESTION -->

-----
## Question 1: Gemini
**Ask Google Gemini the following: *"Define human trafficking. Expand on the history of human trafficking, how the modern concept of human trafficking emerged."*** 
    
**1a) (1 point) Read the output. Summarize what it said in 2-3 sentences.**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**1b) (3 points) Answer the following questions in 2-3 sentences maxium for each.**

>**(i) (1 point) What would you change or add to Google Gemini's answer based on the history and definition of human trafficking presented in class and in your readings? Name at least two. \
(ii) (1 point) What are the strengths and weaknesses of its answer? \
(iii) (1 point) What steps should you take to verify its response?**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

-----
## Question 2: Types of Data and Human Trafficking
<span style="color:#3268a4">

<!-- BEGIN QUESTION -->

**2)(4 points) In your readings thus far, you have been exposed to findings based upon journalistic accounts and in-depth interviews as well as findings based upon large datasets. What value do you think large datasets of information have? What is the value of journalistic accounts and in-depth interviews of a few individuals? In 5-6 sentences, describe the benefits of both types of information and the types of questions we can answer with each.**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## Question 3: Understanding the Data
<span style="color:#3268a4">

The global counter-trafficking community has recognized the importance of inter-organizational coordination in the standardization and consolidation of human trafficking data worldwide. While we are far from reaching an ideal standard, organizations such as the IOM have started taking steps towards this. The code chunk below loads a synthetic dataset provided by the [Counter Trafficking Data Collaborative](https://www.ctdatacollaborative.org/page/about) that is "the first global data hub on human trafficking, publishing harmonized data from counter-trafficking organizations around the world," that aggregates individual level data from various organizations. 

Note: A synthetic dataset is information that's been generated on a computer to augment or replace real data. In this case, the use of synthetic data is to protect sensitive data on human trafficking victims. 

In [None]:
# RUN CODE. DO NOT CHANGE
ctdc <- read.csv("2024_CTDC_synthetic_clean.csv")

### Dataset Structure

**a. (1 point) How many observations are included in this dataset? Use the `nrow()` function.**

In [None]:
# TO DO
n_cases <- NULL # YOUR CODE HERE
n_cases

<!-- BEGIN QUESTION -->

**b) (3 points) Looking at the dataset, we see that each row/observation represents a person, but not every single person in this dataset has been identified as a victim of trafficking. The three columns you would use to examine what might constitute a victim of trafficking would be the `isForcedLabour`, `isSexualExploit`, and `isOtherExploit` columns.  Look through the column names to see what makes sense. You can also reference the [codebook.](https://www.ctdatacollaborative.org/sites/g/files/tmzbdl2011/files/2024-02/Codebook_CTDC_global_synthetic_data_v2024.pdf)**

- i. Look at the codebook for each of these variables. Identify a definition for each of these three variable types in no more than one sentence each. 
- ii. Why are these the right variables to look at when determining if a case is human trafficking?
- iii. What is the difference between the estimated prevalence of trafficking and the number of detected trafficking cases?

In [None]:
# NO ACTION NEEDED. RUN CELL.
# prints first 5 observations of the dataset
head(ctdc, 5)

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## Question 4: Replicating estimates
The CTDC synthetic data [dashboard](https://www.ctdatacollaborative.org/global-synthetic-data-dashboard) has generated summary statistics of the human trafficking reports in their dataset. Let's try to replicate these numbers. 

**We are interested in estimating the proportion of (1) each form of trafficking, (2) means of control, and (3) recruiters among those who are trafficking victims.**



**a) (1 point) Filtering Rows**: Not all individuals in this dataset are trafficking victims. Let's first filter our rows to only victims of trafficking. Create a new dataset that keeps only individuals who have been coded as 1 in either `isForcedLabour`, `isSexualExploit`, and `isOtherExploit`. Use both BaseR and Dplyr.

Example code (BaseR): `dat[criteria1 | criteria2 | criteria3,]`


Example code (Dplyr): `dat %>% filter(criteria1 | criteria2 | citeria3)` 

In [None]:
# Using BaseR
victims.df1 <- NULL # YOUR CODE HERE
# Using Dplyr
victims.df2 <- NULL # YOUR CODE HERE

# Check if they have the same number of rows. 
nrow(victims.df1) == nrow(victims.df2)

**b) Selecting Columns:** Not all individuals in this dataset are trafficking victims. Let's practice how to select columns in both BaseR and Dplyr. Please select the three columns using both methods: `isForcedLabour`, `isSexualExploit`, `isOtherExploit`.

Example code (BaseR): `dat[,c("column1", "column2", "column3")]` 

Example code (Dplyr): `dat %>% select(column1, column2, column3)` 

In [None]:
# EXAMPLE CODE: SELECTING COLUMNS
forms.df1 <- NULL # YOUR CODE HERE
forms.df2 <- NULL # YOUR CODE HERE

ncol(forms.df1) == ncol(forms.df2)
head(forms.df1)# display
head(forms.df2)# display

**c) Calculating Summary Statistics:** Now we want to find the proportion of each form of trafficking within our data. Use both baseR and Dplyr to find the proportion of victims that were classified under forced labour. 

Example code: `prop.table(table(dat$column1))`

Example code: `dat %>% summarise(your_col_name1 = mean(column1, na.rm =T))` 
Notice that we include, `na.rm = T`, to drop all missing values. 


In [None]:
# use BaseR
ht.summary1 <-prop.table(table(victims.df1$isForcedLabour))

# use Dplyr
ht.summary2 <- forms.df1 %>% 
    # find the means of each type of trafficking
    summarise(isForcedLabour = mean(isForcedLabour, na.rm = T)) 

# display
ht.summary1
ht.summary2

**d) Calcuating Cross Tabulation** Let's find the propotion of victims classified for forced labor by gender. 

Example code (BaseR):
 (i) Make a new dataset for each gender category (i. man, ii. woman, iii. trans and gender non-conformining individuals). Then find the mean of victims that are classified as forced labour. 
        
         victims.man <- dat[dat$gender == "Man",]`
        
         mean(dat$isForcedLabour)

Example code: `dat %>% summarise(your_col_name1 = mean(column1, na.rm =T))` 

In [None]:
# Data frame Man
victims.man <- victims.df1[victims.df1$gender == "Man",]
mean(victims.man$isForcedLabour)

# Data frame Woman
victims.woman <- victims.df1[victims.df1$gender == "Woman",]
mean(victims.woman$isForcedLabour)

# Data frame Trans and Nonconforming Individuals
victims.trans <- victims.df1[victims.df1$gender == "Trans/Transgender/NonConforming",]
mean(victims.trans$isForcedLabour)

**e)Calculating Cross Tabulations Continued** Now we will find the proportion of victims by age. You will use the new datasets to break down a gendered and age based set of results. Use the `prop.table` function to do so. 
Example code: `prop.table(table(dataframe$column1))`

In [None]:
# Men by Age
table.menage <- NULL # YOUR CODE HERE

# Women by Age
table.womanage <- NULL # YOUR CODE HERE

# Trans and Non-Conforming People by Age
table.transage <- NULL # YOUR CODE HERE

In [None]:
# View Your Tables
table.menage
table.womanage
table.transage

<!-- BEGIN QUESTION -->

**Part F)** What is the difference between the above table and below? How do they represent the same information differently? Answer in no more than 3 sentences. 

In [None]:
prop.table(table(victims.df1$gender, victims.df1$ageGroup)) |> round(2)

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**g)Comparison and Checking** You should compare the results of your cross tabulations by age and gender to the conclusions to the UNODC 2022 report, pictured below, in 2-3 sentences. Are your findings consistent?

<img src="UNODC_Demographics.png" width="70%"/>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

#  Question 5: Interpretations

<!-- BEGIN QUESTION -->

I asked chatGPT to tell me the most commonly reported forms of human trafficking, means of control, and typical recruiter. It responded...
1.  Most commonly reported form of trafficking is sexual exploitation. It is the most prevalent form of human trafficking, where victims are forced into prostitution, pornography, and other forms of sexual exploitation.
2. Most commonly reported means of control is threats are the most commonly reported methods used by traffickers to control their victims.
3. Most commonly reported recruiters are acquaintances or family members. The majority of traffickers are known to the victims, including friends, romantic partners, or even family members. This familiarity helps traffickers gain the trust of victims before exploiting them.

**5a)(3 points) Based on the data from CTDC [dashboard](https://www.ctdatacollaborative.org/global-synthetic-data-dashboard) (which you have replicated) how would you assess ChatGPT's answer? For each answer, indicate true or fale, and if false -- why, in no more than 1 sentence. Please note in your answers if ChatGPT accurately uses "prevalence" and "commonly reported".**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**b) (3 points) The first chart below is taken from the CTDC dashboard and should match your numbers. The second chart is taken from UNODC's 2022 report from your Week 2 readings. Please...**

> (1) Provide two possible explanations for why we see differences in their numbers.\
> (2) Provide at least one possible real world implications of the differences in results from two well-funded and credible organizations.

<img src="dashboard1.png" width="30%"/>
<img src="UNODC_Trafficking_Prevalence.png" width="70%"/>



_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#  Question 6: Please share how you used AI in completing this assignment?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



In [None]:
ottr::export("pset1.ipynb")

After you hit "Run" on the cell above, click the download link. A .zip file should download to your computer.

(If you make changes to your notebook, you'll need to hit save and then run the cell above again before you submit to get a new version of it.)

### Question 7. Submit the .zip file you just downloaded <a href="https://www.gradescope.com/" target="_blank">on Gradescope here</a>.

Notes:

- **This does not seem to work on Chrome for iPad or iPhone.** If you're using an iPad or iPhone, you need to download the file using **Safari**.
- If your web browser automatically unzips the .zip file (so you see a folder instead of a .zip file), you can just upload the .ipynb file that is inside the folder.
- If this method is not working for you, try this: hit `File`, then `Download as`, then `Notebook (.ipynb)` and submit that.