# Assignment 1: Describing Human Trafficking

## Read this before you begin

Hello! Welcome to PS 138M. To keep up with the increasing prevalence of data in addressing social issues such as human trafficking, we are introducing a dual track assignment system where you can choose to complete either the data intensive assignment or the writing intensive assignment, based on your own personal goals for this course. Both tracks will have a basic data component, as all students in this class should have a basic level of data proficiency. The assignments diverges when it comes to the more complex tasks. To be clear, you only need to choose **ONE** track. 

### Notes
* Most of the **data** questions in this problem set will be automatically graded. Therefore, you need to answer the questions exactly as we asked them: if we ask you to compute something, write the code to compute that exact value.
    * Some of the coding questions below will tell you if you have the right or wrong answer if you run your code and then run the `ottr::check` cell just below it. (The multiple choice questions do not do this.)
    * You'll know you got a question right if it says "All tests passed!" afterwards.
    * If you see an error when you run the check cell, it means you got the question wrong and should try again. The question should show a hint.
    * If you don't see anything when you run the check cell, this means the question doesn't tell you if you got the right or wrong answer.

* **Short answer questions** will be graded by a GSI. For these questions:
    * We expect around 2-3 sentences on average. You just need to write enough to answer the question, don't stress about how short your answer is.
    * To encourage you not to write too much, your GSI will only mark the first 60 words to your answer. For many of these questions, you do not need to write all 60 words.
    
* To make sure you completed all the required parts of this assignment, `Command` + `F` "to do". All parts in which you should have taken some action will be labeled with "to do". 

# Setup
The cell below loads the [packages](https://www.geeksforgeeks.org/packages-in-r-programming/) needed for this assignment. 

In [None]:
# You *must* run this cell first. Do not change the contents of this cell.
library(testthat)
library(ottr)
library(tidyverse)

<!-- BEGIN QUESTION -->

## Question 1
**(2 points) Ask chat gpt the following: *"How would you describe human trafficking? Discuss how you would define human trafficking, what the different forms of trafficking are, its impacts, and its prevalence."*** 
    
**Summarize the output in the section below. What are the strengths and weaknesses of its answer? How would you improve it? Explain how you would approach verifying the information it presents.**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Question 2: Trafficking Victims
<span style="color:#3268a4">

The code chunk below loads a synthetic dataset provided by the [Counter Trafficking Trafficking Collaborative](https://www.ctdatacollaborative.org/page/about) that is "the first global data hub on human trafficking, publishing harmonized data from counter-trafficking organizations around the world," that aggregates individual level data from various organizations. This is the collaborative where IOM publishes their data. 

In [None]:
# RUN CODE. DO NOT CHANGE
ctdc <- read.csv("2024_CTDC_synthetic.csv")

# The line of code below replaces all NA values with 0 if the column is numeric
ctdc <- ctdc %>%  mutate(across(where(is.numeric), ~replace_na(., 0)))

head(ctdc, 5)

#### a. (1 point) How many individuals are included in this dataset? You can use the `nrow()` function or another alternative that returns the same number. The solution should be an integer.  


In [None]:
# TO DO
n_cases <- NULL # YOUR CODE HERE
n_cases

In [None]:
. = ottr::check("tests/q2a.R")

#### b. (1 point) What are the variables that are included in the dataset? Use the `colnames()` function to store a vector of the column names inside the `column_names` variable. 

In [None]:
# TO DO
# hint, use the `colnames()` function
column_names <- NULL # YOUR CODE HERE
column_names

In [None]:
. = ottr::check("tests/q2b.R")

<!-- BEGIN QUESTION -->

#### c) (5 points) What do the variables that start with "means" represent? Choose one means and explain what it is in the context of human trafficking. What are the types of trafficking included in the dataset? What do the variables that start with "recruiter" represent? In what situations are family or friends recruiters? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Replicating estimates
The CTDC synthetic data [dashboard](https://www.ctdatacollaborative.org/global-synthetic-data-dashboard) has generated these estimates for the prevalence of different types of forced labor. Let's see if we can replicate these numbers. 
<img src="synthetic_estimates.png">


<!-- BEGIN QUESTION -->

#### d) (3 points) The first step in replicating the estimates is making sure we are looking at the right subset of data. Explain what the code chunk below is doing. Hint: To look up what a function does, run a code chunk with `?function_name` to get the documentation.  

In [None]:
# DO NOT CHANGE.  
ctdc_subset <- ctdc %>% filter((isForcedLabour >0 | isSexualExploit > 0 | isOtherExploit >0)) %>%
                select(isForcedLabour, isSexualExploit,isOtherExploit)
head(ctdc_subset, 10)

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### e) Now we want to find the prevalence of each form of exploitation. Use the `summary()` function on the `ctdc_subset` to return the means of each column. (Think about why taking the mean of a column returns the prevalence of the variable). 

*Optional Challenge*: If you are already comfortable with coding, try using the `group_by` and `summarise` functions in the dplyr package to tabulate the means. 

In [None]:
# TO DO
exploit_summary <- NULL # YOUR CODE HERE
exploit_summary

In [None]:
. = ottr::check("tests/q2e.R")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### f) What are the flaws of using the CTDC dataset as estimates of prevalence for a certain country? How should we interpret the statistics generated from this dataset?

_Type your answer here, replacing this text._

<!-- END QUESTION -->


# Submitting Your Notebook (please read carefully!)

To submit your notebook...

### 1. Click `File` $\rightarrow$ `Save Notebook`.

### 2. Wait 5 seconds.

### 3. Select the cell below and hit run.

In [None]:
ottr::export("pset1.ipynb")

After you hit "Run" on the cell above, click the download link. A .zip file should download to your computer.

(If you make changes to your notebook, you'll need to hit save and then run the cell above again before you submit to get a new version of it.)

### 4. Submit the .zip file you just downloaded <a href="https://www.gradescope.com/" target="_blank">on Gradescope here</a>.

Notes:

- **This does not seem to work on Chrome for iPad or iPhone.** If you're using an iPad or iPhone, you need to download the file using **Safari**.
- If your web browser automatically unzips the .zip file (so you see a folder instead of a .zip file), you can just upload the .ipynb file that is inside the folder.
- If this method is not working for you, try this: hit `File`, then `Download as`, then `Notebook (.ipynb)` and submit that.