# The CTDC Dataset and Basic Dataframe Manipulations

In [None]:
# RUN THIS CELL
# Load packages
library(testthat)
library(tidyverse) %>% suppressMessages()

Functions we will be learning
#### Dataframe basics
- `read.csv(filename)`: reads in a dataframe
- `head()`: displays first 6 rows
- `ncol()`, `nrow()`
- `%>%`: pipe operator, passes on the previous output into the next function

#### How do you subset?
1. Use `filter()` to subset to the rows you want
2. Use `select()`to subset to the columns you want

#### How do find the summary statistics of a column?
1. Use `summarise()` to find summary statistics like `mean()` of a column

---
## Dataframe basics

### reading in data
The function for reading in a dataset is `read.csv(filename)`

In [None]:
# read in the dataframe
ctdc <- read.csv("2024_CTDC_synthetic.csv")

#### Previewing Datasets with the `head()` function
b) You can preview the first few rows of the dataframe by using the function `head(df_name)`. You can also specify the number of rows you want to look at using `head(df_name, N)`

In [None]:
# preview first 6 rows
head(ctdc)

#### Get dimensions of a dataset

Use  `ncol(df_name)` to get the number of columns

In [None]:
# YOUR ANSWER HERE
number_columns <- NULL # YOUR CODE HERE

# display
number_columns

In [None]:
. = ottr::check("tests/Q1a.R")

Use `nrow(df_name)` to get the number of rows

In [None]:
# YOUR ANSWER HERE
number_rows <- NULL # YOUR CODE HERE

# display
number_rows

In [None]:
. = ottr::check("tests/Q1b.R")

<img src="tidyverse.png" width=40% />

## **What is the CTDC data?**

This [CTDC dataset](https://www.ctdatacollaborative.org/global-synthetic-data-dashboard) is a synthetic dataset that publishes harmonized data from counter-trafficking organizations around the world and is referenced on IOM's website, UN websites, and has been used to inform policies on trafficking. It is also the dataset you are working with in your problem set. 

>"The CTDC initiative is supported by donors such as the US Department of State Bureau of Population, Refugees, and Migration (PRM); the US Department of Labour (DOL) through the International Labour Organization (ILO)'s Research to Action (RTA) project; the Global Fund to End Modern Slavery (GFEMS) under a cooperative agreement with the US Department of State (DOS); the IOM Development Fund; and the Ministry of Foreign Affairs of the Netherlands (BZ). The Counter-Trafficking Data Collaborative (CTDC) has a dataset, known as the global victim of trafficking dataset, which is the largest global repository on victims of human trafficking" (Kooffreh 2023).

The *synthetic* aspect is to protect the privacy of the victims. 

Structure:
* Each row represents one reported case/individual (this includes cases that fit the definition of trafficking, and cases that do not). 
* The columns can be seen below, covering demographics, country of origin, exploitation, the form of trafficking, the means of control, and the recruiter relationship. 

In [None]:
# without %>%
colnames(ctdc)

In [None]:
# with %>%
ctdc %>% colnames()

In [None]:
# what are the column names?
ctdc %>% colnames() %>% as.data.frame()

Preview the first 6 rows using `head(df_name)`

In [None]:
# TO DO: preview the first 6 rows of ctdc
 

### **How should we understand IOM's Data**

Read pages 2-4 of the [codebook](https://www.ctdatacollaborative.org/sites/g/files/tmzbdl2011/files/2024-02/Codebook_CTDC_global_synthetic_data_v2024.pdf).

**Discussion Q: Take 5 minutes to answer with 2-3 people around you.**

1. Where is the data on victims using the CTDC from?
2. How might this differ from nationally collected data cited by the UNODC (i.e. collected by government initiatives)?
3. A study citing CTDC as a primary data source in an analysis of trends says it is "a comprehensive and reliable repository of human trafficking incidents from various regions and time periods" (Olisah et al.) Do you agree or disagree?

_Notetaking box, Replace this text._

---
### **Our Task: How do traffickers control their victims?**
**Question: What are the means of control that are used by traffickers**. 

**What is our population of interest?**

a. All reported cases\
b. Confirmed cases of trafficking

### 1) When you want to subset to certain **rows**, use `filter()`

 Our dataset includes all reported cases (not just the confirmed cases of trafficking). Let's select only the individuals (each person represented by a row) who are confirmed victims of trafficking. 

In [None]:
# EXAMPLE CODE CHUNK #1
# subset to cases of confirmed trafficking
ctdc_confirmed <- ctdc %>% filter(isSexualExploit > 0 | isForcedLabour > 0 | isOtherExploit > 0) 

# display
head(ctdc_confirmed)

### 2) When we only want certain **columns**, we use select() to select the relevant columns

We are only interested in the means of control, so lets only select the columns related to the means of controls. 

In [None]:
# EXAMPLE CODE CHUNK #2

# METHOD #1
ctdc_means <- ctdc_confirmed %>% select(meansDebtBondageEarnings,
                                       meansThreats,
                                       meansAbusePsyPhySex,
                                       meansFalsePromises,
                                       meansDrugsAlcohol,
                                       meansDenyBasicNeeds,
                                       meansExcessiveWorkHours,
                                       meansWithholdDocs,
                                       meansSum)

# METHOD #2
ctdc_means2 <- ctdc_confirmed %>% select(starts_with("means"))

# Display
head(ctdc_means)

In [None]:
all.equal(ctdc_means2,ctdc_means)

**You try!** Replace `df` with the appropriate dataframe (hint: is the same exact thing as the example from code chunk #1, just you are tying it out to encode it into your memory!)

`df %>% select(starts_with("means"))`

In [None]:
# YOUR ANSWER HERE
ctdc_means3 <- NULL # YOUR CODE HERE

# Display
head(ctdc_means3)

In [None]:
. = ottr::check("tests/Q2.R")

### 3) Finding averages of columns using `summarise()` and `mean()`

Look at the dataframe `ctdc_means` above, and look at the column names. They are various means of control. I want to find the average of `meansDebtBondageEarnings`.

What would this average represent? In this case, the average is equivalent to the proportion of cases that experienced that MOC. The function used is the `summarise()` and `mean()` function, which is formatted:

`df %>% summarise(new_column_name = mean(existing_column_name, na.rm = T)`

In [None]:
# EXAMPLE CODE CHUNK #3
ex_ctdc_means_summary <- ctdc_means %>% summarise(
    # mean of the debt bondage column
    meansDebtBondageEarnings = mean(meansDebtBondageEarnings, na.rm = T))

# display
ex_ctdc_means_summary

Your turn! Pick you can summarise multiple columns simulaneously by adding another line. 

    df %>% summarise(
        column1 = mean(column1, na.rm = T),
        column2 = mean(column2, na.rm = T))

Find the means of `meansAbusePsyPhySex` and `meansFalsePromises`. 

How would you interpret these values?

In [None]:
# YOUR ANSWER HERE
# Taking the mean of multiple columns

ex_ctdc_means_summary <- NULL # YOUR CODE HERE

# display output
ex_ctdc_means_summary

In [None]:
. = ottr::check("tests/Q3.R")

## No Action Needed: Plotting Data
Looking at the plot, what were the most commonly reported means of control in this dataset for victims in IOM's database?

In [None]:
ctdc_means  %>% summarise_all(mean, na.rm = T) %>% 
select(-meansSum) %>%
pivot_longer(cols = starts_with("mean"),
                        names_to = "means_of_control",
                        values_to = "proportion") %>%
ggplot(aes(x = reorder(means_of_control, proportion), y = proportion)) + 
labs(x = "Means of Control") +
geom_bar(stat = "identity", width = 0.5) +
theme_bw(base_size = 30) + coord_flip() 

## Interpretation

**Discuss with your group:**
1. How would you interpret these proportions?
2. Can you defend using this data to describe general trafficking trends?
3. Can you defend using these proportions to inform policy and program interventions?
4. What would long term consequences be if we continue to rely solely on this specific dataset to inform interventions?


---
### Practice Time!

A study using the same data found women to be more susceptive to violent means of control (Stöckl 2021), categorized as physical and sexual abuse. We actually are not able to replicate this finding. Why?

The picture below takes from the same dataset on the CTDC website, just that the means of control are more disaggregated. Which means of controls are gendered based off this graph?

_*Replace this text*_

<img src="gendered.png" width=80% />

Lets see if the data we have will reproduce these results.

**Steps**
1. Create a new dataframe for women, and one dataframe for men, using filter().
2. Select columns for the mean of control, using select()
3. Find the mean of each column using summarise()

We already have a dataset of confirmed trafficking cases called `ctdc_confirmed`. See below. 

In [None]:
head(ctdc_confirmed)

#### 1) Practice using the filter() function. 
**Use filter() to create two dataframes. One for each gender category.**

**Step 1:** See the second column named `gender`, filter the confirmed cases dataframe to where `gender == "Woman"`

    ctdc_woman <- ctdc_confirmed %>% filter(gender == "Woman")

In [None]:
# YOUR ANSWER HERE
ctdc_woman <- NULL # YOUR CODE HERE

# display
head(ctdc_woman)

In [None]:
. = ottr::check("tests/Q4.R")

**Step 2:** Now create the same thing but for `"Man"`. In other words, from `ctdc_confirmed`, filter to where `gender == "Man"`

In [None]:
# YOUR ANSWER HERE
ctdc_man <- NULL # YOUR CODE HERE

# display
head(ctdc_man)

In [None]:
. = ottr::check("tests/Q5.R")

#### 2) For both dataframes, use `select()` to select the columns that start with "means"

In [None]:
# YOUR ANSWER HERE
# select the columns that start with "means" for the woman dataframe
ctdc_woman_means <- NULL # YOUR CODE HERE

# select the columns that start with "means" for the man dataframe
ctdc_man_means <- NULL # YOUR CODE HERE

In [None]:
. = ottr::check("tests/Q6.R")

Below are the first five rows of the dataframe `ctdc_woman_means` that you created. 

In [None]:
head(ctdc_woman_means)

#### 3) Comparing the means from the two subsets using `summarise()`

In the example code chunk below, I am finding the means of all MOCs for the female dataframe. 

In [None]:
# EXAMPLE CODE
# find the averages of all eight means of control (MOC) included in the dataset for WOMEN
ctdc_woman_means_summary <- ctdc_woman_means %>% summarise(
        meansDebtBondageEarnings = mean(meansDebtBondageEarnings, na.rm = T),
        meansThreats = mean(meansThreats, na.rm = T),
        meansAbusePsyPhySex = mean(meansAbusePsyPhySex, na.rm = T),
        meansFalsePromises = mean(meansFalsePromises, na.rm = T),
        meansDrugsAlcohol = mean(meansDrugsAlcohol, na.rm = T),
        meansDenyBasicNeeds = mean(meansDenyBasicNeeds, na.rm = T),
        meansExcessiveWorkHours = mean(meansExcessiveWorkHours, na.rm = T),
        meansWithholdDocs = mean(meansWithholdDocs, na.rm = T))

ctdc_woman_means_summary

**Now you do the same exact thing, but for the male dataframe.**

Hint: If you don't want to type it all out, just copy paste the example code above, but there is variable you would need to change. What is it?

In [None]:
# YOUR ANSWER HERE
# find the averages of all eight means of control (MOC) included in the dataset for MEN
ctdc_man_means_summary <- NULL # YOUR CODE HERE

# DISPLAY
ctdc_man_means_summary

In [None]:
. = ottr::check("tests/Q7.R")

---
#### No action needed, the cell below makes a plot

In [None]:
# Plotting results
ctdc_woman_means_summary %>% mutate(gender = "Woman") %>%
    # manipulating the data
    rbind(ctdc_man_means_summary %>% mutate(gender = "Man")) %>%
    pivot_longer(starts_with("means"), names_to="MeanOfControl", values_to = "Proportion") %>%
    # plotting starts here
    ggplot(aes(x = MeanOfControl, y = Proportion, fill = gender)) +
    # barplot
    geom_bar(stat = "identity", position = position_dodge(), width = 0.3)+ 
    # changes to formatting
    theme_bw(base_size = 20) + coord_flip()


<!-- BEGIN QUESTION -->

**What means of control appear gendered in this dataset? Where is it consistent and inconsistent with the disaggregated comparison?**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

# Summary
## Data Interpretation
1. Think through how the data was collected, and in what direction would that bias the results. 
2. Pay attention to how the variable of interest is defined, and if there could be changes in findings based off aggregations/disaggregations or other tweaks to the variable. 

## Coding
#### Dataframe basics
- `read.csv(filename)`: reads in a dataframe
- `head(df_name)`: displays first 6 rows
- `ncol(df_name)`, `nrow(df_name)` : return number of rows/columns of a dataframe, respectively
- `%>%`: pipe operator, passes on the previous output into the next function

#### Subsetting and summarising
1. Use `filter(insert_condition)` to subset to the rows you want
2. Use `select(column_names)`to subset to the columns you want
3. Use `summarise()` to find summary statistics like `mean()` of a column

#### Example

    df %>% filter(condition) %>%
            select(columns) %>%
            summarise(column_name = mean(existing_column_name, na.rm = T))

**Example: What proportion of reported sexual exploitation are of women? In this dataset, its 84%.**

In [None]:
ctdc %>% filter(isSexualExploit ==1) %>%
    select(gender) %>% 
    summarise(prop_female = mean(gender == "Woman", na.rm = T))

# Next Time
   - How can we evaluate claims of association and causation? (E.g. Poverty causes trafficking?)

---
# Practice Submitting on Gradescope

To submit your notebook...

### 1. Click `File` $\rightarrow$ `Save Notebook`.

### 2. Wait 5 seconds.

### 3. Select the cell below and hit run.

In [None]:
ottr::export("discussion2.ipynb")

After you hit "Run" on the cell above, click the download link. A .zip file should download to your computer.

(If you make changes to your notebook, you'll need to hit save and then run the cell above again before you submit to get a new version of it.)

### 4. Submit the .zip file you just downloaded <a href="https://www.gradescope.com/" target="_blank">on Gradescope here</a> under practice submission.

Notes:

- **This does not seem to work on Chrome for iPad or iPhone.** If you're using an iPad or iPhone, you need to download the file using **Safari**.
- If your web browser automatically unzips the .zip file (so you see a folder instead of a .zip file), you can just upload the .ipynb file that is inside the folder.
- If this method is not working for you, try this: hit `File`, then `Download as`, then `Notebook (.ipynb)` and submit that.

---
## Something positive for the day: Hiking Trails near SF
One of my favorites, although it's quite challenging: Dipsea, Steep Ravine, Matt Davis Trail. Recommend for those who like to hike!

(I do not have ownership of these photos, they are from google.)

P.S. If you think these snippets are excessive, let me know! If you have one you want to share, I also take submissions. Email me if you have a piece of good news/something fun to share that you would like to be featured next section!

<img src="dipsea2.jpg" width=40% /><img src="dipsea.jpg" width=47.5% />