# Activity 3: Practicing with Pandas

For this activity, we will assume the role of a data scientist in the search for and evaluation of public datasets. In turn, pay close attention to the dataset you choose for counter narratives or nonapparent political values, gaps or underrepresented perspectives.

Please make sure you open this file in Jupyter Notebook or Google Colab -- you will have to either open Jupyter Notebook first or upload the file to Google Colab before being able to edit the file.

Please make sure you **print your output.** You may need to use ```print``` for your output to be displayed.

-----
## Submission Instructions
1. Upload the completed activity to the "Activities" folder of your GitHub Portfolio;
2. Ensure your submission consists of two files: <br/>(a) one single-page reflection in markdown (e.g. `a3_reflection.md`);  <br/>(b) one Jupyter notebook with completed responses to each activity step (e.g. `a3_pandas.ipynb`).

## Data biography: Trans-Atlantic Slave Trade dataset

To complete the assignment, please make sure you download the dataset from where it hosted on our GitHub: [Trans-Atlantic-Slave-Trade_Americas.csv](https://github.com/zmuhls/ccny-data-science/blob/main/assets/datasets/Trans-Atlantic-Slave-Trade_Americas.csv)

## Data Biography (1-2 Paragraphs)

Before you begin your exploration, you will provide a short data biography (1 - 2 paragraphs) on the dataset that addresses **where the data came from, who collected it, and the original intention(s) for its collection.**

You may find the following sources useful for your research:
* [Slave Voyages' Trans-Atlantic Slave Trade methodology](https://www.slavevoyages.org/voyage/about#methodology/introduction/0/en/)
    * You may find the sections: Introduction, Coverage of the Slave Trade, and Nature of Sources particularly useful.
* [Slave Voyages' About](https://www.slavevoyages.org/about/about#)
* [Jamelle Bouie's We Still Can’t See American Slavery for What It Was](https://csc10800.github.io/assets/pdf/Bouie_we_still_cant_see_american_slavery_for_what_it_is.pdf)

**Data Biography:**
The data was collected from both published works and archives around the world, from Europe to Africa to the Americas. The information on the sites was compiled by a "multi-disciplinary team of historians, librarians, curriculum specialists, cartographers, computer programmers, and web designers, in consultation with scholars of the slave trade from universities in Europe, Africa, South America, and North America." The information was originally compiled with the hopes to understand how complex the Transatlantic Slave Trade was, and to humanize those who were enslaved. It eventually became used by other institutions and other individuals.





-----

## Download and explore dataset

As we have been practicing in class, the questions that follow will require you to consider what output you will need before running the appropriate python codes.

To ensure your code runs, please remember to load ```pandas``` and the ```Trans-Atlantic-Slave-Trade_Americas.csv``` file into the jupyter notebook environment.

If you are running this on Google Colab, please make sure you connect your google drive with your Colab notebook first before attempting the exercise:
```
from google.colab import drive
drive.mount('/content/drive')
```

In [20]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 2. How many rows of data does the dataset contain?</br>
In this first step, you’ll load the dataset and examine its structure. Loading data is foundational in data science and digital humanities workflows. You’ll use the `pandas` library to read a CSV file and display its first few rows.</br>
* **Hint**: See Melanie Walsh's chapter on Pandas Basics if you're stuck: [Pandas Basics – Part 1](https://melaniewalsh.github.io/Intro-Cultural-Analytics/03-Data-Analysis/01-Pandas-Basics-Part1.html)).



In [22]:
tast_df = pd.read_csv('/content/drive/MyDrive/Trans-Atlantic-Slave-Trade_Americas.csv')
tast_df

Unnamed: 0,year_of_arrival,flag,place_of_purchase,place_of_landing,percent_women,percent_children,percent_men,total_embarked,total_disembarked,resistance_label,vessel_name,captain's_name,voyage_id,sources
0,1520,,Portuguese Guinea,San Juan,,,,324.0,259.0,,,,42987,"[u'AGI,Patronato 175, r.9<><p><em>AG!</em> (Se..."
1,1525,Portugal / Brazil,Sao Tome,"Hispaniola, unspecified",,,,359.0,287.0,,S Maria de Bogoña,"Monteiro, Pero",46473,"[u'ANTT,CC,Parte II, maco 131, doc 54<><i>Inst..."
2,1526,Spain / Uruguay,Cape Verde Islands,"Cuba, port unspecified",,,,359.0,287.0,,,"Carega, Esteban (?)",11297,"[u'Pike,60-1,172<>Pike, Ruth, <i>Enterprise</i..."
3,1526,Spain / Uruguay,Cape Verde Islands,"Cuba, port unspecified",,,,359.0,287.0,,,"Carega, Esteban (?)",11298,"[u'Pike,60-1,172<>Pike, Ruth, <i>Enterprise</i..."
4,1526,,Cape Verde Islands,Caribbean (colony unspecified),,,,359.0,287.0,,S Anton,"Leon, Juan de",42631,"[u'Chaunus, 3: 162-63<><p>Chaunus, <em>xxxxxx<..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20774,1864,Spain / Uruguay,"Africa., port unspecified","Cuba, port unspecified",,,,488.0,465.0,,Polaca,,46554,"[u'AHNM, Ultramar, Leg. 3551, 6<><i>Archivo Hi..."
20775,1865,Spain / Uruguay,"Africa., port unspecified",Isla de Pinas,,,,152.0,145.0,Slave insurrection,Gato,,4394,"[u'IUP,ST,50/B/137<>Great Britain, <i>Irish Un..."
20776,1865,,"Africa., port unspecified",Mariel,,,,780.0,650.0,,,,4395,"[u'IUP,ST,50/B/144<>Great Britain, <i>Irish Un..."
20777,1865,,Congo River,"Cuba, port unspecified",,,,1265.0,1004.0,,Cicerón,Mesquita,5052,"[u'IUP,ST,50/A/23-4<>Great Britain, <i>Irish U..."


It has 20,779 rows 😢

### 3. Are the data types for each column appropriate? Please explain how they are/not approrpiate for your analysis.

I wouldn't say so... These logs are definitely dehumanizing. You can't tell how many women, men, and children were on the ship. Makes you wonder *why.*


### 4. What is the overall average proportion of ```percent_women```, ```percent_children```, ```percent_men```?

In [24]:
tast_df[['percent_women', 'percent_children', 'percent_men']].mean()

Unnamed: 0,0
percent_women,0.274098
percent_children,0.231531
percent_men,0.49705


**Double click to edit cell**     
`Please type your answer here`
</br>
</br>
</br>


---


### 5. How many of the column values for ```percent_women```, ```percent_children```, ```percent_men``` are left blank? Suggest 1 reason why majority of the values in these columns are blank.

In [None]:
#Your code here

**Double click to edit cell**     
`Please type your answer here`
</br>
</br>
</br>

---


### 6. Display all duplicated rows and remove the duplicates. Check that the duplicates were successfully removed. Please also suggest a reason why we would remove duplicates for our analysis.

**Level-up**: How many duplicated rows do we have? Recall the methods we have used to help us count things.

In [None]:
#Your code here

**Double click to edit cell**     
`Please type your answer here`
</br>
</br>
</br>


---


### 7. Please identify the **top 5 most common ports of arrival**.

**Hint:** Check the columns ```place_of_landing```.

In [None]:
#Your code here

**Double click to edit cell**     
`Please type your answer here`
</br>
</br>
</br>

---


### 8. Please plot the **top 5 enslavers/captors** from this dataset. Please choose the appropriate visualization (e.g. pie chart, bar chart) to display your finding.

In [None]:
#Your code here

-----

### 9. Having briefly explored the dataset, what further questions have emerged as you explored this dataset?    

Please share your thoughts in a few sentences.

**Double click to edit cell**     
`Please type your answer here`
</br>
</br>
</br>



-----
-----


## Side quest challenge

The side quest challenge is for extra credit. You can still get full credit for this activity even if you do not complete this challenge.

### 1. Which enslaver/captor had the highest difference between the total number of people who embarked and disembarked?

**Hint:** You may need to [add a column](https://github.com/GCDigitalFellows/intro-pandas-dri-2022/blob/main/README.md#8-rename-select-drop-and-add-new-columns) to calculate the difference, use the [```.groupby```, ```.count()``` and ```.sort_values``` methods](https://github.com/GCDigitalFellows/intro-pandas-dri-2022/blob/main/README.md#9-sort-columns-groupby-columns--count-values) for this challenge.

**Double click to edit cell**     
`Please type your answer here`
</br>
</br>
</br>

---
