<img src="img/dsci511_header.png" width="600">

# Lab 2: Advanced data wrangling with Pandas

## Instructions
rubric={mechanics:5}

Check off that you have read and followed each of these instructions:

- [ ] All files necessary to run your work must be pushed to your GitHub.ubc.ca repository for this lab.
- [ ] You need to have a minimum of 3 commit messages associated with your GitHub.ubc.ca repository for this lab.
- [ ] You must also submit `.ipynb` file and the rendered PDF in this worksheet/lab to Gradescope. Entire notebook must be executed so the TA's can see the results of your work. 
- [ ] **There is autograding in this lab, so please do not move or rename this file. Also, do not copy and paste cells, if you need to add new cells, create new cells via the "Insert a cell below" button instead.**
- [ ] To ensure you do not break the autograder remove all code for installing packages (i.e., DO NOT have `! conda install ...` or `! pip install ...` in your homework!
- [ ] Follow the [MDS general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- [ ] <mark>This lab has hidden tests. In this lab, the visible tests are just there to ensure you create an object with the correct name. The remaining tests are hidden intentionally. This is so you get practice deciding when you have written the correct code and created the correct data object. This is a necessary skill for data scientists, and if we were to provide robust visible tests for all questions you would not develop this skill, or at least not to its full potential.</mark>


## Code Quality
rubric={quality:5}

The code that you write for this assignment will be given one overall grade for code quality, see our code quality rubric as a guide to what we are looking for. Also, for this course (and other MDS courses that use R), we are trying to follow the PEP 8 code style. There is a guide you can refer too: https://peps.python.org/pep-0008/

Each code question will also be assessed for code accuracy (i.e., does it do what it is supposed to do?).

## Writing 
rubric={writing:5}

To get the marks for this writing component, you should:

- Use proper English, spelling, and grammar throughout your submission (the non-coding parts).
- Be succinct. This means being specific about what you want to communicate, without being superfluous.


## Let's get started!

Run the cell below to load the packages needed for this lab.

In [1]:
import pandas as pd
import numpy as np
import altair as alt
import re
from palmerpenguins import load_penguins # penguins data set for exercise 3
from nycflights13 import flights # flights data set for exercise 5

## Exercise 1: Working with dates

rubric={autograde:16}
In our recent past, we experienced the COVID-19 global pandemic. 
With your new data science skills, you can start to look at 
and visualize the data about this impactful pandemic yourself. 
Let's look at cumulative confirmed cases in British Columbia 
over a 3 month period from 2021/06/10 to 2021/09/10 
(it was a peak COVID  season then). 
We can obtain such data from the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19).In particular, use Python to load the `time_series_covid19_confirmed_global.csv` file located at the url: [https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv](https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv).Then you will need to filter teses global data for records from the province of British Columbia 
during the time interval 2021/06/10 to 2021/09/10. Note that, you might need to tidy the data before you filter the time interval.  
We provide you data visualization code, 
however your task is to ensure the dataframe is suitable for visualization. 

The final data set for data visualization should be named `bc_covid19_confirmed_3_months` 
and have only the following two columns: - one named `date`, whose `dtype` should be `Date`- one named `confirmed_cases`, whose `dtype` should be intt64`

In [2]:
url = "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
time_series_covid19_confirmed_global = None
three_months = None
bc_covid19_confirmed_3_months = None
# BEGIN SOLUTION
# Load the COVID-19 confirmed global time series data
url = "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
time_series_covid19_confirmed_global = pd.read_csv(url)

# Filter data for British Columbia and reshape it for a 3-month period
bc_covid19_confirmed = (
    time_series_covid19_confirmed_global[
        time_series_covid19_confirmed_global['Province/State'] == 'British Columbia'
    ].iloc[:, 4:]
    .melt(var_name='date', value_name='confirmed_cases')
)

# Convert dates to datetime objects and filter for the specified 3-month period
bc_covid19_confirmed['date'] = pd.to_datetime(bc_covid19_confirmed['date'], format='mixed')
bc_covid19_confirmed_3_months = (
    bc_covid19_confirmed
    .set_index('date')
    .sort_index()
    .loc['2021-06-10':'2021-09-10']
    .reset_index()
)
# END SOLUTION

In [3]:
bc_covid19_confirmed_3_months

Unnamed: 0,date,confirmed_cases
0,2021-06-10,145996
1,2021-06-11,146176
2,2021-06-12,146176
3,2021-06-13,146176
4,2021-06-14,146453
...,...,...
88,2021-09-06,168325
89,2021-09-07,170750
90,2021-09-08,171564
91,2021-09-09,172338


In [4]:
# TEST
# Checking if the DataFrame is defined
assert bc_covid19_confirmed_3_months is not None, "The DataFrame 'bc_covid19_confirmed_3_months' is not defined."

# Checking column names
expected_columns = ['date', 'confirmed_cases']
assert list(bc_covid19_confirmed_3_months.columns) == expected_columns, f"Column names are incorrect. Expected: {expected_columns}, but got: {list(bc_covid19_confirmed_3_months.columns)}."

# Checking if the DataFrame is not empty (optional, but useful for debugging)
assert not bc_covid19_confirmed_3_months.empty, "The DataFrame 'bc_covid19_confirmed_3_months' is empty."

In [5]:
# HIDDEN TEST
# Asserting that the object is a DataFrame
assert isinstance(bc_covid19_confirmed_3_months, pd.DataFrame), "Object is not a DataFrame"

# Asserting the column data types
assert bc_covid19_confirmed_3_months['date'].dtype == np.dtype('datetime64[ns]'), "Date column is not datetime64[ns]"
assert bc_covid19_confirmed_3_months['confirmed_cases'].dtype == np.dtype('int64'), "Confirmed cases column is not int64"

# Asserting the DataFrame shape
expected_shape = (93, 2)  # Assuming 93 days for the period from June 10, 2021, to September 10, 2021
assert bc_covid19_confirmed_3_months.shape == expected_shape, f"DataFrame shape is not {expected_shape}"

# Asserting the correct date range in the date column
expected_dates = pd.date_range(start='2021-06-10', end='2021-09-10')
actual_dates = bc_covid19_confirmed_3_months['date']
assert all(actual_dates == expected_dates), "Date column does not contain the correct date range"

# Asserting the sum of the numerical column
expected_sum = 14246272
actual_sum = bc_covid19_confirmed_3_months['confirmed_cases'].sum()
tolerance = 2
assert abs(actual_sum - expected_sum) <= tolerance, f"Sum of confirmed cases {actual_sum} is not within tolerance of {expected_sum}"

Let’s now visualize how cases have changed over the last three months in British Columbia by viewing cumulative cases per day:

In [6]:
three_months_BC = alt.Chart(bc_covid19_confirmed_3_months).mark_bar(color='lightblue').encode(
    x=alt.X('date:T', title='2021'),  # Specify type as Temporal for dates
    y=alt.Y('confirmed_cases:Q', title='Cumulative COVID-19 cases')
).properties(
    title='BC COVID-19 cases started to increase more rapidly in August and September of 2021'
).configure_axis(
    grid=False
).configure_view(
    strokeWidth=0
)

three_months_BC

## Exercise 2: Working with strings

rubric={accuracy:16}You have learned about string operations combined with regular expressions.    Now you are ready to apply them to the real world of data cleaning!    Your goal is to load in "dirty Gapminder" as a dataframe called `dirty` and "clean Gapminder" as a dataframe called `clean`, and wrangle `dirty` until it is the same as `clean`:
- Dirty Gapminder: <https://raw.githubusercontent.com/STAT545-UBC/STAT545-UBC.github.io/master/gapminderDataFiveYear_dirty.txt>
- Clean Gapminder: <https://raw.githubusercontent.com/STAT545-UBC/STAT545-UBC.github.io/master/gapminderDataFiveYear.txt>

A test has been provided to check that `dirty` is the same as `clean`. Things you might want to do to clean up `dirty`:

- Check that `dirty` and `clean` have the same columns;
- Check if there is any missing data, if there is missing data (NaNs or empty strings) fill them with sensible values;
- Check for things like capitalization, spelling, etc;
- There may be entries that appear to have the exact same spelling and capitalization in both `dirty` and `clean`, but still don't match... Extra whitespace is often a frustrating (and invisible) problem when wrangling text data. You can use `Series.str.strip()` to trim any additional unwanted whitespace around a string.
- At any time, you can check which rows in `dirty` are not equal to `clean` using something like: `dirty[dirty.ne(clean).any(axis=1)]`.

In [7]:
url_dirty = 'https://raw.githubusercontent.com/STAT545-UBC/STAT545-UBC.github.io/master/gapminderDataFiveYear_dirty.txt'
url_clean = 'https://raw.githubusercontent.com/STAT545-UBC/STAT545-UBC.github.io/master/gapminderDataFiveYear.txt'
dirty = None
clean = None
# BEGIN SOLUTION
dirty = pd.read_csv(url_dirty, delimiter='\t')
clean = pd.read_csv(url_clean, delimiter='\t')

dirty[['continent', 'country']] = dirty['region'].str.split('_', expand=True)  # split "region" into "country" and "continent"
dirty = dirty[['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap']]  # rearrange columns to match "clean"

dirty[(dirty == "").any(axis=1)]  # shows continent has 3 empty strings, all for country "Canada"
dirty.loc[(dirty == "").any(axis=1), 'continent'] = 'Americas'  # fill empty strings with the right values
# dirty.loc[dirty['continent'] == "", 'continent'] = 'Americas'  # Another way to fill the empty strings

dirty[(dirty.ne(clean)).any(axis=1)]  # shows which rows still don't match, there are a bunch of mislabelled countries
dirty['country'] = dirty['country'].replace({'china': 'China',
                                             'Central african republic': 'Central African Republic',
                                             'Congo, Democratic Republic': 'Congo, Dem. Rep.',
                                             'Democratic Republic of the Congo': 'Congo, Dem. Rep.',
                                             "Cote d'Ivore": "Cote d'Ivoire" })

dirty['country'] = dirty['country'].str.strip()  # strip whitespace
dirty['continent'] = dirty['continent'].str.strip()  # strip whitespace
# END SOLUTION

In [8]:
# TEST
clean = pd.read_csv(url_clean, delimiter='\t')
assert dirty.equals(clean)

## Exercise 3: Taking control of your categoricals
rubric={accuracy:8,reasoning:8}Explore the effects of the `pandas` `sort_values` method on the `pandas` `categorical` `dtype`. Does sorting the values of a pandas categorical column in a data frame have any effect on, say, an `altair` figure? What about the `cat.reorder_categories` method? This exploration must involve the data, the categorical's order, and some figures, as well as a written explanation of what you are doing and what you find. Choose any data set you wish to demonstrate this. We provide code for a scatter plot figure you could modify for your data set to help you out. 

In [9]:
# Load the penguins dataset
penguins = load_penguins()

# Convert the 'species' column to a categorical type
penguins['species'] = pd.Categorical(penguins['species'])

# Create the Altair chart
penguins_chart = alt.Chart(penguins).mark_point(size=60).encode(
    x=alt.X('bill_length_mm', title='Bill length (mm)'),
    y=alt.Y('body_mass_g', title='Body mass (g)'),
    color=alt.Color('species', scale=alt.Scale(scheme='category10'))
).properties(
    width=450,
    height=350
).configure_axis(
    labelFontSize=14,
    titleFontSize=18
).configure_legend(
    labelFontSize=14,
    titleFontSize=18
)

penguins_chart

In [10]:
# BEGIN SOLUTION
# Load the penguins dataset
penguins = load_penguins()

# Sort the data based on the species name into descending alphabetic order
penguins_value_sorted = penguins.sort_values('species', ascending=False)
# Create the Altair chart
penguins_chart_value_sorted = alt.Chart(penguins_value_sorted).mark_point(size=60).encode(
    x=alt.X('bill_length_mm', title='Bill length (mm)'),
    y=alt.Y('body_mass_g', title='Body mass (g)'),
    color=alt.Color('species', scale=alt.Scale(scheme='category10'))
).properties(
    width=400,
    height=300
).configure_axis(
    labelFontSize=14,
    titleFontSize=18
).configure_legend(
    labelFontSize=14,
    titleFontSize=18
)
penguins_chart_value_sorted

# Convert the 'species' column to a categorical type
penguins_categorical_ordered = penguins

# Define the species order in descending alphabetical order
species_order = sorted(penguins_categorical_ordered['species'].unique(), reverse=True)

# Convert 'species' column to a categorical type with the specified order
penguins_categorical_ordered['species'] = pd.Categorical(penguins_categorical_ordered['species'], categories=species_order, ordered=True)

# Create the Altair chart
penguins_chart_categorical_ordered = alt.Chart(penguins_categorical_ordered).mark_point(size=60).encode(
    x=alt.X('bill_length_mm', title='Bill length (mm)'),
    y=alt.Y('body_mass_g', title='Body mass (g)'),
    color=alt.Color('species', scale=alt.Scale(scheme='category10'))
).properties(
    width=400,
    height=300
).configure_axis(
    labelFontSize=14,
    titleFontSize=18
).configure_legend(
    labelFontSize=14,
    titleFontSize=18
)
# END SOLUTION

In [11]:
# BEGIN SOLUTION
penguins_chart_value_sorted
# END SOLUTION

In [12]:
# BEGIN SOLUTION
penguins_chart_categorical_ordered
# END SOLUTION

# BEGIN SOLUTION
The code snippet above shows a very simple example of how `sort_values` has no effect on figures, 
but that `cat.reorder_categories` does. 
In this example, we plot body mass vs bill length from the `penguins` dataset, 
colored by a `categorical` `species` variable. 
The first plot is the original data. 
The second plot is the data sorted based on the species name into descending alphabetic order, 
it shows that `sort_values` has no effect on plot aesthetics.
The third plot uses the `cat.reorder_categories` method to reorder the catergorical 
based on species name into descending alphabetic order. 
This shows that reordering has an effect on the aesthetics of the plot, namely, 
through the legend and color ordering. 
Reordering `categoricals` can be a very useful method of controlling plots.
# END SOLUTION

## Exercise 4: Two table joins cheatsheet
rubric={accuracy:8,reasoning:8}

This exercise is to help you familiarize with the different _joins_ available in `pandas` using the `merge` method. First, take a look at [Jenny Bryan's cheatsheet](http://stat545.com/bit001_dplyr-cheatsheet.html). Your task is to create your own cheatsheet, covering all the joins that Jenny covers (and in both orders for joins where order has an effect) but focused on something you care about. Examples:  - Pets I have owned + breed + friendly vs. unfriendly + ??. Join to a table of pet breed, including variables for furry vs not furry, mammal true or false, etc.  - Movies and studios....  - Athletes and teams....The data set should be tractable (think 5-7 items in each table). **You are expected to create your own data set for this question, do not use an existing data set.**While demonstrating the joins with your data and code, also provide a narrative in written English explaining what you are doing and what is revealed through the joins. The narrative should be ~ 2-4 sentences per join scenario. **This narrative must be in your own words.**You will likely need to iterate between your data prep and your joining to make your explorations comprehensive and interesting. For example, you will want a specific amount (or lack) of overlap between the two data frames, in order to demonstrate all the different joins. You will want both the data frames to be as small as possible, while still retaining the expository value.You should create this cheatsheet as a separate `.ipynb` file that you render to `.md` (you can do that by clicking **File** > **Save and Export Notebook As ...** > **Markdown**). Both of these files should live in this lab2 repo, and you should paste the links (URL) to them below:

# BEGIN SOLUTION

- URL to to source `.Rmd` file:- URL to to rendered `.md` file:

# END SOLUTION

## Exercise 5: Grouping and aggregating

rubric={autograde:16}

Use the `pandas`to take the `flights` data set, from the `nycflights13` Python package, 
and obtain the average speed (in km/hr) and average distance (in km) for all flights, 
for each of the carriers AA, AS, UA and US.
Name these new columns `carrier_avg_speed` and `carrier_avg_distance_km`, 
and round the values so that the answer is a whole number (i.e., no decimal points). 
Convert the carrier acronyms to their full names 
(American Airlines, Alaska Airlines, United Airlines and US Airways). 
Sort the results in ascending order according to `carrier_avg_speed`. 
Name the data frame `carrier_avg_flights`.

Some hints:
- The distance is in miles and air time is in minutes in the `flights` data. 
- You will have to create a column that holds the average speed for each flight before you can do this for each carrier.
- You may also need to handle `NA` entries in the data.

In [13]:
carrier_avg_flights = None

# BEGIN SOLUTION
# Filtering and mutating
flights_filtered = flights[flights['carrier'].isin(['AA', 'AS', 'UA', 'US'])].copy()

flights_filtered['carrier'] = flights_filtered['carrier'].map({
    'AA': 'American Airlines',
    'AS': 'Alaska Airlines',
    'UA': 'United Airlines',
    'US': 'US Airways'
})

flights_filtered['distance_km'] = flights_filtered['distance'] * 1.6093
flights_filtered['avg_speed'] = flights_filtered['distance_km'] / (flights_filtered['air_time'] / 60)

flights_filtered
# Grouping and aggregating
carrier_avg_flights = flights_filtered.groupby('carrier').agg(
    carrier_avg_speed=('avg_speed', 'mean'), carrier_avg_distance_km=('distance_km', 'mean')
).reset_index()

# Rounding and sorting
carrier_avg_flights = carrier_avg_flights.round().sort_values(by='carrier_avg_speed')

# END SOLUTION
carrier_avg_flights

Unnamed: 0,carrier,carrier_avg_speed,carrier_avg_distance_km
2,US Airways,550.0,891.0
1,American Airlines,672.0,2157.0
3,United Airlines,677.0,2461.0
0,Alaska Airlines,714.0,3866.0


The tests below only check that the object has the correct names. The other tests are intentionally hidden.

In [14]:
# TEST
# Checking if the DataFrame is defined
assert carrier_avg_flights is not None, "The DataFrame 'carrier_avg_flights' is not defined."

# Checking column names
expected_columns = ['carrier', 'carrier_avg_speed', 'carrier_avg_distance_km']
assert list(carrier_avg_flights.columns) == expected_columns, f"Column names are incorrect. Expected: {expected_columns}, but got: {list(carrier_avg_flights.columns)}."

In [15]:
# HIDDEN TEST
# Checking object type
assert isinstance(carrier_avg_flights, pd.DataFrame), "The object 'carrier_avg_flights' is not a DataFrame."

# Checking column names
expected_columns = ['carrier', 'carrier_avg_speed', 'carrier_avg_distance_km']
assert list(carrier_avg_flights.columns) == expected_columns, f"Column names are incorrect. Expected: {expected_columns}, but got: {list(carrier_avg_flights.columns)}."

# Checking column data types

expected_dtypes = {
    'carrier': np.dtype('O'),
    'carrier_avg_speed': (np.dtype('float64'), np.dtype('int64')),
    'carrier_avg_distance_km': (np.dtype('float64'), np.dtype('int64'))
}

actual_dtypes = carrier_avg_flights.dtypes.to_dict()

# Checking if the actual dtypes are among the expected types
assert all(actual_dtypes[col] in expected_dtypes[col] if isinstance(expected_dtypes[col], tuple) else actual_dtypes[col] == expected_dtypes[col]
           for col in expected_columns), f"Column types are incorrect. Expected: {expected_dtypes}, but got: {actual_dtypes}."

# Checking DataFrame shape
expected_shape = (4, 3) 
assert carrier_avg_flights.shape == expected_shape, f"DataFrame shape is incorrect. Expected: {expected_shape}, but got: {carrier_avg_flights.shape}."

# Checking if the 'carrier' column is in the correct order
expected_order = carrier_avg_flights.sort_values(by='carrier_avg_speed', ascending=True)['carrier'].tolist()
actual_order = carrier_avg_flights['carrier'].tolist()
assert actual_order == expected_order, f"The 'carrier' column is not in the correct order. Expected: {expected_order}, but got: {actual_order}."

# Checking the sum of numerical columns with a tolerance
tolerance = 5
expected_sums = {
    'carrier_avg_speed': 2613,  # Replace with the actual expected sum
    'carrier_avg_distance_km': 9375  # Replace with the actual expected sum
}
actual_sums = carrier_avg_flights[expected_columns[1:]].sum().to_dict()
for col in expected_sums:
    assert abs(actual_sums[col] - expected_sums[col]) <= tolerance, (
        f"Sum of column '{col}' is incorrect. "
        f"Expected: {expected_sums[col]} ± {tolerance}, but got: {actual_sums[col]}."
    )

## Exercise 6: CHALLENGING
rubric={accuracy:5}

Warning: This exercise is challenging and could be time-consuming. Please only attempt if you find yourself finishing the assignment early and you want a bit more of a challenge.

In this exercise, you will need to use regular expressions and the Python `re` library to parse a text file containing a collection of 10 "Nigerian" Fraud Letters, dating from 1998 to 2007. This is a subset of a data set [retrieved from Kaggle](https://www.kaggle.com/datasets/rtatman/fraudulent-email-corpus?resource=download).

We would like you to create a dataframe with the following columns:
- `senders` (containing the email addresses the emails were sent to)
- `recipients` (containing the email addresses the emails were sent from)

In the code below, we have read in the text file for you as a single string. 

Some useful resources:
- You can test out your regex expression in real-time using https://regex101.com 
- Documentation on regex flags https://docs.python.org/3/library/re.html#flags

In [16]:
# Open the test_emails.txt file in read mode and read its contents into a string
with open('data/test_emails.txt', 'r') as file:
    email_text = file.read()

In [19]:
# BEGIN SOLUTION
email_senders_pattern = r'From.*?<([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})>'
email_senders = re.findall(email_senders_pattern, email_text)

email_recipients_pattern = r'(?<!Reply-)To:\s*([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+)'
email_recipients = re.findall(email_recipients_pattern, email_text)

email_addresses = pd.DataFrame({
    'senders': email_senders, 
    'recipients': email_recipients
})

email_addresses
# END SOLUTION

Unnamed: 0,senders,recipients
0,james_ngola2002@maktoob.com,webmaster@aclweb.org
1,bensul2004nng@spinfinder.com,R@M
2,obong_715@epatra.com,webmaster@aclweb.org
3,obong_715@epatra.com,webmaster@aclweb.org
4,m_abacha03@www.com,R@M
5,davidkuta@postmark.net,davidkuta@yahoo.com
6,tunde_dosumu@lycos.com,albert.acme@yahoo.ca
7,william2244drallo@maktoob.com,webmaster@aclweb.org
8,abdul_817@rediffmail.com,R@M
9,barrister_td@lycos.com,jenn235232@hotmail.com


**Congratulations!!!** You are done the lab!!! Pat yourself on the back, and submit your lab to **GitHub** and Gradescope! Make sure you have 3 Git commits!