In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab2.ipynb")

<img src="img/dsci511_header.png" width="600">

# Lab 2: Advanced data wrangling with Pandas

## Instructions
rubric={mechanics:5}

Check off that you have read and followed each of these instructions:

- [ ] All files necessary to run your work must be pushed to your GitHub.ubc.ca repository for this lab.
- [ ] You need to have a minimum of 3 commit messages associated with your GitHub.ubc.ca repository for this lab.
- [ ] You must also submit `.ipynb` file and the rendered PDF in this worksheet/lab to Gradescope. Entire notebook must be executed so the TA's can see the results of your work. 
- [ ] **There is autograding in this lab, so please do not move or rename this file. Also, do not copy and paste cells, if you need to add new cells, create new cells via the "Insert a cell below" button instead.**
- [ ] To ensure you do not break the autograder remove all code for installing packages (i.e., DO NOT have `! conda install ...` or `! pip install ...` in your homework!
- [ ] Follow the [MDS general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- [ ] <mark>This lab has hidden tests. In this lab, the visible tests are just there to ensure you create an object with the correct name. The remaining tests are hidden intentionally. This is so you get practice deciding when you have written the correct code and created the correct data object. This is a necessary skill for data scientists, and if we were to provide robust visible tests for all questions you would not develop this skill, or at least not to its full potential.</mark>


## Code Quality
rubric={quality:5}

The code that you write for this assignment will be given one overall grade for code quality, see our code quality rubric as a guide to what we are looking for. Also, for this course (and other MDS courses that use R), we are trying to follow the PEP 8 code style. There is a guide you can refer too: https://peps.python.org/pep-0008/

Each code question will also be assessed for code accuracy (i.e., does it do what it is supposed to do?).

## Writing 
rubric={writing:5}

To get the marks for this writing component, you should:

- Use proper English, spelling, and grammar throughout your submission (the non-coding parts).
- Be succinct. This means being specific about what you want to communicate, without being superfluous.


## Let's get started!

Run the cell below to load the packages needed for this lab.

In [None]:
import pandas as pd
import numpy as np
import altair as alt
import re
from palmerpenguins import load_penguins # penguins data set for exercise 3
from nycflights13 import flights # flights data set for exercise 5

## Exercise 1: Working with dates

rubric={autograde:16}
In our recent past, we experienced the COVID-19 global pandemic. 
With your new data science skills, you can start to look at 
and visualize the data about this impactful pandemic yourself. 
Let's look at cumulative confirmed cases in British Columbia 
over a 3 month period from 2021/06/10 to 2021/09/10 
(it was a peak COVID  season then). 
We can obtain such data from the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19).In particular, use Python to load the `time_series_covid19_confirmed_global.csv` file located at the url: [https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv](https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv).Then you will need to filter teses global data for records from the province of British Columbia 
during the time interval 2021/06/10 to 2021/09/10. Note that, you might need to tidy the data before you filter the time interval.  
We provide you data visualization code, 
however your task is to ensure the dataframe is suitable for visualization. 

The final data set for data visualization should be named `bc_covid19_confirmed_3_months` 
and have only the following two columns: - one named `date`, whose `dtype` should be `Date`- one named `confirmed_cases`, whose `dtype` should be intt64`

In [None]:
url = "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
time_series_covid19_confirmed_global = None
three_months = None
bc_covid19_confirmed_3_months = None
...

In [None]:
bc_covid19_confirmed_3_months

In [None]:
grader.check("ex1")

Let’s now visualize how cases have changed over the last three months in British Columbia by viewing cumulative cases per day:

In [None]:
three_months_BC = alt.Chart(bc_covid19_confirmed_3_months).mark_bar(color='lightblue').encode(
    x=alt.X('date:T', title='2021'),  # Specify type as Temporal for dates
    y=alt.Y('confirmed_cases:Q', title='Cumulative COVID-19 cases')
).properties(
    title='BC COVID-19 cases started to increase more rapidly in August and September of 2021'
).configure_axis(
    grid=False
).configure_view(
    strokeWidth=0
)

three_months_BC

<!-- BEGIN QUESTION -->

## Exercise 2: Working with strings

rubric={accuracy:16}You have learned about string operations combined with regular expressions.    Now you are ready to apply them to the real world of data cleaning!    Your goal is to load in "dirty Gapminder" as a dataframe called `dirty` and "clean Gapminder" as a dataframe called `clean`, and wrangle `dirty` until it is the same as `clean`:
- Dirty Gapminder: <https://raw.githubusercontent.com/STAT545-UBC/STAT545-UBC.github.io/master/gapminderDataFiveYear_dirty.txt>
- Clean Gapminder: <https://raw.githubusercontent.com/STAT545-UBC/STAT545-UBC.github.io/master/gapminderDataFiveYear.txt>

A test has been provided to check that `dirty` is the same as `clean`. Things you might want to do to clean up `dirty`:

- Check that `dirty` and `clean` have the same columns;
- Check if there is any missing data, if there is missing data (NaNs or empty strings) fill them with sensible values;
- Check for things like capitalization, spelling, etc;
- There may be entries that appear to have the exact same spelling and capitalization in both `dirty` and `clean`, but still don't match... Extra whitespace is often a frustrating (and invisible) problem when wrangling text data. You can use `Series.str.strip()` to trim any additional unwanted whitespace around a string.
- At any time, you can check which rows in `dirty` are not equal to `clean` using something like: `dirty[dirty.ne(clean).any(axis=1)]`.

In [None]:
url_dirty = 'https://raw.githubusercontent.com/STAT545-UBC/STAT545-UBC.github.io/master/gapminderDataFiveYear_dirty.txt'
url_clean = 'https://raw.githubusercontent.com/STAT545-UBC/STAT545-UBC.github.io/master/gapminderDataFiveYear.txt'
dirty = None
clean = None
...

In [None]:
grader.check("ex2")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Exercise 3: Taking control of your categoricals
rubric={accuracy:8,reasoning:8}Explore the effects of the `pandas` `sort_values` method on the `pandas` `categorical` `dtype`. Does sorting the values of a pandas categorical column in a data frame have any effect on, say, an `altair` figure? What about the `cat.reorder_categories` method? This exploration must involve the data, the categorical's order, and some figures, as well as a written explanation of what you are doing and what you find. Choose any data set you wish to demonstrate this. We provide code for a scatter plot figure you could modify for your data set to help you out. 

_Type your answer here, replacing this text._

In [None]:
# Load the penguins dataset
penguins = load_penguins()

# Convert the 'species' column to a categorical type
penguins['species'] = pd.Categorical(penguins['species'])

# Create the Altair chart
penguins_chart = alt.Chart(penguins).mark_point(size=60).encode(
    x=alt.X('bill_length_mm', title='Bill length (mm)'),
    y=alt.Y('body_mass_g', title='Body mass (g)'),
    color=alt.Color('species', scale=alt.Scale(scheme='category10'))
).properties(
    width=450,
    height=350
).configure_axis(
    labelFontSize=14,
    titleFontSize=18
).configure_legend(
    labelFontSize=14,
    titleFontSize=18
)

penguins_chart

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Exercise 4: Two table joins cheatsheet
rubric={accuracy:8,reasoning:8}

This exercise is to help you familiarize with the different _joins_ available in `pandas` using the `merge` method. First, take a look at [Jenny Bryan's cheatsheet](http://stat545.com/bit001_dplyr-cheatsheet.html). Your task is to create your own cheatsheet, covering all the joins that Jenny covers (and in both orders for joins where order has an effect) but focused on something you care about. Examples:  - Pets I have owned + breed + friendly vs. unfriendly + ??. Join to a table of pet breed, including variables for furry vs not furry, mammal true or false, etc.  - Movies and studios....  - Athletes and teams....The data set should be tractable (think 5-7 items in each table). **You are expected to create your own data set for this question, do not use an existing data set.**While demonstrating the joins with your data and code, also provide a narrative in written English explaining what you are doing and what is revealed through the joins. The narrative should be ~ 2-4 sentences per join scenario. **This narrative must be in your own words.**You will likely need to iterate between your data prep and your joining to make your explorations comprehensive and interesting. For example, you will want a specific amount (or lack) of overlap between the two data frames, in order to demonstrate all the different joins. You will want both the data frames to be as small as possible, while still retaining the expository value.You should create this cheatsheet as a separate `.ipynb` file that you render to `.md` (you can do that by clicking **File** > **Save and Export Notebook As ...** > **Markdown**). Both of these files should live in this lab2 repo, and you should paste the links (URL) to them below:

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Exercise 5: Grouping and aggregating

rubric={autograde:16}

Use the `pandas`to take the `flights` data set, from the `nycflights13` Python package, 
and obtain the average speed (in km/hr) and average distance (in km) for all flights, 
for each of the carriers AA, AS, UA and US.
Name these new columns `carrier_avg_speed` and `carrier_avg_distance_km`, 
and round the values so that the answer is a whole number (i.e., no decimal points). 
Convert the carrier acronyms to their full names 
(American Airlines, Alaska Airlines, United Airlines and US Airways). 
Sort the results in ascending order according to `carrier_avg_speed`. 
Name the data frame `carrier_avg_flights`.

Some hints:
- The distance is in miles and air time is in minutes in the `flights` data. 
- You will have to create a column that holds the average speed for each flight before you can do this for each carrier.
- You may also need to handle `NA` entries in the data.

In [None]:
carrier_avg_flights = None

...
carrier_avg_flights

The tests below only check that the object has the correct names. The other tests are intentionally hidden.

In [None]:
grader.check("ex5")

<!-- BEGIN QUESTION -->

## Exercise 6: CHALLENGING
rubric={accuracy:5}

Warning: This exercise is challenging and could be time-consuming. Please only attempt if you find yourself finishing the assignment early and you want a bit more of a challenge.

In this exercise, you will need to use regular expressions and the Python `re` library to parse a text file containing a collection of 10 "Nigerian" Fraud Letters, dating from 1998 to 2007. This is a subset of a data set [retrieved from Kaggle](https://www.kaggle.com/datasets/rtatman/fraudulent-email-corpus?resource=download).

We would like you to create a dataframe with the following columns:
- `senders` (containing the email addresses the emails were sent to)
- `recipients` (containing the email addresses the emails were sent from)

In the code below, we have read in the text file for you as a single string. 

Some useful resources:
- You can test out your regex expression in real-time using https://regex101.com 
- Documentation on regex flags https://docs.python.org/3/library/re.html#flags

In [None]:
# Open the test_emails.txt file in read mode and read its contents into a string
with open('data/test_emails.txt', 'r') as file:
    email_text = file.read()

In [None]:
...

<!-- END QUESTION -->

**Congratulations!!!** You are done the lab!!! Pat yourself on the back, and submit your lab to **GitHub** and Gradescope! Make sure you have 3 Git commits!