## Your Info
__Name:__

__PDX Email:__

__Collaborators:__

# Data Cleaning Workout

The following exercises will focus on cleaning Data

## 0 - Importing the Tools

Import the `pandas` library with the `pd` alias.

In [86]:
## Begin Solution
import pandas as pd
import numpy as np
## End Solution

## Workout 1- Parking Data

This activity uses a sample of the New York City Parking Violations Dataset. Imagine this data was collected by police officers, parking inspectors, or other individuals. This means the data might have some missing or incorrect information.

In this exercise, we'll use data cleaning techniques to find and handle missing values in the dataset.

### Exercise 1 - Loading the Data

__Your Task__

1. Create a DataFrame:

    * Create a DataFrame named parking_df from the file located at:
    
        * `data/nyc-parking-violation-sample.csv`

1. Select Specific Columns:

    * Include only the following columns in your DataFrame:

        * `Plate ID`
        
        * `Registration State`
        
        * `Vehicle Make`
        
        * `Vehicle Color`
        
        * `Violation Time`
        
        * `Street Name`

2. Explore the Data:

* Print information about the `parking_df` DataFrame.
* This should include:
    * Column names
    * Number of entries in each column
    * Data types of each column

In [87]:
## Begin Solution
FILE = "data/nyc-parking-violation-sample.csv"
df = pd.read_csv(FILE,
                usecols = ['Plate ID',
                           'Registration State',
                           'Vehicle Make',
                           'Vehicle Color',
                           'Violation Time',
                           'Street Name'])
df.info()
## End Solution

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 6 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Plate ID            249992 non-null  object
 1   Registration State  250000 non-null  object
 2   Vehicle Make        247146 non-null  object
 3   Violation Time      249978 non-null  object
 4   Street Name         249922 non-null  object
 5   Vehicle Color       240593 non-null  object
dtypes: object(6)
memory usage: 11.4+ MB


### Exercise 2 - Removing `NaN`

__Your Task__

* Create a new datafame, `cleaned_parking_df`, Remove rows with any missing data (`NaN` values)
* How many rows remain after doing this pruning?
* Assume that each parking ticket is a \$100 fine. However, if a ticket has missing data, it can be successfully contested.
    * Based on the removed rows, calculate the total avoided fines for New York City citizens due to missing data.

__Note__:
* This missing data loop hole is not based on factual information and is solely for creating interest around these activities.
* I am not a lawyer, for legal advice consult a professional

In [88]:
## Begin Solution
original_df_size = len(df)
cleaned_parking_df = df.dropna()
cleaned_parking_df.info()
cleaned_df_size = len(cleaned_parking_df)
remaining_rows = original_df_size - cleaned_df_size
print(f"Remaining Rows: {remaining_rows}")
print(f"Fines Avoided: ${remaining_rows * 100}")
## End Solution

<class 'pandas.core.frame.DataFrame'>
Index: 238090 entries, 0 to 249999
Data columns (total 6 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Plate ID            238090 non-null  object
 1   Registration State  238090 non-null  object
 2   Vehicle Make        238090 non-null  object
 3   Violation Time      238090 non-null  object
 4   Street Name         238090 non-null  object
 5   Vehicle Color       238090 non-null  object
dtypes: object(6)
memory usage: 12.7+ MB
Remaining Rows: 11910
Fines Avoided: $1191000


In [89]:
## Quick Test
assert len(cleaned_parking_df) == 238_090, "❌ DataFrame size is incorrect"
print("Assertion passed: ✅ DataFrame Size is 238,090")
## End Quick Test

Assertion passed: ✅ DataFrame Size is 238,090


### Exercise 2 - Missing Data

Let's switch up the removal criteria. A ticket can only be dismissed if the license plate, sate, and or street name are missing.

__Your Task__

* Create a new DataFrame from `parking_df`, called `improved_parking_df`, and remove rows that contain missing values from:

    * `Plate ID`
    * `Registration State`
    * `Street Name`


* After removing these rows, how many rows are left in `improved_parking_df`?

* Assume that each parking ticket is still a \$100 fine. And if a ticket has missing data, it can be still successfully contested.

    * Based on the removed rows, calculate the total avoided fines for New York City citizens due to missing data.

    * The result should be a more realistic value than the previous exercise.



In [90]:
## Begin Solution
improved_parking_df = df.dropna(subset = ['Plate ID',
                              'Registration State',
                              'Street Name'])
improved_parking_df.info()
improved_df_size = len(improved_parking_df)
remaining_rows = original_df_size - improved_df_size

print(f"Remaining Rows: {remaining_rows}")
print(f"Fines Avoided: ${remaining_rows * 100}")
## End Solution

<class 'pandas.core.frame.DataFrame'>
Index: 249914 entries, 0 to 249999
Data columns (total 6 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Plate ID            249914 non-null  object
 1   Registration State  249914 non-null  object
 2   Vehicle Make        247065 non-null  object
 3   Violation Time      249895 non-null  object
 4   Street Name         249914 non-null  object
 5   Vehicle Color       240527 non-null  object
dtypes: object(6)
memory usage: 13.3+ MB
Remaining Rows: 86
Fines Avoided: $8600


In [91]:
## Begin Quick Test
assert len(improved_parking_df) == 249914, "❌ DataFrame size is incorrect"
print("Assertion passed: ✅ DataFrame Size is 249914")
## End Quick Test

Assertion passed: ✅ DataFrame Size is 249914


### Exercise 3 - Missing License Plates

In data cleaning, we often deal with not just missing data, but also data that, while present, is invalid.

Consider a new scenario where a parking ticket can be contested and dismissed if the `Plate ID` is recorded as `BLANKPLATE`.

__Your Task__

* Clean the original DataFrame, `df`:
    * Create a new DataFrame, `blank_plates_df`, by removing all rows where the `Plate ID` column contains the value `BLANKPLATE`.
* Determine how many rows were removed from the original DataFrame.
* Based on the scenario where a `BLANKPLATE` entry allows a ticket to be successfully contested, calculate the total amount in fines that NYC citizens could have potentially avoided-each fine is still $100.

In [92]:
## Begin Solution
blank_plates_df = df[df["Plate ID"] != "BLANKPLATE"]
blank_plates_df.info


blank_df_size = len(blank_plates_df)
rows_remaining = original_df_size - blank_df_size
blank_plates_df.info()

print(f"Rows remaining: {rows_remaining}")
print(f"Fines avoided: ${rows_remaining * 100}")
## End Solution

<class 'pandas.core.frame.DataFrame'>
Index: 249675 entries, 0 to 249999
Data columns (total 6 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Plate ID            249667 non-null  object
 1   Registration State  249675 non-null  object
 2   Vehicle Make        246855 non-null  object
 3   Violation Time      249653 non-null  object
 4   Street Name         249602 non-null  object
 5   Vehicle Color       240310 non-null  object
dtypes: object(6)
memory usage: 13.3+ MB
Rows remaining: 325
Fines avoided: $32500


In [93]:
## Begin Quick Test
assert len(blank_plates_df) == 249_675, "❌ DataFrame is not the expected size"
print("✅ DataFrame is expected size")
## End Quick Test

✅ DataFrame is expected size


## Workout 2 - Celebrity Deaths

This activity uses a dataset of celebrities who passed away in 2016. This data was originally recorded in Wikipedia and includes the following information:

* `dateofdeath` - Date of Death
* `name` - the celebrity's name
* `age` - the age of the celebrity
* `bio - a short biographhy
* `causeofdeath` - a short description explaining the cause of death

This dataset is a bit messy, so we'll have to do some cleaning in order to work with it.



### Exercise 1 - Average Age of Celebrities who died February-July 2016

* Create a DataFrame, `celeb_deaths_df`, from the file located at `data/celebrity_deaths_2016.csv`
* Create a new column `month` containing the month from the `dateofdeath` column
* Make `month` the index of `celeb_deaths_df`
* Sort the `celeb_deaths_df` by the index
* Clean all nonintegers from the `age` column
* Turn the `age` column into a numeric data type
* Output statistical information about the DataFrame


In [137]:
## Begin Solution
# Load Data
path = "data/celebrity_deaths_2016.csv"
celeb_death_df = pd.read_csv(path)

# Drop Unnamed Columns
celeb_death_df = celeb_death_df.drop(columns = [celeb_death_df.columns[5], celeb_death_df.columns[6]])

# Create a month column
celeb_death_df["month"] = celeb_death_df["dateofdeath"].str.slice(5,7)

# Set and Sort Index
celeb_death_df = celeb_death_df.set_index("month").sort_index()

# Convert Age Column to Integer
celeb_death_df = celeb_death_df.dropna(subset = ["age"])
celeb_death_df["age"] = pd.to_numeric(celeb_death_df['age'], errors="coerce")

# Output Data
celeb_death_df.describe()
celeb_death_df.info()
celeb_death_df.head()
## End Solution

<class 'pandas.core.frame.DataFrame'>
Index: 6516 entries, 01 to 12
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   dateofdeath   6516 non-null   object 
 1   name          6516 non-null   object 
 2   age           6505 non-null   float64
 3   bio           6516 non-null   object 
 4   causeofdeath  1531 non-null   object 
dtypes: float64(1), object(4)
memory usage: 305.4+ KB


Unnamed: 0_level_0,dateofdeath,name,age,bio,causeofdeath
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2016-01-01,Tony Lane,71.0,American art director (Rolling Stone),brain cancer
1,2016-01-21,Derrick Todd Lee,47.0,American convicted serial killer,heart disease
1,2016-01-21,Francis Seow,87.0,Singapore-born American writer and political r...,pneumonia
1,2016-01-21,Cabot Lyford,90.0,American sculptor,pulmonary distress
1,2016-01-21,Bogus_aw Kaczy_ski,73.0,Polish classical music journalist,stroke


### Exercise 2 - Average Age of Death

Something is off with the data. We'd like you to determine the average age of death but according to `df.describe()` the average age is 100 and the maximum age is 9394! This can't be right.

__Your Task__

* Update the DataFrame to only keep people younger than 120 years old.
* Determine the average age of death from Feb-Jul

In [138]:
### Begin Solution

# Update DataFrame
celeb_death_df = celeb_death_df[celeb_death_df["age"] < 120]
celeb_death_df.describe()

# Get Average Age of Death from Feb 2016 - July 2016
celeb_death_df.loc["01":"07", "age"].mean()

### Begin Solution

np.float64(77.27840765936004)

### Excercise 3 - Cause of Death

__Your Task__

* Update the DataFrame such that any `NaN` values in the `causeofdeath` column is replaced with `unknown`.
* Determine the 5 most common causes of death.

In [142]:
## Begin Solution

# Replace NaN with Unknown
causes_of_death["causeofdeath"] = celeb_death_df["causeofdeath"].fillna('unknown')

# Get Value Counts Series
causes_of_death = causes_of_death["causeofdeath"].str.strip().value_counts()

# Output 5 Most Common Causes of Death
print(causes_of_death.head(5))
## End Solution

causeofdeath
unknown                               4953
cancer                                 257
heart attack                           135
traffic collision                       56
lung cancer                             52
                                      ... 
pancreatic cancer and stroke             1
Celtic)                                  1
cardiovascular and renal disease         1
entomologist and plant pathologist       1
liver and renal failure                  1
Name: count, Length: 324, dtype: int64
4953


In [152]:
## Quick Test
assert causes_of_death.isnull().sum() == 0, "NaN values still exist in the 'causeofdeath' column"
assert causes_of_death["cancer"] == 257, "Cancer count may be incorrect"
assert causes_of_death["unknown"] == 4953, "Unknown count may be incorrect"
print("3/3 Tests Passed")
## End Test

3/3 Tests Passed
