# Data Cleaning

In [1]:
# As always
import pandas as pd
import numpy as np

### Building a dataframe by hand

In [2]:
some_data = [1, 2, 3, 4, 5] # list of numbers
some_more_data = ['a', 'b', 'c', 'd', 'e'] # list of letters
some_booleans = [True, False, True, True, True] # list of booleans
df = pd.DataFrame({'numbers':some_data, 
                   'letters':some_more_data, 
                   'bools':some_booleans})
df

Unnamed: 0,numbers,letters,bools
0,1,a,True
1,2,b,False
2,3,c,True
3,4,d,True
4,5,e,True


We can checkout the datatypes of each column using `df.info()`

### Changing data types

In [4]:
classes  = ['CS1111', 'PSYC1010', 'CS2150', 'ECON2010', 'SOC2010']
courseforum_ratings = ['4', 3.8, '1.2', 2, '4'] # Note: Some of these are strings, floats, and ints
lou = pd.DataFrame({'courses':classes, 'ratings':courseforum_ratings})
lou

Unnamed: 0,courses,ratings
0,CS1111,4.0
1,PSYC1010,3.8
2,CS2150,1.2
3,ECON2010,2.0
4,SOC2010,4.0


Say we want to double the rating values

That's not what we expected- Let's drop that column

Two important things to notice here:
1. We've specified the axis on which we'd like to drop `ratings_2x`. Since it's a column, we specify `axis=1`
    - In general for axes: Rows = 0, Columns = 1
2. We saved the dataframe that `.drop()` returns back onto the original dataframe to save it

Let's try scaling ratings again

Before, our `ratings` column had mixed datatypes, floats and strings. 
<br>In python, when we multiply a string, we repeat it, so we got `"44"` instead of `8` for the first entry

To change the datatype, we can convert each element in the `ratings` series

Pandas offers us some easier functions to convert data types, such as the `.astype()` function

<br> Check out what happens when we pass `'str'`, `'float'`, and `'int'` to `.astype()`

## Handling missing values

Missing data is a common issue, and an incredible headache in machine learning.

<br> Take, for example, this dataframe with several `NaN` and `None` values, both of which can be used to represent missing data

In [12]:
students = ['Student_A', 'Student_B', 'Student_C', 'Student_D', np.nan, 'Student_F', 'Student_G']
years = [1, np.nan, 3, None, 4, np.nan, 1]
majors = ['Econ', 'PolySci', 'CS', 'Phil', np.nan, 'Chemistry', np.nan]
df = pd.DataFrame({'student':students,'year' :years, 'major':majors})
df

Unnamed: 0,student,year,major
0,Student_A,1.0,Econ
1,Student_B,,PolySci
2,Student_C,3.0,CS
3,Student_D,,Phil
4,,4.0,
5,Student_F,,Chemistry
6,Student_G,1.0,


Note how pandas automatically convered `None` to `np.nan`- The latter is what we'll usually to work with

### Identifying missing values

We can use `.isna()` on a dataframe or series

If we wanted to count the number of nulls for each feature (aka column) in our dataset,
<br> we'd have to take sum **across** rows 0 thru 6: Which axis # should we specify to sum across then?

<br> (Hint): Peep the general rule of thumb from above

If we wanted instead to see how many null values we have for every student, we'd swap the axis:

Remember, we want a breakdown *by row*, so we're counting the number of rows *across* each column

### Strategy 1: Dropping rows with missing values

By far the easiest way of dealing with missing data is just dropping rows that have missing values.

By default, `dropna()` will drop rows that have a null value for *any* column

You might also need to drop NAs only if it's present for a single column. 
<br>In this case, we can pass the `subset=` argument to `.dropna()`

### Strategy 2: Filling in missing values

We can fill in missing values in a variety of different ways. 
<br> We can use a specific value (like the mean), forward-fill, back-fill, or use a variety of more advanced imputation methods such as K-nearest neighbors (KNN)

Replacing missing values with a specific value:

You can also use this to replace missing values with the mean of the values:

## String processing

Often, a dataset will contain string representations of data that could be really useful if you could find some way to extract it. 

<br> Let's start off with a dataframe

In [21]:
people = ['Christian Jung', 'Ishaan Dey', 'Carter Bristow', 'Chris Santamaria', 
          'Shawn Weigand', 'Jasmine Dogu', 'Nithin Vijayakumar', 'John Doe', 'Ben H.'] 
classes = ['Node Pro', 'node', 'Node lite', 'deploy', 'Source', 'Node', 'node lite', 'Source Lite', 'node lite']

courses_df = pd.DataFrame({'person':people, 'section':classes})
courses_df

Unnamed: 0,person,section
0,Christian Jung,Node Pro
1,Ishaan Dey,node
2,Carter Bristow,Node lite
3,Chris Santamaria,deploy
4,Shawn Weigand,Source
5,Jasmine Dogu,Node
6,Nithin Vijayakumar,node lite
7,John Doe,Source Lite
8,Ben H.,node lite


It'd be great if we could work with just the first names of everyone. 
With normal python strings, this is pretty easy to do using the `.split()` function:

Let's try using that to extract the first names from the column `person`

Looks like we got an error: We can't use split() on the series object directly.
<br><br> Instead, we have to "vectorize it" using `.str`  first

Before we move on, check out the object type of the output using `type()`

### Lambda apply functions

Lambda apply functions are a pretty helpful tool for cleaning, here's one quick example

Quite literally, this reads: <br>

For every element `x` in the series `couses_df.person`, take the first element of `x` and save it to a new column `first_name`
<br> In other words, we are **apply**ing the *anonymous* function `x[0]` for every value in the series

### Changing capitalization to better process text

Let's look at how many people of the 7 here are teaching each section! 

We should be putting `node lite` and `Node lite` together, but because of a mismatch in cases, we're getting unique results.

<br>An easy way to solve this is by converting all the text to a uniform case

**Try it out**: How many folks are staffing the overall node course offering (i.e. Node, Node Pro, Node Lite)?

<br>Hint: Google the documentation for `pd.Series.str`

## Date & Time processing

In [31]:
presidents = ['Washington' ,'Lincoln', 'Kennedy', 'Obama', 'Trump']
birthdays = ['Feb 27 1732', '2-12-1809', 'May 29th, 1917', '8 4 1961','06//14// //1946' ]

bdays = pd.DataFrame({'president': presidents, 'birthday': birthdays})
bdays

Unnamed: 0,president,birthday
0,Washington,Feb 27 1732
1,Lincoln,2-12-1809
2,Kennedy,"May 29th, 1917"
3,Obama,8 4 1961
4,Trump,06//14// //1946


Yikes! Let's see if we can clean up the time series data using `pd.to_datetime`

As you can see, `pd.to_datetime` is pretty powerful. In can read in quite a few time formats as strings, then convert them into a `Timestamp` series

Unfortunately, there are some formats `pd.to_datetime()` won't recognize on its own

In [34]:
presidents = ['Washington' ,'Lincoln', 'Kennedy']
birthdays = ['2###27adjf1732', '2###12adjf1809', '5###05adjf1917']

bdays_messy = pd.DataFrame({'president': presidents, 'birthday': birthdays})
bdays_messy

Unnamed: 0,president,birthday
0,Washington,2###27adjf1732
1,Lincoln,2###12adjf1809
2,Kennedy,5###05adjf1917


However, we CAN specify the format ourselves! 
<br><br>Look up the documentation (google!) to see if we can pass any parameters to help it along

### Using pandas datetime objects

We can pull quite a lot just from a datetime timestamp using attributes

In [37]:
washington = bdays.at[0, 'birthday'] # Taking the value for washington's bday
print(washington) # the raw timestamp

1732-02-27 00:00:00


### Make new columns from these datetime attributes 

Let's use this to make new columns that reflect these attributes:

Try it out with the others:

### Filtering with datetimes

Let's say we want to subset the dataframe for presidents born *before* WWI started (July 28th, 1914)

We can also do some quick calculations quite easily:

<br> For example, how much older was Washington than Lincoln?

Note: this returns a `Timedelta` object, not `Timestamp`. We can get similar attributes

### How can we apply this?

When you get data that includes time as a variable, it'll be in one of many possible formats, and not always consistent throughout the whole dataset. 


`pd.to_datetime` makes the process of cleaning these incredibly easy!

Once cleaned, we can look at specific attributes such as month, day, and year **to gain insight we wouldn't otherwise have been able to access.**

There's a lot, lot more you can do with pandas datetimes - use business days, adjust for time zones - just about anything you'd imagine.

The docs for all of that is linked here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#overview

## Merging DataFrames

Merging sources of data is super important:
    
<br> Sometimes you have data from two different sources that you'd like to have in one data frame to analyze. We can do that with `.concat()` and `.merge()`

In [49]:
ratings_df = pd.DataFrame({'Dining': ['Castle', 'Newcomb', 'Chick-fil-A', 'Five Guys', 'Runk', 'Subway'], 
                        'Rating': [4, 3, 5, 4, 4, 3]})
ratings_df = ratings_df.sort_values(by='Rating')
ratings_df

Unnamed: 0,Dining,Rating
1,Newcomb,3
5,Subway,3
0,Castle,4
3,Five Guys,4
4,Runk,4
2,Chick-fil-A,5


In [50]:
prices_df = pd.DataFrame({'Dining': ['Castle', 'Newcomb', 'Chick-fil-A', 'Five Guys', 'Runk', 'Subway'], 
                        'Price': [8.5, 9.5, 6.5, 7.75, 9.5, 6]})
prices_df = prices_df.sort_values(by='Price')
prices_df

Unnamed: 0,Dining,Price
5,Subway,6.0
2,Chick-fil-A,6.5
3,Five Guys,7.75
0,Castle,8.5
1,Newcomb,9.5
4,Runk,9.5


In [51]:
locations_df = pd.DataFrame({'Dining': ['Castle', 'Newcomb', 'Chick-fil-A', 'Five Guys', 'Runk', 'Subway'],
                            'Location': ['Old dorms', 'Central Grounds', 'Central Grounds', 'Central Grounds', 'Gooch-Dillard', 'Central Grounds']})
locations_df = locations_df.sort_values(by='Location')
locations_df

Unnamed: 0,Dining,Location
1,Newcomb,Central Grounds
2,Chick-fil-A,Central Grounds
3,Five Guys,Central Grounds
5,Subway,Central Grounds
4,Runk,Gooch-Dillard
0,Castle,Old dorms


Note that each one of these dataframes have a column in common, `Dining`.

The order of the values may not be same, but we're still good to go

### pd.Merge()

Instead of working with three distinct dataframes, let's combine them into one df

To do so, we can call `.merge()` on two data tables and specify the column on which to merge as `on=`

By chaining them we can combine multiple

### Join logic

In [54]:
more_ratings_df = pd.DataFrame({'Dining': ["Newcomb", "Subway", "Starbucks", "Burrito Theory", "O'Hill"], 
                                'Rating': [3, 3, 4, 4, 3]})
more_ratings_df

Unnamed: 0,Dining,Rating
0,Newcomb,3
1,Subway,3
2,Starbucks,4
3,Burrito Theory,4
4,O'Hill,3


With the previous merges, we had the same number of observations in every dataframe.

<br>With some merges, not every row may align. Let's try to merge `more_ratings_df` with `prices_df`. Note how there are some dining halls in common, and some unique to each

In [55]:
print(more_ratings_df.Dining.unique())
print(prices_df.Dining.unique())

# Only Newcomb and Subway are common to both

['Newcomb' 'Subway' 'Starbucks' 'Burrito Theory' "O'Hill"]
['Subway' 'Chick-fil-A' 'Five Guys' 'Castle' 'Newcomb' 'Runk']


We can do two different merges now: 
1. If we want to retain **only** those in common, we use an `inner` join
2. If we want to keep **everything**, and keep placeholders for missing data, we use an `outer` join

### pd.Concat()

Another *similar* function is `.concat()` 

It's a little different from `.merge()`, since we'll have to pass in a `list` of dataframes instead

That didn't quite work as expected, because `concat()` stacked the dataframes above each other, instead of combining information for common rows.

Note that it **didn't combine rows** when `merge()` easily could have.

One example of when `concat()` is appropriate is when we want to add on more information to a dataframe, but the **rows are the different** between the two

In [75]:
# Quick helper
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style=\"display:inline\"'),raw=True)

In [76]:
display_side_by_side(ratings_df,more_ratings_df)

Unnamed: 0,Dining,Rating
1,Newcomb,3
5,Subway,3
0,Castle,4
3,Five Guys,4
4,Runk,4
2,Chick-fil-A,5

Unnamed: 0,Dining,Rating
0,Newcomb,3
1,Subway,3
2,Starbucks,4
3,Burrito Theory,4
4,O'Hill,3


`Concat` can also horizontally stack dataframes, usng the `axis=1` argument. 

Here's a case where it might be useful:

In [80]:
more_info_df = pd.DataFrame({'Popularity': [8, 5, 10, 7, 8, 7], 
                             'Hours': ["7:00-9:00", "7:00-8:00", "11:00-8:00", "11:00-8:00", "7:00-8:00", "11:00-8:00"]})
more_info_df

Unnamed: 0,Popularity,Hours
0,8,7:00-9:00
1,5,7:00-8:00
2,10,11:00-8:00
3,7,11:00-8:00
4,8,7:00-8:00
5,7,11:00-8:00


Note the difference between `.concat(axis=1)` and `.merge()`. We would use `.concat()` when there isn't a duplicate column, and `.merge()` when there is one.