# Handling Missing Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
val1 = None

val1 is None

True

In [None]:
val1*5

In [3]:
vals1 = np.array([1,None, 3, 4])
vals1*5

TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

In [4]:
vals1

array([1, None, 3, 4], dtype=object)

In [5]:
np.sum(vals1)

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

### NaN: Missing numerical data

NaN stands for Not-a-Number

In [6]:
vals1 = np.array([1,np.nan, 3, 4])
vals1*5

array([  5.,  nan,  15.,  20.])

In [7]:
vals1.dtype

dtype('float64')

In [8]:
np.sum(vals1)

nan

**Sum of any true number and a nan is a nan**

### np.nansum

Used to treat nan as a zero in adding the elements of the array

In [9]:
np.nansum(vals1)

8.0

### NaN and None in pandas

Pandas converts both NaN and None as NaN

In [10]:
pd.Series([1,np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [13]:
simple_series=pd.Series([1,np.nan, 2, None])

In [14]:
simple_series.mean()

1.5

## Operating on Null Values

The following functions help in detecting and handling the null values in Pandas package

| Ufunc for missing values              | Description |                         
|---------------------|----------------------------------------------------------|
|``isnull()``          |Generate a Boolean mask indicating missing values         |
|``notnull()``      |Opposite of isnull()                                      |
|``dropna()``           |Return a filtered version of the data                     |
|``fillna()``         |Return a copy of the data with missing values filled      |


In [15]:
simple_data = pd.Series([1,np.nan, 'Hello', None])
simple_data

0        1
1      NaN
2    Hello
3     None
dtype: object

In [16]:
simple_data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [17]:
simple_data[~simple_data.isnull()]

0        1
2    Hello
dtype: object

In [18]:
simple_data[simple_data.notnull()]

0        1
2    Hello
dtype: object

In [19]:
simple_data.dropna()

0        1
2    Hello
dtype: object

In [20]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,    6]])

df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [21]:
df.dropna()
#drops the entire rows with Na

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [25]:
df.dropna(axis='columns')
#drops the entire column with Na
#look deeper into how to use the threshold

Unnamed: 0,2
0,2
1,5
2,6


<div class="alert alert-block alert-info">
<p>
There are other optional parameters that are offered by the ``dropna()`` function on dataframe, like, ``how`` and ``thresh``. **Look at Page 126 of the textbook for more details.** </p>
</div> 

In [26]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [27]:
df.fillna(0)

Unnamed: 0,0,1,2
0,1.0,0.0,2
1,2.0,3.0,5
2,0.0,4.0,6


In [28]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


<div class="alert alert-block alert-info">
<p>
There are other optional parameter called method that are offered by the ``fillna()`` function on dataframe, like, ``method='ffill'`` and ``method='bfill'``. **Look at Page 127 of the textbook for more details.** </p>
</div> 

## Working with dataset with missing values

Marketing dataset: This dataset contains questions from questionaries that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Anual Income of Household from the other 13 demographics attributes. [Source](http://sci2s.ugr.es/keel/dataset.php?cod=163)

[Data Dictionary](http://sci2s.ugr.es/keel/dataset/data/classification/marketing-names.txt)

In [29]:
mark_data = pd.read_csv('./data/marketing.csv')

In [35]:
mark_data.sample(5)

Unnamed: 0,Sex,MaritalStatus,Age,Education,Occupation,YearsInSf,DualIncome,HouseholdMembers,Under18,HouseholdStatus,TypeOfHome,EthnicClass,Language,Income
1925,2,1.0,7,3.0,8.0,5.0,3,3.0,0,1.0,1.0,7.0,1.0,4
1479,1,1.0,2,3.0,7.0,2.0,2,2.0,0,2.0,3.0,3.0,1.0,4
7516,2,1.0,7,4.0,8.0,3.0,3,3.0,0,,3.0,7.0,1.0,1
882,1,1.0,6,2.0,1.0,,3,2.0,0,1.0,1.0,7.0,1.0,5
1636,2,1.0,5,5.0,8.0,,1,4.0,0,1.0,1.0,3.0,1.0,8


### Activity:

* How many total responders in the dataset? 


In [43]:
len(mark_data)

8993

In [49]:
mark_data.shape[0]

8993


* How many missing values for each attribute (column) in the dataset? 


In [77]:
mark_count_null = mark_data.isnull().sum()
print(mark_count_null)

Sex                   0
MaritalStatus       160
Age                   0
Education            86
Occupation          136
YearsInSf           913
DualIncome            0
HouseholdMembers    375
Under18               0
HouseholdStatus     240
TypeOfHome          357
EthnicClass          68
Language            359
Income                0
dtype: int64



* What percentage of missing values for each attribute in the dataset? 


In [78]:
mark_count_null/mark_data.shape[0]*100

Sex                  0.000000
MaritalStatus        1.779162
Age                  0.000000
Education            0.956299
Occupation           1.512287
YearsInSf           10.152341
DualIncome           0.000000
HouseholdMembers     4.169910
Under18              0.000000
HouseholdStatus      2.668742
TypeOfHome           3.969754
EthnicClass          0.756144
Language             3.991994
Income               0.000000
dtype: float64

In [95]:
mark_data['MaritalStatus'].isnull().sum()

160



* Which attribute has the most missing values in the dataset? (**Hint**: To get the index of the maximum element you can use [`idxmax()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.idxmax.html) function)



In [110]:
mark_count_null.idxmax()

'YearsInSf'


* How do you fill the missing values with a `0`? 


In [81]:
mark_data.fillna(0)

Unnamed: 0,Sex,MaritalStatus,Age,Education,Occupation,YearsInSf,DualIncome,HouseholdMembers,Under18,HouseholdStatus,TypeOfHome,EthnicClass,Language,Income
0,2,1.0,5,4.0,5.0,5.0,3,3.0,0,1.0,1.0,7.0,0.0,9
1,1,1.0,5,5.0,5.0,5.0,3,5.0,2,1.0,1.0,7.0,1.0,9
2,2,1.0,3,5.0,1.0,5.0,2,3.0,1,2.0,3.0,7.0,1.0,9
3,2,5.0,1,2.0,6.0,5.0,1,4.0,2,3.0,1.0,7.0,1.0,1
4,2,5.0,1,2.0,6.0,3.0,1,4.0,2,3.0,1.0,7.0,1.0,1
5,1,1.0,6,4.0,8.0,5.0,3,2.0,0,1.0,1.0,7.0,1.0,8
6,1,5.0,2,3.0,9.0,4.0,1,3.0,1,2.0,3.0,7.0,1.0,1
7,1,3.0,3,4.0,3.0,5.0,1,1.0,0,2.0,3.0,7.0,1.0,6
8,1,1.0,6,3.0,8.0,5.0,3,3.0,0,2.0,3.0,7.0,1.0,2
9,1,1.0,7,4.0,8.0,4.0,3,2.0,0,2.0,3.0,7.0,1.0,4



* **Most Common Use**: Can you fill each missing value with the corresponding average for that attribute? 
    * For example, if 'Age' attribute is missing for a person, can you find the average 'Age' of all people and fill that missing 'Age' with that average. 

In [98]:
mark_data.fillna(mark_data.mean())

Unnamed: 0,Sex,MaritalStatus,Age,Education,Occupation,YearsInSf,DualIncome,HouseholdMembers,Under18,HouseholdStatus,TypeOfHome,EthnicClass,Language,Income
0,2,1.00000,5,4.0,5.000000,5.000000,3,3.000000,0,1.0,1.00000,7.0,1.127519,9
1,1,1.00000,5,5.0,5.000000,5.000000,3,5.000000,2,1.0,1.00000,7.0,1.000000,9
2,2,1.00000,3,5.0,1.000000,5.000000,2,3.000000,1,2.0,3.00000,7.0,1.000000,9
3,2,5.00000,1,2.0,6.000000,5.000000,1,4.000000,2,3.0,1.00000,7.0,1.000000,1
4,2,5.00000,1,2.0,6.000000,3.000000,1,4.000000,2,3.0,1.00000,7.0,1.000000,1
5,1,1.00000,6,4.0,8.000000,5.000000,3,2.000000,0,1.0,1.00000,7.0,1.000000,8
6,1,5.00000,2,3.0,9.000000,4.000000,1,3.000000,1,2.0,3.00000,7.0,1.000000,1
7,1,3.00000,3,4.0,3.000000,5.000000,1,1.000000,0,2.0,3.00000,7.0,1.000000,6
8,1,1.00000,6,3.0,8.000000,5.000000,3,3.000000,0,2.0,3.00000,7.0,1.000000,2
9,1,1.00000,7,4.0,8.000000,4.000000,3,2.000000,0,2.0,3.00000,7.0,1.000000,4


# Combining Pandas Datasets with Concatenation [MORE INFO](https://pandas.pydata.org/pandas-docs/stable/merging.html)

## Introduction

In [None]:
# For this tutorial, we will only need our college_loan_defaults dataset.
college_loan_defaults = pd.read_csv(
    './data/college-loan-default-rates.csv', index_col='opeid')

# Keep in mind that the original dataset has this many rows
len(college_loan_defaults)

The Office of Postsecondary Education Identification (OPEID) code for each college is used as an index

In [None]:
college_loan_defaults.head()

## `pd.concat`
You can think of the `pd.concat` function as the equivalent of the NumPy `concatenate` function for `Series` and `DataFrame` objects.

Will we spend most of our time on how these function works with `DataFrame` objects as opposed to `Series` objects since in practice that is how it is used most frequently.

When it comes to using the `pd.concat` function, the most basic question is whether you are adding *additional rows* or *additional columns*. We'll run through the function arguments based on concatenating rows and then come back for a look at how we perform column concatentations.

### Concatenating `DataFrame` Rows

In [None]:
# Here, I'll split the college_loan_defaults into multiple 
# sections of rows that we will then stiched back together.
part_1 = college_loan_defaults.iloc[:1000]
part_2 = college_loan_defaults.iloc[1000:2000]
part_3 = college_loan_defaults.iloc[1999:]

# This creates three parts:
# rows 0-999
# rows 1000-1999
# rows 1999-end -> notice 1999 appears twice
part_3.index & part_2.index

#### Basic Usage

In [None]:
# Join all three parts together pd.concat
concatenated_dataframe = pd.concat([part_3, part_1, part_2])
concatenated_dataframe.head()

**Pretty easy.**

Notice that `pd.concat` does not sort the elements of the DataFrame that it returns.

#### Handling Duplicate Index Values with `verify_integrity` & `ignore_index` Parameters
You probably didn't notice, but we got a school that is appearing twice in our list.

In [None]:
# The `DataFrame.index.duplicated` function returns a boolean array
# we can use as a mask to extract duplicate records.
concatenated_dataframe[concatenated_dataframe.index.duplicated()]

Now, I purposefully caused this problem for us (by including the 1999 indexed element in both `part_2` and `part_3`; but in the real world this is pretty common!

Sometimes you might want to keep both entries (often the case if the index value is the same but the rest of the data is different). If so, you can pass the **`ignore_index`** parameter with a value of **`True`** to the function and all existing index values will be dumped and a new one integer based one will be created for you.

In [None]:
concatenated_dataframe = pd.concat([part_2, part_3, part_1], ignore_index=True)
concatenated_dataframe.head()

If on the other hand, a duplicate index would mean there is a data problem that you don't want to allow, you can specify the `verify_integrity` parameter as `True`.

When this is passed, the existence of duplicate indices will generate a `ValueError` exception.

In [None]:
concatenated_dataframe = pd.concat([part_2, part_3, part_1], verify_integrity=True)

#### Handling Column Mismatches with the `join` Parameter
Sometimes you will have two sets of rows that you want to join together, but the sets don't have all of the same columns.

I'll create a couple of additional small `DataFrame` objects from our college loan dataset to demonstrate our options here.

In [None]:
# DataFrame 1
# Contains the first 5 rows of the original dataset
# But only the name, city, and state columns
name_city_state_columns_only = college_loan_defaults[['name', 'city', 'state']][:5]

# DataFrame 2
# Contains the second 5 rows of the original dataset
# But only the name, state, and zipcode columns
name_state_zipcode_columns_only = college_loan_defaults[['name', 'state', 'zipcode']][5:10]

In [None]:
name_city_state_columns_only

In [None]:
name_state_zipcode_columns_only

We have have 2 sets of 5 rows that we want to concatenate together, but they have different columns. Let's see what happens if you don't specify anything with the **`join`** parameter.

In [None]:
pd.concat([name_city_state_columns_only, name_state_zipcode_columns_only])

See how Pandas adds the special `NaN` value for any column that didn't have a value in the original dataframes? 

The other option is to drop any columns where there is not data in both sets of rows. You can do this be specifying a value of **`inner`** to the join parameter of the function.

Let's demonstrate how doing so will result in only the shared columns (name, state) appearing in the final dataframe.

In [None]:
pd.concat([name_city_state_columns_only, name_state_zipcode_columns_only], join='inner')

### Concatenating `DataFrame` Columns
Now let's go back and see how we can use the `pd.concat` function to merge two sets of columns with the same index (row) values.

The data will start out a little dirty but we will clean it up with our parameters.

In [None]:
# DataFrame 1
# Contains the first 5 rows of the original dataset
# But only the name, city, and state columns
name_city_state_columns = college_loan_defaults[['name', 'city', 'state']][:5]

# DataFrame 2
# Contains the 7 rows of the original dataset - this will cause a duplicate index
# But only default rates columns
default_rates = college_loan_defaults[
    ['year_1_default_rate',
     'year_2_default_rate', 
     'year_3_default_rate']][:7]

In [None]:
name_city_state_columns

In [None]:
default_rates

Now let's do a simple concatenation. To add columns we have to specify the `axis` parameter with a value of **`1`** or **`col`** to indicate we are adding colums, not rows.

In [None]:
pd.concat(
    [name_city_state_columns, default_rates], 
    axis=1)

There are a couple of important things to notice here:
* Unlike when concatenating rows, this time Pandas did sort the 
rows based on the index. Just something to be aware of.
* See how there are a couple of rows with `NaN` values for their first three colums.  That's because our `name_and_default_rates` dataframe had two additional rows for which there were no corresponding values in `name_city_state_zipcode_columns`.

Let's drop the rows with `NaN` values by specifying an inner join.

In [None]:
pd.concat(
    [name_city_state_columns, default_rates], 
    axis=1, join='inner')

Finally, let's talk about the how the **`verify_integrity`** and **`ignore_index`** parameters would work when concatenating columns.

Let's say that we had included the city column in both dataframes:
* The default behavior of `pd.concat` would have been to create a new dataframe with 2 "city" columns.
* You could make Pandas throw a `ValueError` exception by passing `verify_integrity=True` to the function.
* You could also throw out all the column names and replace them with an 0-based series of integers.  This would result in the values of "city" being duplicated in two columns, but the columns would have different integer "names".

### Concatenating two or more `Series` to a `DataFrame` with `axis = 1` parameter in `pd.concat()` function

In [None]:
s1 = pd.Series([1, 2], index=['A', 'B'], name='s1')
s2 = pd.Series([3, 4,5], index=['A', 'B','C'], name='s2')

In [None]:
s1

In [None]:
s2

In [None]:
pd.concat([s1, s2], axis=1)

<div class="alert alert-block alert-info">
<p>
Note that the ``pd.concat()`` function is smart to label the column names with the series variable names. 
</div> 

# Combining Datasets with Merge [MORE INFO](https://pandas.pydata.org/pandas-docs/stable/merging.html)

We will be exploring another way to combine datasets through the **`pd.merge`** function.

Those who have a background in databases will find a significant amount of overlap between your SQL work and the merge function.

## The Difference between `pd.concat` & `pd.merge`
The essential difference between concatenation and merging/joining is that the later requires the existence of one or more shared columns (or indices) between the two dataframes.

Concatenation has no such requirement. It will simply slap together whatever you give it and fill `NaN` values into spots where there are column mismatches.

Let's demonstrate this:

In [None]:
# Team Members Favorite Restaurants
team_restaurants = pd.DataFrame(
    {'restaurant': ['In-N-Out', 'Chipotle', 'Chick-Fil-A'], 
    'name': ['Mike', 'Kim', 'Roger']})
team_restaurants

In [None]:
# Team Members Favorite Restaurants
items_locations = pd.DataFrame(
    {'items': ['Fries', 'Pizza', 'Barritos','Pasta', 'Shakes'], 
    'locations': ['Chicago', 'New York', 'San Diego', 'Pittsburgh', 'Seattle']})
items_locations

As you can see, there are no matching values between these dataframes.  That does not prevent us from concatenating them:

In [None]:
pd.concat([team_restaurants, items_locations])

In [None]:
# Concatenating Columns
pd.concat(
    [team_restaurants,items_locations],
    axis=1)

A merge operation could not be performed between these two datasets because they don't have any shared values to cross-reference.

## The 3 Categories of Joins
There are 3 different categories of merges/joins which are defined by the characteristics of the shared columns/indices:
* One-to-One: Each shared value exists only once in both dataframes.
* One-to-Many: A given shared value exists once in first dataframe, but 1 or more times in the second dateframe.
* Many-to-Many: A given shared value exists 1 or more times in both dataframes.

Let's provide an example of each type of join from our datasets.

### One-to-One Join

<div class="alert alert-block alert-info">
<p>
This will feel pretty similar to concatenating columns.
</p>
</div>

In [None]:
# Restaurant Items
restaurant_items = pd.DataFrame(
    {
        'item': [
        'Shakes', 
        'Burritos', 
        'Burger'
        ]
    ,
        'restaurant':[
        'In-N-Out',
        'Chipotle',
        'Five Guys',
    ]
    }
)
restaurant_items

In [None]:
team_restaurants

The **`restaurant`** field in the restuarant_items, team_restaurants dataset is a unique field, that is it the restaurant names appear only once in each dataset. 

Because of this, if we merge the two dataframes it will be a **1-1 join.**

In [None]:
pd.merge(team_restaurants, restaurant_items)

Great. Here's what Pandas did:
1. Identified the matching column(s) between the two dataframes: **`restaurant`**.
1. Found matching **`restaurant`** values between the two dataframes.
1. Merged the columns of matching **`restuarant`** values together.
1. **Important**: Notice that a new index was generated.

<div class="alert alert-block alert-info">
<p>
In our discussion, we will reference to the columns that pandas is using to find matches between dataframes as the "join column(s)".
</p>
</div>

#### Controlling the Join Type with the `how` Parameter
Did you notice that some of the records from each of the original dataframes didn't make it into the merge product?

This is because the type of join that was applied to the dataframes was called an **inner join**.

The are actually 4 types of joins that you can use:
* **Inner Join**: To be included in the output dataframe, the join column(s) value must exist in both original dataframes. 
    * This is why some of the records didn't get included in the output, because they didn't have a corresponding join column(s) values in the other dataframe.
* **Outer Join**: All records from both dataframes are included in the output. Pandas simply fills in `NaN` where there is no corresponding join column(s) value.
* **Left Join**: All rows from the first (left) dataframe will be included in the output dataframe, regardless of whether there is a matching join column(s) value in the second (right) dataframe.
* **Right Join**: All rows from the second (right) dataframe will be included in the output dataframe, regardless of whether there is a matching join columns value in the left (first) dataframe.

Let's go ahead and try all these different types of joins to see how our output changes.

In [None]:
# Outer Join
# All records from both dataframes are included.
# NaN is inserted into missing grid point.
pd.merge(team_restaurants, restaurant_items, how="outer")

In [None]:
pd.merge(team_restaurants, restaurant_items, how="left")

In [None]:
pd.merge(team_restaurants, restaurant_items, how="right")

### One-to-Many Join

In [None]:
# Restaurant Items
restaurant_items = pd.DataFrame(
    {
        'item': [
        'Burgers', 'Fries', 'Shakes', 
        'Tacos', 'Burritos', 'Chips',
        'Chicken Sandwich', 'Fries', 'Salads'
        ]
    ,
        'rest':[
        'In-N-Out', 'In-N-Out', 'In-N-Out', 
        'Chipotle', 'Chipotle', 'Chipotle',
        'Five Guys', 'Five Guys', 'Five Guys'
    ]
    }
)
restaurant_items

In [None]:
pd.merge(team_restaurants, restaurant_items)

#### Specifying the Join Columns
Well... that isn't want we wanted.

Thankfully though, the error message is pretty self-explanatory. Pandas thinks there are no common columns to merge on.

The reason for this is that the common values are held in columns with slightly different names. We have to explain to Pandas what to do when this happens by specifying the names of the columns to join on.

In [None]:
# Use the left_on and right_on parameters to specify the
# name(s) of the join column(s) in the first(left)
# and second(right) dataframes.
pd.merge(
    team_restaurants, 
    restaurant_items,
    left_on='restaurant',
    right_on='rest')

<div class="alert alert-block alert-info">
<h5>There can be more than 1 join column</h5>
<p>
In this example, we have specified only one join column. But you can specify multiple columns if you so desire. Just pass them as a list to the `left_on` and `right_on` parameters.
</p>
</div>

### Many-to-Many Join

In [None]:
# Team Members Favorite Restaurants
team_restaurants = pd.DataFrame(
    {'restaurant': ['In-N-Out', 'Chipotle', 'Chick-Fil-A', 'Chick-Fil-A', 'In-N-Out'], 
    'name': ['Mike', 'Kim', 'Roger', 'Sam', 'Sonia']})
team_restaurants


In [None]:
# Restaurant Items
restaurant_items = pd.DataFrame(
    {
        'item': [
        'Burgers', 'Fries', 'Shakes', 
        'Tacos', 'Burritos', 'Chips',
        'Chicken Sandwich', 'Fries', 'Salads'
        ]
    ,
        'rest':[
        'In-N-Out', 'In-N-Out', 'In-N-Out', 
        'Chipotle', 'Chipotle', 'Chipotle',
        'Five Guys', 'Five Guys', 'Five Guys'
    ]
    }
)
restaurant_items

In [None]:
pd.merge(team_restaurants, restaurant_items, left_on = 'restaurant', right_on = 'rest', how = "outer")

<div class="alert alert-block alert-info">
<p> You could merge two dataframes based on index as well. </p>

<p>
If you wanted to, you could actually use the index of one dataframe and a column of the other dataframe if you wanted. Pandas gives you great flexibility here. 
</p>
</div>