<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/18Apr20_lab_handling_missing_data_INCOMPLETE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handling Missing Data Lab

## Introduction
After determining what data we are missing, the next step is to determine the best way to handle those missing values. In the following exercises, we will practice a few of those methods. For this lab, we will use the King County house sales data that we analyzed for missing values in the previous lab.

Below, load the pandas and numpy libraries and load the data frame using the specified csv file.

In [0]:
import pandas as pd
import numpy as np
url = "https://raw.githubusercontent.com/jigsawlabs-student/feature-engineering/master/4-handling-missing-data-lab/kc_house_data_missing_values.csv"
housing_df = pd.read_csv(url)

In [0]:
housing_df[:3]

# 	id	date	price	bedrooms	bathrooms	sqft_living	floors
# 0	6414100192	December 09, 2014	538000.0	3	2.25	2570	2.0
# 1	2487200875	December 09, 2014	604000.0	4	3.00	1960	1.0
# 2	1954400510	February 18, 2015	510000.0	3	2.00	1680	1.0

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,floors
0,6414100192,"December 09, 2014",538000.0,3,2.25,2570,2.0
1,2487200875,"December 09, 2014",604000.0,4,3.0,1960,1.0
2,1954400510,"February 18, 2015",510000.0,3,2.0,1680,1.0


Then, use `info` to return a description of the dataframe and get a sense of the null values in each column.

In [0]:
housing_df.info()
# # RangeIndex: 7740 entries, 0 to 7739
# Data columns (total 7 columns):
# id             7740 non-null int64
# date           7740 non-null object
# price          7740 non-null float64
# bedrooms       7740 non-null int64
# bathrooms      7740 non-null float64
# sqft_living    7740 non-null int64
# floors         7662 non-null float64
# dtypes: float64(3), int64(3), object(1)
# memory usage: 423.4+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7740 entries, 0 to 7739
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           7740 non-null   int64  
 1   date         7740 non-null   object 
 2   price        7740 non-null   float64
 3   bedrooms     7740 non-null   int64  
 4   bathrooms    7740 non-null   float64
 5   sqft_living  7740 non-null   int64  
 6   floors       7662 non-null   float64
dtypes: float64(3), int64(3), object(1)
memory usage: 423.4+ KB


We can see that there are 7740 rows of data, and scanning through the columns we see that each column has non-null values except for the `floors` column.

### Independent Variables

In the last lab, we saw that while null values existed with floors, that there were empty strings in the `date` column. 

Let's first go back to the floors column. Create a new column called `floors_is_null`, which lists a 1 when the data is not available, and a 0 otherwise. 

In [0]:
floors_is_null = np.where(housing_df.floors.isna(), 1, 0)
floors_is_null[:10]
# array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [0]:
np.sum(floors_is_null)
# 78

78

Next assign the column to the `housing_df` dataframe.

In [0]:
housing_df = housing_df.assign(floors_is_null=floors_is_null)

In [0]:
housing_df[:3]
# date	price	bedrooms	bathrooms	sqft_living	floors	floors_is_null
# 0	6414100192	December 09, 2014	538000.0	3	2.25	2570	2.0	0
# 1	2487200875	December 09, 2014	604000.0	4	3.00	1960	1.0	0
# 2	1954400510	February 18, 2015	510000.0	3	2.00	1680	1.0	0

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,floors,floors_is_null
0,6414100192,"December 09, 2014",538000.0,3,2.25,2570,2.0,0
1,2487200875,"December 09, 2014",604000.0,4,3.0,1960,1.0,0
2,1954400510,"February 18, 2015",510000.0,3,2.0,1680,1.0,0


In [0]:
housing_df['floors_is_null'].sum()
# 78

78

The next step is to remove the nan values from the `avg_floors` column.  Replace each null value in `avg_floors` with the mean number of floors.

In [0]:
mean_floors = housing_df.floors.mean()
mean_floors
# 1.612438005742626

1.612438005742626

In [0]:
np.where(housing_df.floors.isna(), mean_floors, housing_df.floors)

array([2., 1., 1., ..., 2., 2., 2.])

In [0]:
housing_fl = np.where(housing_df.floors.isna(), mean_floors, housing_df.floors)
housing_df.assign(floors = housing_fl)
housing_df = housing_df.assign(floors = housing_fl)

In [0]:
housing_df[housing_df['floors'] == 1.612438005742626].shape

# (78, 8)

(78, 8)

So now, we'll see that we no longer have na values for in the floors column.

In [0]:
housing_df['floors'].isna().sum()
# 0

0

### Imputing the date

Ok, now it's time to remove the missing values in the date column.  Just like with the floors column, we'll do this in two steps.  First we'll add a new column called `date_is_null`, and then we'll impute the mean in our `date` column.

1. Add a `date_is_null` column

Ok, create the `date_is_null`, column that lists a 1 when the date is not available, and a 0 otherwise.  Assign the series to the variable `date_is_null`.

In [0]:
housing_df.date.eq(' ').sum()

9

In [0]:
date_is_null = np.where(housing_df.date.eq(' '), 1, 0)

In [0]:
date_is_null[:3]
# array([0, 0, 0])

array([0, 0, 0])

In [0]:
date_is_null.sum()
# 9

9

Then add the column to the dataframe.

In [0]:
housing_df = housing_df.assign(date_is_null = date_is_null)

In [0]:
housing_df[:3]
# 
# id	date	price	bedrooms	bathrooms	sqft_living	floors	floors_is_null	date_is_null
# 0	6414100192	December 09, 2014	538000.0	3	2.25	2570	2.0	0	0
# 1	2487200875	December 09, 2014	604000.0	4	3.00	1960	1.0	0	0
# 2	1954400510	February 18, 2015	510000.0	3	2.00	1680	1.0	0	0

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,floors,floors_is_null,date_is_null
0,6414100192,"December 09, 2014",538000.0,3,2.25,2570,2.0,0,0
1,2487200875,"December 09, 2014",604000.0,4,3.0,1960,1.0,0,0
2,1954400510,"February 18, 2015",510000.0,3,2.0,1680,1.0,0,0


In [0]:
housing_df['date_is_null'].sum()
# 9

9

2. Replacing missing values with the mean

The next step is to replace the na values in the date column with the mean date.  

> Check out [this post](https://stackoverflow.com/questions/52007139/pandas-datetime-average) to see how to  compute the average of a date in Pandas.

We'll help you out along the way.  Begin by selecting all of the non-string values and converting them to a date.

In [0]:
# df.date = pd.to_datetime(df.date).values.astype(np.int64)

# df = pd.DataFrame(pd.to_datetime(df.groupby('column').mean().date))

In [0]:
# np.where(housing_df.date.eq(' '), 0, housing_df.date.values.astype(np.int64))

ValueError: ignored

In [0]:
dates_to_datetime = pd.to_datetime(housing_df.date, errors= 'coerce')
dates_to_datetime[:3]

# 0   2014-12-09
# 1   2014-12-09
# 2   2015-02-18

0   2014-12-09
1   2014-12-09
2   2015-02-18
Name: date, dtype: datetime64[ns]

In [0]:
dates_to_datetime= dates_to_datetime.dropna()

In [0]:
dates_to_datetime.shape
# (7731,)

(7731,)

Then convert the datetimes to an `int64` and calculate the average.

In [0]:
mean_date = dates_to_datetime.dropna().values.astype(np.int64).mean()
mean_date
# 1.4144681350407421e+18

1.414468135040745e+18

Convert the mean number back to a date time, and then you can replace the empty strings with the mean.

In [0]:
mean_datetime = pd.to_datetime(mean_date)
mean_datetime
# Timestamp('2014-10-28 03:48:55.040742144'

Timestamp('2014-10-28 03:48:55.040744960')

As an alternative, we can find the median value, by sorting the values of our dates and then selecting the middle index.

Let's start by seeing the length of the our date column.

In [0]:
dates_to_datetime.shape
# 7731

(7731,)

So then then middle value is at 3865.  Next, sort the values, and select the datetime at index 3865.

In [0]:
dates_to_datetime.sort_values().dropna().iloc[3865]

Timestamp('2014-10-13 00:00:00')

In [0]:
median_date = dates_to_datetime.sort_values().dropna().iloc[3865]
median_date
# Timestamp('2014-10-13 00:00:00')

Timestamp('2014-10-13 00:00:00')

Ok, now let's replace the empty strings with the mean datetime.

In [0]:
dates_to_datetime

0      2014-12-09
1      2014-12-09
2      2015-02-18
3      2015-04-03
4      2015-03-12
          ...    
7735   2014-08-13
7736   2014-10-13
7737   2014-09-15
7738   2014-08-25
7739   2014-10-14
Name: date, Length: 7731, dtype: datetime64[ns]

In [0]:
mean_to_date_stamp = pd.to_datetime(mean_date)

In [0]:
replaced_housing_date = np.where(dates_to_datetime, mean_to_date_stamp, dates_to_datetime).astype(np.datetime64)

ValueError: ignored

In [0]:
replaced_housing_date

In [0]:
type(mean_to_date_stamp)

In [0]:
(replaced_housing_date == mean_datetime).sum()
# 9

In [0]:
housing_df = housing_df.assign(date = pd.to_datetime(replaced_housing_date))

In [0]:
housing_df[:3]

In [0]:
(housing_df['date'] == mean_datetime).sum()
# 9

### Target Variable
In the last lab, we saw that the `price` column (our target variable) had missing values in the form of 0s.  Imputing target data can be difficult and is best suited for situations where we have very few observations to work with.

First, determine the number of observations that are 0.

In [0]:

# 46

Here, missing values only represent a very small fraction of our total observations, it is unlikely we would impute values for the target variable. In the cell below, drop all rows with 0 in the price column.

In [0]:
housing_df = None

In [0]:
housing_df.shape
# (7694, 9)

In [0]:
housing_df[housing_df.price == 0]
# 	id	date	price	bedrooms	bathrooms	sqft_living	floors	floors_is_null	date_is_null

## Conclusion
There are different ways to handle missing data depending on the types of data you are dealing with and the problem you are trying to solve. In this lab, we treated our indpendent variables and target variable differently. For the independent variables, we created new column which reflected whether or not there was a missing value. For both of these variable we used the mean value to impute new values where data was missing.
For the target variable, we determined that we only had a small number of missing values. We dropped all of these observations from the dataframe.