# DateTimes and Handling Null Values

### Introduction

At this time, we now have have our data properly coerced.  We almost can train our model, but doing so involves handling both our datetime and na values.  We'll do so in this lesson.

### Loading Our Data

Let's get started by loading our data.

In [52]:
import json
file = "./dtypes_coerced_cats.json"
with open(file, 'r') as f:
    dtypes = json.load(f)

In [53]:
import pandas as pd
coerced_df = pd.read_csv('./coerced_cats.csv', index_col = 0).iloc[:, 1:]

In [54]:
coerced_df[:2]

Unnamed: 0,host_response_rate,security_deposit,cleaning_fee,extra_people,id,listing_url,last_scraped,name,summary,space,...,zipcode_x0_12047,zipcode_x0_12049,zipcode_x0_12051,zipcode_x0_12053,zipcode_x0_12055,zipcode_x0_12059,zipcode_x0_12435,zipcode_x0_13353,zipcode_x0_13357,zipcode_x0_other
0,96.0,200.0,30.0,28.0,2015,https://www.airbnb.com/rooms/2015,2018-11-07,Berlin-Mitte Value! Quiet courtyard/very central,Great location! 30 of 75 sq meters. This wood...,A+++ location! This „Einliegerwohnung“ is an e...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,,0.0,0.0,0.0,2695,https://www.airbnb.com/rooms/2695,2018-11-07,Prenzlauer Berg close to Mauerpark,,In the summertime we are spending most of our ...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [55]:
updated_df = coerced_df.astype(dtypes)

In [56]:
updated_df.select_dtypes('object').shape

(8000, 19)

In [57]:
coerced_df = updated_df.select_dtypes(exclude = 'object')

In [58]:
coerced_df.shape

(8000, 228)

### Changing DateTimes

We can start with coercing our datetimes.  To do so, we'll use our `add_datepart` function.

In [59]:
import numpy as np
import re
def add_datepart(df, fldname, drop=True, time=False, errors="raise"):
    fld = df[fldname]
    fld_dtype = fld.dtype
    if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
        fld_dtype = np.datetime64

    if not np.issubdtype(fld_dtype, np.datetime64):
        df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True, errors=errors)
    targ_pre = re.sub('[Dd]ate$', '', fldname)
    attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
            'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
    if time: attr = attr + ['Hour', 'Minute', 'Second']
    for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
    df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
    if drop: df.drop(fldname, axis=1, inplace=True)

Now, let's select all of the datetime columns.

In [60]:
datetime_df = coerced_df.select_dtypes('datetime')

In [61]:
datetime_df.columns

Index(['last_scraped', 'host_since', 'calendar_last_scraped', 'first_review',
       'last_review'],
      dtype='object')

And we can use list iteration to call `add_datepart` on each of the columns.

> Ignore the warnings.

In [62]:
[add_datepart(datetime_df, col) for col in datetime_df.columns]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


[None, None, None, None, None]

After doing so, we should have 65 columns in our `datetime_df`.

In [63]:
datetime_df.shape

(8000, 65)

Now we can drop our datetime columns from the `coerced_df`. 

In [64]:
replaced_dt_df = coerced_df.drop(columns = ['last_scraped', 'host_since', 'calendar_last_scraped', 'first_review',
       'last_review'])

And add in the new datetime columns to the dataframe.

In [65]:
replaced_dt_df[datetime_df.columns] = datetime_df

In [66]:
replaced_dt_df.shape
# (8000, 288)

(8000, 288)

### Is Null Columns

Next let's use the DataFrameMapper to impute the missing values and add a `is_missing` for each column with missing values. 

First find the names of the columns with any `na` values.

In [67]:
has_na_cols = replaced_dt_df.isna().any(axis = 0)

In [68]:
cols_with_na = has_na_cols[has_na_cols == True].index

In [69]:
cols_with_na.shape
# (35,)

(35,)

In [70]:
cols_with_na
# ['host_response_rate', 'security_deposit', 'cleaning_fee',
#        'host_listings_count', 'host_total_listings_count', 'bathrooms',
#        'bedrooms', 'beds', 'square_feet', 'review_scores_rating',
#        'review_scores_accuracy', 'review_scores_cleanliness',
#        'review_scores_checkin', 'review_scores_communication',
#        'review_scores_location', 'review_scores_value', 'reviews_per_month',
#        'host_sinceYear', 'host_sinceMonth', 'host_sinceWeek', 'host_sinceDay',
#        'host_sinceDayofweek', 'host_sinceDayofyear', 'first_reviewYear',
#        'first_reviewMonth', 'first_reviewWeek', 'first_reviewDay',
#        'first_reviewDayofweek', 'first_reviewDayofyear', 'last_reviewYear',
#        'last_reviewMonth', 'last_reviewWeek', 'last_reviewDay',
#        'last_reviewDayofweek', 'last_reviewDayofyear']

Index(['host_response_rate', 'security_deposit', 'cleaning_fee',
       'host_listings_count', 'host_total_listings_count', 'bathrooms',
       'bedrooms', 'beds', 'square_feet', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'reviews_per_month',
       'host_sinceYear', 'host_sinceMonth', 'host_sinceWeek', 'host_sinceDay',
       'host_sinceDayofweek', 'host_sinceDayofyear', 'first_reviewYear',
       'first_reviewMonth', 'first_reviewWeek', 'first_reviewDay',
       'first_reviewDayofweek', 'first_reviewDayofyear', 'last_reviewYear',
       'last_reviewMonth', 'last_reviewWeek', 'last_reviewDay',
       'last_reviewDayofweek', 'last_reviewDayofyear'],
      dtype='object')

Then lets use list iteration to create steps that perform a simple impute on each of our above columns, and that add an column that indicates if the value is missing, for each of the above.

Each column should have an `is_na` column should have a column name of the form `col_name_is_na`, for example `review_scores_rating_is_na`.

In [71]:
from sklearn.impute import SimpleImputer, MissingIndicator
imputer_steps = [([col], SimpleImputer()) for col in cols_with_na]

In [72]:
is_missing_steps = [([col], MissingIndicator(), {'alias': f'{col}_is_na'})
                    for col in cols_with_na]


Afterwards, we can combine the steps together.

In [73]:
combined_steps = imputer_steps + is_missing_steps

In [74]:
from sklearn_pandas import DataFrameMapper

In [83]:
is_null_mapper = DataFrameMapper(combined_steps, df_out = True)

In [84]:
dt_transformed_df = is_null_mapper.fit_transform(replaced_dt_df)

In [85]:
dt_transformed_df[:2]

Unnamed: 0,host_response_rate,security_deposit,cleaning_fee,host_listings_count,host_total_listings_count,bathrooms,bedrooms,beds,square_feet,review_scores_rating,...,first_reviewWeek_is_na,first_reviewDay_is_na,first_reviewDayofweek_is_na,first_reviewDayofyear_is_na,last_reviewYear_is_na,last_reviewMonth_is_na,last_reviewWeek_is_na,last_reviewDay_is_na,last_reviewDayofweek_is_na,last_reviewDayofyear_is_na
0,96.0,200.0,30.0,4.0,4.0,1.0,1.0,2.0,465.872727,93.0,...,False,False,False,False,False,False,False,False,False,False
1,91.418246,0.0,0.0,1.0,1.0,1.0,1.0,1.0,465.872727,100.0,...,False,False,False,False,False,False,False,False,False,False


After calling our mapper, we should not have any columns with missing values.

In [77]:
dt_transformed_df.isna().any(axis = 0).any()

False

Then we can drop our original columns with na values.  And replace them with our columns from the mapper.

In [78]:
df_with_is_na = replaced_dt_df.drop(columns = cols_with_na)

In [79]:
dt_transformed_df[df_with_is_na.columns] = df_with_is_na

In [80]:
df_with_is_na.isna().any().any()

False

In [81]:
dt_transformed_df.to_csv('cleaned_listings.csv')

### Summary

In this lesson we finished coercing our data.  We did so by both converting our datetime columns and by  removing our null values.  For the datetime columns, we selected our datetime columns and then looped through them, using our `add_datepart` function.  For the columns with na values, we used a DataFrameMapper to both impute the missing values, and add a corresponding `is_na` column.