# Assess the data

## Introduction

In this notebook, I take my first look at, and do my initial cleaning of, the data.  This is my "wrangle" notebook.

The data includes three files:  

* a training set with the forecast variable, 
* a training set with additional predictor variables, and 
* a holdout set of the additional predictor variables to be used for scoring in the data-science contest.

In this notebook, I do the following:

* Inspect the files
* Clean each file, and 
* Collect my commentary at the the end.  

I expect to do additional cleaning after my exploratory analysis and in anticipation of my modeling.

## Set up

In [1]:
import numpy as np
import pandas as pd

## Load data

In [2]:
train_for_org = pd.read_csv('../sb_cap2_nb-99_data/original_train_forecast-variables.csv')
train_pred_org = pd.read_csv('../sb_cap2_nb-99_data/original_train_predictor-variables.csv')
test_pred_org = pd.read_csv('../sb_cap2_nb-99_data/original_comp_predictor-variables.csv')

## Inspect data:  forecast variable in training set

In [3]:
# How many rows and columns?

train_for_org.shape

(1456, 4)

In [4]:
# What does data look like?

train_for_org.head()

Unnamed: 0,city,year,weekofyear,total_cases
0,sj,1990,18,4
1,sj,1990,19,5
2,sj,1990,20,4
3,sj,1990,21,3
4,sj,1990,22,6


In [5]:
#  What type of columns?  Any nulls?

train_for_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1456 entries, 0 to 1455
Data columns (total 4 columns):
city           1456 non-null object
year           1456 non-null int64
weekofyear     1456 non-null int64
total_cases    1456 non-null int64
dtypes: int64(3), object(1)
memory usage: 45.6+ KB


Takeaways:

* There are no nulls in the forecast variable--the key variable for time series

## Inspect data:  predictor variables in training set

In [6]:
# 1How many rows and columns?

train_pred_org.shape

(1456, 24)

In [7]:
# What does data look like?

train_pred_org.head()

Unnamed: 0,city,year,weekofyear,week_start_date,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precipitation_amt_mm,reanalysis_air_temp_k,...,reanalysis_precip_amt_kg_per_m2,reanalysis_relative_humidity_percent,reanalysis_sat_precip_amt_mm,reanalysis_specific_humidity_g_per_kg,reanalysis_tdtr_k,station_avg_temp_c,station_diur_temp_rng_c,station_max_temp_c,station_min_temp_c,station_precip_mm
0,sj,1990,18,1990-04-30,0.1226,0.103725,0.198483,0.177617,12.42,297.572857,...,32.0,73.365714,12.42,14.012857,2.628571,25.442857,6.9,29.4,20.0,16.0
1,sj,1990,19,1990-05-07,0.1699,0.142175,0.162357,0.155486,22.82,298.211429,...,17.94,77.368571,22.82,15.372857,2.371429,26.714286,6.371429,31.7,22.2,8.6
2,sj,1990,20,1990-05-14,0.03225,0.172967,0.1572,0.170843,34.54,298.781429,...,26.1,82.052857,34.54,16.848571,2.3,26.714286,6.485714,32.2,22.8,41.4
3,sj,1990,21,1990-05-21,0.128633,0.245067,0.227557,0.235886,15.36,298.987143,...,13.9,80.337143,15.36,16.672857,2.428571,27.471429,6.771429,33.3,23.3,4.0
4,sj,1990,22,1990-05-28,0.1962,0.2622,0.2512,0.24734,7.52,299.518571,...,12.2,80.46,7.52,17.21,3.014286,28.942857,9.371429,35.0,23.9,5.8


In [8]:
#  What type of columns?  Any nulls?

train_pred_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1456 entries, 0 to 1455
Data columns (total 24 columns):
city                                     1456 non-null object
year                                     1456 non-null int64
weekofyear                               1456 non-null int64
week_start_date                          1456 non-null object
ndvi_ne                                  1262 non-null float64
ndvi_nw                                  1404 non-null float64
ndvi_se                                  1434 non-null float64
ndvi_sw                                  1434 non-null float64
precipitation_amt_mm                     1443 non-null float64
reanalysis_air_temp_k                    1446 non-null float64
reanalysis_avg_temp_k                    1446 non-null float64
reanalysis_dew_point_temp_k              1446 non-null float64
reanalysis_max_air_temp_k                1446 non-null float64
reanalysis_min_air_temp_k                1446 non-null float64
reanalysis_precip

In [9]:
# Get a count of nulls

train_pred_org_nulls_raw = train_pred_org.isnull().sum()
train_pred_org_nulls = pd.DataFrame(train_pred_org_nulls_raw)
train_pred_org_nulls.columns = ['null_count']
train_pred_org_nulls['total_count'] = len(train_pred_org)
train_pred_org_nulls = train_pred_org_nulls[['total_count', 'null_count']]
train_pred_org_nulls['null_pct'] = np.round((train_pred_org_nulls.null_count / len(train_pred_org)) * 100, 2)
train_pred_org_nulls.sort_values(by='null_pct', ascending=False, inplace=True)
train_pred_org_nulls

Unnamed: 0,total_count,null_count,null_pct
ndvi_ne,1456,194,13.32
ndvi_nw,1456,52,3.57
station_diur_temp_rng_c,1456,43,2.95
station_avg_temp_c,1456,43,2.95
station_precip_mm,1456,22,1.51
ndvi_se,1456,22,1.51
ndvi_sw,1456,22,1.51
station_max_temp_c,1456,20,1.37
station_min_temp_c,1456,14,0.96
precipitation_amt_mm,1456,13,0.89


In [10]:
# What's the range of percent of nulls?

train_pred_org_nulls[train_pred_org_nulls['null_count']>0].describe()

Unnamed: 0,total_count,null_count,null_pct
count,20.0,20.0,20.0
mean,1456.0,27.4,1.882
std,0.0,41.236672,2.830675
min,1456.0,10.0,0.69
25%,1456.0,10.0,0.69
50%,1456.0,13.0,0.89
75%,1456.0,22.0,1.51
max,1456.0,194.0,13.32


## Inspect data:  predictor variables in test set

In [11]:
# How many rows and columns?

test_pred_org.shape

(416, 24)

In [12]:
# What does data look like?

test_pred_org.head()

Unnamed: 0,city,year,weekofyear,week_start_date,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precipitation_amt_mm,reanalysis_air_temp_k,...,reanalysis_precip_amt_kg_per_m2,reanalysis_relative_humidity_percent,reanalysis_sat_precip_amt_mm,reanalysis_specific_humidity_g_per_kg,reanalysis_tdtr_k,station_avg_temp_c,station_diur_temp_rng_c,station_max_temp_c,station_min_temp_c,station_precip_mm
0,sj,2008,18,2008-04-29,-0.0189,-0.0189,0.102729,0.0912,78.6,298.492857,...,25.37,78.781429,78.6,15.918571,3.128571,26.528571,7.057143,33.3,21.7,75.2
1,sj,2008,19,2008-05-06,-0.018,-0.0124,0.082043,0.072314,12.56,298.475714,...,21.83,78.23,12.56,15.791429,2.571429,26.071429,5.557143,30.0,22.2,34.3
2,sj,2008,20,2008-05-13,-0.0015,,0.151083,0.091529,3.66,299.455714,...,4.12,78.27,3.66,16.674286,4.428571,27.928571,7.785714,32.8,22.8,3.0
3,sj,2008,21,2008-05-20,,-0.019867,0.124329,0.125686,0.0,299.69,...,2.2,73.015714,0.0,15.775714,4.342857,28.057143,6.271429,33.3,24.4,0.3
4,sj,2008,22,2008-05-27,0.0568,0.039833,0.062267,0.075914,0.76,299.78,...,4.36,74.084286,0.76,16.137143,3.542857,27.614286,7.085714,33.3,23.3,84.1


In [13]:
#  What type of columns?  Any nulls?

test_pred_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 416 entries, 0 to 415
Data columns (total 24 columns):
city                                     416 non-null object
year                                     416 non-null int64
weekofyear                               416 non-null int64
week_start_date                          416 non-null object
ndvi_ne                                  373 non-null float64
ndvi_nw                                  405 non-null float64
ndvi_se                                  415 non-null float64
ndvi_sw                                  415 non-null float64
precipitation_amt_mm                     414 non-null float64
reanalysis_air_temp_k                    414 non-null float64
reanalysis_avg_temp_k                    414 non-null float64
reanalysis_dew_point_temp_k              414 non-null float64
reanalysis_max_air_temp_k                414 non-null float64
reanalysis_min_air_temp_k                414 non-null float64
reanalysis_precip_amt_kg_per_m2  

In [14]:
# Get a count of nulls

test_pred_org_nulls_raw = test_pred_org.isnull().sum()
test_pred_org_nulls = pd.DataFrame(test_pred_org_nulls_raw)
test_pred_org_nulls.columns = ['null_count']
test_pred_org_nulls['total_count'] = len(test_pred_org)
test_pred_org_nulls = test_pred_org_nulls[['total_count', 'null_count']]
test_pred_org_nulls['null_pct'] = np.round((test_pred_org_nulls.null_count / len(test_pred_org)) * 100, 2)
test_pred_org_nulls.sort_values(by='null_pct', ascending=False, inplace=True)
test_pred_org_nulls

Unnamed: 0,total_count,null_count,null_pct
ndvi_ne,416,43,10.34
station_diur_temp_rng_c,416,12,2.88
station_avg_temp_c,416,12,2.88
ndvi_nw,416,11,2.64
station_min_temp_c,416,9,2.16
station_precip_mm,416,5,1.2
station_max_temp_c,416,3,0.72
reanalysis_min_air_temp_k,416,2,0.48
reanalysis_tdtr_k,416,2,0.48
reanalysis_specific_humidity_g_per_kg,416,2,0.48


In [15]:
# What's the range of percent of nulls?

train_pred_org_nulls[train_pred_org_nulls['null_count']>0].describe()

Unnamed: 0,total_count,null_count,null_pct
count,20.0,20.0,20.0
mean,1456.0,27.4,1.882
std,0.0,41.236672,2.830675
min,1456.0,10.0,0.69
25%,1456.0,10.0,0.69
50%,1456.0,13.0,0.89
75%,1456.0,22.0,1.51
max,1456.0,194.0,13.32


### Takeways after inspection

Re indices:

* There are 3 rows describing each observations:  city, year and week
* There's is also a start of week index in the predictor variables

Re forecast variable--total cases--in test set:

* It is numberic--an int.
* It is never null

Re the predictor variables in test set:

* All are numeric
* All are missing values--anywhere from 13 to 1/2 of a percent


Re the predictor variables in holdout set:

* They look the predictors in the test set


## Clean data

#### Define some functions to clean 

In [16]:
def set_index_train_for():
    """Sets a index of week_start_date on the forecast variable test set
    """
    train_for_cln['week_start_date'] = pd.to_datetime(train_pred_org['week_start_date'])
    train_for_cln.set_index('week_start_date', inplace=True)

In [17]:
def set_index_pred(df):
    """Sets an index of week_start_date on a dataframe with a week_start_date column
    """
    df['week_start_date'] = pd.to_datetime(df['week_start_date'])
    df.set_index('week_start_date', inplace=True)

In [18]:
def fill_na_pred(df):
    """Fills all nan on passed in dataframe
    """
    df.fillna(method='ffill', inplace=True)

In [19]:
def add_month_column(df):
    df['month'] = df.index.month  

#### Clean training forecast variable

In [20]:
# Make two clean training forecast dataframes

train_for_cln = train_for_org.copy()
set_index_train_for()
add_month_column(train_for_cln)
train_for_cln = train_for_cln[['city', 'year', 'month', 'weekofyear', 'total_cases']]

# Split dataframe by city

train_for_cln_sj = train_for_cln[train_for_cln['city'] == 'sj'].copy()
train_for_cln_sj.drop(['city'], axis=1, inplace=True)
train_for_cln_iq = train_for_cln[train_for_cln['city'] == 'iq'].copy()
train_for_cln_iq.drop(['city'], axis=1, inplace=True)

# # Check
# train_for_cln
# train_for_cln.isnull().any()
# train_for_cln_sj
# train_for_cln_sj.info()
# train_for_cln_iq

#### Clean training predictor variables

In [21]:
# Make a clean training predictor dataframe

train_pred_cln = train_pred_org.copy()
set_index_pred(train_pred_cln)
fill_na_pred(train_pred_cln)
add_month_column(train_pred_cln)

# Reorder the columns

ls_cols = train_pred_cln.columns.to_list()
ls_cols.insert(2, ls_cols.pop())
train_pred_cln = train_pred_cln[ls_cols]

# Split dataframe by city

train_pred_cln_sj = train_pred_cln[train_pred_cln['city'] == 'sj'].copy()
train_pred_cln_sj.drop(['city'], axis=1, inplace=True)
train_pred_cln_iq = train_pred_cln[train_pred_cln['city'] == 'iq'].copy()
train_pred_cln_iq.drop(['city'], axis=1, inplace=True)


# # Check
# train_pred_cln
# train_pred_cln.isnull().any()
# train_pred_cln_sj
# train_pred_cln_sj.info()
# train_pred_cln_iq

#### Clean test predictor variables

In [22]:
# Make a clean test predictor dataframe

test_pred_cln = test_pred_org.copy()
set_index_pred(test_pred_cln)
fill_na_pred(test_pred_cln)
add_month_column(test_pred_cln)

# Reorder the columns

ls_cols = test_pred_cln.columns.to_list()
ls_cols.insert(2, ls_cols.pop())
test_pred_cln = test_pred_cln[ls_cols]

# Split dataframe by city

test_pred_cln_sj = test_pred_cln[test_pred_cln['city'] == 'sj'].copy()
test_pred_cln_sj.drop(['city'], axis=1, inplace=True)
test_pred_cln_iq = test_pred_cln[test_pred_cln['city'] == 'iq'].copy()
test_pred_cln_iq.drop(['city'], axis=1, inplace=True)

# # Check
# test_pred_cln
# test_pred_cln.isnull().any()
# test_pred_cln_sj
# test_pred_cln_iq
# test_pred_cln_iq.info()

## Save dataframes

In [23]:
train_for_cln_sj.to_pickle('../sb_cap2_nb-99_data/clean_train_forecast-variable_sj.pickle')
train_for_cln_iq.to_pickle('../sb_cap2_nb-99_data/clean_train_forecast-variable_iq.pickle')
train_pred_cln_sj.to_pickle('../sb_cap2_nb-99_data/clean_train_predictor-variables_sj.pickle')
train_pred_cln_iq.to_pickle('../sb_cap2_nb-99_data/clean_train_predictor-variables_iq.pickle')
test_pred_cln_sj.to_pickle('../sb_cap2_nb-99_data/clean_comp_predictor-variables_sj.pickle')
test_pred_cln_iq.to_pickle('../sb_cap2_nb-99_data/clean_comp_predictor-variables_iq.pickle')

## Commentary

This is my first look at, and first cleaning of, the data.  Generally, the data is quite clean as it has been prepared for a data science competition.  Still, there are some transformations I can do to get the data ready for my initial exploration.

I clean the data by:

* Creating a common meaningful datetime index across all data frames using week start date, 
* Adding a month column and reordering date-related columns, and
* Splitting each dataframe in two--one file for each city (as each city is an independent sample).

I look at missing values:

* There are no missing values in the forecast variables--a good thing for time series modeling.
* There are missing values in all 20 of the non-index-related predictor variables.
* However, these aren't many missing values.  Typically, a predicator variable has 1% (mode) missing values, with a range of about 13% to .5%.  See table X.

To address the missing values, I do the following:

* Impute missing values with forward fills.  Given the time-ordered sorting of the data frames, this is similar to last-observation carried forward (LOCF), a common time-series imputation technique.
* Observe that I've got other options if needed after further analysis.  For example, I might impute with linear interpolations, quadratic interpolations, means of nearest neighbors or means of seasonal counterparts.