# Cleaning the Spray Dataset

## Importing modules and data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

%matplotlib inline
sns.set_style('darkgrid')
sns.set_palette('viridis')

spray = pd.read_csv('../data/spray.csv')

In [2]:
spray.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


## Checking for null values

In [3]:
spray.isnull().sum()

Date           0
Time         584
Latitude       0
Longitude      0
dtype: int64

Since there are so many missing values for time and we have no way to determine the true values, we will drop the time column.

## Checking for other errors

### Duplicated locations

We noticed that one latitude/longitude pair was duplicated over 500 times and one other was duplicated once. We address these issues below:

In [4]:
order_groups = spray.groupby(['Latitude', 'Longitude'], as_index=False).count().sort_values('Date', ascending=False)

In [5]:
order_groups.head()

Unnamed: 0,Latitude,Longitude,Date,Time
11853,41.98646,-87.794225,541,541
11499,41.983917,-87.793088,2,2
0,41.713925,-87.615892,1,1
9533,41.959113,-87.719752,1,1
9522,41.959028,-87.72889,1,1


In [6]:
order_groups.loc[11499, :]['Latitude']

41.9839166666667

In [7]:
lat_long_count = spray.drop('Time', axis=1).groupby(['Latitude', 'Longitude']).agg(['count'])

lat_long_count.columns

lat_long_count = pd.Series(lat_long_count.values.reshape(1,-1)[0], lat_long_count.index)

lat_long_count.sort_values(ascending=False).head()

Latitude   Longitude 
41.986460  -87.794225    541
41.983917  -87.793088      2
41.894413  -87.710262      1
41.894380  -87.772148      1
41.894343  -87.760688      1
dtype: int64

Through manual inspection, we have determined a spray entry which has been duplicated 541 times.  We are going to remove the duplicates from this list.

In [8]:
mask = (spray['Latitude'] == 41.986460) & (spray['Longitude'] == -87.794225)
spray[mask].shape

(541, 4)

In [9]:
spray[mask].index[1:]

Int64Index([ 490,  491,  492,  493,  494,  495,  496,  497,  498,  499,
            ...
            1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029],
           dtype='int64', length=540)

In [10]:
spray.drop(spray[mask].index[1:], inplace=True)

In [11]:
lat_long_count = spray.drop('Time', axis=1).groupby(['Latitude', 'Longitude']).agg(['count'])

lat_long_count = pd.Series(lat_long_count.values.reshape(1,-1)[0], lat_long_count.index)

lat_long_count.sort_values(ascending=False).head()

Latitude   Longitude 
41.983917  -87.793088    2
42.395983  -88.095757    1
41.894157  -87.754473    1
41.894380  -87.772148    1
41.894343  -87.760688    1
dtype: int64

In [12]:
mask = (spray['Longitude'] == -87.7930883333333) & (spray['Latitude'] ==  41.9839166666667)
sum(mask)

2

In [13]:
spray[mask].index[1:]

Int64Index([485], dtype='int64')

In [14]:
spray.drop(spray[mask].index[1:], inplace=True)

In [15]:
lat_long_count = spray.drop('Time', axis=1).groupby(['Latitude', 'Longitude']).agg(['count'])

lat_long_count = pd.Series(lat_long_count.values.reshape(1,-1)[0], lat_long_count.index)

lat_long_count.sort_values(ascending=False).head()

Latitude   Longitude 
42.395983  -88.095757    1
41.894160  -87.767937    1
41.894402  -87.704128    1
41.894380  -87.772148    1
41.894343  -87.760688    1
dtype: int64

In [16]:
spray.drop('Time', axis=1, inplace=True)

**Now there are no duplicated latitude/longitude pairs and we have dropped the time column.**

## Exporting file

In [17]:
spray.to_csv('../data/spray_cleaned.csv', index=False)