# Clean Austin-3-1-1 Data 

<img src="imgs/cleaning.jpg" width = "500" align="center"/>

In [1]:
import numpy as np # Linear algebra lib
import pandas as pd # Data analysis lib

# Removes rows and columns truncation of '...'
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Load data

In [None]:
# Toggle Comments to run
!mkdir -p 'raw_data'
!rm -f raw_data/'311_Unified_Data.csv'
!wget 'https://austin-311-data.s3.us-east-2.amazonaws.com/311_Unified_Data.csv' -P raw_data
!ls -lh raw_data
!head raw_data/'311_Unified_Data.csv'
!tail raw_data/'311_Unified_Data.csv'
!wc -l raw_data/'311_Unified_Data.csv'

--2019-07-31 01:13:02--  https://austin-311-data.s3.us-east-2.amazonaws.com/311_Unified_Data.csv
Resolving austin-311-data.s3.us-east-2.amazonaws.com (austin-311-data.s3.us-east-2.amazonaws.com)... 52.219.96.24
Connecting to austin-311-data.s3.us-east-2.amazonaws.com (austin-311-data.s3.us-east-2.amazonaws.com)|52.219.96.24|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 257255577 (245M) [text/csv]
Saving to: ‘raw_data/311_Unified_Data.csv’


2019-07-31 01:13:32 (8.31 MB/s) - ‘raw_data/311_Unified_Data.csv’ saved [257255577/257255577]



## Clean Data

Start by loading the dataset

In [None]:
df = pd.read_csv('raw_data/311_Unified_Data.csv', low_memory=False)
df.head()

In [None]:
print('This dataset has number of rows {}, number of cols {}'.format(df.shape[0], df.shape[1]))

In [None]:
df.info()

For our analysis we don't need following columns:
 
  - `Service Request (SR) Number`
  - `Status Change Date`
  - `SR Status`
  - `Last Update Date`
  - `Close Date`
  - `Map Page`
  - `Map Tile`
  - `State Plane X Coordinate`
  - `State Plane Y Coordinate`
  - `Street Number`
  - `Street Name`
  - `SR Location`
  - `Latitude Coordinate`
  - `Longitude Coordinate`
  - `Council District`

### Drop unecessary columns and empty rows

In [None]:
columns = ['Council District', 'Map Page', 'Map Tile', 'Service Request (SR) Number', 'Status Change Date', 'Last Update Date', 'Close Date', 'SR Status', 'SR Location', 'Street Number', 'Street Name', 'State Plane X Coordinate', 'State Plane Y Coordinate', 'Latitude Coordinate', 'Longitude Coordinate']
df = df.drop(columns, axis=1)
df.head()

### Check for missing values and drop rows that are missing important info.

In [None]:
df.isnull().sum()

So, there are lot missing values, mostly due to empty rows, let's drop those rows now.

In [None]:
df = df.dropna(how='all') # how='all' drops rows that have all NaN values, whereas, 'any' will drop any row that has NaN present
df.isnull().sum()

After dropping rows that had all `NaN` values, now we can drop those that having missing valuable information needed to stratify complaints by their location info. 



#### Missing all location data (i.e. Zip Code and County and City)

Let's drop rows that contain all of the missing location info.

In [None]:
print('Before dimensions: ', df.shape)
df = df.loc[df[['City', 'Zip Code', 'County']].notnull().values.any(axis=1)]
print('After dimensions: ', df.shape)
df.head()

In [None]:
df.isnull().sum()

After eliminating rows with no location information, we can now dive into missing `Latitude Longitude` coordinate values, as they are important to our analysis.

#### Drop Missing Lat., Long. Coordinates

In [None]:
print('Before dimensions: ', df.shape)
df = df.loc[df[['(Latitude.Longitude)']].notnull().values.any(axis=1)]
print('After dimensions: ', df.shape)
df.head()

#### Drop Missing Zipcodes

> NOTE: We can actually reverse geocode location info like, Street Address, City, County, Zip Code, and State using something like [Google Maps Reverse Geocode API](https://developers.google.com/maps/documentation/geocoding/start#reverse), and have that fill in the missing information. 


In [None]:
print('Before dimensions: ', df.shape)
df = df.loc[df['Zip Code'].notnull().values]
print('After dimensions: ', df.shape)
df.head()

### Filter out rows that match Zipcodes from Austin and Travis County

Let's use official City of Austin's to figure out Zip Codes under city territory.

![](https://i.imgur.com/kdG5bxm.png)


All of the polygons with zipcodes intersecting under blue-shaded region are under `City of Austin's` jurisdiction, but it's hazy at best, so we will keep all those complaints, under those zipcodes. 

In [None]:
austin_zipcodes = """
78701,78702,78703,78704,78705,
78721,78722,78723,78724,78725,
78726,78727,78728,78729,78730,
78731,78732,78733,78734,78735,
78736,78737,78738,78739,78741,
78742,78744,78745,78746,78747,
78748,78749,78750,78751,78752,
78753,78754,78756,78757,78758,
78759,78610,78617,78653,78660
""".split(',')
austin_zipcodes = [x.strip() for x in austin_zipcodes]
print('Before dimensions: ', df.shape)
df = df.loc[df['Zip Code'].isin(np.array(austin_zipcodes))]
print('After dimensions: ', df.shape)
df.head()

In [None]:
df['County'].value_counts(dropna=False)

### Clean `city` column

#### Strip leading and trailing whitespaces first from entire dataframe

##### Before Stripping whitespaces

In [None]:
df.isnull().sum()

In [None]:
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df.replace('', np.NaN, inplace=True)

##### After stripping whitespaces

In [None]:
df.isnull().sum()

#### Let's start by fixing casing of `City` and `County` columns to `Title Case`, and fix any misspellings.

In [None]:
df['County'] = df['County'].str.title()
df['County'].value_counts(dropna=False).sort_values(ascending=False)

In [None]:
df['City'] = df['City'].str.title()
df['City'].value_counts(dropna=False).sort_values(ascending=False)

Not so surprisingly we have so many typos, and misspellings, which we can fix, but surprisingly there are calls from cities that are miles apart like `Houston`, `Dallas`, etc.., so let's keep those that are only from `City of Austin` in Travis County which is our main focus.

##### Austin and it's Extraterritorial Jurisdiction

Another thing to note is `Austin's 5 ETJ's` (Extraterritorial Jurisdiction) refers to cities, is the legal capability of a municipality to exercise authority beyond the boundaries of its incorporated area. In the US, Texas is one of the states that by law allow cities to claim ETJ to contiguous land beyond their city limits.  Austin’s ETJ currently extends into 4 counties including Williamson, Travis, Hays, and Bastrop.

In [None]:
df.loc[df['City'].str.contains('Austin 5 Etj', na=False), 'County'].value_counts(dropna=False)

In [None]:
df.loc[df['City'].str.contains('Austin 5 Etj', na=False), 'Zip Code'].value_counts(dropna=False).sort_index()

#### Fix `Austin` related typos and misspellings

In [None]:
df.loc[(df['City'] != 'Austin 5 Etj') & df['City'].str.startswith('Aus', na=False).values, 'City'] = 'Austin'
df['City'].value_counts(dropna=False).sort_values(ascending=False)

Looks like there is considerable complaints from surrounding territories that are under City of Austin's jurisdiction.

In [None]:
df['County'].value_counts(dropna=False).sort_index()

In [None]:
df[['County', 'City']].groupby(['County', 'City']).size()

Looks like there are discrepancies in the data as `Austin` is in `Williamson`, `Bastrop`, and `Hays`, while the zipcodes we filtered only pertained to `City of Austin` territories in `Travis` county.

As we noted earlier, using Google Maps API to reverse geocode to actual data would need to be done to assert the validity of `Zip Code`, `City` and `County`. 

#### Drop duplicate rows

Let's check first to see if we have duplicate rows or not,

In [None]:
df[df.duplicated()].sample(10)

Indeed, there are duplicate rows, let's drop them.

In [None]:
print('Before dimensions: ', df.shape)
df = df.drop_duplicates()
print('After dimensions: ', df.shape)
df[df.duplicated()].head() # Check again to see if duplicate rows are dropped

Let's now drop the `City` and `County` columns, since we are going to use only `Zip Code` and `Latitude Longitude` columns to map our plots.

#### Drop `City`, and `County` columns

In [None]:
df = df.drop(['City', 'County'], axis=1)
df = df.reset_index(drop=True)
df.head()

#### Drop rows that have no `Created Date`

For our analysis to figure when the `Complaint` was noted, we need `Created Date` to be non-null containing column, so let's drop those rows.

In [None]:
df[df['Created Date'].isnull().values].head()

In [None]:
df = df.loc[df['Created Date'].notnull().values]
df.isnull().sum()

##### Reset the index

In [None]:
df = df.reset_index(drop=True)
df.info()

### Some Feature Engineering

In [None]:
# Change Zip Code from float to integer data type
df['Zip Code'] = df['Zip Code'].astype(int)

# Change Created Date to datetime object, so we can extract date, month, year, and hour
df['Created Date'] = pd.to_datetime(df['Created Date'], format='%m/%d/%Y %I:%M:%S %p')

# Add month, year columns
df['Incident Year'] = df['Created Date'].dt.year
df['Incident Month'] = df['Created Date'].dt.month
df['Incident Hour'] = df['Created Date'].dt.hour
df['Incident Weekday'] = df['Created Date'].dt.weekday_name
df.head()

### Rename Columns

In [None]:
df = df.rename(columns={
    'SR Type Code':'Incident Type Code',
    'SR Description':'Incident Description',
    'Created Date':'Incident Date'
})
df.head()

In [None]:
# Verify we have all incidents from all years
df['Incident Year'].value_counts()

### Save the clean data frame

In [None]:
!rm -rf 'clean_data/clean_austin_311.csv'
df.to_csv('clean_data/clean_austin_311.csv', index=False)