# Clean Data
In this notebook, we'll take the raw data and reorganize them so that we can use them for visualization and a predictive model. These are the files we will be working with:
```
incidents.csv                 List of gun violence incidents from January 2014 - March 2018
population.csv                Population from 2000 - 2017 intercensal estimates for July
other_crime_annual.csv        Annual sums of crimes by state from 2010 - 2016
income.csv                    Annual average personal income by state from 2009-2017         
annual_gun_deaths.csv         Annual deaths by guns (homicides only) by state from 1999-2013
alcohol.csv                   Annual alcohol consumption by state from 1977-2016
provisions.csv                List of provisions in place by each state for the years 1991-2017
```
For more information on where these datasets came from, see the writeup. Our goal is to make several DataFrames which will be used later.

One is an overall DataFrame with statistics aggregated annually. This DataFrame will have entries from 2010-2016. This will be used for visualization.

Another is a feature DataFrame with monthly gun homicides from 2014-2017, each entry coupled with annual statistics from the previous year. This will be used later to make a model to predict gun violence trends.



In [1]:
# Numpy and pandas for manipulating the data
import numpy as np
import pandas as pd

In [2]:
daily_incidents_file = './data/raw/incidents.csv.gz' # zipped because file is too big
population_file = './data/raw/population.csv'
crime_file = './data/raw/other_crime_annual.csv'
income_file = './data/raw/income.csv'
annual_file = './data/raw/annual_gun_deaths.csv'
alcohol_file = './data/raw/alcohol.csv'
provisions_file = './data/raw/provisions.csv'

daily_incidents_df = pd.read_csv(daily_incidents_file, parse_dates=True, compression='gzip')
population_df = pd.read_csv(population_file, parse_dates=True, index_col=0)
annual_gun_deaths_df = pd.read_csv(annual_file, parse_dates=True)
crime_df = pd.read_csv(crime_file, parse_dates=True)
income_df = pd.read_csv(income_file, parse_dates=True)
alcohol_df = pd.read_csv(alcohol_file, parse_dates=True)
provisions_df = pd.read_csv(provisions_file, parse_dates=True)

## Incidents File
### daily_incidents_df and population_df
Goals: 

    lat_long_df         Incidents with latitude and longitude coordinates (2014-2017)
    feature_df          Daily incidents per state, each with features from the previous year (2014-2017)
    by_date_total_df    Daily gun homicides per state indexed by date with states in columns
    by_date_norm_df     Same as by_date_total_df but normalized by population (gun homicides / 100000 people)

In [3]:
null_count = daily_incidents_df.isnull().sum()
null_count.sort_values()

incident_id                         0
date                                0
state                               0
city_or_county                      0
n_killed                            0
n_injured                           0
incident_url                        0
incident_url_fields_missing         0
incident_characteristics          326
source_url                        468
sources                           609
longitude                        7923
latitude                         7923
congressional_district          11944
address                         16497
participant_type                24863
participant_status              27626
state_senate_district           32335
participant_gender              36362
state_house_district            38772
participant_age_group           42119
notes                           81017
participant_age                 92298
n_guns_involved                 99451
gun_type                        99451
gun_stolen                      99498
participant_

There seems to be many missing values about the details of each incident. However, the 'state' and 'date' columns have no missing values, which is good for our objective (examining gun violence trends for each state).

In [4]:
# First let's make some columns that we'll need
daily_incidents_df['date'] = pd.to_datetime(daily_incidents_df['date']) # Turn the date into a datetime
daily_incidents_df['year'] = daily_incidents_df['date'].dt.year # Create a column for just the year

# Upon examining the data, it seems that records before 2014 have missing data, so we'll exclude them
daily_incidents_df = daily_incidents_df[daily_incidents_df['year'] >= 2014]

# Get number of casualties for each state, indexed by date
by_date_total_df = daily_incidents_df.groupby(['date', 'state'])['n_killed'].sum().unstack()
by_date_total_df = by_date_total_df.fillna(0) # Some days had no incidents (no entries). Fill with 0s

In [5]:
lat_long_df = daily_incidents_df[['state', 'latitude', 'longitude', 'n_killed']].dropna()
lat_long_df.to_csv('./data/cleaned/lat_long.csv')

In [6]:
# Now we should normalize the number of casualties by state population. 
# We use population_df: data from US Census Bureau

# Transform on a resampled column: get population in that year from population_df
# multiply by 100,000 to get number of casualties per 100,000 people
def normalize_population(x):
    state_name = x.name
    year = str(x.index.year[0])
    population = population_df.loc[state_name, year]
    return x * 100000 / population

by_date_norm_df = by_date_total_df['2014':'2017'].resample('A').transform(normalize_population)
by_date_norm_df.head()

by_date_total_df.to_csv('./data/cleaned/by_date_total.csv')
by_date_norm_df.to_csv('./data/cleaned/by_date_norm.csv')

In [7]:
# Create a feature DataFrame to store all of our features (to be used later in our model)
feature_df = by_date_total_df.resample('M').sum().stack().reset_index()
feature_df.columns = ['next_date', 'state', 'next_deaths']
# Get the number of casualties normalized by population
# Next year (the year which we want to predict n_casualties)
feature_df['next_year'] = feature_df['next_date'].apply(lambda x: x.year)
feature_df['this_year'] = feature_df['next_year'] - 1
feature_df['population'] = feature_df[['state', 'this_year']]\
                            .apply(lambda x: population_df.loc[x['state'], str(x['this_year'])], axis=1)
    
feature_df.head()

Unnamed: 0,next_date,state,next_deaths,next_year,this_year,population
0,2014-01-31,Alabama,38.0,2014,2013,4827660
1,2014-01-31,Alaska,4.0,2014,2013,736760
2,2014-01-31,Arizona,13.0,2014,2013,6616124
3,2014-01-31,Arkansas,12.0,2014,2013,2956780
4,2014-01-31,California,113.0,2014,2013,38347383


## Other Crime
We'll be working with this data:
    
    crime_df                 Annual crime data (2010-2016) from disastercenter.com/crime/  
    annual_gun_deaths_df     Annual gun homicides by state (1999-2013) from gunpolicy.org
And creating/updating these DataFrames:

    overall_2010_2016_df     DataFrame to hold annual statistics from 2010-2016 to be plotted later
    annual_2000_2017_df      DataFrame that holds annual gun death info from 2000-2017
    feature_df               Update with crime features

In [8]:
crime_df['other_crime'] = crime_df[['rape_crime', 'robbery_crime', 'assault_crime', 
                                'burglary_crime', 'larceny_theft_crime', 'vehicle_theft_crime']].sum(axis=1)
crime_df = crime_df.drop(['population', 'index'], axis=1)
crime_df['year'] = crime_df['year'].astype(int)

# Clean annual_gun_deaths_df so we have observations in each row; note this has data from 1999-2013
annual_1999_2013_df = annual_gun_deaths_df.set_index('state').stack().reset_index()
annual_1999_2013_df.columns = ['state', 'year', 'gun_deaths']

# Add 2014-2018 to our annual_gun_deaths 
# Resample incidents annually and make each row an observation w/ state, year, and number of incidents 
annual_2014_2017 = by_date_total_df[:'2017'].resample('A').sum().stack().reset_index()
annual_2014_2017['date'] = annual_2014_2017['date'].apply(lambda x: x.year)

# Rearrange columns and concat the dataframes
annual_2014_2017.columns = ['year', 'state', 'gun_deaths']
annual_2014_2017 = annual_2014_2017[['state', 'year', 'gun_deaths']]

# Note we will lose year 1999 when we merge with population later; since population goes from 2000-2017 only
annual_2000_2017_df = pd.concat([annual_1999_2013_df, annual_2014_2017])
annual_2000_2017_df['year'] = annual_2000_2017_df['year'].astype(int)

# Create a DataFrame that will hold all annual info from 2010-2016. We will keep udating this
overall_2010_2016_df = pd.merge(annual_2000_2017_df, crime_df).sort_values(['state','year'])

# Add population data
population_df = population_df.stack().reset_index()
population_df.columns = ['state', 'year', 'population']
population_df['year'] = population_df['year'].astype(int)

annual_2000_2017_df = pd.merge(annual_2000_2017_df, population_df)
annual_2000_2017_df = annual_2000_2017_df.sort_values(['state', 'year'])

overall_2010_2016_df = pd.merge(overall_2010_2016_df, population_df)
overall_2010_2016_df.head(10) # Show what overall_2010_2016 looks like

Unnamed: 0,state,year,gun_deaths,violent_crime,property_crime,murder_crime,rape_crime,robbery_crime,assault_crime,burglary_crime,larceny_theft_crime,vehicle_theft_crime,other_crime,population
0,Alabama,2010,283.0,25886,143362,407,1385,4686,18877,34065,97574,11723,168310,4779736
1,Alabama,2011,292.0,22957,144785,348,1449,4612,15960,35265,99182,10338,166806,4798649
2,Alabama,2012,305.0,20727,154087,276,1425,4702,13744,39723,104223,10141,173958,4813946
3,Alabama,2013,317.0,20834,161835,346,1449,4645,13788,42410,108862,10563,181717,4827660
4,Alabama,2014,325.0,21693,168878,342,1296,5020,15035,47481,111523,9874,190229,4840037
5,Alabama,2015,385.0,20166,173192,299,1370,4906,13591,51119,111411,10662,193059,4850858
6,Alabama,2016,488.0,18363,168828,275,1355,4864,11869,42484,115564,10780,186916,4860545
7,Alaska,2010,30.0,5966,24876,52,757,850,4011,4053,17766,3057,30494,710231
8,Alaska,2011,19.0,5391,20806,59,648,761,3671,3511,15249,2046,25886,722259
9,Alaska,2012,17.0,4684,20334,41,553,629,3243,3150,15445,1739,24759,730825


In [9]:
# Let's update our feature dataframe with the crime rate
feature_df = pd.merge(feature_df, crime_df, left_on=['state','this_year'], right_on=['state', 'year'])
feature_df = feature_df.drop('year', axis=1)
feature_df.head()

Unnamed: 0,next_date,state,next_deaths,next_year,this_year,population,violent_crime,property_crime,murder_crime,rape_crime,robbery_crime,assault_crime,burglary_crime,larceny_theft_crime,vehicle_theft_crime,other_crime
0,2014-01-31,Alabama,38.0,2014,2013,4827660,20834,161835,346,1449,4645,13788,42410,108862,10563,181717
1,2014-02-28,Alabama,16.0,2014,2013,4827660,20834,161835,346,1449,4645,13788,42410,108862,10563,181717
2,2014-03-31,Alabama,28.0,2014,2013,4827660,20834,161835,346,1449,4645,13788,42410,108862,10563,181717
3,2014-04-30,Alabama,25.0,2014,2013,4827660,20834,161835,346,1449,4645,13788,42410,108862,10563,181717
4,2014-05-31,Alabama,32.0,2014,2013,4827660,20834,161835,346,1449,4645,13788,42410,108862,10563,181717


## Personal Income
We'll be working with this data:
    
    income_df               Personal Income per capita(2009-2017) from US Bureau of Economic Analysis
Goals:
    
    overall_2010_2016_df    Update with personal income per capita
    feature_df              Update with personal income per capita

In [10]:
# Add income data
income_by_state_df = income_df.set_index('state').stack().reset_index()
income_by_state_df.columns = ['state', 'year', 'income']
income_by_state_df['year'] = income_by_state_df['year'].astype(int)

overall_2010_2016_df = pd.merge(overall_2010_2016_df, income_by_state_df)
feature_df = pd.merge(feature_df, income_by_state_df, left_on=['state', 'this_year'], right_on=['state', 'year'])
feature_df = feature_df.drop('year', axis=1)

overall_2010_2016_df.head(10)

Unnamed: 0,state,year,gun_deaths,violent_crime,property_crime,murder_crime,rape_crime,robbery_crime,assault_crime,burglary_crime,larceny_theft_crime,vehicle_theft_crime,other_crime,population,income
0,Alabama,2010,283.0,25886,143362,407,1385,4686,18877,34065,97574,11723,168310,4779736,33696
1,Alabama,2011,292.0,22957,144785,348,1449,4612,15960,35265,99182,10338,166806,4798649,34717
2,Alabama,2012,305.0,20727,154087,276,1425,4702,13744,39723,104223,10141,173958,4813946,35497
3,Alabama,2013,317.0,20834,161835,346,1449,4645,13788,42410,108862,10563,181717,4827660,35792
4,Alabama,2014,325.0,21693,168878,342,1296,5020,15035,47481,111523,9874,190229,4840037,36903
5,Alabama,2015,385.0,20166,173192,299,1370,4906,13591,51119,111411,10662,193059,4850858,38238
6,Alabama,2016,488.0,18363,168828,275,1355,4864,11869,42484,115564,10780,186916,4860545,38918
7,Alaska,2010,30.0,5966,24876,52,757,850,4011,4053,17766,3057,30494,710231,48614
8,Alaska,2011,19.0,5391,20806,59,648,761,3671,3511,15249,2046,25886,722259,51438
9,Alaska,2012,17.0,4684,20334,41,553,629,3243,3150,15445,1739,24759,730825,52667


## Alcohol
### alcohol_df
Goals:

    overall_2010_2016_df    update with avg total alcohol consumption 
    feature_df              update with all alcohol features 

In [11]:
# Merge the alcohol features into the overall and feature data frames
overall_2010_2016_df = pd.merge(overall_2010_2016_df, alcohol_df)
feature_df = pd.merge(feature_df, alcohol_df, left_on=['state', 'this_year'], right_on=['state', 'year'])
feature_df = feature_df.drop('year', axis=1)

## Provisions
### provisions_df
Goals:
    
    overall_2010_2016_df     update with total provision count
    feature_df               update with all provisions

In [12]:
# Update the overall_2010_2016_df with the total provisions
overall_2010_2016_df = pd.merge(overall_2010_2016_df, provisions_df)

# Update the feature_df with all of the provisions
feature_df = pd.merge(feature_df, provisions_df, left_on=['state', 'this_year'], right_on=['state', 'year'])
feature_df = feature_df = feature_df.drop('year', axis=1)

overall_2010_2016_df.head(10)

Unnamed: 0,state,year,gun_deaths,violent_crime,property_crime,murder_crime,rape_crime,robbery_crime,assault_crime,burglary_crime,...,universal,universalh,universalpermit,universalpermith,violent,violenth,violentpartial,waiting,waitingh,lawtotal
0,Alabama,2010,283.0,25886,143362,407,1385,4686,18877,34065,...,0,0,0,0,0,0,0,0,0,11
1,Alabama,2011,292.0,22957,144785,348,1449,4612,15960,35265,...,0,0,0,0,0,0,0,0,0,11
2,Alabama,2012,305.0,20727,154087,276,1425,4702,13744,39723,...,0,0,0,0,0,0,0,0,0,10
3,Alabama,2013,317.0,20834,161835,346,1449,4645,13788,42410,...,0,0,0,0,0,0,0,0,0,10
4,Alabama,2014,325.0,21693,168878,342,1296,5020,15035,47481,...,0,0,0,0,0,0,0,0,0,10
5,Alabama,2015,385.0,20166,173192,299,1370,4906,13591,51119,...,0,0,0,0,0,0,0,0,0,10
6,Alabama,2016,488.0,18363,168828,275,1355,4864,11869,42484,...,0,0,0,0,0,0,0,0,0,10
7,Alaska,2010,30.0,5966,24876,52,757,850,4011,4053,...,0,0,0,0,0,0,0,0,0,5
8,Alaska,2011,19.0,5391,20806,59,648,761,3671,3511,...,0,0,0,0,0,0,0,0,0,5
9,Alaska,2012,17.0,4684,20334,41,553,629,3243,3150,...,0,0,0,0,0,0,0,0,0,5


In [13]:
# Add features normalized by population to the annual and overall dataframes
overall_population = overall_2010_2016_df['population']
overall_2010_2016_df['gun_deaths_norm'] = overall_2010_2016_df['gun_deaths'] / overall_population * 100000
overall_2010_2016_df['other_crime_norm'] = overall_2010_2016_df['other_crime'] / overall_population * 100000

annual_population = annual_2000_2017_df['population']
annual_2000_2017_df['gun_deaths_norm'] = annual_2000_2017_df['gun_deaths'] / annual_population * 100000

In [14]:
# Save annual, overall and feature DataFrames
annual_2000_2017_df.to_csv('./data/cleaned/annual_gun.csv')
overall_2010_2016_df.to_csv('./data/cleaned/overall.csv')
feature_df.to_csv('./data/cleaned/features.csv')

To recap, here are the DataFrames that were saved:

    lat_long_df                Incidents with latitude and longitude coordinates (2014-2017)
    by_date_tot_df             Gun homicides aggregated daily and by state (2014-2017)
    by_date_norm_df            Same as above, but normalized by state population
    annual_2000_2017           Annual gun homicides with population info(2000-2017)
    overall_2010_2016_df       Annual features from (2010-2016)
    feature_df                 Gun homicides aggregated monthly and by state, paired with annual features from
                               the previous year (2014-2017)
