# Dataset 1: FBI Crime Data

We read in the csv file obtained from https://ucr.fbi.gov/crime-in-the-u.s/2017/preliminary-report and use `head()` to peek at the first 5 lines of data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

crime_data = pd.read_csv('January_to_June_2017_Offenses.csv')
crime_data.head()

Unnamed: 0,Table 4,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,January to June 2016�2017,,,,,,,,,,,,,,,,,
1,Offenses Reported to Law Enforcement,,,,,,,,,,,,,,,,,
2,"by State by City 100,000 and over in population",,,,,,,,,,,,,,,,,
3,State,City,,Population1,Violent\rcrime,Murder,Rape2,Robbery,Aggravated\rassault,Property\rcrime,Burglary,Larceny-\rtheft,Motor\rvehicle\rtheft,Arson3,,,,
4,ALABAMA,BIRMINGHAM,2016.0,212549,1732,44,75,460,1153,5875,1318,3807,750,76,,,,


Rows 0-3 of the file contain background information, which we can remove.  Row 3 contains the data column labels, which we can discard after we have used it to rename our columns.  This is a wide dataset, and it appears that there are 4 unnecessary columns at the end of the dataset, since they contain only NaN elements.  (Also, a look at the original csv file shows 11 rows at the very end that contain background information, so we can remove these as well.)

We eliminate the top 4 rows and the final 3 columns, then we rename the columns.

In [2]:
crime_data.drop(crime_data.index[0:4], inplace = True)
crime_data.drop(crime_data.index[-11:], inplace = True)
crime_data.drop(crime_data.columns[14:18], axis = 1, inplace = True)
crime_data.columns = ['State', 'City', 'Year', 'Population', 'Violent Crime', 'Murder', 
                      'Rape', 'Robbery', 'Aggravated Assault', 'Property Crime', 'Burglary', 
                     'Larceny-Theft', 'Motor Vehicle Theft', 'Arson']
crime_data.head()

Unnamed: 0,State,City,Year,Population,Violent Crime,Murder,Rape,Robbery,Aggravated Assault,Property Crime,Burglary,Larceny-Theft,Motor Vehicle Theft,Arson
4,ALABAMA,BIRMINGHAM,2016.0,212549.0,1732,44,75,460,1153,5875,1318,3807,750,76.0
5,,,2017.0,,1829,42,92,472,1223,6458,1292,4350,816,
6,,MOBILE4,2016.0,249921.0,793,18,47,181,547,5169,1100,3724,345,
7,,,2017.0,,925,20,53,235,617,6482,1507,4344,631,
8,,MONTGOMERY,2016.0,199565.0,563,18,42,187,316,4229,1043,2790,396,


Now we use `fillna()` and `ffill` to fill in missing values for State, City and Population (previous values are copied forward into blank fields) to accurately represent the data.  From this brief peek we see the Arson column contains NaN values; we assume that other crime columns might also contain NaN values, so we replace them all with 0. We can safely use `fillna()` since the State, City and Population values have already been filled.  We also convert the Year datatype from float to int. Since we will focus on the Murder category, we ensure it is also formatted as int.

In [3]:
crime_data[['State', 'City','Population']] = crime_data[['State', 'City','Population']].fillna(method='ffill')
crime_data.fillna(0, inplace = True)
crime_data[['Year', 'Murder']] = crime_data[['Year', 'Murder']].astype('int')
crime_data.head()

Unnamed: 0,State,City,Year,Population,Violent Crime,Murder,Rape,Robbery,Aggravated Assault,Property Crime,Burglary,Larceny-Theft,Motor Vehicle Theft,Arson
4,ALABAMA,BIRMINGHAM,2016,212549,1732,44,75,460,1153,5875,1318,3807,750,76
5,ALABAMA,BIRMINGHAM,2017,212549,1829,42,92,472,1223,6458,1292,4350,816,0
6,ALABAMA,MOBILE4,2016,249921,793,18,47,181,547,5169,1100,3724,345,0
7,ALABAMA,MOBILE4,2017,249921,925,20,53,235,617,6482,1507,4344,631,0
8,ALABAMA,MONTGOMERY,2016,199565,563,18,42,187,316,4229,1043,2790,396,0


Now we group the data by State, sum by Murder, and display a list of the top 5 states.

In [4]:
murder = crime_data.groupby('State')['Murder'].sum()
murder.sort_values(ascending = False).head()

State
CALIFORNIA    1082
ILLINOIS       705
FLORIDA        413
NEW YORK       389
OHIO           353
Name: Murder, dtype: int64

Based on this analysis, we conclude that cities **California** lead the country, with **1082 murders**.