# Analyzing Gun Deaths in the US: 2012-2014


### Data Schema
- **year**: the year in which the fatality occurred.
- **month**: the month in which the fatality occurred.
- **intent**: the intent of the perpetrator of the crime. This can be Suicide, Accidental, NA, Homicide, or Undetermined.
- **police**: whether a police officer was involved with the shooting. Either 0 (false) or 1 (true).
- **sex**: the gender of the victim. Either M or F.
- **age**: the age of the victim.
- **race**: the race of the victim. Either Asian/Pacific Islander, Native American/Native Alaskan, Black, Hispanic, or White.
- **hispanic**: a code indicating the Hispanic origin of the victim.
- **place**: where the shooting occurred. Has several categories, which you're encouraged to explore on your own.
- **education**: educational status of the victim. Can be one of the following:
    * 1: Less than High School
    * 2: Graduated from High School or equivalent
    * 3: Some College
    * 4: At least graduated from College
    * 5: Not available

### First, let's import libraries and load the dataset.

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 100)

import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

import seaborn as sns

In [None]:
df = pd.read_csv('../Data/guns.csv')

# 1. Basic information

First, always look at basic information about the dataset. 

<br>
Display the dimensions of the dataset.

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
# Filter and display only df.dtypes that are 'object'
df.dtypes[df.dtypes == 'object']

In [None]:
# Organizing the data by a column value: first by the year, then by month:
df.sort_values(['year', 'month'], inplace=True)
df.head(10)

# 2. Distributions of numeric features

One of the most enlightening data exploration tasks is plotting the distributions of your features.

In [None]:
# Plot histogram grid
df.hist(xrot=-45, figsize=(20, 20))
# Clear the text "residue"
plt.show()

In [None]:
# Summarize numerical features
df.describe()

# 3. Distributions of categorical features

Next, let's take a look at the distributions of our categorical features.
<br>

Display summary statistics for categorical features.

In [None]:
# Summarize categorical features
df.describe(include=["object"])

In [None]:
# Plot bar plot for each categorical feature
for feature in df.select_dtypes(include=['object']): 
    sns.countplot(y=feature, data=df)
    plt.show()

In [None]:
# Calculate correlations between numeric features
correlations = df.corr()

# Make the figsize 10 x 8
plt.figure(figsize=(10, 8))

# Generate a mask for the upper triangle
mask = np.zeros_like(correlations, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Plot heatmap of annotated correlations
correlations = correlations * 100

# Plot heatmap of correlations
sns.heatmap(correlations, annot=True, fmt='.0f', mask=mask)

# 4. Data Cleaning
Drop unwanted observations

In [None]:
# Drop duplicates
print(df.shape)
df.drop_duplicates(inplace=True)
print(df.shape)

### Fix structural errors

The next bucket under data cleaning involves fixing structural errors. 

<br>

In [None]:
# Display unique values of 'police'
df.police.unique()

Next, to check for typos, mislabeled classes or inconsistent capitalization, display all the class distributions for the <code style="color:steelblue">'intent'</code> feature.

In [None]:
# Class distributions for 'intent'
sns.countplot(y='intent', data=df)

In [None]:
# Murder should be Homicide
df.intent.replace('Murder', 'Homicide', inplace=True)
# accident should be Accidental
df.intent.replace('accident', 'Accidental', inplace=True)
# suicide should be Suicide
df.intent.replace('suicide', 'Suicide', inplace=True)

### Plot the class distributions for 'intent' for comparison

In [None]:
# Class distributions for 'intent'
sns.countplot(y='intent', data=df)

Looks much better!!

Now do the same for 'race'

In [None]:
# Class distributions for 'race'
sns.countplot(y='race', data=df)

In [None]:
# Caucasian should be White
df.race.replace('Caucasian', 'White', inplace=True)

In [None]:
# Class distributions for 'Race'
sns.countplot(y='race', data=df)


### Label missing categorical data

It's finally time to address missing data.

<br>
First, find and count the missing categorical data.

In [None]:
# Display number of missing values by feature (categorical)
df.select_dtypes(include=['object']).isnull().sum()

In [None]:
# Fill missing categorical values
for column in df.select_dtypes(include=['object']):
    df[column] = df[column].fillna('Missing')

In [None]:
# Display number of missing values by feature (categorical)
df.select_dtypes(include=['object']).isnull().sum()

# Flag and fill missing numeric data

Finally, let's flag and fill missing numeric data.

<br>
First, let's find and count missing values in numerical feature.

In [None]:
# Display number of missing values by feature (numeric)
df.select_dtypes(exclude=['object']).isnull().sum()

Let's take a look at the unique values for education to see if we should replace null values or drop the observations. Reference the schema above for education definitions.

In [None]:
# View unique values for education
df.education.unique()

In [None]:
# Fill missing categorical values
df.education = df.education.fillna(5)

In [None]:
# Display number of missing values by feature (numeric)
df.select_dtypes(exclude=['object']).isnull().sum()

Great, looks like you've taken care of education. Now handle the age missing values...

In [None]:
# drop missing categorical values
df = df.dropna(axis=0, subset=['age'])

In [None]:
# Display number of missing values by feature (numeric)
df.select_dtypes(exclude=['object']).isnull().sum()

In [None]:
df.shape

### For readability and concistency - capitalizing column names and name the index

In [None]:
df.index.name = 'Index'
df.columns = map(str.capitalize, df.columns)
df.head()

In [None]:
# Save our cleaned data for later use
df.to_csv('project_files/cleaned_guns.csv', index=None)