# Data Visualization and Processing
---
By: Kris Ghimire, Thad Schwebke, Walter Lai, and Jamie Vo
<img src="Images/broken-1391025_1280.JPG" alt="Crime" style="width: 80%;"/>
Photo Cred.: Photo by kat wilcox from Pexels

In [59]:
# Load in libraries

# general libraries
import pandas as pd
import numpy as np
import os

# hide warnings
import warnings
warnings.filterwarnings('ignore')

# visualizations libraries
import seaborn as sns
import plotly 
import matplotlib.pyplot as plt
%matplotlib inline

### Business Understanding (10 pts)
---

#### Purpose of the dataset.
[Homocide Data](https://www.kaggle.com/murderaccountability/homicide-reports)

(i.e., why was this data collected in
the first place?). 

The Murder Accountability Project is a nonprofit organization that discovers discrepancies between the reported homicides between medical examiners and the FBI voluntary crime report. The database is considered to be one of the most exhaustive record collection of homicides that is currently avaiable for the US. Additional information about the organization can be found at [Murder Accountability Project](http://www.murderdata.org/).

The dataset dates back to 1967 and includes demographic information such as gender, age, and ethnicity. A more in depth description of the attributes may be found in the [Data Description](#Data_Description) section.

In [2]:
# read in the data
df = pd.read_csv('../Data/database.csv')

In [3]:
# print the number of records and columns
records = len(df)
attributes = df.columns

print(f'No. of Records: {records} \nNo. of Attributes: {len(attributes)}')

No. of Records: 638454 
No. of Attributes: 24


In [9]:
# create a data frame to hold the attributes and their desciptions
df_description = pd.DataFrame()
df_description['Attributes'] = attributes
df_description['Description'] = ''
df_description.to_excel('../Data/data_description.xlsx')

#### Define and measure the dataset outcomes.
Describe how you would define and measure the outcomes from the
dataset.
That is, why is this data important and how do you know if you have mined
useful knowledge from the dataset? 

#### Model Statistics
How would you measure the effectiveness of a
good prediction algorithm? Be specific.

### Data Understanding (80 pts total)
---
<a id="Data_Description"></a>
#### [10 points]  Data Description:
Describe the meaning and type of data (scale, values, etc.) for each
attribute in the data file.

#### [15 points] Verify data quality: 
Explain any missing values, duplicate data, and outliers.
Are those mistakes? How do you deal with these problems? Be specific.

#### [10 points] Statistics:
Give simple, appropriate statistics (range, mode, mean, median, variance,
counts, etc.) for the most important attributes and describe what they mean or if you
found something interesting. Note: You can also use data from other sources for
comparison. Explain the significance of the statistics run and why they are meaningful.

In [23]:
# basic statistics of categorical data
df_categorical = df.select_dtypes(include='object')
df_categorical.describe()

Unnamed: 0,Agency Code,Agency Name,Agency Type,City,State,Month,Crime Type,Crime Solved,Victim Sex,Victim Race,Victim Ethnicity,Perpetrator Sex,Perpetrator Age,Perpetrator Race,Perpetrator Ethnicity,Relationship,Weapon,Record Source
count,638454,638454,638454,638454,638454,638454,638454,638454,638454,638454,638454,638454,638454,638454,638454,638454,638454,638454
unique,12003,9216,7,1782,51,12,2,2,3,5,3,3,191,5,3,28,16,2
top,NY03030,New York,Municipal Police,Los Angeles,California,July,Murder or Manslaughter,Yes,Male,White,Unknown,Male,0,White,Unknown,Unknown,Handgun,FBI
freq,38416,38416,493026,44511,99783,58696,629338,448172,494125,317422,368303,399541,211079,218243,446410,273013,317484,616647


In [None]:
# get all levels per categorical attribute
df_categorical_levels = pd.DataFrame()
df_categorical_levels['Attribute'] = df_categorical.columns
df_categorical_levels['Levels'] = ''
df_categorical_levels['Levels_Count'] = ''
df_categorical_levels['Unknown_Count'] = ''

# populate the dataframe with categorical levels and count of each category
for i, row in df_categorical_levels.iterrows():
    attribute = row['Attribute']
    df_categorical_levels.at[i,'Levels'] = df[attribute].unique()
    df_categorical_levels.at[i,'Levels_Count'] = len(df[attribute].unique())
    try:
        df_categorical_levels.at[i,'Unknown_Count'] = df.groupby(attribute).count().loc['Unknown'][0]
    except: 
        df_categorical_levels.at[i,'Unknown_Count'] = 0

In [58]:
# show the dataframe
df_categorical_levels.sort_values(by='Unknown_Count', ascending = False)

Unnamed: 0,Attribute,Levels,Levels_Count,Unknown_Count
14,Perpetrator Ethnicity,"[Unknown, Not Hispanic, Hispanic]",3,446410
10,Victim Ethnicity,"[Unknown, Not Hispanic, Hispanic]",3,368303
15,Relationship,"[Acquaintance, Unknown, Wife, Stranger, Girlfr...",28,273013
13,Perpetrator Race,"[Native American/Alaska Native, White, Unknown...",5,196047
11,Perpetrator Sex,"[Male, Unknown, Female]",3,190365
16,Weapon,"[Blunt Object, Strangulation, Unknown, Rifle, ...",16,33192
9,Victim Race,"[Native American/Alaska Native, White, Black, ...",5,6676
8,Victim Sex,"[Male, Female, Unknown]",3,984
1,Agency Name,"[Anchorage, Juneau, Nome, Bethel, North Slope ...",9216,47
12,Perpetrator Age,"[15, 42, 0, 36, 27, 35, 40, 49, 39, 29, 19, 23...",191,0


Attributes with the greatest amount of missing data are ethnicity, relationship, and perpetrator race/sex.

In [49]:
df.groupby('Relationship').count().loc['Unknown'][0]

273013

In [62]:
# basic statistics for continuous variables
df.describe()

Unnamed: 0,Record ID,Year,Incident,Victim Age,Victim Count,Perpetrator Count
count,638454.0,638454.0,638454.0,638454.0,638454.0,638454.0
mean,319227.5,1995.801102,22.967924,35.033512,0.123334,0.185224
std,184305.93872,9.927693,92.149821,41.628306,0.537733,0.585496
min,1.0,1980.0,0.0,0.0,0.0,0.0
25%,159614.25,1987.0,1.0,22.0,0.0,0.0
50%,319227.5,1995.0,2.0,30.0,0.0,0.0
75%,478840.75,2004.0,10.0,42.0,0.0,0.0
max,638454.0,2014.0,999.0,998.0,10.0,10.0


#### [15 points] Visualization
Visualize the most important attributes appropriately (at least 5 attributes).
Important: Provide an interpretation for each chart. Explain for each attribute why the
chosen visualization is appropriate.

#### [15 points] EDA
Explore relationships between attributes: Look at the attributes via scatter
plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain
any interesting relationships.

In [85]:
states = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhodes Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}
df_state = df.groupby('State').count().reset_index()

df_state['State_Abb'] = [states[full_state] for full_state in df_state['State']]

In [88]:
import plotly.express as px

fig = px.choropleth(locations=df_state['State_Abb'], locationmode="USA-states", color=df_state['Record ID'], scope="usa")
fig.show()

#### [10 points] Discoveries
Identify and explain interesting relationships between features and the class
you are trying to predict (i.e., relationships with variables and the target classification).

#### [5 points] New Feature Creation
Are there other features that could be added to the data or created from
existing features? Which ones?

#### Exceptional Work (10 points total)
• You have free reign to provide additional analyses.
• One idea: implement dimensionality reduction, then visualize and interpret the results.