# Data Visualization and Processing
---
By: Kris Ghimire, Thad Schwebke, Walter Lai, and Jamie Vo
<img src="Images/broken-1391025_1280.JPG" alt="Crime" style="width: 80%;"/>
Photo Cred.: Photo by kat wilcox from Pexels

In [None]:
# Load in libraries

# general libraries
import pandas as pd
import numpy as np
import os

# hide warnings
import warnings
warnings.filterwarnings('ignore')

# visualizations libraries
import seaborn as sns
import plotly 
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline

# Machine Learning 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import random



### Business Understanding (10 pts)
---

#### Purpose of the dataset.
[Homocide Data](https://www.kaggle.com/murderaccountability/homicide-reports)

(i.e., why was this data collected in
the first place?). 

The Murder Accountability Project is a nonprofit organization that discovers discrepancies between the reported homicides between medical examiners and the FBI voluntary crime report. The database is considered to be one of the most exhaustive record collection of homicides that is currently avaiable for the US. Additional information about the organization can be found at [Murder Accountability Project](http://www.murderdata.org/).

The dataset dates back to 1967 and includes demographic information such as gender, age, and ethnicity. A more in depth description of the attributes may be found in the [Data Description](#Data_Description) section.

In [None]:
# read in the data
df = pd.read_csv('../Data/database.csv')

In [None]:
# print the number of records and columns
records = len(df)
attributes = df.columns

print(f'No. of Records: {records} \nNo. of Attributes: {len(attributes)}')

#### Define and measure the dataset outcomes.
Describe how you would define and measure the outcomes from the
dataset.
That is, why is this data important and how do you know if you have mined
useful knowledge from the dataset? 

#### Model Statistics
How would you measure the effectiveness of a
good prediction algorithm? Be specific.

### Data Understanding (80 pts total)
---
<a id="Data_Description"></a>
#### [10 points]  Data Description:
Describe the meaning and type of data (scale, values, etc.) for each
attribute in the data file.

#### [15 points] Verify data quality: 
Explain any missing values, duplicate data, and outliers.
Are those mistakes? How do you deal with these problems? Be specific.

---

AGE
    - When age is 0, this is considered to be unknown data
    - When age is 99+ or 998, this indicates aged greater than 99

#### [10 points] Statistics:
Give simple, appropriate statistics (range, mode, mean, median, variance,
counts, etc.) for the most important attributes and describe what they mean or if you
found something interesting. Note: You can also use data from other sources for
comparison. Explain the significance of the statistics run and why they are meaningful.

In [None]:
# basic statistics of categorical data
df_categorical = df.select_dtypes(include='object')
df_categorical.describe()

In [None]:
# get all levels per categorical attribute
df_categorical_levels = pd.DataFrame()
df_categorical_levels['Attribute'] = df_categorical.columns
df_categorical_levels['Levels'] = ''
df_categorical_levels['Levels_Count'] = ''
df_categorical_levels['Unknown_Count'] = ''

# populate the dataframe with categorical levels and count of each category
for i, row in df_categorical_levels.iterrows():
    attribute = row['Attribute']
    df_categorical_levels.at[i,'Levels'] = df[attribute].unique()
    df_categorical_levels.at[i,'Levels_Count'] = len(df[attribute].unique())
    try:
        df_categorical_levels.at[i,'Unknown_Count'] = df.groupby(attribute).count().loc['Unknown'][0]
    except: 
        df_categorical_levels.at[i,'Unknown_Count'] = 0

In [None]:
# show the dataframe
df_categorical_levels.sort_values(by='Unknown_Count', ascending = False)

Attributes with the greatest amount of missing data are ethnicity, relationship, and perpetrator race/sex.

In [None]:
# basic statistics for continuous variables
df.describe()

In [None]:
df.groupby('Victim Age').count()

#### [15 points] Visualization
Visualize the most important attributes appropriately (at least 5 attributes).
Important: Provide an interpretation for each chart. Explain for each attribute why the
chosen visualization is appropriate.


In [None]:
fig = px.scatter_matrix(df[['Year', 'Incident', 'Victim Age', 'Victim Count','Perpetrator Count']])
fig.show()

#### [15 points] EDA
Explore relationships between attributes: Look at the attributes via scatter
plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain
any interesting relationships.

In [None]:
# https://gist.github.com/rogerallen/1583593
states = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhodes Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}
df_state = df.groupby('State').count().reset_index()

df_state['State_Abb'] = [states[full_state] for full_state in df_state['State']]

In [None]:
# heat map of states 

fig = px.choropleth(locations=df_state['State_Abb'], 
                    locationmode="USA-states", 
                    color=df_state['Record ID'], 
                    color_continuous_scale='portland',
                    scope="usa")
fig.update_layout(
    title_text = 'Homicide Rates per State',
    geo_scope='usa', # limite map scope to USA
)
fig.show()

#### [10 points] Discoveries
Identify and explain interesting relationships between features and the class
you are trying to predict (i.e., relationships with variables and the target classification).

#### [5 points] New Feature Creation
Are there other features that could be added to the data or created from
existing features? Which ones?

##### Dummy Code
- Dummy code the categorical data
- export to csv due to time required for loop to run

In [None]:
# Function to create dummy variables
def dummy_code(col, df): # input the column names and dataframe
    df_dummy = pd.DataFrame()
    for val in col:
        df_dummy_temp = pd.get_dummies(df[val], prefix=val)
        df_dummy = pd.concat([df_dummy, df_dummy_temp], axis=1, sort=False)
    return df_dummy

In [None]:
# select columns for cummy coding
cat_col = df_categorical.columns.values
categorical = np.delete(cat_col, [0,1])

In [None]:
# call function for dummy coding variables
df_dummy = dummy_code(categorical, df)

The cell below has been commented out to prevent rerunning unless necessary due to computing power required.

In [None]:
# export to csv
#df_full = pd.concat([df_dummy, df[df.describe().columns]], axis=1, sort=False)
#df_full = pd.concat([df_dummy, df[['Agency Name', 'Agency Code']]], axis=1, sort=False)
#df_full.to_csv('../Data/Dummy_coded_database.csv')

#### Exceptional Work (10 points total)
• You have free reign to provide additional analyses.
• One idea: implement dimensionality reduction, then visualize and interpret the results.

##### PCA 
In this example, the data set will be used to determine the probability that a crime will be solved
Response Variable: Crime Solved

In [None]:
# scale the data
df_full = pd.read_csv('../Data/Dummy_coded_database.csv')

In [None]:
df_full = df_full.drop('Unnamed: 0', axis=1)

##### Train/Test Split
- Train/Test split due to the large data size and for data validation [Resource](https://data-flair.training/blogs/train-test-set-in-python-ml/#:~:text=%20How%20to%20Split%20Train%20and%20Test%20Set,our%20model%20on%20the%20train%20data...%20More%20)
---

In [None]:
# set seed
random.seed(1234)
df_pca = df_full.drop(['Agency Name', 'Agency Code'], axis=1)
# split into train/test
y = df_pca['Crime Solved_Yes']
x = df_pca.drop(['Crime Solved_Yes', 'Crime Solved_No'], axis = 1)

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.8)

##### PCA Code
---
1. Since PCA is sensitive to scales, the first step is to scale the data [Resource](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60)

In [None]:
# Standardizing the features
x = StandardScaler().fit_transform(x_train)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents, columns=['PCA_'+ str(x) for x in range(10)])

In [None]:
df_PCA = pd.concat([principalDf, y], axis=1)

In [None]:
fig = px.scatter(principalComponents, x=df_PCA['PCA_0'], y=df_PCA['PCA_1'], color=df_PCA['Crime Solved_Yes'])
fig.update_layout(title='PCA 1 vs. PCA 2',
                  yaxis_zeroline=False, xaxis_zeroline=False)
fig.update_xaxes(title_text='PCA 1')
fig.update_yaxes(title_text='PCA 2')
fig.show()

##### Logistic Regression
---

###### Balancing the Dataset
The data set is skewed heavily to the yes side, as shown in table below

In [None]:
# check for a balanced dataset
df_crime = df_full[['Crime Solved_Yes', 'Crime Solved_No']].groupby('Crime Solved_Yes').count().reset_index().rename(columns={'Crime Solved_No':'Count'})
df_crime['Solved'] = ['No', 'Yes']
df_crime = df_crime.drop('Crime Solved_Yes', axis=1)
total = df_crime['Count'].sum()
df_crime['Percentage'] = [x/total for x in df_crime['Count']]
df_crime

Down sampling will be used to balance out the data.