# Data Visualization and Processing

By: Kris Ghimire, Thad Schwebke, Walter Lai, and Jamie Vo

## CONTENTS 

<a href="#Business-Understanding"><b>Business Understanding</b></a>

<a href = "#Data-Understanding"><b>Data Understanding</b></a>  
    <ul>
    <li><a href="#Data-Description">Data Description</a></li>
    <li><a href="#Data-Quality">Data Quality</a></li>
    <li><a href="#Statistics">Statistics</a></li>
    <li><a href="#EDA">EDA</a></li>
    <li><a href="#Visualization">Visualization</a></li>
    <li><a href="#Discoveries">Discoveries</a></li>
    <li><a href="#New-Feature-Creation">New Feature Creation</a></li>
    <li><a href="#Exceptional-Work">Exceptional Work</a></li>
    </ul> 


In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

# Business Understanding

_Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good prediction algorithm? Be specific_.


![crime.jpg](attachment:crime.jpg)
We have been hearing many news about homicide  lately in the mids of already existing Covid-19 pandemic. Violent crime and homicide have been rising saliently across the US. The number of homicide has almost increase by double digits in many bigger cities such has Chicago, New York, Philadelphia just to name few. To have better understanding about criminal and victim profile , to visualize some interesting relation and to make prediction on if the kind of homicide has been solved or unsolved we decided to pick homicide as a topic for our project.

Just to define simply, homicide is the killing of one person by another. Homicide might or might not be illegal. Legal homicide could be such as person killing intruders without committing crime or solders killing enemies in battle, while the illegal homicide is an intentional murder of one individual by another or if someone involves in the activities. Our data set has two types of crime: Murder or Manslaughter and Manslaughter by Negligence. Murder occurs when one human being unlawfully kills another human being. Murder are broken into degrees, First degree which is willful, deliberate and premeditated murder. Second-degree murder- which do not carry the death penalty. Manslaughter is the act of killing another human being in a way that is less accountable than murder. In other words manslaughter is not as sever crime as murder. Manslaughter is categorized into voluntary and involuntary manslaughter. Voluntary manslaughter is defined as killing of another human being under extreme provocation which typically does not require an intent to kill. For example an individual who kills another individual in self-defense may be charged with voluntary manslaughter if he was the original attacker in the situation.

While involuntary manslaughter is defined as death of another human being due to act of negligence or recklessness of the defendant.For example, a person who drives under the influence of alcohol may hit and kill a pedestrian, although killing him was not his intention.

United States does a poor job in tracking and accounting for its unsolved homicides. According to Scripps Howard News Service study of the FBI’s Uniform Crime Report, across U.S. nearly 185,000 cases of homicides and non-negligent manslaughter were unsolved from 1980 to 2008. The rate at which police clear homicides through arrest has declines year over year. About 4 of every 10 homicides go unsolved each year.As per FBI Uniform Crime Report, currently on average, 40 percent of homicides are unsolved. These rising number of unsolved homicides also known as cold case is a major problem to our society as well as law enforcement as its leaving a growing number of killers out on the streets, undermining the safety in urban neighborhoods and also crumbling the confidence in the criminal justice system.

No one knows all the names of the murder victims because no law enforcement agency in America is assigned to monitor failed homicide investigations by local police departments. Even the official national statistics on murder are actually estimates and projections based upon incomplete reports by police departments that voluntarily choose (or refuse) to participate in federal crime reporting programs.

Keeping all these in mind our primary key in this project is to help classify the cases solved or unsolved based on the data we have.

Our dataset come from the Murder Accountability Project which is a nonprofit group organized in 2015 to educate American on the important of accurately accounting for unsolved homicides. Their projects board of directory is composed of retired law enforcement investigators, investigative journalists, criminologists and other experts on various aspects of homicide.

This dataset is important because the number of unsolved cases are piling up everyday on the detective's office and there is a lack of trained staffing in the police departments. According to report there are more than 250,000 cold cases that have been accumulated since 1980. Therefore having having model to classify or predict which crime has been solved and which hasn't been will significantly help the law enforcement authorities. And also many other useful information can be minded from the data that could be helpful such victim age analysis, crime type, type of weapon used, age of criminal, location just to point out few. Not only that, there are many cases where family member are waiting for justice for their loved one. Being able to know that their case has been solved will provide being relief to family member as well. 








Major analysis that we are interested in this dataset includes: 

    1. Based on attributes of a case (i.e. Year, Month, City, State, Agency, Weapon, Victim attributes, Perpetrator  attributes, and Relationship), we can predict if the case will be solved or not.
    2. Based on attributes (i.e. Year, Month, City, State, Weapon, and Victim attributes), we can profile the Perpetrator (Age, Race, Ethnicity, and Sex).

<a href="#CONTENTS"> Top of the page</a>

In [None]:
# all imported libraries used for analysis
import numpy as np
import pandas as pd 
import os 
import urllib
import copy
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns 
import plotly.graph_objects as go
import plotly.express as px

# set color scheme and style for seaborn
sns.set(color_codes=True)
sns.set_style('whitegrid')

Imports various libraries used in the analysis of the homicide data

In [None]:
states = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhodes Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}
df_state = df.groupby('State').count().reset_index()

df_state['State_Abb'] = [states[full_state] for full_state in df_state['State']]

Summarized what was completed in this step.

# Data Understanding


## Data Description

_Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file_.

<a href="#CONTENTS"> Top of the page</a>

--Data Meaning Type Full Write-up--

--Insert a table with all the features--

--Jamie has a really nice table view based on an Excel file--

--Align data description and sttribute types information from Kris--


In [None]:
df_description = pd.read_excel('../Data/data_description.xlsx')
pd.set_option('display.max_colwidth', 0)
df_description

In [None]:
# read the data file

The [Homicide Report dataset](https://www.kaggle.com/murderaccountability/homicide-reports) from Kaggle is read into a dataframe

In [None]:
# data wrangling, clean-up, rename headers, drop columns, change data types, and transforms
# change crime solved values - Yes = 1 and No = 0 
homicide_df['Crime Solved']=homicide_df['Crime Solved'].replace(to_replace='No',value=0)
homicide_df['Crime Solved']=homicide_df['Crime Solved'].replace(to_replace='Yes',value=1)

# cleanse the Perpetrator Age
print('Max age of Perpetrator before:', df['Perpetrator Age'].max())
print('Max age of Perpetrator before:', df['Perpetrator Age'].min())
homicide_df['Perpetrator Age']=homicide_df['Perpetrator Age'].replace(to_replace=" ",value=0)
homicide_df['Perpetrator Age'] = homicide_df['Perpetrator Age'].astype(int)
#homicide_df['Perpetrator Age']=pd.to_numeric(homicide_df['Perpetrator Age'])
index=homicide_df[homicide_df['Perpetrator Age'] > 98].index
homicide_df.drop(index, inplace=True)
homicide_df['Perpetrator Age']=homicide_df['Perpetrator Age'].replace(to_replace=0,value=homicide_df['Perpetrator Age'].median())
print('Max age of Perpetrator after:', df['Perpetrator Age'].max())
print('Max age of Perpetrator after:', df['Perpetrator Age'].min())

# cleanse the Victim Age
print('Max age of Victim before:', df['Victim Age'].max())
print('Max age of Victim before:', df['Victim Age'].min())
homicide_df['Victim Age']=homicide_df['Victim Age'].replace(to_replace=998,value=homicide_df['Victim Age'].median())
print('Max age of Victim after:', df['Victim Age'].max())
print('Max age of Victim after:', df['Victim Age'].min())

# remove records where Relationship, Weapon, Victim & Perpetrator Sex, Race, 
# and Ethinicity are Unknown and Victim & Perpetrator Age is 0


# combine Victim and Perpetrator Race & Ethnicity into new features - Victim_Race_Ethnicity and Perpetrator_Race_Ethnicity


# create bins for Victim and Perpetrator Age in a new feature - Victim_Age_Group and Perpetrator_Age_Group


# group Relationship into logical bins - Relationship_Group
df.loc[(df['Relationship'] == 'Wife') | (df['Relationship'] == 'Ex-Wife') |
             (df['Relationship'] == 'Girlfriend') |
             (df['Relationship'] == 'Common-Law Wife'), 'Relationship_Group'] = 'Female Partner'

df.loc[(df['Relationship'] == 'Husband') | (df['Relationship'] == 'Ex-Husband') |
             (df['Relationship'] == 'Boyfriend') | 
             (df['Relationship'] == 'Common-Law Husband'), 'Relationship_Group'] = 'Male Partner'

df.loc[(homicide['Relationship'] == 'Boyfriend/Girlfriend') & (df['Victim_Sex'] == 'Female'),
             'Relationship_Group'] = 'Female Partner'

df.loc[(homicide['Relationship'] == 'Boyfriend/Girlfriend') & ((df['Victim_Sex'] == 'Male') |
            (df['Victim_Sex'] == 'Unknown')) , 'Relationship_Group'] = 'Male Partner'

df.loc[(df['Relationship'] == 'Father') | (df['Relationship'] == 'In-Law') |
             (df['Relationship'] == 'Mother') | (df['Relationship'] == 'Stepfather') |
             (df['Relationship'] == 'Stepmother'), 'Relationship_Group'] = 'Parent'

df.loc[(df['Relationship'] == 'Daughter') | (df['Relationship'] == 'Son') |
             (df['Relationship'] == 'Stepdaughter') | 
             (df['Relationship'] == 'Stepson'), 'Relationship_Group'] = 'Children'

df.loc[(df['Relationship'] == 'Brother') | (df['Relationship'] == 'Sister'),
             'Relationship_Group'] = 'Sibling'

df.loc[(df['Relationship'] == 'Employee') | (df['Relationship'] == 'Employer') ,
             'Relationship_Group'] = 'Work'

df.loc[(df['Relationship'] == 'Family') , 'Relationship_Group'] = 'Other Family'

df.loc[(df['Relationship'] == 'Friend') , 'Relationship_Group'] = 'Friend'

df.loc[(df['Relationship'] == 'Neighbor') , 'Relationship_Group'] = 'Neighbor'

df.loc[(df['Relationship'] == 'Stranger') , 'Relationship_Group'] = 'Stranger'

df.loc[(df['Relationship'] == 'Acquaintance') , 'Relationship_Group'] = 'Acquaintance'

# combine City and State into a new feature - City_State


# combine Month and Year into a new feature - Month_Year


# drop Incident feature
if 'Incident' in df:
    del df['Incident']

Summarized what was completed in this step.

In [None]:
# display head of the cleansed data

Summarized what was completed in this step.

In [None]:
# display info which includes shape, columns, non-null count, and datatype
# print the number of records and columns
records = len(df)
attributes = df.columns

print(f'No. of Records: {records} \nNo. of Attributes: {len(attributes)}')

# use df.info()

Summarized what was completed in this step.

In [None]:
# displays unique values and their counts using .value_counts() and nunique() for each feature
# use df.nunique()

# use df.value_counts() for each feature
homicide_df['Agency Code'].value_counts()
homicide_df['Agency Name'].value_counts()
homicide_df['Agency Type'].value_counts()
homicide_df['City'].value_counts()
homicide_df['State'].value_counts()
homicide_df['City_State'].value_counts()

homicide_df['Month'].value_counts()
homicide_df['Year'].value_counts()
homicide_df['Month_Year'].value_counts()

homicide_df['Crime Type'].value_counts()
homicide_df['Crime Solved'].value_counts()

homicide_df['Victim Sex'].value_counts()
homicide_df['Victim Age'].value_counts()
homicide_df['Victim_Age_Group'].value_counts()
homicide_df['Victim Race'].value_counts()
homicide_df['Victim Ethnicity'].value_counts()
homicide_df['Victim_Race_Ethnicity'].value_counts()

homicide_df['Perpetrator Sex'].value_counts()
homicide_df['Perpetrator Age'].value_counts()
homicide_df['Perpetrator_Age_Group'].value_counts()
homicide_df['Perpetrator Race'].value_counts()
homicide_df['Perpetrator Ethnicity'].value_counts()
homicide_df['Perpetrator_Race_Ethnicity'].value_counts()

homicide_df['Relationship'].value_counts()
homicide_df['Relationship_Group'].value_counts()

homicide_df['Weapon'].value_counts()
homicide_df['Record Source'].value_counts()
homicide_df['Victim Count'].value_counts()
homicide_df['Perpetrator Count'].value_counts()

In [None]:
homicide_df.nunique()

Summarized what was completed in this step.

## Data Quality

_Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Be specific_.

<a href="#CONTENTS"> Top of the page</a>


--Data Quality Full Write-up--

In [None]:
# check for null values using isnull().sum() and clean-up null values
# check for null values
homicide_df.isnull().sum()

Summarized what was completed in this step.

In [None]:
# check for rows containing duplicate data and clean-up duplicate rows
df_duplicates = df.groupby(df.columns.tolist(),as_index=False).size()
df_duplicates.loc[df_duplicates['size'] > 1]

Summarized what was completed in this step.

In [None]:
# check for outliers using box plots
fig, axes = plt.subplots(nrows=5,ncols=1)
fig.set_size_inches(10, 30)
sns.boxplot(data=homicide_df,x="Incident",orient="h",ax=axes[0])
sns.boxplot(data=homicide_df,x="Victim Age",orient="h",ax=axes[1])
sns.boxplot(data=homicide_df,x="Victim Count",orient="h",ax=axes[2])
sns.boxplot(data=homicide_df,x="Perpetrator Count",orient="h",ax=axes[3])
sns.boxplot(data=homicide_df,x="Perpetrator Age",orient="h",ax=axes[4])

Summarized what was completed in this step.

In [None]:
# histograms to look at the distributions
fig, axes = plt.subplots(nrows=5,ncols=1)
fig.set_size_inches(10, 30)
sns.distplot(homicide_df['Incident'],ax=axes[0], bins=15)
sns.distplot(homicide_df['Victim Age'],ax=axes[1], bins=10)
sns.distplot(homicide_df['Victim Count'],ax=axes[2])
sns.distplot(homicide_df['Perpetrator Count'],ax=axes[3])
sns.distplot(homicide_df['Perpetrator Age'],ax=axes[4], bins=10)

Summarized what was completed in this step.

In [None]:
# check for outliers using quantiles and IQR

Summarized what was completed in this step.

In [None]:
# create a pairplot for continuous variables looking for outliers 

Summarized what was completed in this step.

In [None]:
# Violin plots to compare distributions between groups

Summarized what was completed in this step.

In [None]:
# Pair plots and matrix are also viable options
fig = px.scatter_matrix(df[['Year', 'Incident', 'Victim Age', 'Victim Count','Perpetrator Count']])
fig.show()

In [None]:
sns.set()
cols = ['Perpetrator Age', 'Perpetrator Count', 'Victim Age', 'Victim Count', 'Year']

# Create a pairplot for the selected columns 
sns.pairplot(homicide_df[cols], height = 2.5)

# show the plot
plt.show();

Summarized what was completed in this step.

## Statistics

_Give simple, appropriate statistics (range, mode, mean, median, variance, counts, etc.) for the most important attributes and describe what they mean or if you found something interesting. Note: You can also use data from other sources for comparison. Explain the significance of the statistics run and why they are meaningful_.


<a href="#CONTENTS"> Top of the page</a>

--Simple Statistics Full Write-up--

In [None]:
# count, mean, standard deviation, minimum and maximum values and the quantities for continuous variables
homicide_df.describe().T

Summarized what was completed in this step.

In [None]:
# total number of victim count
print('Total number victims=',df['Victim Count'].sum())

Summarized what was completed in this step.

In [None]:
# basic statistics for categorical features
df_categorical = df.select_dtypes(include='object')
df_categorical.describe()

Summarized what was completed in this step.

In [None]:
# get all levels per categorical attribute
df_categorical_levels = pd.DataFrame()
df_categorical_levels['Attribute'] = df_categorical.columns
df_categorical_levels['Levels'] = ''
df_categorical_levels['Levels_Count'] = ''
df_categorical_levels['Unknown_Count'] = ''

# populate the dataframe with categorical levels and count of each category
for i, row in df_categorical_levels.iterrows():
    attribute = row['Attribute']
    df_categorical_levels.at[i,'Levels'] = df[attribute].unique()
    df_categorical_levels.at[i,'Levels_Count'] = len(df[attribute].unique())
    try:
        df_categorical_levels.at[i,'Unknown_Count'] = df.groupby(attribute).count().loc['Unknown'][0]
    except: 
        df_categorical_levels.at[i,'Unknown_Count'] = 0

In [None]:
# show the dataframe
df_categorical_levels.sort_values(by='Unknown_Count', ascending = False)

Summarized what was completed in this step.

In [None]:
# include any pertinant crosstabs with percentages

In [None]:
pd.pivot_table(known,index=["Victim Race","Perpetrator Race"],values=["Victim Count"],aggfunc=[np.sum])

In [None]:
pv_weapons = df.pivot_table(columns='Crime Solved', index='Weapon',values='Record ID',aggfunc='count')
pv_weapons.sort_values(by='No', ascending=0)

Summarized what was completed in this step.

## Visualization

_Visualize the most important attributes appropriately (at least 5 attributes). Important: Provide an interpretation for each chart. Explain for each attribute why the chosen visualization is appropriate._

_Visualize attributes just individual attributes_

<a href="#CONTENTS"> Top of the page</a>

--Visualize Attributes Full Write-up--

Checklist of important attributes to visualize (use count of Record ID)
* Victim (Sex, Age, Race_Ethnicity) - bar plot
* Crime Type - pie chart
* Weapons - horizontal bar plot
* Perpetrator (Sex, Age, Race_Ethnicity) - bar plot
* State - map plot
* City_State - map plot
* Crime Solved - pie chart
* Month_Year - line plot

Perpetrator and Victim Count
- Are there more perpetrators than victims?

Weapon
- What weapons are most and least commonly used for murder over the time period?

Time series for homicede incidences 
- Has the homocide rates decrease from 1980-2014?
- What states have the highest and lowest number of incidences from 1980-2014?
- What states had the highest and lowest murder rate in 2014?

Age distribution
- What age group is the most predominent for perpetrators and victims?
- Has the age distribution change over the years?
- What ages encompass the most perpetrators? 
        
Race and sex 
- Perpetrator and victim sex distribution for each race?
- What is the age group distribution for each race?
- What perpetrator and victim race has the highest homicide incidences?
        
Relationship
- What relationship result in most murders from 1980-2014?
- What victim relationship result in most murders from 1980-2014?
- What relationship group has the most homicides?

In [None]:
# Victim (Sex, Age, Race_Ethnicity) - bar plot
# Victims sex
v_sex = df['Victim Sex'].value_counts()
v_sex.plot.pie(autopct='%1.0f%%',figsize=(6, 6), title = 'Victims sex')

Summarized what was completed in this step.

In [None]:
# Crime Type - pie chart
ct = df['Crime Type'].value_counts()
ct.plot.pie(autopct='%1.0f%%', figsize=(6, 6), title = 'Crime Types')

Summarized what was completed in this step.

In [None]:
# Weapons - horizontal bar plot

Summarized what was completed in this step.

In [None]:
# Perpetrator (Sex, Age, Race_Ethnicity) - bar plot
# Perpetrator sex
p_sex = df['Perpetrator Sex'].value_counts()
p_sex.plot.pie(autopct='%1.0f%%', figsize=(6, 6), title = 'Perpetrators sex')

Summarized what was completed in this step.

In [None]:
# State - map plot
# heat map of states 
fig = px.choropleth(locations=df_state['State_Abb'], 
                    locationmode="USA-states", 
                    color=df_state['Record ID'], 
                    color_continuous_scale='portland',
                    scope="usa")
fig.update_layout(
    title_text = 'Homicide Rates per State',
    geo_scope='usa', # limite map scope to USA
)
fig.show()

California, Texas, New York, and Florida are the leading states with homocide rates.

Summarized what was completed in this step.

In [None]:
# City_State - map plot

Summarized what was completed in this step.

In [None]:
# Crime Solved - pie chart

Summarized what was completed in this step.

In [None]:
df_homicides_per_year = df.groupby('Year').count().reset_index()
df_homicides_per_year_solved = df.groupby(['Year', 'Crime Solved']).count().reset_index()

In [None]:
# Create traces
solved_y = df_homicides_per_year_solved.loc[df_homicides_per_year_solved['Crime Solved'] == 'Yes']
unsolved_y = df_homicides_per_year_solved.loc[df_homicides_per_year_solved['Crime Solved'] == 'No']
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_homicides_per_year['Year'], y=df_homicides_per_year['Record ID'],
                    mode='lines+markers',
                    name='Homicide Rates'))
fig.add_trace(go.Scatter(x=solved_y['Year'], y=solved_y['Record ID'],
                    mode='lines+markers',
                    name='Solved Homicides'))
fig.add_trace(go.Scatter(x=unsolved_y['Year'], y=unsolved_y['Record ID'],
                    mode='lines+markers',
                    name='Unsolved Homicides'))
fig.update_layout(
    title={
        'text': "Homicides Per Year",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Year",
    yaxis_title=" Number of Homicides")

fig.show()

Summarized what was completed in this step.

In [None]:
# Month_Year - line plot

Summarized what was completed in this step.

## EDA

_Explore relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships_.

_Explore joint attributes is the comparing of multiple attributes (pairwise and correlation are good options)_


<a href="#CONTENTS"> Top of the page</a>

--Explorer Joint Attributes Full Write-up--
Use scatter plots for 2 variable comparison
Use bubble plots for 3 variable comparison
Use line plots for time lines
Use geo map plots for locations

Correlation
- How is the homicide data correlated

In [None]:
# Correclation plot
plt.figure(figsize=(8,4))
sns.heatmap(homicide_df.corr())

In [None]:
corr_pair(homicide_df)

Summarized what was completed in this step.

In [None]:
# Pairwise plot

Summarized what was completed in this step.

In [None]:
# Create a plot to compare actual crime rate numbers to the homicide numbers to see if they follow the same pattern
# Need crime data from https://www.macrotrends.net/states/louisiana/murder-homicide-rate-statistics

Summarized what was completed in this step.

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(data=df,
              x='Victim Race',
              #y='Crime Type',
              hue='Crime Type',
              palette=['#432371',"#FAAE7B"] 
             )
plt.xlabel('Victim Race',size=14)
plt.ylabel('Total Count',size=14)
plt.title('Majority of white individual were victim of \n Murder or Manslaughter followed by Black',size=18)
plt.show()

Summarized what was completed in this step.

In [None]:
sns.catplot(x='Perpetrator Race',
           y='Perpetrator Count',
           kind='bar', 
           height=6,
            aspect=2,
            hue='Perpetrator Sex', 
           data=df.sort_values('Perpetrator Race'))

Summarized what was completed in this step.

## Discoveries

_Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification)._

_Prediction can include 1 or more attributes (i.e. Perpetrator age and race to predict the profile of a murder based on other attributes_

<a href="#CONTENTS"> Top of the page</a>

--Explore Attributes and Class Full Write-up--

Checklist of important attributes to visualize (use count of Record ID)
* Victim (Sex, Age, Race_Ethnicity) vs Crime Solved
* Weapons vs Crime Solved
* Perpetrator (Sex, Age, Race_Ethnicity) vs Crime Solved
* Perpetrator (Sex, Age, Race_Ethnicity) vs Crime Type
* State vs Crime Solved
* State vs Crime Type
* City_State vs Crime Type
* City_State vs Crime Solved
* Agency Type vs Crime Solved

In [None]:
# Victim (Sex, Age, Race_Ethnicity) vs Crime Solved
df_victime_gender = df.groupby(['Victim Age', 'Victim Sex', 'Year']).count().reset_index()
px.scatter(df_victime_gender, x="Victim Age", y="Record ID", animation_frame="Year", animation_group="Victim Age",
           size="Record ID", color="Victim Sex", hover_name="Record ID",
           log_x=False, size_max=20, range_x=[0,100], range_y=[0,1200])

Summarized what was completed in this step.

In [None]:
# Weapons vs Crime Solved

Summarized what was completed in this step.

In [None]:
# Perpetrator (Sex, Age, Race_Ethnicity) vs Crime Solved
df_gender = df.groupby(['Perpetrator Sex', 'Victim Sex', 'Year']).count().reset_index()
df_gender['Perp_Vict'] = df['Perpetrator Sex'].str.cat(df['Victim Sex'],sep=" ")
df_gender['Perp_Vict'].unique()

df_hom = pd.DataFrame()
df_hom['Year'] = ''
df_hom['Perp_Vict'] = ''
years = df['Year'].unique()
combo = df_gender['Perp_Vict'].unique()
for val in combo:
    for i in years:
        df_hom = df_hom.append({'Year': i, 'Perp_Vict': val}, ignore_index=True)

In [None]:
df_hom_perp = df_hom.merge(df_gender, on=['Year', 'Perp_Vict'], how='outer')

In [None]:
fig_gender = px.bar(df_hom_perp, x="Perp_Vict", y="Record ID", color="Perp_Vict",
             animation_frame="Year", animation_group="Perp_Vict", range_y=[0,10000],
             opacity = 0.5)

In [None]:
fig_gender.show()

Summarized what was completed in this step.

In [None]:
# Perpetrator (Sex, Age, Race_Ethnicity) vs Crime Type

Summarized what was completed in this step.

In [None]:
# State vs Crime Solved
# One example of what we can do here
fig = px.scatter(database_df, x="State",y='Weapon', color="Crime Solved",
                 hover_name="Weapon",template="plotly_dark",
                 animation_frame='Year',animation_group='State')
fig.show()

Summarized what was completed in this step.

In [None]:
# State vs Crime Type

Summarized what was completed in this step.

In [None]:
# City_State vs Crime Type

Summarized what was completed in this step.

In [None]:
# City_State vs Crime Solved

Summarized what was completed in this step.

In [None]:
# Agency Type vs Crime Solved

Summarized what was completed in this step.

In [None]:
ct5.plot(kind='barh', 
         #stacked=True, 
         color=['#432371','red'],
         width=0.8,  
         figsize=(8,6) # (x-axis,y-axis)
         )
plt.xlabel('Total Count',size=14)
plt.ylabel('Agency Type',size=14)
plt.title('Murder or Manslaughter accounts for the majority of crime, \n mostly taken cared by Municipal Police',size=18)
plt.show()

Summarized what was completed in this step.

In [None]:
df_homicides_solved = pd.DataFrame()
df_homicides_solved['Year'] = df_homicides_per_year['Year']
df_homicides_solved['Unsolved'] = unsolved_y['Record ID'].values
df_homicides_solved['Solved'] = solved_y['Record ID'].values
df_homicides_solved['Total Homicides'] = df_homicides_per_year['Record ID']

In [None]:
# new variable creations
df_homicides_solved['Unsolved_Solved_Diff'] = df_homicides_solved['Solved'] - df_homicides_solved['Unsolved']
df_homicides_solved['Diff_Percentage'] = round((df_homicides_solved['Unsolved_Solved_Diff']/df_homicides_solved['Total Homicides'])*100,2)

In [None]:
fig = px.line(df_homicides_solved, x="Year", y="Diff_Percentage", title='Percentage of Difference of Solved vs. Unsolved Homicides')
fig.show()

Summarized what was completed in this step.

Decision Tree
- Can we predict whether the crime will be solved or unsolved for victims?

## New Feature Creation

_Are there other features that could be added to the data or created from existing features? Which ones?_


<a href="#CONTENTS"> Top of the page</a>

New features - can we do something to clean up city, state, and agency? (qcut in pandas)

--New Features Full Write-up--

--New features were created in the Data Meaning Type section we ust need to do the write=up for them--
Victim_Race_Ethnicity
Perpetrator_Race_Ethnicity
Victim_Age_Group
Perpetrator_Age_Group
Relationship_Group
City_State
Month_Year

## Exceptional Work
_You have free reign to provide additional analyses. • One idea: implement dimensionality reduction, then visualize and interpret the results._


<a href="#CONTENTS"> Top of the page</a> 


Include PCA and fit a model for exceptional points

Initial view of the dataset shows that headers are descriptive enough and won't require any changes. However, we need to look at the equivalent of a N/A in the Perpetrator and Victim Age columns. The N/A equivalent is 0.

In [None]:
# Function to create dummy variables
def dummy_code(col, df): # input the column names and dataframe
    df_dummy = pd.DataFrame()
    for val in col:
        df_dummy_temp = pd.get_dummies(df[val], prefix=val)
        df_dummy = pd.concat([df_dummy, df_dummy_temp], axis=1, sort=False)
    return df_dummy

In [None]:
# select columns for cummy coding
cat_col = df_categorical.columns.values
categorical = np.delete(cat_col, [0,1])

In [None]:
# call function for dummy coding variables
df_dummy = dummy_code(categorical, df)

Summarized what was completed in this step.

In [None]:
# Train/Test split due to the large data size and for data validation
# set seed
random.seed(1234)
df_pca = df_full.drop(['Agency Name', 'Agency Code'], axis=1)


In [None]:
# split into train/test
y = df_pca['Crime Solved_Yes']
x = df_pca.drop(['Crime Solved_Yes', 'Crime Solved_No'], axis = 1)

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.8)

Summarized what was completed in this step.

In [None]:
# PCA
# Standardizing the features
x = StandardScaler().fit_transform(x_train)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents, columns=['PCA_'+ str(x) for x in range(10)])

In [None]:
df_PCA = pd.concat([principalDf, y], axis=1)

In [None]:
fig = px.scatter(principalComponents, x=df_PCA['PCA_0'], y=df_PCA['PCA_1'], color=df_PCA['Crime Solved_Yes'],
                width=600, height=300)
fig.update_layout(title='PCA 1 vs. PCA 2',
                  yaxis_zeroline=False, xaxis_zeroline=False)
fig.update_xaxes(title_text='PCA 1')
fig.update_yaxes(title_text='PCA 2')
fig.show()

Summarized what was completed in this step.

In [None]:
# Linear Regresiion
# check for a balanced dataset
df_crime = df_full[['Crime Solved_Yes', 'Crime Solved_No']].groupby('Crime Solved_Yes').count().reset_index().rename(columns={'Crime Solved_No':'Count'})
df_crime['Solved'] = ['No', 'Yes']
df_crime = df_crime.drop('Crime Solved_Yes', axis=1)
total = df_crime['Count'].sum()
df_crime['Percentage'] = [x/total for x in df_crime['Count']]
df_crime

Summarized what was completed in this step.

# Archive

Delete before turning in

In [None]:
profile = ProfileReport(homicide_df, title="Pandas Profiling Report")
profile.to_file("pandas_report.html")

In [None]:
#https://www.analyticsvidhya.com/blog/2020/08/exploratory-data-analysiseda-from-scratch-in-python/
#https://analyticsindiamag.com/beginners-guide-to-pyjanitor-a-python-tool-for-data-cleaning/
# for normalizing, scaling, and encoding categorical values

In [None]:
# sample EDA
#https://github.com/Dongee-W/EDA-python-spark/blob/master/seaborn.ipynb
#https://www.geeksforgeeks.org/exploratory-data-analysis-in-python/