## Analysis of Hate Crime data in United States
The data ranges from years 2001 to 2020. 
I particularly chose this dataset for the following reasons:
1. It contains various attributes to work with helping us have varius perspectives in the dataset
2. Good analysis has the potential to provide impactful insights towards hate crime reduction procedures
3. I understand that greater the size of the dataset more accurate is your analysis. Since this dataset has more than 100000 records

In [None]:
import pandas as pd

hcrime_df = pd.read_csv("C:/Users/Nikitha/OneDrive/Documents/School/Assignments/Math_Methods_in_DA/hate_crime_dataset_3.csv")
hcrime_df = hcrime_df[['INCIDENT_ID','DATA_YEAR','PUB_AGENCY_NAME','PUB_AGENCY_UNIT','AGENCY_TYPE_NAME','STATE_NAME','POPULATION_GROUP_DESC','INCIDENT_DATE','TOTAL_OFFENDER_COUNT','ADULT_OFFENDER_COUNT','JUVENILE_OFFENDER_COUNT','OFFENDER_RACE','OFFENDER_ETHNICITY','OFFENSE_NAME','VICTIM_COUNT','LOCATION_NAME','BIAS_DESC']]

In [None]:
hcrime_df.head()

### Data Wrangling
I have performed data wrangling at various places in this notebook.
Performed the following the operations:
1. POPULATION_GROUP_DESC -- get value after thru as a max population group--Retrive by extracting all the numbers in the text and getting a maximum of it
2. INCIDENT_DATE parse it into proper date format inorder to retrieve month using an existing function
3. LOCATION_NAME colunm spliting of multiple location if present and create a collect of location as a replacement. And later explode this collection to multiple it into different rows 

In [None]:
import math

total_incident_count = len(hcrime_df.INCIDENT_ID)
latestyear_df = hcrime_df[hcrime_df.DATA_YEAR == 2020]
total_incident_count_recent = len(latestyear_df)

count_years = len(hcrime_df['DATA_YEAR'].unique())

mean_incident_per_year = total_incident_count/count_years

print(f'Total incidents reported from 2001 to 2020 = {total_incident_count}')
print(f'Total incidents reported in 2020 = {total_incident_count_recent}')
print(f'Mean incidents count for the recent ten years = {mean_incident_per_year:.4f}')

In [None]:
total_offender_count = sum(hcrime_df['TOTAL_OFFENDER_COUNT'])
total_victim_count = sum(hcrime_df['VICTIM_COUNT'])

print(f'Total offenders count from 2001 to 2020 has been = {total_offender_count}')
print(f'Total victims count from 2001 to 2020 has been = {total_victim_count}')

#### Measures of Central Tendency
Calculation of mean, median and mode.
The data I used to calculate the mean, median ,mode using the count of incidents each month over the years of 2001 to 2020

Implemented the Data Wrangling Point 2 below:

In [None]:
#Parse the string date to an actual date datatype
hcrime_df['INCIDENT_DATE'] = pd.to_datetime(hcrime_df['INCIDENT_DATE'], format='%d-%b-%y')
#Creating a new column by extracting month from a INCIDENT_DATE column
hcrime_df['MONTH_OF_INCIDENT_DATE'] = pd.DatetimeIndex(hcrime_df['INCIDENT_DATE']).month
#Grouping by Year and Month to calculate the count of incidents each month
monthly_incident_counts = hcrime_df.groupby(['MONTH_OF_INCIDENT_DATE','DATA_YEAR']).size().to_list()

In [None]:
import statistics as stat
import numpy as np

print(f'stat.mean      = {stat.mean(monthly_incident_counts):.2f}')
print(f'stat.median    = {stat.median(monthly_incident_counts):.2f}')
print(f'stat.mode    = {stat.mode(monthly_incident_counts):.2f}')

#### Measure of Variability
Using the same dataset of incident count per month

In [None]:
#In order to calculate the range
#Sorting the list of incident counts in ascending order
sort_incident_counts = sorted(monthly_incident_counts)

#Minimum value of the range
min_incident_count_per_month = sort_incident_counts[0]
print(f'minimum monthly incident count = {min_incident_count_per_month}')

#Maximum value of the range
max_incident_count_per_month = sort_incident_counts[-1]
print(f'maximum monthly incident count = {max_incident_count_per_month}')

#Calculation of variance
print(f'stat.pvariance = {stat.pvariance(monthly_incident_counts):.2f}')

#Calculate the standard deviation
print(f'stat.pstdev    = {stat.pstdev(monthly_incident_counts):.2f}')

#Interquartile range
to_cal_qua = np.array(sort_incident_counts)
q3, q1 = np.percentile(to_cal_qua,[75,25])
inter_quartile_iqr = q3-q1

print(f'Q1 is {q1:.2f}')
print(f'Q3 is {q3:.2f}')
print(f'Interquartile range = {inter_quartile_iqr:.2f}')

### Based on the bias evaluate the number of incidents over the year
Based on the categorization information in https://www.fbi.gov/services/cjis/ucr/hate-crime i have grouped the column BIAS_DESC into 6 categories:
1. Race/Ethnicity/Ancestry
2. Religion -- not included below
3. Sexual Orientation
4. Disability
5. Gender
6. Gender Identity

Result: Understoof that Race/Ethnicity/Ancestry is the most comman bias in the US

This result justifies the most prevailing hate among the people living in the united states. Hence more programs to eradicate these difference can be conducted.

In [None]:
#function to map the values in BIAS_DESC into the above list except Religion
def categorize_bias(row):
    if ('Anti-Female' in row['BIAS_DESC']) or ('Anti-Male' in row['BIAS_DESC']) :
        return 'Gender'
    elif ('Anti-Mental Disability' in row['BIAS_DESC']) or ('Anti-Physical Disability' in row['BIAS_DESC']) :
        return 'Disability'
    elif ('Anti-Bisexual' in row['BIAS_DESC']) or ('Anti-Gay (Male)' in row['BIAS_DESC']) or ('Anti-Heterosexual' in row['BIAS_DESC']) or ('Anti-Lesbian' in row['BIAS_DESC']) or ('Anti-Lesbian, Gay, Bisexual, or Transgender (Mixed Group))' in row['BIAS_DESC']) :
        return 'Sexual Orientation'
    elif ('Anti-Transgender' in row['BIAS_DESC']) or ('Anti-Gender Non-Conforming' in row['BIAS_DESC']) :
        return 'Gender Identity'
    elif ('Anti-American Indian or Alaska Native' in row['BIAS_DESC']) or ('Anti-Arab' in row['BIAS_DESC']) or ('Anti-Asian' in row['BIAS_DESC']) or ('Anti-Black or African American' in row['BIAS_DESC']) or ('Anti-Hispanic or Latino' in row['BIAS_DESC']) or ('Anti-Multiple Races, Group' in row['BIAS_DESC']) or ('Anti-Native Hawaiian or Other Pacific Islander' in row['BIAS_DESC']) or ('Anti-Other Race/Ethnicity/Ancestry' in row['BIAS_DESC']) or ('Anti-White' in row['BIAS_DESC']) :
        return 'Race/Ethnicity/Ancestry'
    else :
        return 'others or multiple bias'

In [None]:
import matplotlib.pyplot as plt

#To show the number incidents over the years based on the Bias
bias_df = hcrime_df[['INCIDENT_ID','BIAS_DESC']]
#Creating a new column GROUPED_BIAS which buckets the Biases into categories 
bias_df['GROUPED_BIAS'] = bias_df.apply(categorize_bias, axis=1)

#Calculating the count of incidents based on the GROUPED_BIAS
df = bias_df.groupby(bias_df['GROUPED_BIAS']).size()

#PLotting Pie chart
print('NUMBER OF INCIDENTS FROM 2001 to 2020 FOR VARIOUS BIASES')
grp_bias = pd.DataFrame({'GROUPED_BIAS':df.index, 'incident_count':df.values})
grp_bias.set_index('GROUPED_BIAS', inplace=True)
grp_bias.plot.pie(y='incident_count',figsize=(8,8))

### Crime incidents based on the max size of the population of a city or a county

Here implemented the number 1 Data Wrangling point of extracting the maximum number from a given text in 'Cities from 250,000 thru 499,999' of a column POPULATION_GROUP_DESC

Considering 250,000 is minimum polution range and 499,999 is the maximum population range.
Here I have retrieved the maximum number from the text and considered as the population of that county or a city

Result: A max population of 99,999 city or county has faced maximum incidents 

With this result we can target the cities and counties with the population range of 99,999 and perform further investigation of the cause.

In [None]:
import re

#Function extracting the numberical values from the text
def getNumbers(str):
    array = re.findall(r'[0-9]+', str)
    return array
#replacing the , in the text with no value
hcrime_df['POPULATION_GROUP_DESC']=hcrime_df['POPULATION_GROUP_DESC'].replace(',','',regex=True)
#Extracting number
hcrime_df['POPULATION_NUMBERS']=hcrime_df['POPULATION_GROUP_DESC'].map(lambda x: str(getNumbers(x)))
hcrime_df['POPULATION_NUMBERS']=hcrime_df['POPULATION_NUMBERS'].map(lambda x: str(re.findall(r"(\d+)']",x)))

In [None]:
#Grouping the population to fetch the count of incidents in that type of population
x = hcrime_df.groupby(['POPULATION_NUMBERS'])['TOTAL_OFFENDER_COUNT'].sum()
x_df = pd.DataFrame({'population_numbers':x.index, 'count_of_incidents':x.values})
#Creating a Bar plot
print("CRIME INCIDENTS BASED ON THE MAX SIZE OF THE POPULATION OF A CITY OR A COUNTY")
x_df.sort_values(['count_of_incidents'], ascending=False).plot.bar(x = 'population_numbers', y = 'count_of_incidents')

### Group incidents and location correlation

Considering a group of people is with 3 people or more I have choosed the cases where the offender counts are 3 or greater and tried to find the location where group offending most observed.

Here Data Wrangling point 3 is executed of exploding the column

RESULT:
Highway/Road/Alley/Street/Sidewalk location is the most prone location for Group offending to occur

From this result more security can be provided in this region in case of the prevalent group crimes.

In [None]:
#Since few texts in LOCATION_NAME column have multiple locations separated by ;. This function would eliminate that and derive a list
def split_multi_location(str):
    if ';' in str:
        return str.split(';')
    else :
        return str
    

In [None]:
#Count of non-group incidents
non_group_incident_cnt = len(hcrime_df[hcrime_df.TOTAL_OFFENDER_COUNT < 3])
print(f'Total non group offending incident count from 2001 to 2020 has been = {non_group_incident_cnt}')

#Count of group incidents
group_incident_cnt = len(hcrime_df[hcrime_df.TOTAL_OFFENDER_COUNT >= 3])
print(f'Total group offending incident count from 2001 to 2020 has been = {group_incident_cnt}')

#filtering where offending happened in group
grp_series = hcrime_df.loc[hcrime_df["TOTAL_OFFENDER_COUNT"] >= 3,"LOCATION_NAME"]
#Creating a dataframe
grp_df = pd.DataFrame({'group_count':grp_series.index, 'location_name':grp_series.values})
#In case of multiple locations
grp_df['location_name'] = grp_df['location_name'].map(lambda x: split_multi_location(x))
#Exploding the column
m = grp_df.explode('location_name')

location_series = m.groupby(m['location_name']).size()

location_df = pd.DataFrame({'crime_location':location_series.index, 'group_incident_count':location_series.values})

#In ascending order of incident count and getting the top 10
location_df = location_df.sort_values(['group_incident_count'], ascending=False).head(10)
location_df.head()

In [None]:
#Creating the pie chart to represent the same shown above
print("SHOWCASE OF LOCATION CORRELATION TO GROUP HATE CRIME")
location_df.set_index('crime_location', inplace=True)
location_df.plot.pie(y='group_incident_count',subplots=True ,figsize=(10,10))

### Juvinile in crime over the years in percentage measures

Juvenile Hate crime has decreased from 2020 to 2019

This outcome will just acknowledge that the juvenile crime have seen a better face from the year 2019 to 2020

In [None]:
last_two_year = hcrime_df.loc[(hcrime_df['DATA_YEAR'] == 2020) | (hcrime_df['DATA_YEAR'] == 2019)]
last_two_year = last_two_year.dropna()
last_two_year = last_two_year[last_two_year['JUVENILE_OFFENDER_COUNT']!=0]
juv_df = last_two_year[['DATA_YEAR','JUVENILE_OFFENDER_COUNT']]
juv_df = juv_df.groupby(['DATA_YEAR'])['JUVENILE_OFFENDER_COUNT'].sum()
juv_df.plot.bar(ylabel='Juvenile crime count')

#### Data Cleaning usage:
    With Respect to the data cleaning, In this data I required to perform more of Data Wrangling than performing Data Cleaning. Data I chose was mostly neat and usable.
#### My Approach to the analysis:
    While finding the data I was looking data which is more understandable to human readability and having more attribute to experiment the analysis.
    Crime data can be pretty common data across the internet. Here my approach was to retrieving something new which is not already available beforehand

Outcomes are provided in markdown of each analysis done above