# Analysis of Hate Crimes in India
# Urisha Kreem, Nidhi Allani, Ernev Sharma

# INTRODUCTION

The overall objective of this project is to analyze the hate crimes that have taken place in India. This data is picked from the National Crime Records Bureau of India and represents the state wise and district wise crimes against scheduled castes during the years 2001 to 2012. The primary goal is to inform the general public of the most frequent areas in which hate-crimes occur, how the number of hate crimes has changed over the years, and how the current trends presented to us could predict future hate crime patterns. 

Each record represents the number of each type of crime that occurred where the victim was registered as a person of a scheduled cast. Crimes are recorded regardless of the caste of the offender - which we thought was beneficial as this shows the avoidance of bias within data collection and allows for our findings to be more significiant.

The motivation and relevance of this topic is due to many reasons. One of the main reasons supported by a statistic, **INSERT STATISTIC HERE** {**INSERT LINK HERE**}. This is very current information and the issue of hate crimes in this region is not going away anytime soon. 

The relevance and importance of this topic in respect to data science is that through data analysis of the information we have access to, it will give us insight to crucial imformation to which we can use to minimize hate crimes dependent on a specific pattern, and bring about awareness of the patterns and insights we uncover. Crimes against the historically marginalized castes in India represent an extreme form of prejudice and discrimination and thus we aim to provide the numbers and data in a way that allows us to exploit such occurances backed up with numbers - to bring light to the issue and bring about change with how the system works. 

Throughout this tutorial we attempt to uncover these trends of hate crimes in India in a clear, concsice manner - making the data available at hand easy to be interpretted and understood by the public. 

# DATA COLLECTION 

# Imports: 

In [23]:
import requests #for get request
import pandas as pd #pandas
import numpy as np #module
from datetime import datetime #datetime objects
import json #needed for google API
import os.path #needed for file reading
import matplotlib.pyplot as plt #for plotting
from sklearn import linear_model #for linear regression
from sklearn.preprocessing import PolynomialFeatures #polynomial regression
os.system('jupyter nbconvert --to html FinalProject.ipynb')

0

# Data Extraction: 

In [24]:
data = pd.read_csv("crime_by_district.csv")
data.head()

Unnamed: 0,STATE/UT,DISTRICT,Year,Murder,Assault on women,Kidnapping and Abduction,Dacoity,Robbery,Arson,Hurt,Prevention of atrocities (POA) Act,Protection of Civil Rights (PCR) Act,Other Crimes Against SCs
0,,,,,,,,,,,,,
1,ANDHRA PRADESH,ADILABAD,2001.0,0.0,1.0,4.0,0.0,0.0,0.0,3.0,0.0,15.0,32.0
2,,,,,,,,,,,,,
3,ANDHRA PRADESH,ANANTAPUR,2001.0,0.0,4.0,0.0,0.0,0.0,0.0,49.0,21.0,0.0,53.0
4,,,,,,,,,,,,,


From this extraction, we can see that there are quite a few values across the table that have values of 'NaN' or *. Essentially what this entails is that the data was unable to be collected, or is simply missing. To do further analysis we need to deal with these values in a strategic way. This leads us to Data Processing - where we will attempt to clean up the data presented to us in a way that is easily readable and provides us the most effeicient ways of analysis later on. 

# Data Processing:

In [25]:
data.dropna(inplace=True) # Drop all rows that have "NaN" as an entry 
data.index = range(len(data)) #Re-Index the dataframe 
data.head()

Unnamed: 0,STATE/UT,DISTRICT,Year,Murder,Assault on women,Kidnapping and Abduction,Dacoity,Robbery,Arson,Hurt,Prevention of atrocities (POA) Act,Protection of Civil Rights (PCR) Act,Other Crimes Against SCs
0,ANDHRA PRADESH,ADILABAD,2001.0,0.0,1.0,4.0,0.0,0.0,0.0,3.0,0.0,15.0,32.0
1,ANDHRA PRADESH,ANANTAPUR,2001.0,0.0,4.0,0.0,0.0,0.0,0.0,49.0,21.0,0.0,53.0
2,ANDHRA PRADESH,CHITTOOR,2001.0,3.0,3.0,0.0,0.0,0.0,0.0,38.0,36.0,0.0,34.0
3,ANDHRA PRADESH,CUDDAPAH,2001.0,0.0,3.0,0.0,0.0,0.0,0.0,20.0,52.0,0.0,25.0
4,ANDHRA PRADESH,EAST GODAVARI,2001.0,1.0,3.0,0.0,0.0,0.0,0.0,3.0,12.0,63.0,7.0


Here we have went through the entire dataframe and removed all the rows with "NaN" as their values. It was observed that this was a valid solution to our issue with missing data because it was found that the rows with "NaN" in one row often had "NaN" in most all of the columns - due to the way the data was initially presented to us. Thus there was no solid informatin that entry could have provided us with and therefore it was seen to be safe to remove all the rows instead of using other methods such as single imputations, multiple imputations, etc. 

Now that we have cleaned up our data, we will move to the exploratory analysis and data visualization portion of the project where we aim to plot our data and observe as well as analyze the presented trends. It is also in this step where we will perform statistical analysesi n order to obtain better supporting evidence for trends that may discover. 

# EXPLORATORY ANALYSIS AND DATA VISUALIZATION:

In [None]:
# Heat maps to show where the trend is high for hate crimes to occur {red area on map}
# Map to show numerical values of how many counts of hate crime occured in each state
# bar graph to show counts of hate crimes per state over time
# histogram to show which type of hate crime is most "popular" {country and then break down and do per state}
# Linear Regression / T-Test to see if all predictors explain response well 
# ... 

In this section we will be conducting the exploratory data analysis and data visualization. 