# Analysis of Hate Crimes in India
# Urisha Kreem, Nidhi Allani, Ernev Sharma

# INTRODUCTION

The overall objective of this project is to analyze the hate crimes that have taken place in India. This data is picked from the National Crime Records Bureau of India and represents the state wise and district wise crimes against scheduled castes during the years 2001 to 2012. The primary goal is to inform the general public of the most frequent areas in which hate-crimes occur, how the number of hate crimes has changed over the years, and how the current trends presented to us could predict future hate crime patterns. 

Each record represents the number of each type of crime that occurred where the victim was registered as a person of a scheduled cast. Crimes are recorded regardless of the caste of the offender - which we thought was beneficial as this shows the avoidance of bias within data collection and allows for our findings to be more significiant.

The motivation and relevance of this topic is due to many reasons. One of the main reasons supported by a statistic, **INSERT STATISTIC HERE** {**INSERT LINK HERE**}. This is very current information and the issue of hate crimes in this region is not going away anytime soon. 

The relevance and importance of this topic in respect to data science is that through data analysis of the information we have access to, it will give us insight to crucial imformation to which we can use to minimize hate crimes dependent on a specific pattern, and bring about awareness of the patterns and insights we uncover. Crimes against the historically marginalized castes in India represent an extreme form of prejudice and discrimination and thus we aim to provide the numbers and data in a way that allows us to exploit such occurances backed up with numbers - to bring light to the issue and bring about change with how the system works. 

Throughout this tutorial we attempt to uncover these trends of hate crimes in India in a clear, concsice manner - making the data available at hand easy to be interpretted and understood by the public. 

# DATA COLLECTION 

# Imports: 

In [164]:
import requests #for get request
import pandas as pd #pandas
import numpy as np #module
from datetime import datetime #datetime objects
import json #needed for google API
import os.path #needed for file reading
import matplotlib.pyplot as plt #for plotting
from sklearn import linear_model #for linear regression
from sklearn.preprocessing import PolynomialFeatures #polynomial regression
os.system('jupyter nbconvert --to html FinalProject.ipynb')
from operator import itemgetter 

# Data Extraction: 

In [165]:
data = pd.read_csv("crime_by_state_rt.csv")
data = data.rename(columns = {data.columns[0]:'State', data.columns[9]:'Prevention of atrocities Act', data.columns[10]:'Protection of Civil Rights Act', data.columns[11]:'Other Crimes Against SCs'})
data.head()

Unnamed: 0,State,Year,Murder,Assault on women,Kidnapping and Abduction,Dacoity,Robbery,Arson,Hurt,Prevention of atrocities Act,Protection of Civil Rights Act,Other Crimes Against SCs
0,ANDHRA PRADESH,2001,45,69,22,3,2,6,518,950,312,1006
1,ANDHRA PRADESH,2002,60,98,18,0,4,12,568,830,459,1336
2,ANDHRA PRADESH,2003,33,79,27,1,15,4,615,1234,165,1386
3,ANDHRA PRADESH,2004,39,66,28,0,7,20,474,1319,68,1234
4,ANDHRA PRADESH,2005,37,74,21,0,0,9,459,1244,61,1212


From this extraction, we can see that there are quite a few values across the table that have values of 'NaN' or *. Essentially what this entails is that the data was unable to be collected, or is simply missing. To do further analysis we need to deal with these values in a strategic way. This leads us to Data Processing - where we will attempt to clean up the data presented to us in a way that is easily readable and provides us the most effeicient ways of analysis later on. 

# Data Processing:

In [166]:
data.dropna(inplace=True) # Drop all rows that have "NaN" as an entry 

data.index = range(len(data)) #Re-Index the dataframe 
data.head(10)

Unnamed: 0,State,Year,Murder,Assault on women,Kidnapping and Abduction,Dacoity,Robbery,Arson,Hurt,Prevention of atrocities Act,Protection of Civil Rights Act,Other Crimes Against SCs
0,ANDHRA PRADESH,2001,45,69,22,3,2,6,518,950,312,1006
1,ANDHRA PRADESH,2002,60,98,18,0,4,12,568,830,459,1336
2,ANDHRA PRADESH,2003,33,79,27,1,15,4,615,1234,165,1386
3,ANDHRA PRADESH,2004,39,66,28,0,7,20,474,1319,68,1234
4,ANDHRA PRADESH,2005,37,74,21,0,0,9,459,1244,61,1212
5,ANDHRA PRADESH,2006,52,97,12,3,5,13,657,1514,93,1445
6,ANDHRA PRADESH,2007,46,105,25,0,0,17,541,1200,122,1327
7,ANDHRA PRADESH,2008,48,88,18,0,0,5,651,1383,123,1682
8,ANDHRA PRADESH,2009,35,99,19,1,4,12,722,1737,39,1836
9,ANDHRA PRADESH,2010,43,100,18,0,1,17,709,1509,50,1874


Here we have went through the entire dataframe and removed all the rows with "NaN" as their values. It was observed that this was a valid solution to our issue with missing data because it was found that the rows with "NaN" in one row often had "NaN" in most all of the columns - due to the way the data was initially presented to us. Thus there was no solid informatin that entry could have provided us with and therefore it was seen to be safe to remove all the rows instead of using other methods such as single imputations, multiple imputations, etc. 

Now that we have cleaned up our data, we will move to the exploratory analysis and data visualization portion of the project where we aim to plot our data and observe as well as analyze the presented trends. It is also in this step where we will perform statistical analysesi n order to obtain better supporting evidence for trends that may discover. 

# EXPLORATORY ANALYSIS AND DATA VISUALIZATION:

In [167]:
# Heat maps to show where the trend is high for hate crimes to occur {red area on map}
# Map to show numerical values of how many counts of hate crime occured in each state
# bar graph to show counts of hate crimes per state over time
# histogram to show which type of hate crime is most "popular" {country and then break down and do per state}
# Linear Regression / T-Test to see if all predictors explain response well 
# ... 

In this section we will be conducting exploratory data analysis and data visualization. We will use our data to observe trends by creating visualizations to better help us understand nuances of crime occurrence across india. We are also able to use statistical analysis to obtain better supporting evidence for trends. 

For the sake of our analysis we are interested in observing the crime trends for every state in india. Below we are going to observe the most commonly occurring crime over time for each individual state and also print the total number of occurrences. 

In [168]:
states = sorted(data['State'].unique())
crimes = data.columns[2:]
years = [sorted(data['Year'].unique())]
result = []
state_arr = []

for s in states:
    for c in crimes: 
        total = data.loc[(data['State'] == s), c].sum()
        state_arr.append((c, total))
    result.append((s, max(state_arr, key = itemgetter(1))[0] , max(state_arr, key = itemgetter(1))[1])) 
    state_arr = []
result

[('A & N ISLANDS', 'Murder', 0),
 ('ANDHRA PRADESH', 'Other Crimes Against SCs', 17412),
 ('ARUNACHAL PRADESH', 'Murder', 1),
 ('ASSAM', 'Hurt', 341),
 ('BIHAR', 'Prevention of atrocities Act', 23425),
 ('CHANDIGARH', 'Prevention of atrocities Act', 6),
 ('CHHATTISGARH', 'Other Crimes Against SCs', 2672),
 ('D & N HAVELI', 'Other Crimes Against SCs', 5),
 ('DAMAN & DIU', 'Other Crimes Against SCs', 5),
 ('DELHI', 'Prevention of atrocities Act', 256),
 ('GOA', 'Prevention of atrocities Act', 17),
 ('GUJARAT', 'Other Crimes Against SCs', 5399),
 ('HARYANA', 'Other Crimes Against SCs', 1148),
 ('HIMACHAL PRADESH', 'Prevention of atrocities Act', 629),
 ('JAMMU & KASHMIR', 'Other Crimes Against SCs', 12),
 ('JHARKHAND', 'Prevention of atrocities Act', 2055),
 ('KARNATAKA', 'Prevention of atrocities Act', 13773),
 ('KERALA', 'Other Crimes Against SCs', 2231),
 ('LAKSHADWEEP', 'Murder', 0),
 ('MADHYA PRADESH', 'Other Crimes Against SCs', 30721),
 ('MAHARASHTRA', 'Other Crimes Against SCs', 5

The above list has occurrences of States/UT that have 0 records of crime in the table we are using. We recognize that this is very likely an inaccurate representation of actual crime in India, for example, A & N ISLANDS listed to have no crimes but we were able to find a source that cites crime activity in this area in the same time range our data table in representing http://www.neighbourhoodinfo.co.in/crime/Andaman-and-Nicobar-Islandsis . We will keep this in mind as we do the rest of our analysis, as it could pose to be potentially misleading in other observed trends. 

In [169]:
crime_occur  = [0,0,0,0,0,0,0,0,0,0]

i = 0
for c in crimes:
    for x,y,z in result:
        if c == y:
            crime_occur[i] = crime_occur[i]+1
    i += 1
print(crimes)
print(crime_occur)

Index(['Murder', 'Assault on women', 'Kidnapping and Abduction', 'Dacoity',
       'Robbery', 'Arson', 'Hurt', 'Prevention of atrocities Act',
       'Protection of Civil Rights Act', 'Other Crimes Against SCs'],
      dtype='object')
[5, 0, 1, 0, 0, 0, 2, 11, 1, 15]


From the above, we see that the frequency of certain crime occurences are as follows:
    1. Other Crimes Against SCs
    2. Prevention of atrocities Act
    3. Murder
    4. Hurt
    5. Protection of Civil Rights Act
    6. Kidnapping and Abduction
    7, 8, 9, 10. Assault on women, Dacoity, Robbery, Arson
    
Through we see that Other Crimes Against SCs is the most common accross all of India, we also know that this this category is particularly vague because this could encompass a wide range of crimes and isn't very telling of the classification of crimes occurring all around the country. For this reason we will choose to not focus on this category for the rest of our analysis.