### Import the data 

In [26]:
import pandas as pd
data1 = pd.read_csv('data/data1.csv')


## What is correlation between two variables?

Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship.  In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and -1.  A value of ± 1 indicates a perfect degree of association between the two variables.  As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker.  The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a – sign indicates a negative relationship. 

#### Data preprocessing:
< State, District, Year, Rape, Kidnapping and Abduction, Dowry Deaths, Assault on women with intent to outrage her modesty, Insult to modesty of women, Cruelty by Husband or his Relatives, Importation of Girls>


- I'm aggregating all the district values into corresponding State of different assaults on women. 
- Aggregating all the State values of different assaults to form reported crimes of that particular year.

In [2]:
# Like wise.. show values of  all years with grouping each state/ut: India 

yrs = [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012]
yrsdata = data1[data1.Year.isin(yrs)]
yrs_total = yrsdata[yrsdata.DISTRICT == "TOTAL"]
yrs_total_delhi = yrsdata[yrsdata.DISTRICT == "DELHI UT TOTAL"]
yrs_total = pd.concat([yrs_total, yrs_total_delhi])

total = yrs_total.sum()
#keep it as yrs_total and dont change it to data1.. values will be duplicated.
total = yrs_total.groupby(['Year']).sum()
total['Year'] = yrs


Since population data ranges from 2001 to 2011. We need to remove last element from r_total(total rape cases) to maintain consistency because r_total has 12 elements.

In [3]:
dowryD = total["Dowry Deaths"]
dowryD = dowryD.drop(dowryD.index[-1])

cruelty = total["Cruelty by Husband or his Relatives"]
cruelty = cruelty.drop(cruelty.index[-1])

importation = total["Importation of Girls"]
importation = importation.drop(importation.index[-1])

AOW = total["Assault on women with intent to outrage her modesty"]
AOW = AOW.drop(AOW.index[-1])

insult = total["Insult to modesty of Women"]
insult = insult.drop(insult.index[-1])

r_total = total["Rape"]
r_total = r_total.drop(r_total.index[-1])

kidnap = total["Kidnapping and Abduction"]
kidnap = kidnap.drop(kidnap.index[-1])



# type(r_total)

Reading the population data

In [4]:
pop = pd.read_csv('data/pop.csv')
pop_total = pop.sum()
yearWisePop = list() 
for i in range (0, 11):
    yearWisePop.append(pop_total[str(yrs[i])])

### Scipy's Pearson R
We are using Scipy library's function pearsonr.
##### Calculation Pearson's correlation coefficient measures the linear relationship between two datasets. 
- Pearson correlation coefficients varies between -1 and +1 with 0 implying no correlation. 
- Correlations of -1 or +1 imply an exact linear relationship.
- Positive correlation imply that as __x increases so does y__.


In [6]:
import scipy.stats as sci
# Rape cases
r_coefficient = sci.pearsonr(yearWisePop, r_total)
r_coefficient[0]

0.9768003888946082

The value is almost close to 1, which specified that Rape cases reported is highly correlated with population.

### Dowry Deaths

In [7]:
dowry_coefficient = sci.pearsonr(yearWisePop, dowryD)
dowry_coefficient[0]

0.9156808172941565

### Cruelty by Husband or his relatives

In [8]:
cruelty_coefficient = sci.pearsonr(yearWisePop, cruelty)
cruelty_coefficient[0]

0.9776682035355758

### Assault on woman with intent to outgrade her modesty

In [9]:
AOW_coefficient = sci.pearsonr(yearWisePop, AOW)
AOW_coefficient[0]

0.9314610517843717

### Importation of girls: 


In [10]:
importation_coefficient = sci.pearsonr(yearWisePop, importation)
importation_coefficient[0]

-0.42173415607411013

### Insult to modesty of women

In [11]:
insult_coefficient = sci.pearsonr(yearWisePop, insult)
insult_coefficient[0]

-0.142283787806352

Since the value is negative and close to 0, we can concur that insult to modesty of women is not correlated with the population.

In [24]:
from scipy import stats
slope, intercept, r_value, p_value, std_err= stats.linregress(yearWisePop, r_total)

### A linear regression model that predicts the number of rape cases provided the population of the country.

__Input data:__ population from 2001-20011 and it's corresponding Rape cases reported at that time.

__Libraries used:__ Sklearn, numpy.

In [25]:
from sklearn.linear_model import LinearRegression
import numpy as np
r_total=np.asarray(r_total)

popList = np.ndarray(shape=(11,))
r_numpy =  np.ndarray(shape=(11,))
for i in range(0, len(yearWisePop)):
    popList[i] =yearWisePop[i]
    r_numpy[i]= int(r_total[i])


reg = LinearRegression().fit(popList.reshape(-1,1), r_numpy)


#### We are validating our model by using inbuilt __Score__ function that calculates Y' = reg.predict(X) and then compares Y' to actual Y in the training data. This is done to give an accuracy measure. 

##### If the value is closer to1, it is said to be accurate.

In [21]:
reg.score(popList.reshape(-1,1), r_numpy)

0.9541389997446578

### We will predict number of rape cases if the population of the whole country is 1.23 billion.

In [22]:
reg.predict(np.array([[1236005000]]))

array([24656.9264845])

#### So there are almost 24k rape cases