## Import All Necessary Modules And Setup Project
If you get any errors when importing these, ensure you run the command:

  $ python -m pip install -r requirements.txt

to install all necessary modules for this project. This command must be run from inside of this project directory.

It is recommended to use virtual environments for this project to ensure there is no conflicting package versions on your system.

Activate the virtual environment (if needed), run the pip install command, and then launch Jupyter Lab inside this project to get this project running.

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
import numpy as np
from scipy.stats import t
from numpy.polynomial.polynomial import polyfit

In [None]:
masterData=pd.read_csv("data/crime-housing-austin-2015.csv")
ZipcodeData=pd.read_csv("data/AustinZipCodes.csv")
#explore dataset
masterData.head()
#It receives an int or None (to print all the columns):
pd.set_option('display.max_columns', None)
masterData.head()

In [None]:
masterData.Zip_Code_Crime.unique()

In [None]:
ZipcodeData.head()

In [None]:
#list out all the columns in a dataset
list(masterData)
#list unique values in a column "Council_District" of a dataset
masterData.Council_District.unique()
#clear rows that contains nan values
masterData=masterData.dropna()
ZipcodeData=ZipcodeData.dropna()

In [None]:
#type of a dataset columns
masterData.dtypes
#"council district" & "zip codes" are float, lets make it as int first
masterData=masterData.astype({"Council_District":'int',"Zip_Code_Crime":'int'})
#percent sign or dollar sign attached
#dataset['colname'] = dataset['colname'].str.replace('$', '').astype('float')
masterData['Medianhouseholdincome'] = masterData['Medianhouseholdincome'].str.replace('$', '').astype('float')
masterData['Medianrent'] = masterData['Medianrent'].str.replace('$', '').astype('float')
masterData['Medianhomevalue'] = masterData['Medianhomevalue'].str.replace('$', '').astype('float')
masterData['Unemployment'] = masterData['Unemployment'].str.replace('%', '').astype('int')
masterData['Populationbelowpovertylevel'] = masterData['Populationbelowpovertylevel'].str.replace('%', '').astype('int')
masterData['Non-WhiteNon-HispanicorLatino'] = masterData['Non-WhiteNon-HispanicorLatino'].str.replace('%', '').astype('int')
masterData['HispanicorLatinoofanyrace'] = masterData['HispanicorLatinoofanyrace'].str.replace('%', '').astype('int')

In [None]:
sub_masterData=masterData[['Key','Zip_Code_Crime','District','Clearance_Status','Highest_NIBRS_UCR_Offense_Description',
                              'Medianhouseholdincome', 'Unemployment','Populationbelowpovertylevel','Medianrent',
                              'Medianhomevalue','Non-WhiteNon-HispanicorLatino','HispanicorLatinoofanyrace']]
sub_masterData=sub_masterData.rename(columns={'Zip_Code_Crime':'Zip_Code','Highest_NIBRS_UCR_Offense_Description':'crime_type',
                                             'Clearance_Status':'CS','Medianhouseholdincome':'MhhI','Unemployment':'UE(%)',
                                             'Populationbelowpovertylevel':'Pov_lvl','Medianrent':'rent',
                                             'Medianhomevalue':'home_price','HispanicorLatinoofanyrace':'hispanic',
                                             'Non-WhiteNon-HispanicorLatino':'non_wh_non_lat'})
display(sub_masterData)
ZipcodeData=ZipcodeData.rename(columns={'Zip Code':'Zip_Code'})
display(ZipcodeData)

In [None]:
ZipcodeData.dtypes

In [None]:
#groupby zip code for sub_masterdata
grp_sub_masterData = sub_masterData.groupby('Zip_Code').agg(
    {'crime_type': 'count','MhhI':'mean','UE(%)':'mean','Pov_lvl':'mean','rent':'mean','home_price':'mean',
    'non_wh_non_lat':'mean','hispanic':'mean'})
grp_sub_masterData=grp_sub_masterData.rename(columns={'crime_type':'crime_count'})
grp_sub_masterData=grp_sub_masterData.reset_index()
merged_dataset=pd.merge(grp_sub_masterData,ZipcodeData,on='Zip_Code')

#Convert string with comma separator and dot to float
merged_dataset['Population'] = merged_dataset['Population'].str.replace(',', '').astype(float)
merged_dataset.head()

In [None]:
#aggregate the total number of each type of crimes
crime_types = sub_masterData.groupby('crime_type').agg(
    {'crime_type': 'count'})
crime_types=crime_types.rename(columns={'crime_type':'count'})
crime_types=crime_types.reset_index()

#plot figure
plt.figure(figsize=(8,8))
plt.pie(data=crime_types, x='count',shadow=True)
plt.title('Texas Crime Results in 2015')
plt.legend(labels=crime_types['crime_type'])
plt.savefig('pi_plot_categorical_crime.jpeg')

## main instructions
Your analysis must include a **number of statistical methods**. You must include `Pearson correlations` (be sure to report the **`p-value`**), _scatterplots_, _averages_, _standard deviations_, and a **t-test** (or **Mann-Whitney-U test**). The specific number of analyses is left up to you, but the contribution must be significant and your project report must give detailed justification and results for each analysis.

## Look at summary statistics mean, median, and mode

In [None]:
rent=sub_masterData.rent
homeprice=sub_masterData.home_price
print('mean: {}\nmedian: {}\nstddev: {}\nmode: {}'.format(rent.mean(), rent.median(), rent.std(), rent.mode()))
print('mean: {}\nmedian: {}\nstddev: {}\nmode: {}'.format(homeprice.mean(), homeprice.median(), homeprice.std(),
                                                          homeprice.mode()))
sub_masterData.describe()

## Show a scatterplot and regression line & Pearson R
correlation between population percentage below poverty line towards the crime commit

In [None]:
#aggregate the total number of crimes with respect to population below poverty level
crime_POV_lvl = sub_masterData.groupby('Pov_lvl').agg(
    {'crime_type': 'count'})

In [None]:
crime_POV_lvl =crime_POV_lvl.rename(columns={'crime_type':'crime_count_by_PoVlvl'})
crime_POV_lvl =crime_POV_lvl.reset_index()

In [None]:
plt.figure(figsize=(8,8))
sns.regplot('Pov_lvl', 'crime_count_by_PoVlvl', data=crime_POV_lvl)

plt.title('correlation for population % below poverty line to crime commit')
plt.savefig('correlation for population % below poverty line to crime commit.jpeg')
#pearson R coefficient and probaility in confidence of statistics
display(stats.pearsonr(crime_POV_lvl.Pov_lvl, crime_POV_lvl.crime_count_by_PoVlvl))

In [None]:
plt.figure(figsize=(8,8))
sns.regplot('rent', 'home_price', data=sub_masterData,marker='o',scatter_kws={'s':15},color='blue',line_kws={'color':'red'})
plt.title('correlation for rent paid by people to price of house')
plt.xlabel('rent paid($)')
plt.ylabel('median price of house in texas ($)')
plt.savefig('correlation for rent paid by people to price of house.jpeg')
#pearson R coefficient and probaility in confidence of statistics
display(stats.pearsonr(sub_masterData.rent, sub_masterData.home_price))

In [None]:
plt.figure(figsize=(8,8))
sns.regplot('Population', 'home_price', data=merged_dataset,marker='o',scatter_kws={'s':15},color='blue',line_kws={'color':'red'})
plt.title('correlation for population to price of house')
plt.xlabel('population')
plt.ylabel('median price of house in texas ($)')
plt.savefig('correlation for population to price of house.jpeg')
#pearson R coefficient and probaility in confidence of statistics
display(stats.pearsonr(merged_dataset.Population, merged_dataset.home_price))

## distribution plots

In [None]:
plt.figure()
#histogram plot or KDE plot 
#sns.kdeplot(data = sub_masterData['MhhI'])
#sns.histplot(data = sub_masterData['MhhI'],bins=11,kde=True,legend=True)
sns.distplot(sub_masterData[sub_masterData.CS == 'N'].rent, hist=False, label='not cleared',bins=None)
sns.distplot(sub_masterData[sub_masterData.CS == 'C'].rent, hist=False, label='cleared by arrest',bins=None)
sns.distplot(sub_masterData[sub_masterData.CS == 'O'].rent, hist=False, label='cleared by exception',bins=None)
plt.xlabel('median house rent($)')
plt.legend()

## T distribution
The t test computes t-statistic, which measures the spread between means. What are the chances that we re-sampled from a population and got the same value of t? This is what the p value is important for.

In [None]:
#sns.distplot(sub_masterData.rent, hist=False)
#sns.distplot(sub_masterData.home_price, hist=False)

#t-test for rent vs house price in texas in 2015
stats.ttest_ind(sub_masterData.rent, sub_masterData.home_price)

In [None]:
#t-test for population % below poverty level
stats.ttest_ind(sub_masterData.Pov_lvl, sub_masterData.rent)

In [None]:
#t-test for population and house price by places
stats.ttest_ind(merged_dataset.Population, merged_dataset.MhhI)

## Simpson's Paradox
Simpson's paradox occurs when trends that are present when data is separated into groups reverse when the data is aggregated. In this notebook, we take a look at four simple examples of Simpson's Paradox both quantitatively and visually.