## Import All Necessary Modules And Setup Project
If you get any errors when importing these, ensure you run the command:

  $ python -m pip install -r requirements.txt

to install all necessary modules for this project. This command must be run from inside of this project directory.

It is recommended to use virtual environments for this project to ensure there is no conflicting package versions on your system.

Activate the virtual environment (if needed), run the pip install command, and then launch Jupyter Lab inside this project to get this project running.

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
import numpy as np

In [None]:
masterData=pd.read_csv("data/crime-housing-austin-2015.csv")
ZipcodeData=pd.read_csv("data/AustinZipCodes.csv")
#explore dataset
masterData.head()
#It receives an int or None (to print all the columns):
pd.set_option('display.max_columns', None)
masterData.head()

In [None]:
masterData.Populationbelowpovertylevel.unique()

In [None]:
ZipcodeData.head()

In [None]:
#list out all the columns in a dataset
list(masterData)
#list unique values in a column "Council_District" of a dataset
masterData.Council_District.unique()
#clear rows that contains nan values
masterData=masterData.dropna()

In [None]:
#type of a dataset columns
masterData.dtypes
#"council district" & "zip codes" are float, lets make it as int first
masterData=masterData.astype({"Council_District":'int',"Zip_Code_Crime":'int'})
#percent sign or dollar sign attached
#dataset['colname'] = dataset['colname'].str.replace('$', '').astype('float')
masterData['Medianhouseholdincome'] = masterData['Medianhouseholdincome'].str.replace('$', '').astype('float')
masterData['Unemployment'] = masterData['Unemployment'].str.replace('%', '').astype('int')
masterData['Populationbelowpovertylevel'] = masterData['Populationbelowpovertylevel'].str.replace('%', '').astype('int')

In [None]:
sub_masterData=masterData[['Key','Zip_Code_Crime','District','Clearance_Status','Highest_NIBRS_UCR_Offense_Description',
                              'Medianhouseholdincome', 'Unemployment','Populationbelowpovertylevel']]
sub_masterData=sub_masterData.rename(columns={'Zip_Code_Crime':'Zip Code','Highest_NIBRS_UCR_Offense_Description':'crime_type',
                                             'Clearance_Status':'CS','Medianhouseholdincome':'MhhI($)','Unemployment':'UE(%)',
                                             'Populationbelowpovertylevel':'Pov_lvl'})
display(sub_masterData)

In [None]:
#aggregate the total number of each type of crimes
crime_types = sub_masterData.groupby('crime_type').agg(
    {'crime_type': 'count'})
crime_types=crime_types.rename(columns={'crime_type':'count'})
crime_types=crime_types.reset_index()

#plot figure
plt.figure(figsize=(8,8))
plt.pie(data=crime_types, x='count',shadow=True)
plt.title('Texas Crime Results in 2015')
plt.legend(labels=crime_types['crime_type'])
plt.savefig('pi_plot_categorical_crime.jpeg')

## main instructions
Your analysis must include a **number of statistical methods**. You must include `Pearson correlations` (be sure to report the **`p-value`**), _scatterplots_, _averages_, _standard deviations_, and a **t-test** (or **Mann-Whitney-U test**). The specific number of analyses is left up to you, but the contribution must be significant and your project report must give detailed justification and results for each analysis.

In [None]:
#aggregate the total number of crimes with respect to population below poverty level
crime_POV_lvl = sub_masterData.groupby('Pov_lvl').agg(
    {'crime_type': 'count'})

In [None]:
crime_POV_lvl =crime_POV_lvl.rename(columns={'crime_type':'crime_count_by_PoVlvl'})
crime_POV_lvl =crime_POV_lvl.reset_index()

## Show a scatterplot and regression line & Pearson R
correlation between population percentage below poverty line towards the crime commit

In [None]:
plt.figure(figsize=(8,8))
sns.regplot('Pov_lvl', 'crime_count_by_PoVlvl', data=crime_POV_lvl)

plt.title('correlation for population % below poverty line to crime commit')
plt.savefig('correlation for population % below poverty line to crime commit.jpeg')
#pearson R coefficient and probaility in confidence of statistics
display(stats.pearsonr(crime_POV_lvl.Pov_lvl, crime_POV_lvl.crime_count_by_PoVlvl))