# Data Exploration

The dataset we will be using in this activity is from the ProPublica analysis of the COMPAS system.

Data & original analysis gathered by ProPublica.
Original Data methodology article:
https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

Original Article:
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Original data from ProPublica:
https://github.com/propublica/compas-analysis

In [None]:
# Render our plots inline
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Make the graphs a bit prettier, and bigger
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 5)

In [None]:
# read the datafile
crime_df = pd.read_csv('pb_compass.csv')
crime_df

In [None]:
crime_df.info()

## Data cleanup

From this information we can already see that some features won't be relevant in our exploratory analysis as there are too much missing values (such as violent_recid and r_jail_in). Plus there is so much features to analyze that it may be better to concentrate on the ones which can give us real insights. Let's just remove Id and the features with 30% or more missing values.

In [None]:
crime_df2 = crime_df[[column for column in crime_df if crime_df[column].count() / len(crime_df) >= 0.3]]
# drop the id column
del crime_df2['id'] # another way to delete a column -> crime_df = crime_df.drop('id', 1)

# find which columns got dropped by comparing the dataframe to the original
print("List of dropped columns:", end=" ")

for c in crime_df.columns:
    if c not in crime_df2.columns:
        print(c, end=", ")
print('\n')

# use the cleaned up dataframe moving forward
crime_df = crime_df2

In [None]:
# preview the data
crime_df.describe()

From the data preview we can also see that the first two entries are exactly the same. This might indicate that there are more duplicates. If you explore the data further you will notice that there are many duplicates so let's remove those

In [None]:
# drop duplicate rows and keeping the first occurance
crime_df.drop_duplicates(keep = 'first')

### To Do #1: Explore the dataset more and remove any columns that are duplicated.

Refer to Panda's DataFrame documentation for a list of functions that will help with this task.

https://pandas.pydata.org/docs/reference/frame.html

Hint: Look at duplicated and transpose.

## Histograms and Overlays

Now let us create some overlays to help us better understand our data.

What if wanted to see how age and sex relate to how the COMPASS system scored these individual likelihood of recidivism 

In [None]:
crime_df[crime_df.sex == "Male"].groupby(['score_text','race'])['priors_count'].mean().unstack().plot(kind='bar')


### To Do #2: Was there anything that surprised you from this graph?

Now let us explore how age and sex attributes look like in our dataset

In [None]:
age_m = crime_df[crime_df.sex == "Male"]['age']
age_w = crime_df[crime_df.sex == "Female"]['age']

plt.hist([age_m, age_w], stacked = False) # the attribute stacked = False will display the two variables (age_m, age_w) side by side
plt.legend(['Sex = Male', 'Sex = Female']) # specify the values of the legend
plt.title('Histogram of Age with Sex Overlay') # specify the values of the title
plt.xlabel('Age'); plt.ylabel('Frequency'); plt.show()  # specify the values of the x‐axis label, and y‐axis label and display the figure.


Next we want to create a normalized histogram of age with sex overlay
create a stacked histogram and save the information generated by the histogram.

In [None]:
(n, bins, patches) = plt.hist([age_m, age_w], bins = 10, stacked = False)
# n is the height of the histogram bars and bins are the boundaries of each bin in the histogram
# patches is the individual patches used to create the histogram, e.g a collection of rectangles
plt.setp(patches[0], 'facecolor', 'green') # set the color property for one of the rectangles

In [None]:
n_table = np.column_stack((n[0], n[1])) # combine the heights of the two variables’ bars into one array using the column_stack() command
n_norm = n_table / n_table.sum(axis=1)[:, None] #  calculate what proportion of the bar is accounted for by each variable by dividing each row by the sum across that row
ourbins = np.column_stack((bins[0:10], bins[1:11])) #create an array whose rows are the exact cuts of each bin
# Each row in ourbins gives the upper and lower bounds of each bin

In [None]:
# using the bar() function the x attribute specifies the upper and lower bounds of the bins 
# the height attribute uses the normalized count values we created previously to specify the height of each of the two sections of each bar 
# and the width input reuses the bar widths from the original bar chart
p1 = plt.bar(x = ourbins[:,0], height = n_norm[:,0],
width = ourbins[:, 1] - ourbins[:, 0])

# setting the bottom attribute to n_norm[:,0]specifies the second of the two bar sections to start on top of the first. 
p2 = plt.bar(x = ourbins[:,0], height = n_norm[:,1],
width = ourbins[:, 1] - ourbins[:, 0],
bottom = n_norm[:,0])
# set the legend, title, xlabel, ylabel and show the figure
plt.legend(['Sex = Male', 'Sex = Female'])
plt.title('Histogram of Age with Sex Overlay')
plt.xlabel('Age'); plt.ylabel('Proportion'); plt.show()

### To Do #3: Create a normalized histogram with the sex and number of prior convictions (priors_count).

## Correlations 

Correlation is a way to determine if two variables in a dataset are related in any way. Correlations have many real-world applications. We can see if using certain search terms are correlated to views on youtube. Or, we can see if ads are correlated to sales. When building machine learning models correlations are an important factor in determining features. Not only can this help us to see which features are linear related, but if features are strongly correlated we can remove them to prevent duplicating information.

Before we try and calculate correlations between attributes in our dataset we should get a sense of the data types of the attributes we have

Note: Transposing a DataFrame with mixed dtypes will result in a homogeneous DataFrame with the object dtype. If Transpose was used in an earlier step we can use the nfer_objects function where Pandas attempts infers better dtypes for object columns.
See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.infer_objects.html for more details.

In [None]:
# crime_df = crime_df.infer_objects() # uncomment if the next statement results in [dtype('O')]
list(set(crime_df.dtypes.tolist())) # should see [dtype('O'), dtype('int64'), dtype('float64')]

We can see here that we have integers, floats and objects.
With this information we can perform further analysis by selecting the columns that have number type data

In [None]:
df_num = crime_df.select_dtypes(include = ['float64', 'int64'])
df_num.head()

Plotting a histogram of these columns can give us a sense of their distributions 

In [None]:
df_num.hist(figsize=(16, 20), xlabelsize=8, ylabelsize=8); 


In [None]:
df_num_corr = df_num.corr()['priors_count']
golden_features_list = df_num_corr.sort_values(ascending=False)
print("We calcuated {} values with the count of prior convictions:\n{}".format(len(golden_features_list), golden_features_list))

### To Do #4: Calculate the correlations between the other numerical attributes and determine if any attributes can be considered redundant.