# <center> DATA SCIENCE WITH SOCIAL SCIENCE DATA: <br/> BUILDING PREDICTIVE MODELS USING PYTHON'S SCIKIT-LEARN <br/><br/> CSCAR WORKSHOP <br/><br/> 02/23/2018
## <center> Marcio Mourao and Jeff Lockhart

# <center> Setup for Anaconda / Jupyter Notebook

<ul>
    <li>Go to the page https://marcio-mourao.github.io/</li>
    <li>Download the materials (first two docs) under "Social Data Science" to your "username/Documents"</li><br/>
    
    <li>Click the Windows button (Bottom Left Corner)</li>
    <li>Click "All apps"</li>
    <li>Click "Anaconda3"</li>
    <li>Click "Anaconda Prompt" </li>
    <ul>
        <li>Enter "conda update pandas"</li>
        <li>Enter "conda update scikit-learn"</li>
    </ul><br/>
    
    <li>Click the Windows button (Bottom Left Corner)</li>
    <li>Click "All apps"</li>
    <li>Click "Anaconda3"</li>
    <li>Click "Jupyter Notebook" </li><br/>
    <li>Upload 'adult.csv' (may not be necessary)</li>
    <li>Click "Workshop6.ipynb" (this should open a new tab in the browser)</li>
</ul>

# <center> Introduction

<ul>
  <li>Please, sign up the sheet! </li>
  <li>Don't forget to go to: http://cscar.research.umich.edu/ to know what we're offering!</li>
  <li>Any questions/feedback, you can send an email to <a href="mailto:mdam@umich.edu" target="_top">Marcio.</a>
</ul>

# <center> Summary of this workshop

<ul>
  <li>Summary of Python Data Types</li>
  <li>Pandas Dataframes</li>
  <li>Scikit-learn</li>
</ul>



# <center> References

<ul>
  <li>https://www.continuum.io/anaconda-overview</li>
  <li>http://www.numpy.org/</li>
  <li>http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html</li>
  <li>http://pandas.pydata.org/pandas-docs/stable/10min.html</li>
  <li>http://matplotlib.org/</li>
  <li>http://scikit-learn.org/stable/documentation.html</li>
  <li>https://pypi.python.org/pypi/patsy</li>
</ul>

## <center> Import relevant general modules

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
import sys
print(sys.version)
print(np.__version__)
print(pd.__version__)

## <center> Summary of Python Data Types

## Python Simple Data Types
##### Integers
##### Floats
##### Booleans

## Python Data Structures

### Lists

In [None]:
#An example of a list
example_list = [2,4,'fg',8,[3,4]]
print(type(example_list))
print(example_list)
print(example_list[0])
print(example_list[2:4])
print(example_list[-2])
print(example_list[4][0])
example_list[1]=100; print(example_list) # Modifies one element of the list

### Numpy arrays

In [None]:
#An example of a numpy array
example_array = np.array([2,4,'4',8,10])
print(example_array)
print(example_array[0])
print(example_array[2:4])
print(example_array[-2])
example_array[2]=20; print(example_array) # Modifies one element of the numpy array

### Dictionary

In [None]:
#An example of a dictionary
example_dictionary = {'A':20,'B':40,'C':60}
print(example_dictionary)
print(example_dictionary['B'])
example_dictionary['C']=100
print(example_dictionary)
#print(example_dictionary[0]) # This should produce an error

### Pandas Series
#### A one dimensional labeled array

In [None]:
#An example of a pandas series
example_dictionary = {'A':20,'B':40,'C':60,'D':55}
example_series = pd.Series(example_dictionary)
print(example_series)
print(example_series[0])
print(example_series['B'])
print(example_series['B':])

# <center> Pandas dataframes
### <center> A two-dimensional labeled data structure with columns of potentially different types

In [None]:
#Creation with a list
aux_list=[['ds',1.0],
          ['as',3],
          ['bq',5]]

example_DF = pd.DataFrame(aux_list,index=['Row1','Row2','Row3'],columns=['Col1','Col2'])
example_DF

In [None]:
#Creation with a numpy array
example_DF=pd.DataFrame(np.random.randint(0,10,(3,2)),index=['Row1','Row2','Row3'],columns=['Col1','Col2'])
example_DF

In [None]:
#Creation with a dictionary
example_DF=pd.DataFrame({'Col1':range(3),'Col2':pd.Series([4,5,6],index=[1,2,3])})
example_DF

# <center> Some info about the dataset

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). 

<b>The prediction task is to determine whether a person makes over $50K a year!</b>

<b>Attributes:</b>

income: >50K, <=50K

age: continuous

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked

fnlwgt: continuous

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool

education-num: continuous

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black

sex: Female, Male

capital-gain: continuous

capital-loss: continuous

hours-per-week: continuous

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands

## Load and describe the data

In [None]:
#Displays signature of the function
?pd.read_csv

In [None]:
#Creates a dataframe named "adults" from reading the file "adult.csv"
adults = pd.read_csv('adult.csv',na_values=['?'])
adults.head()

In [None]:
#Displays the type of the object we are working with
type(adults)

In [None]:
#Obtains the number of lines and columns of the dataframe
adults.shape

In [None]:
#Obtains the dataframe main types
adults.dtypes

In [None]:
#Displays first lines of the dataframe
adults.head(5)

In [None]:
#Displays last lines of the dataframe
adults.tail(3)

In [None]:
#Returns the index of the dataframe
adults.index

In [None]:
#Returns the columns of the dataframe
adults.columns

In [None]:
#Provides a statistical summary of the adults data
adults.describe()

In [None]:
#Provides a statistical summary of the adults data
adults.describe(include='all')

In [None]:
#Summarizes just the column 'age
adults['age'].describe()

In [None]:
#Displays whether columns contain any null values
adults.isnull().any(axis=0)

In [None]:
#Count the number of missing values in each column of the dataframe
adults.apply(lambda x: sum(x.isnull()),axis=0)

In [None]:
#Count the number of missing values in each column of the dataframe and sums them up
adults.apply(lambda x: sum(x.isnull()),axis=0).sum()

In [None]:
#Count number of lines with NaNs
adults.apply(lambda x: x.isnull().any(),axis=1).sum()

## Machine Learning

In [None]:
#Import modules
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn import metrics

In [None]:
#A reminder of the data we have
adults.describe(include='all')

In [None]:
#Renames columns on the dataframe for compatibility with the patsy library
adults.rename(columns={'education.num':'educationnum', 
                       'hours.per.week':'hoursperweek',
                       'marital.status':'maritalstatus',
                       'capital.gain':'capitalgain',
                       'capital.loss':'capitalloss',
                       'native.country':'nativecountry'},
              inplace=True)

adults.columns

In [None]:
#Converts column income into an integer
adults.income = pd.factorize(adults.income)[0]

In [None]:
#Import patsy
from patsy import dmatrices

#Check dmatrices signature
?dmatrices

In [None]:
#Set formula to use in dmatrices
formula = 'income ~ -1 + age + workclass + educationnum + maritalstatus + ' + \
                   'occupation + relationship + race + sex + ' + \
                   'capitalgain + capitalloss + hoursperweek + nativecountry'

#Obtain the design matrix for use in the logistic regression and random forests modeling approaches
y, X = dmatrices(formula, adults, return_type = 'dataframe')

In [None]:
#Check first elements
y.head()

In [None]:
#Check y types
y.dtypes

In [None]:
#Describe the dataframe y
y.describe()

In [None]:
#Look at the first observations of the design matrix
X.head()

In [None]:
#Check X types
X.dtypes

In [None]:
#Describe the design matrix
X.describe()

In [None]:
#The dependent variable needs to be a unidimensional vector rather than a dataframe
y = y.income.values

In [None]:
#Check types of both y (dependent variable) and X (predictors)
print(y.shape)
print(X.shape)

In [None]:
?train_test_split

In [None]:
#Obtain the data for the fitting
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13, stratify = y)

print('Total number of records: ', adults.shape[0])
print('Type of X_train: ', type(X_train))
print('Number of records in X_train: ', len(X_train))
print('Fraction on X_train: ', len(X_train)/adults.shape[0])
print('Number of records in y_train: ', len(y_train))
print('Type of y_train: \n\n', type(y_train))

print('Type of X_test: ', type(X_test))
print('Number of records in X_test: ', len(X_test))
print('Fraction on X_test: ', len(X_test)/adults.shape[0])
print('Number of records in y_test: ', len(y_test))
print('Type of y_test: ', type(y_test))

### Logistic Regression

In [None]:
#Creates a Logistic Regression classification model
LR_model = LogisticRegression()

#Fit to the data
LR_model.fit(X_train, y_train)

In [None]:
#Obtain class predictions
y_pred_LR_prob = LR_model.predict_proba(X_test)
print('Predicted probabilities: \n', y_pred_LR_prob)

#Obtain probability predictions
y_pred_LR_class = LR_model.predict(X_test)
print('Predicted classes: \n', y_pred_LR_class)

In [None]:
#Obtains accuracy score
print('LR Score: ', metrics.accuracy_score(y_test, y_pred_LR_class))

In [None]:
#Obtains confusion matrix
LR_cm=metrics.confusion_matrix(y_test,y_pred_LR_class)
LR_cm

In [None]:
#KFolds and Cross_val_scores
kf = KFold(n_splits=10, shuffle=True)
print('Cross validation score: ', cross_val_score(LR_model, X, y, cv=kf).mean())

### Random Forests

In [None]:
#Creates a RF classification model
RF_model = RandomForestClassifier(n_estimators=10, criterion='gini')

#Fit to the data
RF_model.fit(X_train, y_train)

In [None]:
#Obtain class predictions
y_pred_RF_prob = RF_model.predict_proba(X_test)
print('Predicted probabilities: \n', y_pred_RF_prob)

#Obtain probability predictions
y_pred_RF_class = RF_model.predict(X_test)
print('Predicted classes: \n', y_pred_RF_class)

In [None]:
#Obtains accuracy score
print('RF Score: ', metrics.accuracy_score(y_test, y_pred_RF_class))

In [None]:
#Obtains confusion matrix
RF_cm=metrics.confusion_matrix(y_test,y_pred_RF_class)
RF_cm

In [None]:
#KFolds and Cross_val_scores
kf = KFold(n_splits=10, shuffle=True)
print('Cross validation score: ', cross_val_score(RF_model, X, y, cv=kf).mean())