# Decision Tree for Identifying Alcohol Abuse #

** Submitted as part of Machine Learning for Data Analysis by Wesleyan University **

** Author: Oliver Morris **

** Date: 12 March 2016 **

The below analysis will show the code and describe the context of a decision tree constructed in python which identifies whether a person has declared the onset of alcohol abuse.

This uses the NESARC data, which is a standard data set for the course. 

This blog entry was published using Jupyter, which is designed for sharing reproducible data science in python. The code was prepared in Visual Studio 2015 Community Edition with Anaconda and an IPython interactive window installed.


## Dependencies ##

The packages used are the same as for the example code, however, pandasql which is sql-lite for python, has also been introduced. This is used in the cleaning of the NESARC data into a format usable by the decision tree code.

In [1]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics

#we also need pandasql so we can efficiently clean the data
from pandasql import sqldf

## The NESARC Data ##

The NESARC data set is large, over 

In [None]:
#set working directory to course folder
os.chdir("C:/Users/Oliver/Documents/0_OM/Training/WesleyanPython")

#Load the Alchohol Abuse Dataset, NESARC, remove 'na' records.
NESARC_data = pd.read_csv("nesarc_pds.csv")
data_clean = NESARC_data.dropna()

#This dataset does not suffer from 'na's, but it does have lots of blanks.
#Those will be removed in a later code chunk

## Get Predictors & Target from Dataset ##

There are hundreds of predictors in the NESARC data list, the following were selected as predictors for the decision tree simply because they appeared reasonable. Note the target is extracted at the same time, this assists with data cleaning in a later step.

In [None]:
#Get predictors from dataset
target_and_predictors_fullset = data_clean[[
# TARGET
 'S2BQ3A', #AGE AT ONSET OF ALCOHOL ABUSE

# PREDICTORS
 'S1Q1E',  #ORIGIN OR DESCENT
 'S1Q2C1', #RAISED BY ADOPTIVE PARENTS BEFORE AGE 18
 'S1Q2B',  #BIOLOGICAL FATHER EVER LIVE IN HOUSEHOLD BEFORE RESPONDENT WAS 18
 'S1Q2C4', #RAISED IN AN INSTITUTION BEFORE AGE 18
 'S1Q2D',  #DID BIOLOGICAL OR ADOPTIVE PARENTS GET DIVORCED OR PERMANENTLY STOP LIVING TOGETHER BEFORE RESPONDENT WAS 18
 'S1Q2K',  #DID BIOLOGICAL OR ADOPTIVE PARENT DIE BEFORE RESPONDENT WAS 18
 'S1Q2L',  #AGE AT DEATH OF BIOLOGICAL OR ADOPTIVE PARENT
 'S1Q4A',  #AGE AT FIRST MARRIAGE
 'S1Q5A',  #CHILDREN EVER HAD, INCLUDING ADOPTIVE, STEP AND FOSTER CHILDREN
 'S1Q6A',  #HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED
 'MARITAL',#CURRENT MARITAL STATUS
 'AGE',    #AGE
 'SEX'     #SEX
 ]]

## Data Cleaning ##

The NESARC data has lots of blanks, many of which are valid and do not mean 'unknown'. The data can be made useful by replacing blanks with another integer, which was chosen following review of the NESARC codebook.

Also, the number of predictors was reduced to just three in order to simplify the decision tree for presentation. With 13 predictors the tree has hundreds of branches. The predictors used are as follows (note zero based index):

    [0]. DID BIOLOGICAL OR ADOPTIVE PARENTS GET DIVORCED OR PERMANENTLY STOP LIVING TOGETHER BEFORE RESPONDENT WAS 18
    [1]. DID BIOLOGICAL OR ADOPTIVE PARENT DIE BEFORE RESPONDENT WAS 18
    [2]. AGE AT FIRST MARRIAGE

The target needs to be simplified into a yes/no outcome, yet the selected field is "Age at onset of alohol abuse". This can be any integer from 1-99, or blank if there is no abuse. So, the data is simplified to 1 if there is any age at which alcohol absue developed, else 0.

Finally, the records where these predictors are 'unknown', usually represented by a '9', are removed as 'unknowns' simply confuse the decision tree.

In [None]:
#Clean predictors, which are mostly blanks.
query = """
SELECT
     CASE WHEN LTRIM(S1Q2D) = '' THEN  2 ELSE CAST(LTRIM(S1Q2D) AS INTEGER) END AS S1Q2D 
     --i.e. parental divorce before 18, 1=Yes, 2=No, 9 = Unknown
     --Default for blank is No (2), because blanks are where child did not live with parents, hence no divorce
     
    ,CASE WHEN LTRIM(S1Q2K) = '' THEN  2 ELSE CAST(LTRIM(S1Q2K) AS INTEGER) END AS S1Q2K
     --i.e. parent death before 18, 1=Yes, 2=No, 9 = Unknown
     --Default for blank is No (2), because blanks are where child did not live with parents, hence no death
    
    ,CASE WHEN LTRIM(S1Q4A) = '' THEN 99 ELSE CAST(LTRIM(S1Q4A) AS INTEGER) END AS S1Q4A
     --i.e. age at first marriage, integer value unless not married, which is blank.
     --Default for blank is 99, because this indicates not married whilst young.
FROM
    target_and_predictors_fullset
WHERE
    --Remove the 'unknown' values, as thse confuse the decision tree
    S1Q2D <> 9
    AND
    S1Q2K <> 9
"""
predictors = sqldf(query, locals())

#Clean Targets
query = """
SELECT
    CASE WHEN LTRIM(S2BQ3A) = '' THEN 0 ELSE 1 END AS S2BQ3A
    --S2BQ3A = AGE AT ONSET OF ALCOHOL ABUSE
    --i.e. integer value up to 99, but blank = no alcohol abuse
    --We convert this to a classification outcome by saying any age = 1, but blank = 0
FROM
    target_and_predictors_fullset
WHERE
    --Remove the 'unknown' values, as thse confuse the decision tree
    --This is the reason we extracted targets and predictors in the same dataframe, i.e. to apply the same 'where' clause. 
    S1Q2D <> 9
    AND
    S1Q2K <> 9
"""
targets = sqldf(query, locals())

#now that we have predictors and target we can save memory by clearing the unused data frames
del NESARC_data
del predictors_fullset
del data_clean


## Model Training ##

The only major change as compared with the example code is the test_size, which has be changed from 0.4 to 0.99. This is an unreasonably large proportion of the data to apply to testing. This was chosen because larger training sets led to enormousand unwieldy decision trees that added no accuracy for their complexity.

Using three predictors gave a accuracy of 69%

In [None]:
#Separate data into training and test set
pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.997)
pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape

#Build model on training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)

#Get predcitions
predictions=classifier.predict(pred_test)

#Publish confusion matrix and accuracy
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)

## Display the Results ##

The resulting decision tree relies mostly on [2] age at marriage, possibly because this is more than a yes/no classifier. However, it is very complex, even given so few examples and predictors. 

The main branches split into a) age at marriage <= 18, b) 23-26 c) > 44. (These do not encompass all eventualities). Basically, those who marry whilst young (<=26) have complex trees, whereas those who marry older (>=44) are simple trees, affected mostly by parental death and divorce.

In [None]:
from sklearn import tree
from io import StringIO
from IPython.display import Image

out = StringIO()
tree.export_graphviz(classifier, out_file=out)

import pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
Image(graph.create_png())

## Click link to view decision tree ##

[Decision Tree Image](https://drive.google.com/file/d/0B9E_gt6FCYe-TGpkUUJPX19TeFU/view?usp=sharing)