""Now that we've seen the story that the data tells, it's time to build the model. First we import the libraries and the data file.""

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import scale

print("Loaded")

Loaded


In [2]:
df = pd.read_csv (r"C:\Users\Laura-Black\Documents\PhD\Data Scientist\Springboard\Capstone Projects\Drug Abuse ED Visits\Output files\Capstone1DataWrangling.csv", 
                     index_col=0, low_memory=False)

In [3]:
print('The number of data points =\n', df.count())
print('The number of features = ',len(df.columns))
print('The target is Case Type.')
print('The number of distinct values in the target = ', df['CASETYPE'].nunique())
print('They are: ',df.CASETYPE.unique())
print('The relative size of each class = ', df['CASETYPE'].value_counts(normalize=True))

The number of data points =
 METRO           229211
STRATA          229211
PSU             229211
REPLICATE       229211
CASEWGT         229211
                 ...  
ALCOHOL         229211
NONALCILL       229211
PHARMA          229211
NONMEDPHARMA    229211
ALLABUSE        229211
Length: 86, dtype: int64
The number of features =  86
The target is Case Type.
The number of distinct values in the target =  8
They are:  ['Other' 'Adverse Reaction' 'Seeking Detox' 'Suicide Attempt'
 'Overmedication' 'Accidential Ingestion' 'Alcohol (Age<21)'
 'Malicious Poisoning']
The relative size of each class =  Adverse Reaction         0.384345
Other                    0.382303
Overmedication           0.079167
Seeking Detox            0.064748
Suicide Attempt          0.039409
Alcohol (Age<21)         0.032376
Accidential Ingestion    0.014192
Malicious Poisoning      0.003460
Name: CASETYPE, dtype: float64


In [4]:
print(df.head())
print(df.info())

              METRO  STRATA  PSU  REPLICATE   CASEWGT  PSUFRAME   AGECAT  \
CASEID                                                                     
1          New York      25  108          2  0.942635         3    18-20   
2          New York      29  129          2  5.992011         9     > 65   
3       Date County       7   25          1  4.723172         6     > 65   
4           Phoenix       8   29          2  4.080147         6  06-11\t   
5            Boston      22   94          2  5.177709        10    25-29   

           SEX      RACE  YEAR  ...       DRUGID_22        ROUTE_22  \
CASEID                          ...                                   
1         Male     Black  2011  ...  Not Applicable  Not Applicable   
2         Male  Hispanic  2011  ...  Not Applicable  Not Applicable   
3       Female     Black  2011  ...  Not Applicable  Not Applicable   
4         Male  Hispanic  2011  ...  Not Applicable  Not Applicable   
5         Male  Hispanic  2011  ...  Not 

""Now, we will create the design matrix (X) and the target vector (y) for
the associated classification problem.""

In [5]:
x_split = df.drop('CASETYPE', axis=1)

X = pd.get_dummies(x_split,drop_first=True)

In [6]:
y = df['CASETYPE'].values

""Next, we will split the data into test and training sets.""

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42, stratify=y)

In [8]:
from collections import Counter
print('The size of each class in the training set = ', Counter(y_train))

print('\n The size of each class in the test set = ', Counter(y_test))


The size of each class in the training set =  Counter({'Adverse Reaction': 52857, 'Other': 52577, 'Overmedication': 10887, 'Seeking Detox': 8904, 'Suicide Attempt': 5420, 'Alcohol (Age<21)': 4453, 'Accidential Ingestion': 1952, 'Malicious Poisoning': 476})

 The size of each class in the test set =  Counter({'Adverse Reaction': 35239, 'Other': 35051, 'Overmedication': 7259, 'Seeking Detox': 5937, 'Suicide Attempt': 3613, 'Alcohol (Age<21)': 2968, 'Accidential Ingestion': 1301, 'Malicious Poisoning': 317})


In [9]:
# Create the classifier: logreg
logreg = LogisticRegression(multi_class='multinomial',solver='lbfgs', max_iter=1000000000)

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Predict the labels of the training and test sets
y_pred_test = logreg.predict(X_test)
y_pred_train = logreg.predict(X_train)

# Compute and print the confusion matrices and classification reports for the Training set
print('Train Set')
print(confusion_matrix(y_train, y_pred_train))
print(classification_report(y_train, y_pred_train))
print(logreg.score(X_train,y_train))

# Compute and print the confusion matrices and classification reports for the Test set
print('Test Set')
print(confusion_matrix(y_test, y_pred_test))
print(classification_report(y_test, y_pred_test))
print(logreg.score(X_test,y_test))




Train Set
[[   23  1817     0     0    51     0     1    60]
 [    1 52298     0     0    10     0    94   454]
 [    0     0  4453     0     0     0     0     0]
 [    0     0     0    11   449    16     0     0]
 [    0     0     0     1 49022  2735   819     0]
 [    0     0     0     0  3081  7806     0     0]
 [    0   899    17     0  3141     0  4518   329]
 [    0  1975    27     0   384     0   460  2574]]
                       precision    recall  f1-score   support

Accidential Ingestion       0.96      0.01      0.02      1952
     Adverse Reaction       0.92      0.99      0.95     52857
     Alcohol (Age<21)       0.99      1.00      1.00      4453
  Malicious Poisoning       0.92      0.02      0.05       476
                Other       0.87      0.93      0.90     52577
       Overmedication       0.74      0.72      0.73     10887
        Seeking Detox       0.77      0.51      0.61      8904
      Suicide Attempt       0.75      0.47      0.58      5420

            

print('Train Set')
print(confusion_matrix(y_train, y_pred_train))
print(classification_report(y_train, y_pred_train))
print(logreg.score(X_train,y_train))

""Both the training set and the test set show an overall accuracy of 0.877, which is a good result for this model. The test set has the following results:
1. Accidental Ingestion: 88% of the predictions were correct. 
2. Adverse Reaction: 92% of the predictions were correct.
3. Alcohol (<21): 99% of the predictions were correct.
4. Malicious Poisoning: 50% of the predictions were correct. The accuracy of this class is affected by the low number of cases in the data set. More cases in this class would have resulted in a higher prediction rate.
5. Other: 87% of the predictions were correct.
6. Overmedication: 74% of the predictions were correct.
7. Seeking Detox: 77% of the predictions were correct.
8. Suicide Attempt: 74% of the predictions were correct.
Based on this model hosptials can know with reasonable accuracy the distribution of types of drug abuse cases for which they will need to prepare. 

In the Extended Analysis notebook, we will explore approaches with the goal of improving the
results obtained in this baseline modeling notebook.