# Python for Data Analysis : Modeling Part

Problem which can be solved:
* 7-class classifications for each drug separately.
* Problem can be transformed to binary classification by union of part of classes into one new class. 
    * For example, "Never Used", "Used over a Decade Ago" form class "Non-user" and all other classes form class "User".
* The best binarization of classes for each attribute.
* Evaluation of risk to be drug consumer for each drug.

For this dataset, I have decided to solve the first problem : *7-class Classification for each drug*

## Import the dataset

In [1]:
import pandas as pd
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data", header = None)
columns = ['ID', 'AGE', 'GENDER', 'EDUCATION_LEVEL', 'COUNTRY', 'ETHNICITY', 'NSCORE_VALUE', 'ESCORE_VALUE', 'OSCORE_VALUE', 'ASCORE_VALUE', 
        'CSCORE_VALUE', 'IMPULSIVENESS', 'SENSATION_SEEING', 'ALCOHOL_CONSUMPTION', 'AMPHET_CONSUMPTION', 'AMYL_CONSUMPTION', 'BENZOS_CONSUMPTION', 
        'CAFFEINE_CONSUMPTION', 'CANNABIS_CONSUMPTION', 'CHOCOLATE_CONSUMPTION', 'COKE_CONSUMPTION', 'CRACK_CONSUMPTION', 'ECSTASY_CONSUMPTION', 
        'HEROIN_CONSUMPTION', 'KETAMINE_CONSUMPTION', 'LEGAL_HIGHS_CONSUMPTION', 'LSD_CONSUMPTION', 'METH_CONSUMPTION', 'MAGIC_MUSHROOMS_CONSUMPTION', 
        'NICOTINE_CONSUMPTION', 'SEMER_CONSUMPTION', 'VSA_CONSUMPTION']
df.columns = columns

## Spliting into training and testing set

In [2]:
import sklearn
from sklearn.model_selection import train_test_split

In [3]:
features = ['AGE', 'GENDER', 'EDUCATION_LEVEL', 'COUNTRY', 'ETHNICITY', 
    'NSCORE_VALUE', 'ESCORE_VALUE', 'OSCORE_VALUE', 'ASCORE_VALUE', 'CSCORE_VALUE', 'IMPULSIVENESS', 'SENSATION_SEEING']

# As we have seen in the analysis, the caffeine is the most consumed drug. So I chose this one for now. 
predict = ['CAFFEINE_CONSUMPTION']

X = df[features].values
y = df[predict].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

##  Support Vector Machine Algorithm for Classification (Support Vector Clustering)

In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis.
SVMs are one of the most robust prediction methods.

In [4]:
from sklearn import metrics
from sklearn.svm import SVC

In [5]:
def accuracy(y_test, pred): 
    accuracy = metrics.accuracy_score(y_test,pred)
    print(f"The accuracy score is : {round(accuracy*100, 2)}%.")

### By-default parameters

In [30]:
svm = SVC()
svm.fit(X_train, y_train.ravel())
pred = svm.predict(X_test)
accuracy(y_test,pred)

The accuracy score is : 72.26%.


We have an accuracy score of 72,26%, which is a quite good one, for a by-default usage of this algorithm. 

### With Grid-search

In [31]:
from sklearn.model_selection import GridSearchCV 

# defining parameter range 
param_grid = {'C': [0.01, 0.1, 1], # The strength of the regularization is inversely proportional to C
			'gamma': [1, 0.1, 0.01], # Kernel coefficient for rbf, poly and sigmoid
			'kernel': ['poly', 'rbf']} # kernel type to be used in the algorithm

grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3) 

# fitting the model for grid search 
grid.fit(X_train, y_train.ravel()) 


Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV] C=0.01, gamma=1, kernel=poly ....................................
[CV] ........ C=0.01, gamma=1, kernel=poly, score=0.633, total=   0.1s
[CV] C=0.01, gamma=1, kernel=poly ....................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[CV] ........ C=0.01, gamma=1, kernel=poly, score=0.655, total=   0.1s
[CV] C=0.01, gamma=1, kernel=poly ....................................
[CV] ........ C=0.01, gamma=1, kernel=poly, score=0.621, total=   0.1s
[CV] C=0.01, gamma=1, kernel=poly ....................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s
[CV] ........ C=0.01, gamma=1, kernel=poly, score=0.636, total=   0.1s
[CV] C=0.01, gamma=1, kernel=poly ....................................
[CV] ........ C=0.01, gamma=1, kernel=poly, score=0.643, to

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.01, 0.1, 1], 'gamma': [1, 0.1, 0.01],
                         'kernel': ['poly', 'rbf']},
             verbose=3)

In [32]:
# print best parameter after tuning 
print(grid.best_params_) 

# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_) 

{'C': 0.01, 'gamma': 1, 'kernel': 'rbf'}
SVC(C=0.01, gamma=1)


In [33]:
grid_predictions = grid.predict(X_test) 
accuracy(y_test,grid_predictions)

The accuracy score is : 72.26%.


Accuracy score before grid search = 72,26 % <br/>
Accuracy score after grid search = 72.26%

The accuracy score hasn't changed. 

As we have said in the conclusion, this is the best algorithm we have for now. 

In [34]:
import pickle
pickle.dump(grid, open('models/grid_SVM_prediction.pickle', 'wb'))

## Decision Tree 

Decision tree builds classification or regression models in the form of a tree structure. <br/>
It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. <br/>
The final result is a tree with decision nodes and leaf nodes.

**DecisionTreeClassifier** is a class capable of performing multi-class classification on a dataset.

In [11]:
from sklearn.tree import DecisionTreeClassifier

clf_dtc = DecisionTreeClassifier()
clf_dtc.fit(X_train, y_train.ravel())
pred = clf_dtc.predict(X_test)
accuracy(y_test,pred)


The accuracy score is : 57.07%.


In [12]:
param_grid = {'criterion': ['gini', 'entropy'],
			'splitter': ['random', 'best'], 
			'max_depth': [5,10,25], 
            'random_state': [None, 1, 5, 10]} 

grid = GridSearchCV(DecisionTreeClassifier(), param_grid, refit = True, verbose = 3) 

# fitting the model for grid search 
grid.fit(X_train, y_train.ravel()) 

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] criterion=gini, max_depth=5, random_state=None, splitter=random .
[CV]  criterion=gini, max_depth=5, random_state=None, splitter=random, score=0.739, total=   0.0s
[CV] criterion=gini, max_depth=5, random_state=None, splitter=random .
[CV]  criterion=gini, max_depth=5, random_state=None, splitter=random, score=0.731, total=   0.0s
[CV] criterion=gini, max_depth=5, random_state=None, splitter=random .
[CV]  criterion=gini, max_depth=5, random_state=None, splitter=random, score=0.727, total=   0.0s
[CV] criterion=gini, max_depth=5, random_state=None, splitter=random .
[CV]  criterion=gini, max_depth=5, random_state=None, splitter=random, score=0.735, total=   0.0s
[CV] criterion=gini, max_depth=5, random_state=None, splitter=random .
[CV]  criterion=gini, max_depth=5, random_state=None, splitter=random, score=0.738, total=   0.0s
[CV] criterion=gini, max_depth=5, random_state=None, splitter=best ...
[CV]  criterion=gini, 

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [5, 10, 25],
                         'random_state': [None, 1, 5, 10],
                         'splitter': ['random', 'best']},
             verbose=3)

In [14]:
# print best parameter after tuning 
print(grid.best_params_) 

# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_) 

{'criterion': 'gini', 'max_depth': 5, 'random_state': None, 'splitter': 'random'}
DecisionTreeClassifier(max_depth=5, splitter='random')


In [13]:
grid_predictions = grid.predict(X_test) 
accuracy(y_test,grid_predictions)

The accuracy score is : 71.38%.


Accuracy score before grid search = 57.07% <br/>
Accuracy score before grid search = 71,38%

Better with grid search. 

## Random Forest

Random forest has nearly the same hyperparameters as a decision tree or a bagging classifier.
Random forest adds additional randomness to the model, while growing the trees. 

Instead of searching for the most important feature while splitting a node, it searches for the best feature among a random subset of features.

In [15]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train.ravel())
pred = clf_dtc.predict(X_test)
accuracy(y_test,pred)

The accuracy score is : 57.07%.


In [19]:
param_grid = {'criterion': ['gini', 'entropy'],
			'max_depth': [5,10,25], 
            'n_estimators' : [50, 100, 150, 200, 500], 
            'bootstrap' : ['True', 'False'], 
            'random_state': [None, 1, 5, 10]} 

grid = GridSearchCV(RandomForestClassifier(), param_grid, refit = True, verbose = 3) 

# fitting the model for grid search 
grid.fit(X_train, y_train.ravel()) 

[CV] bootstrap=False, criterion=gini, max_depth=10, n_estimators=50, random_state=None 
[CV]  bootstrap=False, criterion=gini, max_depth=10, n_estimators=50, random_state=None, score=0.735, total=   0.1s
[CV] bootstrap=False, criterion=gini, max_depth=10, n_estimators=50, random_state=None 
[CV]  bootstrap=False, criterion=gini, max_depth=10, n_estimators=50, random_state=None, score=0.735, total=   0.1s
[CV] bootstrap=False, criterion=gini, max_depth=10, n_estimators=50, random_state=None 
[CV]  bootstrap=False, criterion=gini, max_depth=10, n_estimators=50, random_state=None, score=0.739, total=   0.1s
[CV] bootstrap=False, criterion=gini, max_depth=10, n_estimators=50, random_state=None 
[CV]  bootstrap=False, criterion=gini, max_depth=10, n_estimators=50, random_state=None, score=0.745, total=   0.1s
[CV] bootstrap=False, criterion=gini, max_depth=10, n_estimators=50, random_state=1 
[CV]  bootstrap=False, criterion=gini, max_depth=10, n_estimators=50, random_state=1, score=0.742, 

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'bootstrap': ['True', 'False'],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': [5, 10, 25],
                         'n_estimators': [50, 100, 150, 200, 500],
                         'random_state': [None, 1, 5, 10]},
             verbose=3)

In [20]:
# print best parameter after tuning 
print(grid.best_params_) 

# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_) 

{'bootstrap': 'True', 'criterion': 'gini', 'max_depth': 5, 'n_estimators': 50, 'random_state': None}
RandomForestClassifier(bootstrap='True', max_depth=5, n_estimators=50)


In [21]:
grid_predictions = grid.predict(X_test) 
accuracy(y_test,grid_predictions)

The accuracy score is : 72.26%.


Accuracy score before grid search = 57.07% <br/>
Accuracy score before grid search = 72,26%

Better with grid search. 

## K-Nearest Neighbours (KNN)

An approach to data classification that estimates how likely a data point is to be a member of one group or the other 
depending on what group the data points nearest to it are in.

In [22]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier()
clf.fit(X_train,y_train.ravel())
pred = clf.predict(X_test)
accuracy(y_test,pred)

The accuracy score is : 65.72%.


In [27]:
param_grid = {'n_neighbors': [8,9,10,11],
			'weights': ['uniform', 'distance'], 
            'algorithm' : ['ball_tree', 'kd_tree', 'brute', 'auto'], 
            'leaf_size' : [5,10,25, 30, 50]} 

grid = GridSearchCV(KNeighborsClassifier(), param_grid, refit = True, verbose = 3) 

# fitting the model for grid search 
grid.fit(X_train, y_train.ravel()) 

[CV] algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=uniform 
[CV]  algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=uniform, score=0.727, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=uniform 
[CV]  algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=uniform, score=0.731, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=uniform 
[CV]  algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=uniform, score=0.742, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=uniform 
[CV]  algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=uniform, score=0.719, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=distance, score=0.731, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=25, n_neighbors=10, weights=distance, sco

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'algorithm': ['ball_tree', 'kd_tree', 'brute', 'auto'],
                         'leaf_size': [5, 10, 25, 30, 50],
                         'n_neighbors': [8, 9, 10, 11],
                         'weights': ['uniform', 'distance']},
             verbose=3)

In [28]:
# print best parameter after tuning 
print(grid.best_params_) 

# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_) 

{'algorithm': 'ball_tree', 'leaf_size': 5, 'n_neighbors': 11, 'weights': 'distance'}
KNeighborsClassifier(algorithm='ball_tree', leaf_size=5, n_neighbors=11,
                     weights='distance')


In [29]:
grid_predictions = grid.predict(X_test) 
accuracy(y_test,grid_predictions)

The accuracy score is : 71.38%.


Accuracy score before grid search = 65.72% <br/>
Accuracy score after grid search = 71.38%

The accuracy is better with grid search.  

## Conclusion

We have the highest accuracy score by using the **Support Vector Classification**. <br/>
We are going to use this model for our API. 