# Challenge: make a neural network

For this challenge you have two options for how to use neural networks . Choose one of the following:

- Use RBM to perform feature extraction on an image-based dataset that you find or create. If you go this route, present the features you extract and explain why this is a useful feature extraction method in the context you’re operating in. DO NOT USE either the MNIST digit recognition database or the iris data set. They’ve been worked on in very public ways very very many times and the code is easily available. (However, that code could be a useful resource to refer to). 

- Create a multi-layer perceptron neural network model to predict on a labeled dataset of your choosing. Compare this model to either a boosted tree or a random forest model and describe the relative tradeoffs between complexity and accuracy. Be sure to vary the hyperparameters of your MLP!

## Option: Create a multi-layer perceptron neural network model to predict on a labeled dataset 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline


from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.tree import DecisionTreeClassifier


from sklearn.preprocessing import StandardScaler


  from numpy.core.umath_tests import inner1d


### Using a dataset of wines to determine if wine is from the same region in Italy but derived from three different cultivars

In [2]:
import pandas as pd
wine = pd.read_csv('./wine/wine.csv', names = ["Cultivator", "Alchol", "Malic_Acid", "Ash", "Alcalinity_of_Ash", "Magnesium", "Total_phenols", "Falvanoids", "Nonflavanoid_phenols", "Proanthocyanins", "Color_intensity", "Hue", "OD280", "Proline"])

wine.head()

Unnamed: 0,Cultivator,Alchol,Malic_Acid,Ash,Alcalinity_of_Ash,Magnesium,Total_phenols,Falvanoids,Nonflavanoid_phenols,Proanthocyanins,Color_intensity,Hue,OD280,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


### Data cleaning and exploration

In [3]:
wine.Cultivator.unique()

array([1, 2, 3])

In [4]:
wine.describe()

Unnamed: 0,Cultivator,Alchol,Malic_Acid,Ash,Alcalinity_of_Ash,Magnesium,Total_phenols,Falvanoids,Nonflavanoid_phenols,Proanthocyanins,Color_intensity,Hue,OD280,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.938202,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.775035,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [5]:
#Look for null values
wine.isnull().sum()/wine.isnull().count()

Cultivator              0.0
Alchol                  0.0
Malic_Acid              0.0
Ash                     0.0
Alcalinity_of_Ash       0.0
Magnesium               0.0
Total_phenols           0.0
Falvanoids              0.0
Nonflavanoid_phenols    0.0
Proanthocyanins         0.0
Color_intensity         0.0
Hue                     0.0
OD280                   0.0
Proline                 0.0
dtype: float64

In [6]:
#Looking for categorical variables
wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
Cultivator              178 non-null int64
Alchol                  178 non-null float64
Malic_Acid              178 non-null float64
Ash                     178 non-null float64
Alcalinity_of_Ash       178 non-null float64
Magnesium               178 non-null int64
Total_phenols           178 non-null float64
Falvanoids              178 non-null float64
Nonflavanoid_phenols    178 non-null float64
Proanthocyanins         178 non-null float64
Color_intensity         178 non-null float64
Hue                     178 non-null float64
OD280                   178 non-null float64
Proline                 178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.5 KB


### Create the model MLPClassifier

In [7]:
import time

#Set X and Y
X = wine.drop('Cultivator',axis=1)
y = wine['Cultivator']

#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)


start_time = time.time()

#Because Multi Layer Perceptron is sensitive to feature scaling, 
#so we are scaling the data
scaler = StandardScaler()
scaler.fit(X_train)


X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(15,15,15),  max_iter=300)

mlp.fit(X_train,y_train)

#Predictions
predictions = mlp.predict(X_test)

print("--- Runtime: %s seconds ---" % (time.time() - start_time))

--- Runtime: 0.20296907424926758 seconds ---




### Evaluating the model

In [8]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))

[[19  0  0]
 [ 0 14  0]
 [ 0  1 11]]


In [9]:
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          1       1.00      1.00      1.00        19
          2       0.93      1.00      0.97        14
          3       1.00      0.92      0.96        12

avg / total       0.98      0.98      0.98        45



In [10]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X_train, y_train, cv=5)

array([0.92857143, 0.96296296, 1.        , 1.        , 1.        ])

In [11]:
y_test.value_counts()/len(y_test)

1    0.422222
2    0.311111
3    0.266667
Name: Cultivator, dtype: float64

In [12]:
dic = {1:0,2:0,3:0}
for n in predictions:
    dic[n]+=1
    
#Get %
for i in dic:
    dic[i]/=45

print(dic)
    

{1: 0.4222222222222222, 2: 0.3333333333333333, 3: 0.24444444444444444}


### Creating a Random Forest Model

In [13]:
#Setting start time to calculate runtime
start_time = time.time()

#Creating an instance of the RandomForestClassifier class
rfc = ensemble.RandomForestClassifier()

#If we add parameter n_estimators, runtime increases
#rfc = ensemble.RandomForestClassifier(n_estimators=50)

#Fitting the model
rfc.fit(X_train,y_train)

#Making predictions
predictionsrfc = rfc.predict(X_test)

print("--- Runtime: %s seconds ---" % (time.time() - start_time))

--- Runtime: 0.01820993423461914 seconds ---


### Evaluating the model

In [14]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictionsrfc))

[[19  0  0]
 [ 0 14  0]
 [ 0  0 12]]


In [15]:
print(classification_report(y_test,predictionsrfc))

             precision    recall  f1-score   support

          1       1.00      1.00      1.00        19
          2       1.00      1.00      1.00        14
          3       1.00      1.00      1.00        12

avg / total       1.00      1.00      1.00        45



In [16]:
y_test.value_counts()/len(y_test)

1    0.422222
2    0.311111
3    0.266667
Name: Cultivator, dtype: float64

In [17]:
dic = {1:0,2:0,3:0}
for n in predictions:
    dic[n]+=1
    
#Get %
for i in dic:
    dic[i]/=45

print(dic)
    

{1: 0.4222222222222222, 2: 0.3333333333333333, 3: 0.24444444444444444}


#### Conclusions:

- Random forest was a little bit better in term of precision, recall and f1-score
- Random forest was faster than MLP
- It is important the parameter max_iter, the highest number th worst results in this case


________________

By: Wendy Navarrete

September 2019