# Deep learning 

In this notebook I'm going to make the regrssion with a deep learning approach. The code was developed in Keras and sklearn.

In [2]:
# Pandas and numpy for data manipulation
import pandas as pd
import numpy as np
np.random.seed(42)

 
# Matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
%matplotlib inline

import matplotlib
matplotlib.rcParams['font.size'] = 16
matplotlib.rcParams['figure.figsize'] = (9, 9)

import seaborn as sns

from IPython.core.pylabtools import figsize

# Scipy helper functions
from scipy.stats import percentileofscore
from scipy import stats

In [3]:
# Standard ML Models for comparison
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
    

# Splitting data into training/testing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, median_absolute_error

# Distributions
import scipy
# Read
import csv

In [4]:
# keras 

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


# Load and preprocess data

Due to some columns of our dataset have ',', when I create a pandas dataframe those values aren't read as numeric, for this reason it's important to preprocess data to relace ',' by '.'.

In [5]:
r = []

with open('./AirQualityUCI.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=';')
    for row in spamreader:
        j2 = []
        for j in row:
            j3 = str(j).replace(',','.')
            if len(j3) > 0:
                j2.append(j3)  
        r.append(j2)
            
        
                

Load data as pandas Dataframe. REmove NA values and select columns.

In [6]:
df = pd.DataFrame(r[1:], columns=r[0])

In [7]:
df = df.dropna()

In [8]:
df = df[['Time','CO(GT)','PT08.S1(CO)','NMHC(GT)','C6H6(GT)' ,'PT08.S2(NMHC)' ,'NOx(GT)' ,'PT08.S3(NOx)'  ,'NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)'  ,'T' ,'RH','AH'  ]]

In this part of the study I convert data to the right data type.

In [9]:
print (df.dtypes)

Time             object
CO(GT)           object
PT08.S1(CO)      object
NMHC(GT)         object
C6H6(GT)         object
PT08.S2(NMHC)    object
NOx(GT)          object
PT08.S3(NOx)     object
NO2(GT)          object
PT08.S4(NO2)     object
PT08.S5(O3)      object
T                object
RH               object
AH               object
dtype: object


In [10]:
df[['CO(GT)','PT08.S1(CO)','NMHC(GT)','C6H6(GT)' ,'PT08.S2(NMHC)' ,'NOx(GT)' ,'PT08.S3(NOx)'  ,'NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)'  ,'T' ,'RH','AH' ]] = df[['CO(GT)','PT08.S1(CO)','NMHC(GT)','C6H6(GT)' ,'PT08.S2(NMHC)' ,'NOx(GT)' ,'PT08.S3(NOx)'  ,'NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)'  ,'T' ,'RH','AH']].apply(pd.to_numeric)

In [11]:
df['Time'] = df["Time"].astype('category')

In [12]:
df.describe()

Unnamed: 0,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
count,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0
mean,-34.207524,1048.990061,-159.090093,1.865683,894.595276,168.616971,794.990168,58.148873,1391.479641,975.072032,9.778305,39.48538,-6.837604
std,77.65717,329.83271,139.789093,41.380206,342.333252,257.433866,321.993552,126.940455,467.210125,456.938184,43.203623,51.216145,38.97667
min,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0
25%,0.6,921.0,-200.0,4.0,711.0,50.0,637.0,53.0,1185.0,700.0,10.9,34.1,0.6923
50%,1.5,1053.0,-200.0,7.9,895.0,141.0,794.0,96.0,1446.0,942.0,17.2,48.6,0.9768
75%,2.6,1221.0,-200.0,13.6,1105.0,284.0,960.0,133.0,1662.0,1255.0,24.1,61.9,1.2962
max,11.9,2040.0,1189.0,63.7,2214.0,1479.0,2683.0,340.0,2775.0,2523.0,44.6,88.7,2.231


In [13]:
df = df.rename(columns={'PT08.S1(CO)': 'S1', 'PT08.S3(NOx)': 'S3', 'C6H6(GT)':'C6H6' , 'CO(GT)':'CO', 'NO2(GT)':'NO2',
                        'S1(CO)':'S1', 'NMHC(GT)':'NMHC', 'NOx(GT)':'NOx', 'S3(NOx)':'S3', 'PT08.S2(NMHC)':'S2' , 'PT08.S4(NO2)': 'S4'  ,
                       'PT08.S5(O3)': 'S5'})

In [14]:
print (df.dtypes)

Time    category
CO       float64
S1         int64
NMHC       int64
C6H6     float64
S2         int64
NOx        int64
S3         int64
NO2        int64
S4         int64
S5         int64
T        float64
RH       float64
AH       float64
dtype: object


Dataset it is alredy prepared to work with.

## Modelling

In this notebook only numeric variables are being taken into account. The varaible that we want to predredict is T (temperatura).

In [16]:
x = df[['CO','S1','NMHC','C6H6','NOx',
         'S3','NO2','AH', 'RH', 'S4', 'S5']]
y= df[['T']]

Split dataset in train and test.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=42)

### First Regression

In this attent, a 5-fold cross validation it is going to being use in order to ensure and evaluate the validation of the results. I seek to evaluate different number of epochs an select the best value. 


The 'deep' architecture is a netwrk with one hidden layer. The activation function are ReLu and in the output layer linear activation function it is use to make the final predictions.

The optimazer of  the neural net is Adam and the error metric is MSE.

In [24]:
c, r = y_train.values.shape
y_train2 = y_train.values.reshape(c,)

In [25]:
y_train2.shape

(7485,)

In [26]:
X_train2 = X_train.values

ep is the list with all the values that I'm going to use.

In [45]:
ep = [10,20,30,40,50,60,70,80,90,100, 110, 120,130,140,150,160]

In [46]:
resul2 = []

for i in ep:
    
    from sklearn.model_selection import StratifiedKFold
    seed = 15
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
    cvscores_mae = []
    cvscores_r = []
    for train, test in kfold.split(X_train2, y_train2):
        
        model = Sequential()
        model.add(Dense(12, input_dim=11, kernel_initializer='normal', activation='relu'))
        model.add(Dense(8, activation='relu'))
        model.add(Dense(1, activation='linear'))
        model.summary()
        # Compile model
        model.compile(loss='mse', optimizer='adam', metrics=['mse','mae'])
        # Fit the model
        model.fit(X_train2[train], y_train2[train], epochs=10, batch_size=120, verbose=0)
        # evaluate the model
        resultado2 = model.predict(X_train2[test])
        # Metrics
        rmse = np.sqrt(np.mean((resultado2 - y_train2[test]) ** 2))
        print(rmse)
        cvscores_r.append(rmse)
    print([(np.mean(cvscores_r), np.std(cvscores_r))])
    
    resul2.append([i,np.mean(cvscores_r)])



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_499 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_500 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_501 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
58.36706894006485
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_502 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_503 (Dense)            (None, 8)                 104       
____________________________________________________________

58.879842914459026
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_532 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_533 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_534 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
60.12502111357824
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_535 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_536 (Dense)            (None, 8)                 104       
_________________________________________

61.6550010289821
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_568 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_569 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_570 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
45.71870237923624
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_571 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_572 (Dense)            (None, 8)                 104       
___________________________________________

62.05988283598736
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_601 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_602 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_603 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
62.59167987346282
[(61.010452776745296, 1.4227851866677657)]
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_604 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_605 (Dense)            (None, 8)                 104       

62.63553994981724
[(61.03173502817905, 1.4570234557401547)]
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_634 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_635 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_636 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
58.42407205682881
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_637 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_638 (Dense)            (None, 8)                 104       


_________________________________________________________________
59.82084810443449
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_670 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_671 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_672 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
61.4448683985688
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_673 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_674 (Dense)            (None, 8)     

61.34999966039792
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_703 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_704 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_705 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
62.01651643729777
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_706 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_707 (Dense)            (None, 8)                 104       
__________________________________________

62.57819528002908
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_736 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_737 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_738 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
62.560730273407096
[(61.16665069570172, 1.544383448659082)]


These are the results of the experiment, it is clear that all the values return the same MSE error.

In [47]:
for i in resul2:
    
    print(i)

[10, 60.877716638549735]
[20, 61.01109696013668]
[30, 61.12559622862127]
[40, 60.94764669197165]
[50, 57.674624687375356]
[60, 61.12087411545828]
[70, 61.010452776745296]
[80, 60.89681751992449]
[90, 61.03173502817905]
[100, 60.98396129037275]
[110, 60.79915990056973]
[120, 60.9252927856635]
[130, 60.837795056619235]
[140, 60.52592059654184]
[150, 61.044952096962675]
[160, 61.16665069570172]


### Test

With the number of 120 epochs, test comprobation.

In [40]:
#Define model
model = Sequential()
model.add(Dense(12, input_dim=11, kernel_initializer='normal', activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='linear'))
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 12)                144       
_________________________________________________________________
dense_5 (Dense)              (None, 8)                 104       
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________


In [41]:
# Compile model
model.compile(loss='mse', optimizer='adam', metrics=['mse','mae'])
 

In [42]:
# Fit the model
model.fit(X_train, y_train, epochs=120, batch_size=120, verbose=0)
     

<keras.callbacks.History at 0x2d5d72e20f0>

In [44]:
# evaluate the model
resultado = model.predict(X_test)

In [45]:
# Metrics
rmse = np.sqrt(np.mean((resultado - y_test) ** 2))
print(rmse)

60.629684703754336


### Standarized variables

In this part I will going to repeat the same neural net experiment but in these case, The data will be standaried in the saacle (0,1)

In [21]:
scaler = MinMaxScaler()
print(scaler.fit(X_train))
print(scaler.fit(y_train))
xscale=scaler.transform(X_train)
yscale=scaler.transform(y_train)
xscale_test=scaler.transform(X_test)
yscale_test=scaler.transform(y_test)


MinMaxScaler(copy=True, feature_range=(0, 1))
MinMaxScaler(copy=True, feature_range=(0, 1))


In [31]:
type(yscale)

numpy.ndarray

In [32]:
c, r = yscale.shape
yscale = yscale.reshape(c,)


In [42]:
resul = []

for i in ep:
    
    seed = 15
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
    cvscores_mae = []
    cvscores_r = []
    for train, test in kfold.split(xscale, yscale):
        
        model = Sequential()
        model.add(Dense(12, input_dim=11, kernel_initializer='normal', activation='relu'))
        model.add(Dense(8, activation='relu'))
        model.add(Dense(1, activation='linear'))
        model.summary()
        # Compile model
        model.compile(loss='mse', optimizer='adam', metrics=['mse','mae'])
        # Fit the model
        model.fit(xscale[train], yscale[train], epochs=i, batch_size=120, verbose=0)
        # evaluate the model
        resultado_sc = model.predict(xscale[test])
        resultado2 = scaler.inverse_transform(resultado_sc)
        y_t = scaler.inverse_transform(yscale[test].reshape(-1,1))
        # Metrics
        rmse = np.sqrt(np.mean((resultado2 - y_t) ** 2))
        print(rmse)
        cvscores_r.append(rmse)
    print([(np.mean(cvscores_r), np.std(cvscores_r))])
    
    resul.append([i,np.mean(cvscores_r)])

    
    



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_244 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_245 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_246 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
8.919525586497844
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_247 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_248 (Dense)            (None, 8)                 104       
____________________________________________________________

5.3265766488792
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_277 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_278 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_279 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
5.293233586145873
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_280 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_281 (Dense)            (None, 8)                 104       
____________________________________________

4.426389024793337
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_310 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_311 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_312 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
4.43282323537207
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_313 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_314 (Dense)            (None, 8)                 104       
___________________________________________

3.685724737947742
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_343 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_344 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_345 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
4.28720996793647
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_346 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_347 (Dense)            (None, 8)                 104       
___________________________________________

3.2753962621395933
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_376 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_377 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_378 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
4.540732482353048
[(3.908291330744561, 0.4237487646498284)]
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_379 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_380 (Dense)            (None, 8)                 104       

3.8021130250311144
[(4.145035767073811, 0.5858709457241498)]
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_409 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_410 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_411 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
3.4671100679380515
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_412 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_413 (Dense)            (None, 8)                 104      

3.401285781252559
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_445 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_446 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_447 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
3.0942535373312277
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_448 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_449 (Dense)            (None, 8)                 104       
_________________________________________

3.7621789614979115
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_478 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_479 (Dense)            (None, 8)                 104       
_________________________________________________________________
dense_480 (Dense)            (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________
3.5693940876334747
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_481 (Dense)            (None, 12)                144       
_________________________________________________________________
dense_482 (Dense)            (None, 8)                 104       
________________________________________

In this case, we find a different landscape. The mse it is clearly lower than in the case of not standariced variables. When the number of epochs increase a better result it is achive. At the value of 120 epochs the result achive its better result.

In [44]:
for i in resul:
    
    print(i)

[10, 10.385213124365004]
[20, 15.96664029935476]
[30, 5.969006377715782]
[40, 6.629873440124048]
[50, 4.59314084619202]
[60, 3.981308072203703]
[70, 4.066085142551195]
[80, 4.187322310896123]
[90, 3.908291330744561]
[100, 3.991188655854516]
[110, 4.145035767073811]
[120, 3.5250078822190196]
[130, 3.5518512970400025]
[140, 3.502519220176059]
[150, 3.394072090069856]
[160, 3.4709446932452517]


# Test

Although train an mlp is not a deterministic process with the best value of epochs at the last experiment a test comprabation it is going to being use.

In [18]:
#Define model
model = Sequential()
model.add(Dense(12, input_dim=11, kernel_initializer='normal', activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='linear'))
model.summary()

      
        

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 12)                144       
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 104       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 9         
Total params: 257
Trainable params: 257
Non-trainable params: 0
_________________________________________________________________


In [19]:
# Compile model
model.compile(loss='mse', optimizer='adam', metrics=['mse','mae'])
 

In [22]:
# Fit the model
model.fit(xscale, yscale, epochs=120, batch_size=120, verbose=0)
        

<keras.callbacks.History at 0x2d5d2633940>

Predict test rersult and make the inverse transformation to get the real values of the predcitions.

In [23]:
# evaluate the model
resultado_sc = model.predict(xscale_test)
resultado2 = scaler.inverse_transform(resultado_sc)

In [24]:
resultado2

array([[28.54942 ],
       [24.90849 ],
       [21.831032],
       ...,
       [26.412212],
       [22.4318  ],
       [16.383698]], dtype=float32)

In [25]:
# Metrics
rmse = np.sqrt(np.mean((resultado2 - y_test) ** 2))
print(rmse)

T    3.268982
dtype: float64


# Conclusion

In this experiment a regression problem with keras has being made. Only numeric variables of the dataset was used and not accurate result has being achive. It is imporart to remark that the process of standariez variables outprefrom the same sistem without these step.