## Machine Learning Project 
### Neural Network Component

### Purpose: The goal of this notebook is to serve as a execution of chosen neural network algorithms to train and test the performance of the model with the appropriate datasets

In [1]:
#import modules

from neural_classifier import neural_classifier
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from sklearn.cross_validation import train_test_split

## Using a Combined Dataset

__Purpose:__ To evaluate the hypothesis of the "science of cities", a dataset has been created that combines the London and Paris datasets and then randomizes the training/testing over numerous different intervals. 

## Loading Data1 (SOC)

## Training/Validation Data 1

In [2]:
# Loading training data1
dftrain_data1 = pd.read_csv('Data1/Data1_train.csv')

In [3]:
# Loading validation data1
dfvalid_data1 = pd.read_csv('Data1/Data1_validation.csv')

In [4]:
#Concatenating the 2 sets to form ONE TRAINING SET
dftrainCombined1 = pd.concat([dftrain_data1, dfvalid_data1], axis=0)

# Subsetting training to take the first 10 features only
dftrain10feat1 = dftrainCombined1.iloc[:,:10].copy()

In [5]:
dftrainCombined1.head(3)

Unnamed: 0,mean0,mean1,mean2,mean3,mean4,mean5,mean6,mean7,mean8,mean9,...,var_diff11,var_diff12,var_diff13,var_diff14,var_diff15,var_diff16,var_diff17,var_diff18,var_diff19,scenes
0,0.575796,0.717401,0.539698,-1.06237,0.191049,-1.723536,-1.194319,0.053656,-0.20937,-0.200844,...,-0.655437,-0.839057,-0.300258,-0.830228,-0.783522,-0.736217,-0.683096,-0.866754,-0.571898,tubestation
1,-0.626746,1.382491,0.447212,-1.766357,0.009479,2.536803,0.380988,1.120369,3.892846,4.094601,...,-0.906597,-0.561817,-0.990822,-0.910512,-0.77032,-1.019891,-1.231433,-0.857304,-0.888658,train-ter
2,-0.484043,0.597128,1.028187,-1.412017,1.534679,0.511047,1.286969,0.967864,-0.259907,0.790211,...,1.939481,2.26083,2.335424,2.172651,2.198847,2.151799,1.910007,1.362777,1.451639,bus


In [6]:
dftrain10feat1.head(3)

Unnamed: 0,mean0,mean1,mean2,mean3,mean4,mean5,mean6,mean7,mean8,mean9
0,0.575796,0.717401,0.539698,-1.06237,0.191049,-1.723536,-1.194319,0.053656,-0.20937,-0.200844
1,-0.626746,1.382491,0.447212,-1.766357,0.009479,2.536803,0.380988,1.120369,3.892846,4.094601
2,-0.484043,0.597128,1.028187,-1.412017,1.534679,0.511047,1.286969,0.967864,-0.259907,0.790211


## Loading Data2 (Generalizability)

## Training/Validation Data 2

In [11]:
# Loading training data2
dftrain_data2 = pd.read_csv('Data2/Data2_train.csv')

In [12]:
# Loading validation data2
dfvalid_data2 = pd.read_csv('Data2/Data2_validation.csv')

In [13]:
#Concatenating the 2 sets to form ONE TRAINING SET
dftrainCombined2 = pd.concat([dftrain_data2, dfvalid_data2], axis=0)

# Subsetting training to take the first 10 features only
dftrain10feat2 = dftrainCombined2.iloc[:,:10].copy()

In [14]:
dftrainCombined2.head(3)

Unnamed: 0,mean0,mean1,mean2,mean3,mean4,mean5,mean6,mean7,mean8,mean9,...,var_diff11,var_diff12,var_diff13,var_diff14,var_diff15,var_diff16,var_diff17,var_diff18,var_diff19,scenes
0,0.093323,-0.399723,0.783561,-2.520351,2.564316,-0.604616,1.020544,1.818084,-2.152912,1.866948,...,-0.980404,-0.94053,-1.060487,-0.91527,-0.933561,-1.192804,-1.120846,-1.367337,-1.184519,train-ter
1,0.076053,1.165681,0.960745,-1.856342,0.366469,1.074824,0.338721,1.890659,3.694118,2.813581,...,-0.824649,-0.92521,-0.656642,-0.723337,-0.911938,-0.838659,-0.662626,-0.954069,-0.782338,train-ter
2,0.855206,1.212459,-0.488913,0.914367,-0.056152,-0.089688,1.16052,0.783487,0.425175,-0.174028,...,-0.758511,-0.88437,-1.231002,-0.600003,-0.559877,-0.837631,-0.818048,-0.670121,-0.492715,busystreet


In [15]:
dftrain10feat2.head(3)

Unnamed: 0,mean0,mean1,mean2,mean3,mean4,mean5,mean6,mean7,mean8,mean9
0,0.093323,-0.399723,0.783561,-2.520351,2.564316,-0.604616,1.020544,1.818084,-2.152912,1.866948
1,0.076053,1.165681,0.960745,-1.856342,0.366469,1.074824,0.338721,1.890659,3.694118,2.813581
2,0.855206,1.212459,-0.488913,0.914367,-0.056152,-0.089688,1.16052,0.783487,0.425175,-0.174028


### Creating labels

In [20]:
dftrainCombined1.scenes.unique()

array(['tubestation', 'train-ter', 'bus', 'market', 'restaurant',
       'busystreet', 'quietstreet'], dtype=object)

### Groups (Zero-indexed, 6 groups in total)
0 - tubestation, 1 - quietstreet, 2 - busystreet, 3 - restaurant, 4 - market

#### scene values that were set to the "general" group, number 5
general group = bus, train-ter

In [21]:
# create a dictionary of tuples (the keys require the "," after the string so python sees it as tuple) 
# for each scene and group the general scenes into one. Zero-indexed for NN function
scene_list = {("tubestation",): 0, ("quietstreet",): 1, ("busystreet",): 2, ("restaurant",): 3, 
             ("market",): 4, ("bus", "train-ter",): 5}

In [22]:
# create function to loop through and assign the number values to each scene type
# since column 'scene_type' doesn't exist, initialize with zeros then check uniques

def createLabels(label_list, dataset):
    dataset['scene_type'] = np.zeros(len(dataset), dtype=np.int)
    
    for key, value in label_list.iteritems():
        dataset['scene_type'][dataset['scenes'].isin(key)] = value


In [23]:
createLabels(scene_list, dftrainCombined1)
dftrainCombined1[["scenes","scene_type"]][:10]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,scenes,scene_type
0,tubestation,0
1,train-ter,5
2,bus,5
3,market,4
4,train-ter,5
5,tubestation,0
6,train-ter,5
7,bus,5
8,restaurant,3
9,busystreet,2


In [24]:
createLabels(scene_list, dftrainCombined2)
dftrainCombined2[["scenes","scene_type"]][:10]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,scenes,scene_type
0,train-ter,5
1,train-ter,5
2,busystreet,2
3,restaurant,3
4,market,4
5,busystreet,2
6,market,4
7,train-ter,5
8,market,4
9,restaurant,3


In [25]:
createLabels(scene_list, dftest_data1)
dftest_data1[["scenes","scene_type"]][:10]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,scenes,scene_type
0,bus,5
1,bus,5
2,bus,5
3,bus,5
4,bus,5
5,bus,5
6,bus,5
7,bus,5
8,bus,5
9,bus,5


In [26]:
createLabels(scene_list, dftest_data2)
dftest_data2[["scenes","scene_type"]][:10]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,scenes,scene_type
0,bus,5
1,busystreet,2
2,train-ter,5
3,busystreet,2
4,quietstreet,1
5,restaurant,3
6,tube,0
7,market,4
8,tubestation,0
9,bus,5


In [27]:
ytrain_labels1 = dftrainCombined1.scene_type.copy()
ytrain_labels1.head(3)

0    0
1    5
2    5
Name: scene_type, dtype: int64

In [28]:
ytrain_labels2 = dftrainCombined2.scene_type.copy()
ytrain_labels2.head(3)

0    5
1    5
2    2
Name: scene_type, dtype: int64

## NN 1

In [29]:
validationSize = [0.30, 0.28, 0.26, 0.24, 0.22, 0.20]

for i in validationSize:
    
    # splitting the dataset
    X_train, X_test, Y_train, Y_test = train_test_split(dftrain10feat1, ytrain_labels1, test_size=i, random_state=39)
    
    # Printing reference header
    testing = 100 * i
    training = 100 - testing
    print '\n\n **** For %d/%d data split ratio **** : \n' %(training, testing)
    
    nc = neural_classifier()
    nc.train(learning_rate=0.85, n_epochs=7000,
    X_train=X_train, Y_train=Y_train, X_test=X_test, Y_test=Y_test, batch_size=200, 
    print_frequency=2000, n_in=10, n_out=6, n_hidden=7, n_layers=2)



 **** For 70/30 data split ratio **** : 

using training set as validation set...
... building the model
... training the model
epoch 500, minibatch 4/4, validation error 10.125000 %
epoch 500, minibatch 4/4, test error of best model 22.250000 %
epoch 1000, minibatch 4/4, validation error 10.125000 %
epoch 1500, minibatch 4/4, validation error 9.125000 %
epoch 1500, minibatch 4/4, test error of best model 23.500000 %
epoch 2000, minibatch 4/4, validation error 8.875000 %
epoch 2000, minibatch 4/4, test error of best model 22.250000 %
epoch 2500, minibatch 4/4, validation error 9.250000 %
epoch 3000, minibatch 4/4, validation error 9.000000 %
epoch 3500, minibatch 4/4, validation error 8.625000 %
epoch 3500, minibatch 4/4, test error of best model 21.500000 %
epoch 4000, minibatch 4/4, validation error 8.250000 %
epoch 4000, minibatch 4/4, test error of best model 20.750000 %
epoch 4500, minibatch 4/4, validation error 8.375000 %
epoch 5000, minibatch 4/4, validation error 8.250000 %


The code for file neural_classifier.pyc ran for 5.3s


... training the model
epoch 400, minibatch 5/5, validation error 11.200000 %
epoch 400, minibatch 5/5, test error of best model 23.500000 %
epoch 800, minibatch 5/5, validation error 11.600000 %
epoch 1200, minibatch 5/5, validation error 11.500000 %
Optimization complete with best validation score of 11.200000 %,with test performance 23.500000 %
The code ran for 1600 epochs, with 1084.428038 epochs/sec


 **** For 74/26 data split ratio **** : 

using training set as validation set...
... building the model


The code for file neural_classifier.pyc ran for 1.5s


... training the model
epoch 400, minibatch 5/5, validation error 11.600000 %
epoch 400, minibatch 5/5, test error of best model 24.500000 %
epoch 800, minibatch 5/5, validation error 10.200000 %
epoch 800, minibatch 5/5, test error of best model 24.000000 %
epoch 1200, minibatch 5/5, validation error 10.300000 %
epoch 1600, minibatch 5/5, validation error 10.400000 %
epoch 2000, minibatch 5/5, validation error 10.900000 %
epoch 2400, minibatch 5/5, validation error 11.400000 %
epoch 2800, minibatch 5/5, validation error 11.100000 %
Optimization complete with best validation score of 10.200000 %,with test performance 24.000000 %
The code ran for 3200 epochs, with 1048.156983 epochs/sec


 **** For 76/24 data split ratio **** : 

using training set as validation set...
... building the model


The code for file neural_classifier.pyc ran for 3.1s


... training the model
epoch 400, minibatch 5/5, validation error 10.000000 %
epoch 400, minibatch 5/5, test error of best model 23.000000 %
epoch 800, minibatch 5/5, validation error 9.400000 %
epoch 800, minibatch 5/5, test error of best model 24.000000 %
epoch 1200, minibatch 5/5, validation error 9.000000 %
epoch 1200, minibatch 5/5, test error of best model 23.000000 %
epoch 1600, minibatch 5/5, validation error 8.600000 %
epoch 1600, minibatch 5/5, test error of best model 21.000000 %
epoch 2000, minibatch 5/5, validation error 8.800000 %
epoch 2400, minibatch 5/5, validation error 8.700000 %
epoch 2800, minibatch 5/5, validation error 9.000000 %
epoch 3200, minibatch 5/5, validation error 9.000000 %
epoch 3600, minibatch 5/5, validation error 9.000000 %
epoch 4000, minibatch 5/5, validation error 8.800000 %
epoch 4400, minibatch 5/5, validation error 8.800000 %
epoch 4800, minibatch 5/5, validation error 8.500000 %
epoch 4800, minibatch 5/5, test error of best model 22.000000 %


The code for file neural_classifier.pyc ran for 6.7s




 **** For 78/22 data split ratio **** : 

using training set as validation set...
... building the model
... training the model
epoch 400, minibatch 5/5, validation error 10.400000 %
epoch 400, minibatch 5/5, test error of best model 25.000000 %
epoch 800, minibatch 5/5, validation error 9.200000 %
epoch 800, minibatch 5/5, test error of best model 22.000000 %
epoch 1200, minibatch 5/5, validation error 9.400000 %
epoch 1600, minibatch 5/5, validation error 9.100000 %
epoch 1600, minibatch 5/5, test error of best model 21.500000 %
epoch 2000, minibatch 5/5, validation error 8.900000 %
epoch 2000, minibatch 5/5, test error of best model 22.500000 %
epoch 2400, minibatch 5/5, validation error 8.700000 %
epoch 2400, minibatch 5/5, test error of best model 22.000000 %
epoch 2800, minibatch 5/5, validation error 8.800000 %
epoch 3200, minibatch 5/5, validation error 8.800000 %
epoch 3600, minibatch 5/5, validation error 8.400000 %
epoch 3600, minibatch 5/5, test error of best model 22.000

The code for file neural_classifier.pyc ran for 6.7s


... training the model
epoch 400, minibatch 5/5, validation error 11.600000 %
epoch 400, minibatch 5/5, test error of best model 24.500000 %
epoch 800, minibatch 5/5, validation error 11.400000 %
epoch 800, minibatch 5/5, test error of best model 21.000000 %
epoch 1200, minibatch 5/5, validation error 10.500000 %
epoch 1200, minibatch 5/5, test error of best model 20.500000 %
epoch 1600, minibatch 5/5, validation error 10.300000 %
epoch 1600, minibatch 5/5, test error of best model 20.500000 %
epoch 2000, minibatch 5/5, validation error 10.200000 %
epoch 2000, minibatch 5/5, test error of best model 22.000000 %
epoch 2400, minibatch 5/5, validation error 10.300000 %
epoch 2800, minibatch 5/5, validation error 9.700000 %
epoch 2800, minibatch 5/5, test error of best model 23.000000 %
epoch 3200, minibatch 5/5, validation error 9.900000 %
epoch 3600, minibatch 5/5, validation error 9.900000 %
epoch 4000, minibatch 5/5, validation error 9.900000 %
epoch 4400, minibatch 5/5, validation err

The code for file neural_classifier.pyc ran for 6.6s


<b> 
- Validation Error Average : 9.2375
- Test Error Average       : 22.5
- Best split 70/30         : 7.625/22 (Validation/Test)
<b>

## NN 2

In [35]:
validationSize = [0.30, 0.28, 0.26, 0.24, 0.22, 0.20]

for i in validationSize:
    
    # splitting the dataset
    X_train, X_test, Y_train, Y_test = train_test_split(dftrain10feat2, ytrain_labels2, test_size=i, random_state=39)
    
    # Printing reference header
    testing = 100 * i
    training = 100 - testing
    print '\n\n **** For %d/%d data split ratio **** : \n' %(training, testing)
    
    nc = neural_classifier()
    nc.train(learning_rate=0.45, n_epochs=7000,
    X_train=X_train, Y_train=Y_train, X_test=X_test, Y_test=Y_test, batch_size=200, 
    print_frequency=2000, n_in=10, n_out=6, n_hidden=10, n_layers=2)



 **** For 70/30 data split ratio **** : 

using training set as validation set...
... building the model
... training the model
epoch 667, minibatch 2/3, validation error 10.000000 %
epoch 667, minibatch 2/3, test error of best model 17.500000 %
epoch 1334, minibatch 1/3, validation error 7.666667 %
epoch 1334, minibatch 1/3, test error of best model 20.500000 %
epoch 2000, minibatch 3/3, validation error 6.666667 %
epoch 2000, minibatch 3/3, test error of best model 19.000000 %
epoch 2667, minibatch 2/3, validation error 5.833333 %
epoch 2667, minibatch 2/3, test error of best model 20.000000 %
epoch 3334, minibatch 1/3, validation error 5.000000 %
epoch 3334, minibatch 1/3, test error of best model 20.000000 %
epoch 4000, minibatch 3/3, validation error 4.666667 %
epoch 4000, minibatch 3/3, test error of best model 21.000000 %
epoch 4667, minibatch 2/3, validation error 4.666667 %
epoch 5334, minibatch 1/3, validation error 4.166667 %
epoch 5334, minibatch 1/3, test error of best m

The code for file neural_classifier.pyc ran for 4.3s


... training the model
epoch 667, minibatch 2/3, validation error 11.000000 %
epoch 667, minibatch 2/3, test error of best model 18.000000 %
epoch 1334, minibatch 1/3, validation error 7.833333 %
epoch 1334, minibatch 1/3, test error of best model 20.000000 %
epoch 2000, minibatch 3/3, validation error 6.666667 %
epoch 2000, minibatch 3/3, test error of best model 19.000000 %
epoch 2667, minibatch 2/3, validation error 7.333333 %
epoch 3334, minibatch 1/3, validation error 7.166667 %
epoch 4000, minibatch 3/3, validation error 6.500000 %
epoch 4000, minibatch 3/3, test error of best model 20.500000 %
epoch 4667, minibatch 2/3, validation error 5.666667 %
epoch 4667, minibatch 2/3, test error of best model 20.500000 %
epoch 5334, minibatch 1/3, validation error 5.166667 %
epoch 5334, minibatch 1/3, test error of best model 20.500000 %
epoch 6000, minibatch 3/3, validation error 4.666667 %
epoch 6000, minibatch 3/3, test error of best model 20.500000 %
epoch 6667, minibatch 2/3, validati

The code for file neural_classifier.pyc ran for 4.6s




 **** For 74/26 data split ratio **** : 

using training set as validation set...
... building the model
... training the model
epoch 667, minibatch 2/3, validation error 9.000000 %
epoch 667, minibatch 2/3, test error of best model 20.000000 %
epoch 1334, minibatch 1/3, validation error 7.166667 %
epoch 1334, minibatch 1/3, test error of best model 22.500000 %
epoch 2000, minibatch 3/3, validation error 6.000000 %
epoch 2000, minibatch 3/3, test error of best model 21.000000 %
epoch 2667, minibatch 2/3, validation error 5.333333 %
epoch 2667, minibatch 2/3, test error of best model 22.000000 %
epoch 3334, minibatch 1/3, validation error 5.500000 %
epoch 4000, minibatch 3/3, validation error 5.666667 %
epoch 4667, minibatch 2/3, validation error 5.333333 %
epoch 5334, minibatch 1/3, validation error 5.000000 %
epoch 5334, minibatch 1/3, test error of best model 23.500000 %
epoch 6000, minibatch 3/3, validation error 5.166667 %
epoch 6667, minibatch 2/3, validation error 5.000000 %
Op

The code for file neural_classifier.pyc ran for 4.4s


... training the model
epoch 500, minibatch 4/4, validation error 12.875000 %
epoch 500, minibatch 4/4, test error of best model 21.000000 %
epoch 1000, minibatch 4/4, validation error 11.125000 %
epoch 1000, minibatch 4/4, test error of best model 21.000000 %
epoch 1500, minibatch 4/4, validation error 10.250000 %
epoch 1500, minibatch 4/4, test error of best model 21.000000 %
epoch 2000, minibatch 4/4, validation error 10.500000 %
epoch 2500, minibatch 4/4, validation error 9.500000 %
epoch 2500, minibatch 4/4, test error of best model 20.000000 %
epoch 3000, minibatch 4/4, validation error 9.875000 %
epoch 3500, minibatch 4/4, validation error 10.000000 %
epoch 4000, minibatch 4/4, validation error 9.625000 %
epoch 4500, minibatch 4/4, validation error 9.375000 %
epoch 4500, minibatch 4/4, test error of best model 25.000000 %
epoch 5000, minibatch 4/4, validation error 9.250000 %
epoch 5000, minibatch 4/4, test error of best model 25.000000 %
epoch 5500, minibatch 4/4, validation er

The code for file neural_classifier.pyc ran for 5.8s


... training the model
epoch 500, minibatch 4/4, validation error 11.125000 %
epoch 500, minibatch 4/4, test error of best model 19.500000 %
epoch 1000, minibatch 4/4, validation error 10.250000 %
epoch 1000, minibatch 4/4, test error of best model 21.000000 %
epoch 1500, minibatch 4/4, validation error 10.250000 %
epoch 1500, minibatch 4/4, test error of best model 20.000000 %
epoch 2000, minibatch 4/4, validation error 9.875000 %
epoch 2000, minibatch 4/4, test error of best model 21.000000 %
epoch 2500, minibatch 4/4, validation error 9.125000 %
epoch 2500, minibatch 4/4, test error of best model 23.000000 %
epoch 3000, minibatch 4/4, validation error 8.500000 %
epoch 3000, minibatch 4/4, test error of best model 23.000000 %
epoch 3500, minibatch 4/4, validation error 8.375000 %
epoch 3500, minibatch 4/4, test error of best model 24.500000 %
epoch 4000, minibatch 4/4, validation error 8.125000 %
epoch 4000, minibatch 4/4, test error of best model 24.500000 %
epoch 4500, minibatch 4/

The code for file neural_classifier.pyc ran for 5.9s


... training the model
epoch 500, minibatch 4/4, validation error 11.500000 %
epoch 500, minibatch 4/4, test error of best model 16.000000 %
epoch 1000, minibatch 4/4, validation error 9.875000 %
epoch 1000, minibatch 4/4, test error of best model 16.500000 %
epoch 1500, minibatch 4/4, validation error 9.625000 %
epoch 1500, minibatch 4/4, test error of best model 17.000000 %
epoch 2000, minibatch 4/4, validation error 9.375000 %
epoch 2000, minibatch 4/4, test error of best model 17.000000 %
epoch 2500, minibatch 4/4, validation error 9.125000 %
epoch 2500, minibatch 4/4, test error of best model 19.000000 %
epoch 3000, minibatch 4/4, validation error 9.250000 %
epoch 3500, minibatch 4/4, validation error 8.750000 %
epoch 3500, minibatch 4/4, test error of best model 20.500000 %
epoch 4000, minibatch 4/4, validation error 8.750000 %
epoch 4500, minibatch 4/4, validation error 8.500000 %
epoch 4500, minibatch 4/4, test error of best model 20.500000 %
epoch 5000, minibatch 4/4, validati

The code for file neural_classifier.pyc ran for 5.8s


<b> 
- Validation Error Average : 5.97
- Test Error Average       : 22.75
- Best split 70/30         : 4/20 (Validation/Test)
<b>