This notebook carries out Naive Bayes classfication on the two datasets created, one where the model is trained and validated on data from Paris but tested on data from London while another where all the data is mixed and the model is trained, validated and tested on the mixed data.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.formula.api as smf
from sklearn.naive_bayes import GaussianNB

Populating the interactive namespace from numpy and matplotlib


## DATASET #1 (Separate Cities)##

**TRAINING DATA**

In [2]:
# loading training dataset 

train = pd.read_csv('Data1/Data1_train.csv')

train.columns
train.head()
#train.shape

Unnamed: 0,mean0,mean1,mean2,mean3,mean4,mean5,mean6,mean7,mean8,mean9,...,var_diff11,var_diff12,var_diff13,var_diff14,var_diff15,var_diff16,var_diff17,var_diff18,var_diff19,scenes
0,0.575796,0.717401,0.539698,-1.06237,0.191049,-1.723536,-1.194319,0.053656,-0.20937,-0.200844,...,-0.655437,-0.839057,-0.300258,-0.830228,-0.783522,-0.736217,-0.683096,-0.866754,-0.571898,tubestation
1,-0.626746,1.382491,0.447212,-1.766357,0.009479,2.536803,0.380988,1.120369,3.892846,4.094601,...,-0.906597,-0.561817,-0.990822,-0.910512,-0.77032,-1.019891,-1.231433,-0.857304,-0.888658,train-ter
2,-0.484043,0.597128,1.028187,-1.412017,1.534679,0.511047,1.286969,0.967864,-0.259907,0.790211,...,1.939481,2.26083,2.335424,2.172651,2.198847,2.151799,1.910007,1.362777,1.451639,bus
3,0.308431,-0.005826,-0.932353,0.143836,-0.61964,0.029328,-0.851747,-0.587147,0.131899,-0.957469,...,0.320351,0.050667,-0.057665,0.239617,0.260419,0.66899,0.290236,0.469636,0.520772,market
4,-1.846697,-1.260525,1.77926,0.112545,0.020945,0.628017,0.52307,1.356682,1.079687,0.648798,...,3.062474,0.293887,3.250679,-0.14649,2.59112,-0.208007,0.856651,0.327243,-0.24614,train-ter


In [3]:
trainSet = train.iloc[:,:10]
trainSet.head()

Unnamed: 0,mean0,mean1,mean2,mean3,mean4,mean5,mean6,mean7,mean8,mean9
0,0.575796,0.717401,0.539698,-1.06237,0.191049,-1.723536,-1.194319,0.053656,-0.20937,-0.200844
1,-0.626746,1.382491,0.447212,-1.766357,0.009479,2.536803,0.380988,1.120369,3.892846,4.094601
2,-0.484043,0.597128,1.028187,-1.412017,1.534679,0.511047,1.286969,0.967864,-0.259907,0.790211
3,0.308431,-0.005826,-0.932353,0.143836,-0.61964,0.029328,-0.851747,-0.587147,0.131899,-0.957469
4,-1.846697,-1.260525,1.77926,0.112545,0.020945,0.628017,0.52307,1.356682,1.079687,0.648798


In [4]:
# Getting the corresponding Y scenes(text)

Y_labels = train.scenes
Y_labels[:15]

0     tubestation
1       train-ter
2             bus
3          market
4       train-ter
5     tubestation
6       train-ter
7             bus
8      restaurant
9      busystreet
10     busystreet
11     busystreet
12    tubestation
13     busystreet
14    tubestation
Name: scenes, dtype: object

In [5]:
#The function that assigns numbers to our categories

def numericLabels(x):
     return {
        ourLabels[0]: 1,
        ourLabels[1]: 2,
        ourLabels[2]: 3,
        ourLabels[3]: 4,
        ourLabels[4]: 5,
        ourLabels[5]: 5,
        'unknown': 6,
    }[x]

In [6]:
#The function that assigns numerical values to our labels
ourLabels = ['tubestation', 'quietstreet', 'busystreet', 'restaurant', 'market', 'openairmarket']

def manageLabels(labelsText, labelsNum):
    i = 0;
    while i < labelsText.size:
        if labelsText[i] not in ourLabels:
            labelsText.replace(labelsText[i],'unknown',inplace=True)
        labelsNum[i] = numericLabels(labelsText[i])
        i += 1


In [7]:
#Creating the labels based on what we have defined

Y_train = Y_labels

#Calling the function
manageLabels(Y_labels, Y_train)

#converting type of new series to int
Y_train = Y_train.astype('int64')    
print Y_train[:20]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


0     1
1     6
2     6
3     5
4     6
5     1
6     6
7     6
8     4
9     3
10    3
11    3
12    1
13    3
14    1
15    5
16    3
17    2
18    1
19    3
Name: scenes, dtype: int64


**VALIDATION DATA**

In [8]:
# loading test dataset 

test = pd.read_csv('Data1/Data1_validation.csv')

test.head()

Unnamed: 0,mean0,mean1,mean2,mean3,mean4,mean5,mean6,mean7,mean8,mean9,...,var_diff11,var_diff12,var_diff13,var_diff14,var_diff15,var_diff16,var_diff17,var_diff18,var_diff19,scenes
0,-0.26375,-0.014295,1.149765,0.136582,0.398016,0.8501,2.028252,1.771847,0.884094,0.80341,...,-0.325681,-0.015936,-0.178232,0.091926,-0.120727,-0.305149,-0.221091,-0.715623,-0.561421,bus
1,0.098467,-1.334234,1.788847,-0.054485,1.273843,-0.497433,1.077276,1.745969,-0.234839,0.731322,...,0.908786,0.528635,0.45138,0.601195,0.567032,0.316984,0.207231,0.103127,0.259936,quietstreet
2,0.4042,-0.804345,-0.23149,0.401523,-0.025396,-0.414613,0.142179,-0.028487,-0.478219,-0.685566,...,1.780729,2.171094,1.456386,1.274684,1.318548,1.116542,1.785046,1.095505,1.928737,restaurant
3,0.3993,0.348493,-0.497847,-0.259436,-0.061074,-0.519319,0.145534,-0.263331,0.201231,0.030069,...,-0.577582,-0.644123,-0.750824,-0.478993,-0.44041,-0.474857,-0.416666,-0.312619,-0.65835,busystreet
4,-0.032688,-1.436713,1.478498,-0.519802,2.948268,0.365212,-0.865947,-0.826244,-1.964569,2.023665,...,1.68434,1.98412,2.184567,2.672278,2.379555,2.125408,2.289126,2.560299,4.026152,bus


In [9]:
#Splitting to take the first 10 features only
testSet = test.iloc[:,:10]
testSet.head()

Unnamed: 0,mean0,mean1,mean2,mean3,mean4,mean5,mean6,mean7,mean8,mean9
0,-0.26375,-0.014295,1.149765,0.136582,0.398016,0.8501,2.028252,1.771847,0.884094,0.80341
1,0.098467,-1.334234,1.788847,-0.054485,1.273843,-0.497433,1.077276,1.745969,-0.234839,0.731322
2,0.4042,-0.804345,-0.23149,0.401523,-0.025396,-0.414613,0.142179,-0.028487,-0.478219,-0.685566
3,0.3993,0.348493,-0.497847,-0.259436,-0.061074,-0.519319,0.145534,-0.263331,0.201231,0.030069
4,-0.032688,-1.436713,1.478498,-0.519802,2.948268,0.365212,-0.865947,-0.826244,-1.964569,2.023665


In [10]:
# Getting the corresponding Y scenes(text)

Y_labelsT = test.scenes
Y_labelsT[:15]

0             bus
1     quietstreet
2      restaurant
3      busystreet
4             bus
5     tubestation
6             bus
7       train-ter
8      busystreet
9     quietstreet
10         market
11     busystreet
12      train-ter
13            bus
14    tubestation
Name: scenes, dtype: object

In [11]:
#Creating the labels based on what we have defined

Y_test = Y_labelsT

#Calling the function
manageLabels(Y_labelsT, Y_test)

#converting type of new series to int
Y_test = Y_test.astype('int64')    
print len(Y_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


291


** NAIVE BAYES TEST #1 **

In [12]:
gnb=GaussianNB()
y_pred=gnb.fit(trainSet, Y_train).predict(testSet)

print len(y_pred)

291


In [13]:
1.0*(y_pred==Y_test).sum()/len(y_pred)


0.68384879725085912

## DATASET #2 (Mixed data) ##

** TRAINING DATA **

In [14]:
# loading training dataset 

train2 = pd.read_csv('Data2/Data2_train.csv', header=0, skiprows=-64)

#train2 = train[:265]

#train.columns
#train2.head()
train2.shape

(794, 161)

In [15]:
trainSet2 = train2.iloc[:,:10]
trainSet2.head()

Unnamed: 0,mean0,mean1,mean2,mean3,mean4,mean5,mean6,mean7,mean8,mean9
0,0.093323,-0.399723,0.783561,-2.520351,2.564316,-0.604616,1.020544,1.818084,-2.152912,1.866948
1,0.076053,1.165681,0.960745,-1.856342,0.366469,1.074824,0.338721,1.890659,3.694118,2.813581
2,0.855206,1.212459,-0.488913,0.914367,-0.056152,-0.089688,1.16052,0.783487,0.425175,-0.174028
3,0.347927,-1.2414,-0.085663,1.107282,0.342155,-0.15128,-0.556604,0.207628,-0.366518,-0.758259
4,0.752063,-1.192575,-0.296607,-0.540139,0.113057,-1.612495,-0.691072,-0.810711,-1.709526,-0.551167


In [16]:
# Getting the corresponding Y scenes(text)

Y_labels2 = train2.scenes
Y_labels2[:15]

0       train-ter
1       train-ter
2      busystreet
3      restaurant
4          market
5      busystreet
6          market
7       train-ter
8          market
9      restaurant
10     restaurant
11      train-ter
12     busystreet
13    quietstreet
14            bus
Name: scenes, dtype: object

In [17]:
#Creating the labels based on what we have defined

Y_train2 = Y_labels2

#Calling the function
manageLabels(Y_labels2, Y_train2)

#converting type of new series to int
Y_train2 = Y_train2.astype('int64')    
print Y_train2[:20]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


0     6
1     6
2     3
3     4
4     5
5     3
6     5
7     6
8     5
9     4
10    4
11    6
12    3
13    2
14    6
15    6
16    1
17    4
18    6
19    1
Name: scenes, dtype: int64


**VALIDATION DATA**

In [18]:
# loading test dataset 

test2 = pd.read_csv('Data2/Data2_validation.csv')

test2 = test2[:265]

test2.head()
test2.shape

(265, 161)

In [19]:
#Splitting to take the first 10 features only
testSet2 = test.iloc[:,:10]
testSet2.head()

testSet2 = testSet2[:265]

In [20]:
# getting labels
Y_labels2 = test2.scenes
Y_labels2[:15]

0     quietstreet
1             bus
2      busystreet
3          office
4      restaurant
5      restaurant
6       train-ter
7      busystreet
8     quietstreet
9      busystreet
10         market
11    quietstreet
12            bus
13     restaurant
14    tubestation
Name: scenes, dtype: object

In [21]:
#Creating the labels based on what we have defined

Y_test2 = Y_labels2

#Calling the function
manageLabels(Y_labels2, Y_test2)

#converting type of new series to int
#Y_test2 = Y_test2.astype('int64')    
print len(Y_test2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


265


**NAIVE BAYES TEST #2 **

In [25]:
gnb2 = GaussianNB()
y_pred2 =gnb2.fit(trainSet2, Y_train2).predict(testSet2)

len(y_pred2)

265

In [26]:
1.0*(y_pred2==Y_test2).sum()/len(y_pred2)


0.21886792452830189