# REALITY MINING
## Authors: Nerea Losada and Maitane Martinez
## Objectives:
1) Select the approach to apply to the data and the Python implementation to use. 

2) Preprocess the data as required for the approach chosen. 

3) Apply the algorithm, describe the results, and explain why these results are useful, interesting, or reveal any insight about the process.

## What is done in the notebook:

In this notebook we can find:

1) The objective of our proyect

2) The task election

3) The description of our dataset

4) The reading of the dataset and its interpretation

5) The preprocessing to get a matrix with all the data

6) The preparation of the data for classification

7) The definition of the chosen classifiers and their learning

8) The predictions made

9) The validation of the model

10) The visualization of the results

The data base: https://drive.google.com/file/d/1e6OJhAlQ75wrHk3ZN7mFTNxwACN5yytP/view?usp=sharing

# Importing the libraries
We start by importing all relevant libraries to be used in the notebook. We use pandas, sklearn, numpy, spicy and Counter.

In [1]:
# To read our dataset from a .mat file
import scipy.io as sio

import numpy as np
from collections import Counter
import pandas as pd

# Imputation methods
from sklearn.preprocessing import Imputer

# Classifiers and metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Reading the dataset

After analyzing our dataset and the features we have, we have decided that our task is going to be to classificate the instances depending on their affiliation. For that aim, first we read the realitymining.mat file that contain 106 instances of the classification problem.

We choose the most relevant features that are going to be important to solve our task. The features selected are:

    -'GROUP' - The subject's research group
    -'REGULAR' - If the subject reports having a regular working schedule.
    -'PREDICTABLE' - If the subject reports having a predictable schedule
    -'TRAVEL' - If the subject reports often traveling
    -'TEXTS' - How often the subject reports send text messages
    -'LOCS' - The unique set of towers seen by the subject
    -'APPS' - The set of times when a user started an application
    -'Di' - Inferred locations at each hour of the day: where the subject is at i hour, being 0<i<25
    -'SMS' - Number of text messages send and received
    -'VOICE' Number of voice calls made and received
    

Task: Classificate the instances depending on if the subjects are:

    -mlgrad (Media Lab Graduate Student, not a first year)

    -1st year grad (Media Lab First Year Graduate Student)

    -mlfrosh (Media Lab First Year Undergraduate Student)

    -mlstaff (Media Lab Staff)

    -mlurop (Media Lab Undergraduate)

    -professor (Media Lab Professor)

    -sloan (Sloan Business School)

In [2]:
# We have taken the attributes from the web page of the dataset 
# (http://realitycommons.media.mit.edu/realitymining.html)
A = sio.loadmat('realitymining.mat')

In [3]:
# Now the file is read. We take the 's' struct, where the features are described
s = A['s']

Now we are going to start taking our features one by one.

In [4]:
# We get the values of the feature 'GROUP' -> The subject's research group
G1 = s[0,:]['my_group']
G=[]
for i in range(G1.shape[0]):
    if G1[i].shape[0]!=0:
        if G1[i][0,0].shape[0]==0:
            G.append(np.NaN)
        else:
            G.append(G1[i][0,0][0])
    else:
        G.append(np.NaN)

In [5]:
# We get the values of the feature 'REGULAR' -> Does the subject report having a regular working schedule
R1 = s[0,:]['my_regular']
R=[]
for i in range(R1.shape[0]):
    if R1[i].shape[0]!=0:
        if R1[i][0,0].shape[0]==0:
            R.append(np.NaN)
        else:
            R.append(R1[i][0,0][0])
    else:
        R.append(np.NaN)

In [6]:
# We get the values of the feature 'PREDICTABLE' -> Does the subject report having a predictable schedule
P1 = s[0,:]['my_predictable']
P=[]
for i in range(P1.shape[0]):
    if P1[i].shape[0]!=0:
        if P1[i][0,0].shape[0]==0:
            P.append(np.NaN)
        else:
            P.append(P1[i][0,0][0])
    else:
        P.append(np.NaN)

In [7]:
# We get the values of the feature 'TRAVEL' -> Does the subject report often traveling
T1 = s[0,:]['my_travel']
T=[]
for i in range(T1.shape[0]):
    if T1[i].shape[0]!=0:
        if T1[i][0,0].shape[0]==0:
            T.append(np.NaN)
        else:
            T.append(T1[i][0,0][0])
    else:
        T.append(np.NaN)

In [8]:
# We get the values of the feature 'TEXTS' -> How often the subject reports send text messages
TX1 = s[0,:]['my_texts']
TX=[]
for i in range(TX1.shape[0]):
    if TX1[i].shape[0]!=0:
        if TX1[i][0,0].shape[0]==0:
            TX.append(np.NaN)
        else:
            TX.append(TX1[i][0,0][0])
    else:
        TX.append(np.NaN)

In [9]:
# We get the values of the feature 'LOCS' -> The unique set of towers seen by the subject
# As each subject has a list with lots of towers ID, we have decided to take the most frequent, this is,
# the ID of the tower which the subject sees most. This way we can know more or less where the subject usually is
def most_frequent(list):
    l=np.concatenate(list,axis=0)
    ll=[int(i) for i in l]
    occurrence_count=Counter(ll)
    return occurrence_count.most_common(1)[0][0]
    
L1 = s[0,:]['all_locs']
L=[]
for i in range(L1.shape[0]):
    if L1[i].shape[0]==1:
        L.append(np.NaN)
    else:
        L.append(most_frequent(L1[i]))

In [10]:
# We get the values of the feature 'APPS' -> The time each application was started and the total number of times 
# the app was used
# As in the previous feature, we use most_frequent, this way, we get the most used application for each subject 
def most_frequent1(list):
    l=np.concatenate(list,axis=0)
    occurrence_count=Counter(l)
    return occurrence_count.most_common(1)[0][0]
    
A1 = s[0,:]['apps']
A=[]
for i in range(A1.shape[0]):
    if A1[i].shape[1]==0:
        A.append(np.NaN)
    else:
        A.append(most_frequent1(A1[i][0]))

In [11]:
# We get the values of the feature 'DATA' -> Inferred locations at each hour of the day.
# Now, we are going to get 24 of our features, those that tell us where the subjects are at every hour of the day
# For that aim, we are going to take from each subject the most frequent location at each hour, since we have 
# information of many days
def most_frequent2(list):
    l1=list[list != 0]
    l=l1[np.logical_not(np.isnan(l1))]
    if l.shape[0] == 0:
        return np.NaN
    else:
        occurrence_count=Counter(l)
        return occurrence_count.most_common(1)[0][0]
    
D = s[0,:]['data_mat']
D1=[]; D2=[]; D3=[]; D4=[]; D5=[]; D6=[]; D7=[]; D8=[]; D9=[]; D10=[]; D11=[]; D12=[];D13=[]
D14=[]; D15=[]; D16=[]; D17=[]; D18=[]; D19=[]; D20=[]; D21=[]; D22=[]; D23=[]; D24=[]


for i in range(D.shape[0]):
    if D[i].shape[1]==0:
        D1.append(np.NaN); D2.append(np.NaN); D3.append(np.NaN); D4.append(np.NaN); D5.append(np.NaN); D6.append(np.NaN)
        D7.append(np.NaN); D8.append(np.NaN); D9.append(np.NaN); D10.append(np.NaN); D11.append(np.NaN); D12.append(np.NaN)
        D13.append(np.NaN); D14.append(np.NaN); D15.append(np.NaN); D16.append(np.NaN); D17.append(np.NaN); D18.append(np.NaN)
        D19.append(np.NaN); D20.append(np.NaN); D21.append(np.NaN); D22.append(np.NaN); D23.append(np.NaN); D24.append(np.NaN)

    else:
        D1.append(most_frequent2(D[i][0])); D2.append(most_frequent2(D[i][1])); D3.append(most_frequent2(D[i][2])); D4.append(most_frequent2(D[i][3]))
        D5.append(most_frequent2(D[i][4])); D6.append(most_frequent2(D[i][5])); D7.append(most_frequent2(D[i][6])); D8.append(most_frequent2(D[i][7]))
        D9.append(most_frequent2(D[i][8])); D10.append(most_frequent2(D[i][9])); D11.append(most_frequent2(D[i][10])); D12.append(most_frequent2(D[i][11]))
        D13.append(most_frequent2(D[i][12])); D14.append(most_frequent2(D[i][13])); D15.append(most_frequent2(D[i][14])); D16.append(most_frequent2(D[i][15]))
        D17.append(most_frequent2(D[i][16])); D18.append(most_frequent2(D[i][17])); D19.append(most_frequent2(D[i][18])); D20.append(most_frequent2(D[i][19]))
        D21.append(most_frequent2(D[i][20])); D22.append(most_frequent2(D[i][21])); D23.append(most_frequent2(D[i][22])); D24.append(most_frequent2(D[i][23]))


In [12]:
# We get the values of the feature 'SMS' -> Number of text messages send and received
M1 = s[0,:]['comm_sms']
M=[]
for i in range(M1.shape[0]):
    M.append(M1[i][0][0])


In [13]:
# We get the values of the feature 'VOICE' -> Number of voice calls made and received
V1 = s[0,:]['comm_voice']
V=[]
for i in range(V1.shape[0]):
    V.append(V1[i][0][0])

In [14]:
# We get the values of the classes: 'AFFIL' -> The subject's affiliation
Class1 = s[0,:]['my_affil']
Class=[]
for i in range(Class1.shape[0]):
    if Class1[i].shape[0]!=0:
        Class.append(Class1[i][0,0][0])
    else:
        Class.append(np.NaN)

# Preprocessing the dataset

In this problem there are seven classes that correspond to the types of affiliations: 'mlgrad', '1st year grad',
'mlfrosh', 'mlstaff', 'mlurop', 'professor', 'sloan'.
For using the classifiers, we need to convert each of these strings in the dataset to a number between 1 and 7.
That is what we do in the next cell.

In [15]:
affiliations = np.asarray(Class)
AF = affiliations.copy()
mlg=0; st=0; mlf=0; mls=0; mlu=0; pro=0; slo=0

for i in range(affiliations.shape[0]):
    if affiliations[i] == 'mlgrad' or affiliations[i] == 'grad':
        AF[i] = 1
        mlg+=1
    if affiliations[i] == '1styeargrad ':
        AF[i] = 2
        st+=1
    if affiliations[i] == 'mlfrosh':
        AF[i] = 3
        mlf+=1
    if affiliations[i] == 'mlstaff':
        AF[i] = 4
        mls+=1
    if affiliations[i] == 'mlurop':
        AF[i] = 5
        mlu+=1
    if affiliations[i] == 'professor':
        AF[i] = 6
        pro+=1
    if affiliations[i] == 'sloan' or affiliations[i] =='sloan_2':
        AF[i] = 7
        slo+=1
print(mlg,st,mlf,mls,mlu,pro,slo)
# Now we are going to created the first column in the new dataset where classes are replaced by numbers
our_dataset = pd.DataFrame(AF,columns=['CLASS'])

36 15 6 6 3 1 27


In [16]:
# We insert the data of the 'GROUP' feature into our dataset
# First, we convert each of the groups to a number between 1 and 26,
# grouped by those that are similar
groups = np.asarray(G)
GR = groups.copy()
for i in range(groups.shape[0]):
    if groups[i] == ' Wakeborders ' or groups[i] == 'Water Skiers' or groups[i] == 'waterskiiers' or groups[i] == 'Waterskiers' or groups[i] == 'Wakeboarders' or groups[i] == 'Windsurfers' or groups[i] == 'Kiteboarders' or groups[i] == 'Surfers':
        GR[i] = 1
    if groups[i] == 'Carribean Snorklers' or groups[i] == 'Snorkelers':
        GR[i] = 2
    if groups[i] == 'Parasailors':
        GR[i] = 3  
    if groups[i] == 'admin':
        GR[i] = 4
    if groups[i] == 'andy':
        GR[i] = 5
    if groups[i] == 'barry':
        GR[i] = 6
    if groups[i] == 'bove':
        GR[i] = 7
    if groups[i] == 'chris' or groups[i] == 'chrisc' or groups[i] == 'chriss':
        GR[i] = 8
    if groups[i] == 'cynthia':
        GR[i] = 9
    if groups[i] == 'dan':
        GR[i] = 10
    if groups[i] == 'gloriana':
        GR[i] = 11
    if groups[i] == 'henry':
        GR[i] = 12
    if groups[i] == 'hugh':
        GR[i] = 13
    if groups[i] == 'joej':
        GR[i] = 14
    if groups[i] == 'john' or groups[i] == 'johnm':
        GR[i] = 15
    if groups[i] == 'judith':
        GR[i] = 16
    if groups[i] == 'marvin':
        GR[i] = 17
    if groups[i] == 'mitch':
        GR[i] = 18
    if groups[i] == 'necsys':
        GR[i] = 19
    if groups[i] == 'neil':
        GR[i] = 20
    if groups[i] == 'pattie':
        GR[i] = 21
    if groups[i] == 'roz':
        GR[i] = 22
    if groups[i] == 'sandy':
        GR[i] = 23
    if groups[i] == 'semor':
        GR[i] = 24
    if groups[i] == 'tod':
        GR[i] = 25
    if groups[i] == 'walter':
        GR[i] = 26

# As we have missing data, we are going to impute it before introducing the feature in the dataset
# Define the imputer: we are going to use the most_frequent method, since we think that for this feature is
# better than the mean or median
frequently_imputer = Imputer(missing_values=np.NaN, strategy="most_frequent",axis=0)
# Fit the imputer
frequently_imputer.fit(GR.reshape(-1,1))
# Transform (impute) the data
imputed_data = frequently_imputer.transform(GR.reshape(-1,1))
# Insert the fitted data into our dataset
our_dataset['GROUP'] = pd.Series(np.concatenate(imputed_data, axis=0))

In [17]:
# We insert the data of the 'REGULAR' feature into our dataset
# We have three answers: very, somewhat and no at all, so we convert each of these to a number between 1 and 3
regular = np.asarray(R)
RG = regular.copy()
for i in range(regular.shape[0]):
    if regular[i] == 'very' or regular[i] == 'Very':
        RG[i] = 1
    if regular[i] == 'somewhat':
        RG[i] = 2
    if regular[i] == 'not at all':
        RG[i] = 3

# Impute missing data
# Define the imputer: in this case we have chosen the median method, because for those instances with missing data
# we would like to have the median value, this is, somewhat
median_imputer = Imputer(missing_values=np.NaN, strategy="median",axis=0)
# Fit the imputer
median_imputer.fit(RG.reshape(-1,1))
# Transform (impute) the data
imputed_data = median_imputer.transform(RG.reshape(-1,1))
# Insert the fitted data into our dataset
our_dataset['REGULAR'] = pd.Series(np.concatenate(imputed_data, axis=0))

In [18]:
# We insert the data of the 'PREDICTABLE' feature into our dataset
# We also have three answers: very, somewhat and no at all, so we convert each of these to a number between 1 and 3
predictable = np.asarray(P)
PR = predictable.copy()
for i in range(predictable.shape[0]):
    if predictable[i] == 'very' or regular[i] == 'Very':
        PR[i] = 1
    if predictable[i] == 'somewhat':
        PR[i] = 2
    if predictable[i] == 'not at all':
        PR[i] = 3  
        
# Impute missing data
# Define the imputer: we use the same method as in the previous case
median_imputer = Imputer(missing_values=np.NaN, strategy="median",axis=0)
# Fit the imputer
median_imputer.fit(PR.reshape(-1,1))
# Transform (impute) the data
imputed_data = median_imputer.transform(PR.reshape(-1,1))
# Insert the fitted data into our dataset
our_dataset['PREDICTABLE'] = pd.Series(np.concatenate(imputed_data, axis=0))

In [19]:
# We insert the data of the 'TRAVEL' feature into our dataset
# We have four answers: 'Very often - more than a week/month', 'Often - week/month', 'Sometimes - several days/month'
# and 'Rarely - several days/term', so we convert each of these to a number between 1 and 4
travel = np.asarray(T)
TR = travel.copy()
for i in range(travel.shape[0]):
    if travel[i] == 'Very often - more than a week/month':
        TR[i] = 1
    if travel[i] == 'Often - week/month':
        TR[i] = 2
    if travel[i] == 'Sometimes - several days/month':
        TR[i] = 3
    if travel[i] == 'Rarely - several days/term':
        TR[i] = 4  
        
# Impute missing data
# Define the imputer: we use the same method as in the previous cases
median_imputer = Imputer(missing_values=np.NaN, strategy="median",axis=0)
# Fit the imputer
median_imputer.fit(TR.reshape(-1,1))
# Transform (impute) the data
imputed_data = median_imputer.transform(TR.reshape(-1,1))
# Insert the fitted data into our dataset
our_dataset['TRAVEL'] = pd.Series(np.concatenate(imputed_data, axis=0))

In [20]:
# We insert the data of the 'TEXTS' feature into our dataset
# We have five answers: 'Very often - several times/day', 'Often - once/day', 'Occasionally - once/week'
# 'Rarely - several days/term' and 'never', so we convert each of these to a number between 1 and 5
texts = np.asarray(TX)
TT = texts.copy()
for i in range(texts.shape[0]):
    if texts[i] == 'very often' or texts[i] == 'Very often - several times/day':
        TT[i] = 1
    if texts[i] == 'often' or texts[i] == 'Often - once/day':
        TT[i] = 2
    if texts[i] == 'occasionally' or texts[i] == 'ocasionally' or texts[i] == 'Occasionally - once/week':
        TT[i] = 3
    if texts[i] == 'Rarely - several days/term' or texts[i] =='rarely':
        TT[i] = 4  
    if texts[i] == 'never':
        TT[i] = 5  
    
# Impute missing data
# Define the imputer: we use the same method as in the previous cases
median_imputer = Imputer(missing_values=np.NaN, strategy="median",axis=0)
#Fit the imputer
median_imputer.fit(TT.reshape(-1,1))
#Transform (impute) the data
imputed_data = median_imputer.transform(TT.reshape(-1,1))
# Insert the fitted data into our dataset
our_dataset['TEXTS'] = pd.Series(np.concatenate(imputed_data, axis=0))

In [21]:
# We insert the data of the 'LOCS' feature into our dataset
locs = np.asarray(L)
# Impute missing data
# Define the imputer:  we are going to use the most_frequent method, since we think that for this feature is
# better than the mean or median
frequently_imputer = Imputer(missing_values=np.NaN, strategy="most_frequent",axis=0)
# Fit the imputer
frequently_imputer.fit(locs.reshape(-1,1))
# Transform (impute) the data
imputed_data = frequently_imputer.transform(locs.reshape(-1,1))
# Insert the fitted data in our dataset
our_dataset['LOCS'] = pd.Series(np.concatenate(imputed_data, axis=0))

In [22]:
# We insert the data of the 'APPS' feature into our dataset
# After taking the most used application from each subject, we have just two most used apps: Phone and ScreenSaver
apps = np.asarray(A)
AP = apps.copy()
for i in range(apps.shape[0]):
    if apps[i] == 'Phone':
        AP[i] = 1
    if apps[i] == 'ScreenSaver':
        AP[i] = 2
# Impute missing data
# Define the imputer:  we are going to use the most_frequent method, since we think that for this feature is
# better than the mean or median
frequently_imputer = Imputer(missing_values=np.NaN, strategy="most_frequent",axis=0)
# Fit the imputer
frequently_imputer.fit(AP.reshape(-1,1))
# Transform (impute) the data
imputed_data = frequently_imputer.transform(AP.reshape(-1,1))
# Insert the fitted data in our dataset
our_dataset['APPS'] = pd.Series(np.concatenate(imputed_data, axis=0))

In [23]:
# Impute missing data
# Define the imputer:  we are going to use the most_frequent method, since we think that for this feature is
# better than the mean or median
frequently_imputer = Imputer(missing_values=np.NaN, strategy="most_frequent",axis=0)

#1am
d1 = np.asarray(D1)
# Fit the imputer
frequently_imputer.fit(d1.reshape(-1,1))
# Transform (impute) the data
imputed_data = frequently_imputer.transform(d1.reshape(-1,1))
# Insert the fitted data in our dataset
our_dataset['D1'] = pd.Series(np.concatenate(imputed_data, axis=0))

#2am
d2 = np.asarray(D2)
frequently_imputer.fit(d2.reshape(-1,1))
imputed_data = frequently_imputer.transform(d2.reshape(-1,1))
our_dataset['D2'] = pd.Series(np.concatenate(imputed_data, axis=0))

#3am
d3 = np.asarray(D3)
frequently_imputer.fit(d3.reshape(-1,1))
imputed_data = frequently_imputer.transform(d3.reshape(-1,1))
our_dataset['D3'] = pd.Series(np.concatenate(imputed_data, axis=0))

#4am
d4 = np.asarray(D4)
frequently_imputer.fit(d4.reshape(-1,1))
imputed_data = frequently_imputer.transform(d4.reshape(-1,1))
our_dataset['D4'] = pd.Series(np.concatenate(imputed_data, axis=0))

#5am
d5 = np.asarray(D5)
frequently_imputer.fit(d5.reshape(-1,1))
imputed_data = frequently_imputer.transform(d5.reshape(-1,1))
our_dataset['D5'] = pd.Series(np.concatenate(imputed_data, axis=0))

#6am
d6 = np.asarray(D6)
frequently_imputer.fit(d6.reshape(-1,1))
imputed_data = frequently_imputer.transform(d6.reshape(-1,1))
our_dataset['D6'] = pd.Series(np.concatenate(imputed_data, axis=0))

#7am
d7 = np.asarray(D7)
frequently_imputer.fit(d7.reshape(-1,1))
imputed_data = frequently_imputer.transform(d7.reshape(-1,1))
our_dataset['D7'] = pd.Series(np.concatenate(imputed_data, axis=0))

#8am
d8 = np.asarray(D8)
frequently_imputer.fit(d8.reshape(-1,1))
imputed_data = frequently_imputer.transform(d8.reshape(-1,1))
our_dataset['D8'] = pd.Series(np.concatenate(imputed_data, axis=0))

#9am
d9 = np.asarray(D9)
frequently_imputer.fit(d9.reshape(-1,1))
imputed_data = frequently_imputer.transform(d9.reshape(-1,1))
our_dataset['D9'] = pd.Series(np.concatenate(imputed_data, axis=0))

#10am
d10 = np.asarray(D10)
frequently_imputer.fit(d10.reshape(-1,1))
imputed_data = frequently_imputer.transform(d10.reshape(-1,1))
our_dataset['D10'] = pd.Series(np.concatenate(imputed_data, axis=0))

#11am
d11 = np.asarray(D11)
frequently_imputer.fit(d11.reshape(-1,1))
imputed_data = frequently_imputer.transform(d11.reshape(-1,1))
our_dataset['D11'] = pd.Series(np.concatenate(imputed_data, axis=0))

#12am
d12 = np.asarray(D12)
frequently_imputer.fit(d12.reshape(-1,1))
imputed_data = frequently_imputer.transform(d12.reshape(-1,1))
our_dataset['D12'] = pd.Series(np.concatenate(imputed_data, axis=0))

#1pm
d13 = np.asarray(D13)
frequently_imputer.fit(d13.reshape(-1,1))
imputed_data = frequently_imputer.transform(d13.reshape(-1,1))
our_dataset['D13'] = pd.Series(np.concatenate(imputed_data, axis=0))

#2pm
d14 = np.asarray(D14)
frequently_imputer.fit(d14.reshape(-1,1))
imputed_data = frequently_imputer.transform(d14.reshape(-1,1))
our_dataset['D14'] = pd.Series(np.concatenate(imputed_data, axis=0))

#3pm
d15 = np.asarray(D15)
frequently_imputer.fit(d15.reshape(-1,1))
imputed_data = frequently_imputer.transform(d15.reshape(-1,1))
our_dataset['D15'] = pd.Series(np.concatenate(imputed_data, axis=0))

#4pm
d16 = np.asarray(D16)
frequently_imputer.fit(d16.reshape(-1,1))
imputed_data = frequently_imputer.transform(d16.reshape(-1,1))
our_dataset['D16'] = pd.Series(np.concatenate(imputed_data, axis=0))

#5pm
d17 = np.asarray(D17)
frequently_imputer.fit(d17.reshape(-1,1))
imputed_data = frequently_imputer.transform(d17.reshape(-1,1))
our_dataset['D17'] = pd.Series(np.concatenate(imputed_data, axis=0))

#6pm
d18 = np.asarray(D18)
frequently_imputer.fit(d18.reshape(-1,1))
imputed_data = frequently_imputer.transform(d18.reshape(-1,1))
our_dataset['D18'] = pd.Series(np.concatenate(imputed_data, axis=0))

#7pm
d19 = np.asarray(D19)
frequently_imputer.fit(d19.reshape(-1,1))
imputed_data = frequently_imputer.transform(d19.reshape(-1,1))
our_dataset['D19'] = pd.Series(np.concatenate(imputed_data, axis=0))

#8pm
d20 = np.asarray(D20)
frequently_imputer.fit(d20.reshape(-1,1))
imputed_data = frequently_imputer.transform(d20.reshape(-1,1))
our_dataset['D20'] = pd.Series(np.concatenate(imputed_data, axis=0))

#9pm
d21 = np.asarray(D21)
frequently_imputer.fit(d21.reshape(-1,1))
imputed_data = frequently_imputer.transform(d21.reshape(-1,1))
our_dataset['D21'] = pd.Series(np.concatenate(imputed_data, axis=0))

#10pm
d22 = np.asarray(D22)
frequently_imputer.fit(d22.reshape(-1,1))
imputed_data = frequently_imputer.transform(d22.reshape(-1,1))
our_dataset['D22'] = pd.Series(np.concatenate(imputed_data, axis=0))

#11pm
d23 = np.asarray(D23)
frequently_imputer.fit(d23.reshape(-1,1))
imputed_data = frequently_imputer.transform(d23.reshape(-1,1))
our_dataset['D23'] = pd.Series(np.concatenate(imputed_data, axis=0))

#12pm
d24 = np.asarray(D24)
frequently_imputer.fit(d24.reshape(-1,1))
imputed_data = frequently_imputer.transform(d24.reshape(-1,1))
our_dataset['D24'] = pd.Series(np.concatenate(imputed_data, axis=0))

In [24]:
# We insert the data of the 'SMS' feature into our dataset
# As the number of sms sent by each subject doesn't give much information itself, we have grouped them by
# making 6 relevant intervals
sms = np.asarray(M)
SMS = sms.copy()
for i in range(sms.shape[0]):
    if sms[i] == 0:
        SMS[i] = 1
    if sms[i] > 0 and sms[i] < 300:
        SMS[i] = 2
    if sms[i] > 300 and sms[i] < 600:
        SMS[i] = 3
    if sms[i] > 600 and sms[i] < 900 :
        SMS[i] = 4  
    if sms[i] > 900 and sms[i] < 1200:
        SMS[i] = 5 
    if sms[i] > 1200:
        SMS[i] = 6
# As there is no missing data for this feature, we are going to insert it into de dataset
our_dataset['SMS'] = SMS

In [25]:
# We insert the data of the 'VOICE' feature into our dataset
# As the number of voice calls of each subject doesn't give much information itself, we have grouped them by
# making 6 relevant intervals
voice = np.asarray(V)
VC = voice.copy()
for i in range(voice.shape[0]):
    if voice[i] == 0:
        VC[i] = 1
    if voice[i] > 0 and voice[i] < 300:
        VC[i] = 2
    if voice[i] > 300 and voice[i] < 600:
        VC[i] = 3
    if voice[i] > 600 and voice[i] < 900 :
        VC[i] = 4  
    if voice[i] > 900 and voice[i] < 1200:
        VC[i] = 5 
    if voice[i] > 1200:
        VC[i] = 6
# As there is no missing data for this feature, we are going to insert it into de dataset
our_dataset['VOICE'] = VC

In [26]:
#Remove the rows in which the class is NaN
our_dataset = our_dataset.drop([0,1,17,23,33,38,44,46,50,58,63,104], axis=0)

In [27]:
our_dataset

Unnamed: 0,CLASS,GROUP,REGULAR,PREDICTABLE,TRAVEL,TEXTS,LOCS,APPS,D1,D2,...,D17,D18,D19,D20,D21,D22,D23,D24,SMS,VOICE
2,2,23.0,1.0,1.0,4.0,3.0,5119.0,2.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,2
3,1,23.0,2.0,1.0,2.0,1.0,5188.0,1.0,1.0,1.0,...,2.0,2.0,2.0,2.0,3.0,3.0,1.0,1.0,2,5
4,1,23.0,2.0,2.0,4.0,2.0,5188.0,1.0,1.0,1.0,...,2.0,2.0,3.0,3.0,3.0,3.0,3.0,1.0,2,6
5,1,9.0,2.0,1.0,4.0,1.0,5188.0,1.0,1.0,1.0,...,2.0,2.0,3.0,3.0,3.0,3.0,3.0,1.0,3,5
6,1,23.0,2.0,2.0,4.0,5.0,24127.0,2.0,2.0,2.0,...,3.0,2.0,2.0,3.0,2.0,2.0,3.0,3.0,1,1
7,1,23.0,1.0,1.0,3.0,5.0,5188.0,1.0,1.0,1.0,...,2.0,2.0,2.0,2.0,3.0,3.0,3.0,3.0,2,4
8,6,23.0,1.0,1.0,2.0,4.0,5187.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2,2
9,1,18.0,1.0,1.0,4.0,5.0,32104.0,1.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2,3
10,4,1.0,2.0,1.0,4.0,3.0,30000.0,1.0,1.0,1.0,...,2.0,2.0,3.0,3.0,3.0,3.0,3.0,1.0,2,3
11,1,21.0,2.0,1.0,4.0,3.0,5188.0,1.0,1.0,1.0,...,2.0,2.0,3.0,3.0,3.0,3.0,3.0,1.0,2,5


# Preparing data for classification

To apply classifiers, we need to separate in two different sets the features and the classes.

In [28]:
# all_theclass contains the classes (values between 1 and 7) for the 94 samples
all_theclass = our_dataset["CLASS"]

# all_features contains the features for the 94 samples.
names_attributes=['GROUP','REGULAR','PREDICTABLE','TRAVEL','TEXTS','LOCS','APPS','D1','D2','D3','D4','D5','D6','D7','D8','D9','D10','D11','D12','D13','D14','D15','D16','D17','D18','D19','D20','D21','D22','D23','D24','SMS','VOICE']
all_features = our_dataset[names_attributes]

# Diving dataset in train and test sets for validation

In [29]:
# We divide the data into two sets (train and test)

# Number of samples in the train and test sets (half of the number of samples)
n_samples = int(len(all_features)/2)

# The train data are the first half of all_features
train_data = all_features[:n_samples]
train_class = all_theclass[:n_samples]

# The test data are the seconnd half of all_features
test_data = all_features[n_samples:]
test_class = all_theclass[n_samples:]

# Defining the classifiers

We define the three classifiers used.

In [46]:
dt = DecisionTreeClassifier()
knn = KNeighborsClassifier()
rf = RandomForestClassifier(n_estimators=100)

# Learning the classifiers

We use the train data to learn the three classifiers we have chosen: Decision Tree, K-Nearest Neighbor and Random Forest.

In [47]:
dt.fit(train_data,train_class)
knn.fit(train_data, train_class)
rf.fit(train_data, train_class)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

# Using the classifier for predictions

We predict the class of the samples in the test data with the three classifiers.

In [48]:
dt_test_predictions = dt.predict(test_data)
knn_test_predictions = knn.predict(test_data)
rf_test_predictions = rf.predict(test_data)

# Validation


### Cross-validation

We try to estimate the classifier accuracy using k-fold cross-validation with k=2. The result of cross-validation will be the predictions for all instances. It is not possible to estimate the accuracy by using cross-validation since the least populated class in y has only 1 member, which is too few. So the minimum number of members in any class cannot be more than that. k-fold cross-validation requires at least one train/test split by setting n_splits=2 or more, and we got n_splits=1.



### fscore

Let us use the validation metric fscore to validate our model.

In [49]:
dt_score=f1_score(test_class, dt_test_predictions, average='micro') 
knn_score=f1_score(test_class, knn_test_predictions, average='micro')
rf_score=f1_score(test_class, rf_test_predictions, average='micro')
print("fscore for the decision tree: " ,dt_score)
print("fscore for the k-nearest neighbors: " ,knn_score)
print("fscore for the random forest : " ,rf_score)

fscore for the decision tree:  0.468085106383
fscore for the k-nearest neighbors:  0.531914893617
fscore for the random forest :  0.404255319149


# Visualization of results

## Confusion matrix

Finally we compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report. 


In [50]:
print("Confusion matrix decision tree")
cm_dt = pd.crosstab(test_class,dt_test_predictions)
print(cm_dt)
cm_dt.to_latex()

Confusion matrix decision tree
col_0   1  2  3  7
CLASS             
1      15  2  0  0
2       5  1  0  0
3       0  0  2  0
4       0  2  0  1
5       1  1  0  0
7       1  6  6  4


'\\begin{tabular}{lrrrr}\n\\toprule\ncol\\_0 &   1 &  2 &  3 &  7 \\\\\nCLASS &     &    &    &    \\\\\n\\midrule\n1     &  15 &  2 &  0 &  0 \\\\\n2     &   5 &  1 &  0 &  0 \\\\\n3     &   0 &  0 &  2 &  0 \\\\\n4     &   0 &  2 &  0 &  1 \\\\\n5     &   1 &  1 &  0 &  0 \\\\\n7     &   1 &  6 &  6 &  4 \\\\\n\\bottomrule\n\\end{tabular}\n'

In [53]:
print("Confusion matrix k-nearest neighbors")
cm_knn = pd.crosstab(test_class,knn_test_predictions)
print(cm_knn)
cm_knn.to_latex()

Confusion matrix k-nearest neighbors
col_0   1  2  3   7
CLASS              
1      12  1  0   4
2       4  0  0   2
3       0  0  2   0
4       0  0  0   3
5       0  0  0   2
7       2  4  0  11


'\\begin{tabular}{lrrrr}\n\\toprule\ncol\\_0 &   1 &  2 &  3 &   7 \\\\\nCLASS &     &    &    &     \\\\\n\\midrule\n1     &  12 &  1 &  0 &   4 \\\\\n2     &   4 &  0 &  0 &   2 \\\\\n3     &   0 &  0 &  2 &   0 \\\\\n4     &   0 &  0 &  0 &   3 \\\\\n5     &   0 &  0 &  0 &   2 \\\\\n7     &   2 &  4 &  0 &  11 \\\\\n\\bottomrule\n\\end{tabular}\n'

In [52]:
print("Confusion matrix random forest")
cm_rf = pd.crosstab(test_class,rf_test_predictions)
print(cm_rf)
cm_rf.to_latex()

Confusion matrix random forest
col_0   1  2  3  4  7
CLASS                
1      13  1  1  1  1
2       5  0  0  1  0
3       0  0  2  0  0
4       0  1  0  0  2
5       1  1  0  0  0
7       2  8  1  2  4


'\\begin{tabular}{lrrrrr}\n\\toprule\ncol\\_0 &   1 &  2 &  3 &  4 &  7 \\\\\nCLASS &     &    &    &    &    \\\\\n\\midrule\n1     &  13 &  1 &  1 &  1 &  1 \\\\\n2     &   5 &  0 &  0 &  1 &  0 \\\\\n3     &   0 &  0 &  2 &  0 &  0 \\\\\n4     &   0 &  1 &  0 &  0 &  2 \\\\\n5     &   1 &  1 &  0 &  0 &  0 \\\\\n7     &   2 &  8 &  1 &  2 &  4 \\\\\n\\bottomrule\n\\end{tabular}\n'