# Code 17 to Code 29

In [2]:
import pandas as pd
import re

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Importing data

In [3]:
df = pd.read_csv('airline_dataset.csv',low_memory=False,encoding = 'latin1')


In [4]:
df.head()

Unnamed: 0,line,class
0,When can I web check-in?,check in
1,want to check in,check in
2,please check me in,check in
3,check in,check in
4,my flight is tomm can I check in,check in


Distribution of "class" or intent count

In [5]:
df['class'].value_counts()

login        105
other         79
baggage       76
check in      61
greetings     45
thanks        16
cancel        16
Name: class, dtype: int64

In [6]:
df["line"]=df["line"].str.lower().str.lstrip().str.rstrip()

In [7]:
### TFIDF Vectorizer. Get a TFIDF Matrix

TF-IDF computes term frequency – inverse document frequency. This statistic for a given word indicates how important it is as compared to other words in the corpus. The word that comes in fewer documents will have a higher TF-IDF score and the words that comes in all the documents will have lower TF-IDF score

1.	TF (Term Frequency) in this formula represents the number of times a “word” (token or feature) appears in a document normalized by number of words in the document
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
2.	IDF (Inverse Document Freqeuncy) represents the number of times the given word occurs in a corpus. 
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
3.	Final value is obtained by multiplying TF and IDF.



In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

The “TfidfVectorizer” object has few important parameters to reduce the sparsity of the matrix (number of columns) - 
1.	mindf - this is the minimum number of documents a word should be present
2.	Stop_words - an option to provide stop words or not.
The “Analyzer” can be word or character depending on the type of token we want. We would like to have words for the problem at hand.
The options on ngram_range provides any length of ngrams (bigrams, trigrams etc). You could have tokens range from bigrams to trigrams for instance by using ngram_range(2,3). Once we get a vectorizer object we fit it on the corpus and get a sparse matrix. The sparse matrix can then be converted to dense. 

In [9]:

tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 1),stop_words=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(df["line"])
tf1= tfidf_matrix.todense()


This command tfidf_vectorizer.vocabulary lists the set of tokens done by the vectorizer. As we can see there are a bunch of numbers and dates which may not add much value to the analysis but if they are normalized into entities like dates, months etc can be of a lot of value. We will now see data normalization in the next section


In [10]:
tfidf_vectorizer.vocabulary_

{'when': 215,
 'can': 43,
 'web': 210,
 'check': 53,
 'in': 103,
 'want': 206,
 'to': 194,
 'please': 161,
 'me': 129,
 'my': 138,
 'flight': 72,
 'is': 107,
 'tomm': 195,
 'not': 146,
 'able': 10,
 'next': 142,
 'week': 211,
 'how': 102,
 'onine': 153,
 'online': 154,
 'before': 34,
 'checking': 55,
 'hours': 101,
 'am': 21,
 'flying': 75,
 'on': 151,
 '26th': 5,
 'of': 149,
 'this': 188,
 'month': 135,
 '20th': 3,
 'monday': 133,
 'now': 147,
 'getting': 85,
 'late': 117,
 'the': 185,
 'airport': 16,
 'pnr': 162,
 '65321': 7,
 'checkin': 54,
 'get': 84,
 'internet': 106,
 'what': 214,
 'free': 79,
 'baggage': 29,
 'allowance': 20,
 'much': 137,
 'have': 94,
 '35': 6,
 'kg': 113,
 'should': 174,
 'pay': 158,
 'carry': 49,
 'lots': 123,
 'bags': 32,
 'till': 192,
 'many': 127,
 'are': 24,
 'upto': 201,
 'weight': 212,
 'number': 148,
 'carrying': 50,
 'travelling': 198,
 'with': 219,
 'money': 134,
 'for': 77,
 'luggage': 124,
 'too': 196,
 'cancel': 45,
 'booking': 39,
 'tickets': 191

In [11]:
len(tfidf_vectorizer.vocabulary_),tf1.shape

(223, (398, 223))

tf matrix

In [12]:
tf1[0:10]

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

## Data normalization
Data normalization typically includes techniques like - generalizing dates, times, amount using regex. For instance we could group all date formats into a word as “Dates” similarly all amount paid or recieved could be replaced with the word “Money”. We could also “relabel” similar looking words. For ex - In our case - “Baggage” and “Luggage” means the same thing and hence they could be replaced into a single word. Normalization helps in 2 things - one making the data dense and the other is to prepare the data for any “unseen” variations. For eg - take the sentence. “My flight is on sunday and I want to check in”. If we build a model considering the previous sentence then when a new sentence is presented which says,“My flight is on Friday and I want to check in” the model may not perform well. Hence we replace any such words (in this case days of week) to possibly a common name (in this case “Dayofweek”) to generalize the model and make it more robust.

Few examples below on how regular expressions can help in preprocessing sentences

In [13]:
str1 = "I want to be there on 19th"

In [14]:
re.sub("[0-9]+th","datepp",str1)

'I want to be there on datepp'

In [15]:

str1 = "I want to be there on 23-05-18"
str2 = "I want to be there on 23-05"

In [16]:
import re
print (re.sub("[0-9]+[\/-]+[0-9]+[\/-]*[0-9]*","datepp",str1))
print (re.sub("[0-9]+[\/-]+[0-9]+[\/-]*[0-9]*","datepp",str2))

I want to be there on datepp
I want to be there on datepp


Now let us apply the regular expression preprocessing to our corpus

In [17]:
df["line1"]= df["line"].str.replace('[0-9]+th','datepp')
df["line1"]= df["line1"].str.replace('[0-9]+[\/-]+[0-9]+[\/-]*[0-9]*','datepp')
df["line1"] = df["line1"].str.replace('[0-9]+','digitpp')
df["line1"]= df["line1"].str.replace('[^A-Za-z]+',' ')


We will now see an example of word replacements with a common group name. Here in replace we replace common similar meaning words to a common name. I created a file that maps similar words to a group. We shall now import and have a look at the file

Get preprocessing files to normalize text

In [18]:
pp = pd.read_csv('preprocess.csv',low_memory=False,encoding = 'latin1')

In [19]:
pp.head()

Unnamed: 0,word,class
0,luggage,baggage
1,bags,baggage
2,checkin,checkin
3,chckin,checkin
4,check in,checkin


Next step is to replace the words to the corresponding class name

In [20]:

def preprocess(l1):
    l2=l1
    for word in pp["word"]:
        if (l1.find(word)>=0):
            newcl = list(pp.loc[pp.word.str.contains(word),"class"])[0]
            l2 = l1.replace(word,newcl)
            break;

    return l2


In [21]:
df["line1"] = df["line1"].apply(preprocess)

In [22]:
tgt = df.loc[:,"class"]

For us to know the performance of the dataset on train and test we need to slice the dataset into train and test. Since we are dealing with text classes - we would have to do multi-class classification. This  means we will have to stratrified split the dataset into multiple classes so that each class gets enough representation in train and test

Startified sample to proceed before we do vectorizations - to take care of vocab not "leaking" to test dataset

In [23]:
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1,random_state=42,n_splits=1)

for train_index, test_index in sss.split(df, tgt):
    x_train, x_test = df[df.index.isin(train_index)], df[df.index.isin(test_index)]
    y_train, y_test = df.loc[df.index.isin(train_index),"class"], df.loc[df.index.isin(test_index),"class"]

In [24]:
x_train.shape,y_train.shape,x_test.shape,x_test.shape

((358, 3), (358,), (40, 3), (40, 3))

We now apply the TF-IDF Vectorizer on the train and test split we have. Remember to do this stop after train and test split. If this step is done before splitting chances are you are overestimating your accuracies.

In [25]:
tfidf_vectorizer = TfidfVectorizer(min_df=0.0001,analyzer=u'word',ngram_range=(1, 3),stop_words='english')
tfidf_matrix_tr = tfidf_vectorizer.fit_transform(x_train["line1"])

tfidf_matrix_te = tfidf_vectorizer.transform(x_test["line1"])

x_train2= tfidf_matrix_tr.todense()
x_test2 = tfidf_matrix_te.todense()

In [26]:
x_train2.shape,y_train.shape,x_test2.shape,y_test.shape

((358, 269), (358,), (40, 269), (40,))

Before we apply machine learning algorithm we will do a feature selection step and reduce the features to 40% of the original features

In [27]:
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile=40)
selector.fit(x_train2, y_train)
x_train3 = selector.fit_transform(x_train2, y_train)
x_test3 = selector.transform(x_test2)

In [28]:
x_train3.shape,x_test3.shape

((358, 107), (40, 107))

#### Build models

We are ready to run our machine learning algorithm on the train dataset and test it on x_test. 

In [29]:
clf_log = LogisticRegression()
clf_log.fit(x_train3,y_train)  



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [30]:
### Measuring Accuracies

In a multiclass problem - class imbalance is a common occurrence where the distribution of categories could be biased towards a few of the top categories. Hence over and above the overall accuracy we want to test individual accuracies for different categories. This can be done using confusion matrix, precision, recall and F1 measures

In [31]:
pred=clf_log.predict(x_test3)

print (accuracy_score(y_test, pred))

0.875


In [32]:
from sklearn.metrics import f1_score
f1_score(y_test, pred, average='macro')  

  'precision', 'predicted', average, warn_for)


0.7529993815708103

In [33]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred, labels=None, sample_weight=None)

array([[ 7,  0,  0,  0,  0,  1,  0],
       [ 0,  1,  0,  0,  0,  1,  0],
       [ 0,  0,  5,  0,  0,  1,  0],
       [ 0,  0,  0,  4,  0,  0,  0],
       [ 0,  0,  0,  0, 10,  0,  0],
       [ 0,  0,  0,  0,  0,  8,  0],
       [ 0,  0,  0,  0,  0,  2,  0]], dtype=int64)

##### Using XGBoost -Not in the chapter

We will build model using XGBoost. First step is to convert the dependent variable to label encoded format

In [34]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y_train1 = le.fit_transform(y_train)
y_test1 = le.transform(y_test)

In [35]:
y_train1.shape

(358,)

In [36]:
n_class = len(pd.Series(y_train).unique())

In [37]:
n_class

7

Providing class weights so that the model penalizes errors for all classes equally

In [38]:
from sklearn.utils import class_weight
import numpy as np
class_weights = list(class_weight.compute_class_weight('balanced',
                                             np.unique(y_train1),
                                             y_train1))

In [39]:
class_weights

[0.7521008403361344,
 3.6530612244897958,
 0.9298701298701298,
 1.2473867595818815,
 0.5383458646616541,
 0.7203219315895373,
 3.6530612244897958]

In [40]:
#class_weights = [0.08,0.8,0.8]

In [41]:
from xgboost import XGBClassifier

xgb = XGBClassifier(n_estimators=1000,max_depth= 10,subsample= 0.5,colsample_bytree= 0.3
                   ,class_weights = class_weights)
xgb.fit(x_train3,y_train1)
pred = xgb.predict(x_test3)

Computing Accuracies for XGBoost model

In [42]:


print (accuracy_score(y_test1, pred))

0.9


In [43]:
y_test2 = le.inverse_transform(y_test1)
pred2 = le.inverse_transform(pred)

In [44]:
from sklearn.metrics import confusion_matrix as cf
cf(y_test2, pred2, labels=None, sample_weight=None)

array([[ 7,  0,  0,  0,  0,  1,  0],
       [ 0,  2,  0,  0,  0,  0,  0],
       [ 0,  0,  5,  0,  0,  1,  0],
       [ 0,  0,  0,  4,  0,  0,  0],
       [ 0,  0,  0,  0, 10,  0,  0],
       [ 0,  0,  0,  0,  0,  8,  0],
       [ 0,  0,  0,  0,  0,  2,  0]], dtype=int64)