We will seek to investigate if lemmatization can improve the performance of the model using the balanced dataset. As this makes use of a data-augmented dataset, we will preprocess the train and test sets separately.

## Contents:
- [Loading of Libraries](#Loading-of-Libraries) 
- [Load Datasets](#Load-Datasets)
- [Preprocessing: Tokenize and Lemmatize](#Preprocessing:-Tokenize-and-Lemmatize)
- [Data Modelling](#Data-Modelling)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)
- [Limitations](#Limitations)

## Loading of Libraries

In [91]:
# Import libraries
import pandas as pd       
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer # for lemmatization
from nltk.corpus import stopwords # for stopwords removal

from sklearn.pipeline import Pipeline # to compactly pack multiple modeling operations
from sklearn.naive_bayes import MultinomialNB # to build our classification model

# Import TFIDFVectorizer from feature_extraction.text module in sklearn.
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import accuracy_score # for model performance assessment

## Load Datasets

In [79]:
# Load datasets
test = pd.read_csv('./output/data_augment/test_dataset.csv')
test.head()

Unnamed: 0,reply,forum
0,Originally posted by feb01mel View Post I bet...,0
1,Worse than 5 years of approx. S$25k each year...,1
2,Ouch... KrisFlyer online Star Alliance award ...,1
3,Originally posted by SQJunkie View Post I cal...,1
4,Nothing major as they are moving terminals wi...,0


In [55]:
aug_train = pd.read_csv('./output/data_augment/augmented_train_dataset.csv')
aug_train.head()

Unnamed: 0,reply,forum
0,"The one from August is there, if you did a se...",0
1,It really makes me start questioning the abil...,1
2,"SQ HKG-SIN, SQ 871, ECONOMY CLASS. VOML. Dinn...",0
3,SQ853 CAN-SIN (June 2013) SUPPER (GUANGZHOU T...,0
4,"Looking at last year, my PPS Values per fligh...",1


In [63]:
# Find null values
aug_train.isna().sum()

reply    16
forum     0
dtype: int64

In [64]:
# Drop rows with null values
aug_train.dropna(inplace=True)

## Preprocessing: Tokenize and Lemmatize

In [80]:
# Tokenize replies into words
test['reply'] = test.apply(lambda row: word_tokenize(row['reply']), axis=1)

In [81]:
# Instantiate lemmatizer.
lemmatizer = WordNetLemmatizer()

In [82]:
# Lemmatize each word in replies
test['reply'] = test['reply'].apply(lambda lst: [lemmatizer.lemmatize(word) for word in lst])

In [85]:
test['reply'] = test['reply'].apply(', '.join)

In [86]:
test

Unnamed: 0,reply,forum
0,"Originally, posted, by, feb01mel, View, Post, ...",0
1,"Worse, than, 5, year, of, approx, ., S, $, 25k...",1
2,"Ouch, ..., KrisFlyer, online, Star, Alliance, ...",1
3,"Originally, posted, by, SQJunkie, View, Post, ...",1
4,"Nothing, major, a, they, are, moving, terminal...",0
...,...,...
6839,"My, letter, wa, also, dated, 14, March, ..., I...",1
6840,"Asking, on, behalf, of, a, friend, who, is, PP...",1
6841,"Originally, posted, by, SQflyergirl, View, Pos...",1
6842,"Originally, posted, by, Nick, C, View, Post, A...",1


In [65]:
# Repeat the same on train dataset
# Tokenize replies into words
aug_train['reply'] = aug_train.apply(lambda row: word_tokenize(row['reply']), axis=1)

In [67]:
# Lemmatize each word in replies
aug_train['reply'] = aug_train['reply'].apply(lambda lst: [lemmatizer.lemmatize(word) for word in lst])

In [87]:
aug_train['reply'] = aug_train['reply'].apply(', '.join)

In [88]:
aug_train

Unnamed: 0,reply,forum
0,"The, one, from, August, is, there, ,, if, you,...",0
1,"It, really, make, me, start, questioning, the,...",1
2,"SQ, HKG-SIN, ,, SQ, 871, ,, ECONOMY, CLASS, .,...",0
3,"SQ853, CAN-SIN, (, June, 2013, ), SUPPER, (, G...",0
4,"Looking, at, last, year, ,, my, PPS, Values, p...",1
...,...,...
23526,"Originally, posted, by, kapitan, View, Post, H...",1
23527,"hi, all, ,, have, an, upcoming, flight, on, SQ...",0
23528,"I, 'm, missing, the, T2/T3, SKL, agent, alread...",0
23529,"Originally, posted, by, SQueeze, View, Post, W...",0


## Data Modelling
- Set data up for modelling
- Modelling with baseline classifier for comparison

In [90]:
# Set data up for modelling
X_train = aug_train['reply']
y_train = aug_train['forum']
X_test = test['reply']
y_test = test['forum']

### Modelling using TfidfVectorizer and baseline classifier

In [92]:
# Set up a pipeline with tf-idf vectorizer and multinomial naive bayes

pipe_tvec = Pipeline([
    ('tvec', TfidfVectorizer(stop_words='english')),
    ('nb', MultinomialNB())
])

In [93]:
# Fit pipeline to training data
pipe_tvec.fit(X_train, y_train)

In [94]:
# Score model on training set
pipe_tvec.score(X_train, y_train)

0.907250691048267

In [95]:
# Score model on testing set
pipe_tvec.score(X_test, y_test)

0.8885154880187025

When lemmatization was done, the train and test scores did not improve, with the vectorizer and estimator kept constant. Thus, the model with lemmatization will not be optimised.

|**Metrics**|**Model without lemmatization**|**Model with lemmatization**|
|:---|:---|:---|
|train_score|0.909|0.907|
|test_score|0.893|0.889|

## Conclusions and Recommendations
- The objective of this project is to build a machine learning algorithm for an airline company to direct incoming queries to the right channel, namely the membership programme (Krisflyer) and services (inflight catering, amentities and lounges). 

1. Firstly, webscraping was done from an online forum dedicated to the airline company since queries were posted there especially when customers could not reach a customer support agent.

2. A baseline model using CountVectorizer (transformer) and Multinomial Naive Bayes (estimator) was then fitted and evaluated using the following metrics.
|**Metrics**|**MultinomialNB()**|
|:---|:---|
|train_score|0.858|
|test_score|0.844|
|sensitivity|0.986|
|precision|0.801|
|f1 score|0.884|
|specificity|0.626|
|roc auc|0.806|
>A model accuracy of more than 80% is great. Furthermore, as the accuracy score of the train dataset is close to and higher than that of the test dataset, there is no evidence of an overfit. Since the objective is to classify a reply to the correct forum category, the cost of false positives (model predicts a reply to be under 'krisflyer' category, when it is actually not) is the same as the cost of false negatives (model predicts a reply to be under 'amenity_catering_lounges' category, when it is actually not) - a dissatisfied costumer. As such, `accuracy` is a suitable metric. However, due to imbalanced nature of the target variable, the `accuracy` metric is not suitable.
> The other metrics including f1 score, specificity and roc/auc score were then optimised.

3. Model tuning was done by fitting different vectorizers and classifiers to find the best model yielding the highest metrics through GridSearch. 
|**Metrics**|**MultinomialNB() (baseline)**|**TfidfVectorizer with LinearSVC (with GridSearch4)**|**Zero Shot Classification**|
|:---|:---|:---|:---|
|train_score|0.858|0.959|-|
|test_score|0.844|0.922|0.570|
|sensitivity|0.986|0.924|0.407|
|precision|0.801|0.946|0.775|
|f1 score|0.884|0.935|0.533|
|specificity|0.626|0.919|0.819|
|roc auc|0.806|0.922|0.613|
> Overall, the model with TfidfVectorizer with LinearSVC (with GridSearch4) stands out across different metrics.

4. Data augmentation was done to remedy the problem of the imbalanced dataset so that the `accuracy` metric can be used as well.
> Compared to the model with an imbalanced target variable, the model with a balanced dataset has a higher accuracy score, with the vectorizer and estimator kept constant.
|**Metrics**|**Model with an imbalanced dataset**|**Model with a balanced dataset**|
|:---|:---|:---|
|train_score|0.887|0.909|
|test_score (accuracy score)|0.873|0.893|

5. Again, GridSearch was done to find the best model yielding the highest metrics.
> Comparing the best models with TfidfVectorizer and LinearSVC with an imbalanced and balanced dataset, the `accuracy` scores are better with the balanced dataset as summarised below.
|**Metrics**|**Model with an imbalanced dataset**|**Model with a balanced dataset**|
|:---|:---|:---|
|train_score|0.959|0.962|
|test_score (accuracy score)|0.922|0.923|

6. Lemmatization was done in an attempt to further improve the accuracy score on the balanced dataset but the model performed worse.
> Thus, the model with lemmatization was not optimised.
|**Metrics**|**Model without lemmatization**|**Model with lemmatization**|
|:---|:---|:---|
|train_score|0.909|0.907|
|test_score|0.893|0.889|

- The best model is therefore one with TfidfVectorizer and LinearSVC with a balanced dataset, based on the `accuracy` metric.

- Moving forward, the airline company can use this model to classify new queries to the correct channel on the chatbot or to the relevant customer support agent.
- The project can also be further extended to a multiclass classification model if the airline company wants to direct the query to a more specific channel.
- Sentiment analysis can also be conducted using the text data scraped to help the airline company identify its strengths and weaknesses, so that appropriate intervention steps can be implemented to improve its reputation.

## Limitations
- As the text data was scraped from a forum mainly contributed by Singapore residents, there is the presence of 'Singlish' in the corpus and not being removed through 'English' stopwords removal. However, this is addressed through GridSearch if 'Singlish' is very much prevalent. 
- Under the 'Catering' fourm, the posts are mainly informational instead of queries, thus the machine learning algorithm might not be able to classify questions related to catering well.