##¬†Import Libraries

All libraries required to create a model capable of classifying tweets by category are imported.  These are described in the comments below.

In [107]:
## Pandas required to manipulate data into user-friendly data structure
import pandas as pd

## Pickle allows Python objects to be saved for later use, and retrieved
import pickle
from sklearn.externals import joblib

## Numpy is used to execute various mathematical functions
import numpy as np

## Matplotlib and Seaborn are both plotting tools used to support datavisualisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Function to enable random split of data into training and test set
from sklearn.model_selection import train_test_split

##¬†Import the TfidfVectorizer to convert a collection of raw documents to a matrix of TF-IDF features.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

## Gridsearch enables the optimal combination of parameters to be selected for a given classifier
from sklearn.model_selection import GridSearchCV

## A number of different classification models
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import xgboost as xgb

## Metrics to help evaluate the performance of each model
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score, roc_curve, auc



## Set Random Seed

To ensure results are reproducible, a random see is set using Numpy.

In [3]:
## Set a random seed to ensure results are reproducible
np.random.seed(10)

## Set Pandas Display Options

Pandas display settings are chosen to ensure that the full contents of each column can be seen.

In [4]:
## Set width of pandas dataframe to ensure entire Tweet is displayed
pd.set_option('display.max_colwidth', 3000)

## Import Data

The cleaned and labelled tweets are imported from the pre-prepared DataFrame ('cleaned_labelled_tweets').  For more information on how this was created, please refer to **[Step 1 - Obtain Data](https://github.com/isobeldaley/categorising-tweets/blob/master/Step%201%20-%20Obtain%20Data.ipynb)** and **[Step 2 - Scrub Data](https://github.com/isobeldaley/categorising-tweets/blob/master/Step%202%20-%20Scrub%20Data.ipynb)**.

In [5]:
## Import saved dataframe using pickle
df = pd.read_pickle('cleaned_labelled_tweets')

Next, the first five rows of the dataframe are previewed.  

In [6]:
##¬†Preview first five rows of DataFrame
df.head()

Unnamed: 0,network,datetime,original_tweet,subject,sentiment,lemmatized_tweets_tokens,lemmatized_tweets_string
0,@VodafoneUK,2019-12-04 08:05:14,@VodafoneUK Plus ¬£2.28 package &amp; posting ! ! !,device,0.0,"[plus, 2.28, package, posting]",plus 2.28 package posting
1,@VodafoneUK,2019-12-04 08:04:05,I have repeatedly asked how to get a refund so I can use another provider. I have also asked how to escalate my complaint. @VodafoneIN refuses to give me this information. @VodafoneUK @VodafoneGroup @rmstakkar @Nairkavita,customer service,-0.3,"[repeatedly, asked, get, refund, use, another, provider, also, asked, escalate, complaint, refuse, give, information]",repeatedly asked get refund use another provider also asked escalate complaint refuse give information
2,@VodafoneUK,2019-12-04 08:01:19,"I have supplied visa details twice, I have been subjected to horrendously rude staff instore, and now Vodafone are stealing my money by removing services I have paid for. Tourists should not use Vodafone. @VodafoneIn @VodafoneUK @VodafoneGroup @rmstakkar @Nairkavita",customer service,-0.3,"[supplied, visa, detail, twice, subjected, horrendously, rude, staff, instore, stealing, money, removing, service, paid, tourist, use]",supplied visa detail twice subjected horrendously rude staff instore stealing money removing service paid tourist use
3,@VodafoneUK,2019-12-04 07:57:42,@VodafoneIN promised yesterday I‚Äôd receive no more calls and would get an email in 30 mins. No email received. Today I received yet another call. Vodaphone incompetence means I‚Äôll be losing the data I‚Äôve paid for from midnight @VodafoneUK @VodafoneGroup @rmstakkar @Nairkavita,customer service,-0.25,"[promised, yesterday, id, receive, call, would, get, email, 30, min, email, received, today, received, yet, another, call, vodaphone, incompetence, mean, ill, losing, data, ive, paid, midnight]",promised yesterday id receive call would get email 30 min email received today received yet another call vodaphone incompetence mean ill losing data ive paid midnight
4,@VodafoneUK,2019-12-04 07:57:16,@VodafoneUK you send texts about rewards - this morning Lindt. It takes me to my app but they are never there. Doesn‚Äôt matter how quickly I look. It actually becomes annoying.,promotion,-0.155556,"[send, text, reward, morning, lindt, take, app, never, doesnt, matter, quickly, look, actually, becomes, annoying]",send text reward morning lindt take app never doesnt matter quickly look actually becomes annoying


As can be seen above, the tweets are categorised by network, subject and sentiment.  They are also stored in three different forms:

- The raw/original tweet
- The cleaned/lemmatized tweet as tokens
- The cleaned/lemmatized tweet as a string

This is to allow greatest flexibility when modelling.

## Split Data into Training & Test Set

To ensure that the models created work well and do not sufer from overfitting, the dataset is split into a training and test set. The training set will be used to build the model, whilst the test set will be used to validate that it works well and can be generalised to new data.

To do this, it is first necessary to define the independent (X) and dependent (y) variables.

In [9]:
## Define the X and y variables
X = df['lemmatized_tweets_string']
y = df['subject']

The **train_test_split()** function can then be used to create this split.  Note that a random_state is specified. This is to ensure results can be reproduced by others.

In [10]:
## Split dataset into training & test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=213)

## Vectorize Data

Most machine learning algorithms are not able to process raw text directly.  As such, it is first necessary to convert the raw text into vectors of numbers.  

There are a number of different ways of doing this.  This includes using a:

- **Count Vectorizer**, which simply counts the number of times a word appears in a tweet and uses this as its weight.

- **TF-IDF Vectorizer**, which evaulates how important a specified word is in a tweet.  This works by increasing the importance of a word in proportion to the number of times it appears in a particular tweet, but reducing the importance of that word by the frequency the word appears in the entire dataset of tweets.  In this way, it helps the algorithm determine which words are key to categorising a given tweet.  

### Count Vectorizer

Sklearn provides an inbuilt CountVectorizer().  For greatest efficiency, this is used:

In [13]:
## Specify the CountVectorizer, as provided by sklearn
count_vectorizer = CountVectorizer()

Having specified the vectorizer, it is fitted using the training data.  The training data is then transformed.  It is important not to fit using the test data, as this may lead to data leakage from the training to the test data.

In [17]:
## Fit and trasnform the training data using the tf-idf vectorizer
count_X_train = count_vectorizer.fit_transform(X_train)

Finally the test data is also transformed.

In [18]:
## Transform the test data using this vectorizer
count_X_test = count_vectorizer.transform(X_test)

### TF-IDF Vectorizer

Sklearn provides an inbuilt TfidfVectorizer().  For greatest efficiency, this is used:

In [11]:
## Specify the tfidfvectorizer, as provided by sklearn
tfidf_vectorizer = TfidfVectorizer()

Having specified the vectorizer, it is fitted using the training data.  The training data is then transformed.  It is important not to fit using the test data, as this may lead to data leakage from the training to the test data.

In [15]:
## Fit and trasnform the training data using the tf-idf vectorizer
tf_idf_X_train = tfidf_vectorizer.fit_transform(X_train)

Finally, the test data is also transformed.

In [16]:
## Transform the test data using this vectorizer
tf_idf_X_test = tfidf_vectorizer.transform(X_test)

## Model Data

Before building and comparing alternative classification models, the following function has been defined. This function identifies the most effective combination of parameters and the best data transformation to enhance model performance for a given classifier.

In [20]:
##¬†Function to identify the optimat dataset and parameters for a given classifier and parameter grid
def best_model_parameters_dataset(classifier, param_grid, datasets):
    
    ##¬†Create a list to contain the dataset, optimal parameters, and score for training and test set
    score_parameters = []
    
    ## Create a for loop which iterates through each dataset and identifies the optimal parameters for the given classifier
    for data in datasets:
        
        gs = GridSearchCV(classifier, param_grid, scoring='accuracy', cv=3)
        gs.fit(data['X_train'], data['y_train'])
        y_test_preds = gs.predict(data['X_test'])
        test_score = accuracy_score(y_test_preds, data['y_test'])
        score_parameters.append({'Dataset':data['name'], 'Training Score':round(gs.best_score_,2), 'Test Score': round(test_score,2), 'Parameters':gs.best_params_})
     
    ## Generate a dataframe that contains the optimal parameters for each dataset
    df = pd.DataFrame(score_parameters)
    df.sort_values(by=['Test Score', 'Training Score'], inplace=True, ascending=False)
    
    return df

A number of different classification models will be assessed for their suitability for the task of predicting the category a given tweet relates to.  These are:

- K Nearest Neighbors
- Naive Multinomial Bayes Classifier
- Multinomial Logistic Regression
- Random Forest Classifier
- XG Boost
- Support Vector Machine (SVM)

Each model will be created using the two datasets specified above:

- Dataset transformed by tf-idf vectorization
- Dataset transformed by count vectorization

To make this task more efficient, a list containing all training and test datasets is created.

In [25]:
## Create a list of datasets.  Each item is a dictionary detailing training/test datasets
datasets = [{'name': 'tf_idf','X_train': tf_idf_X_train, 'y_train':y_train, 'X_test': tf_idf_X_test, 'y_test':y_test},
           {'name':'count','X_train':count_X_train,'y_train':y_train, 'X_test': count_X_test, 'y_test':y_test}]

###¬†K Nearest Neighbours

The K Nearest Neighbours (KNN) works by identifying a specified number of similar observations ('nearest neighbours) based on a specified distance metric, and then providing a classification based on the majority classification of the identified 'neighbours'.  

In [23]:
## Specify the classifier, in this case K nearest neighbours
knn = KNeighborsClassifier()

## Define the parameter grid
knn_param_grid = {'n_neighbors':[5,20,40,50,60],
              'metric': ['manhattan', 'euclidean','minkowski'],
              'weights': ['uniform', 'distance']
             }

In [24]:
## Run the best_model_parameters_dataset() function to identify the optimal dataset and parameters to use
best_model_parameters_dataset(knn, knn_param_grid, datasets)

Unnamed: 0,Dataset,Parameters,Test Score,Training Score
1,count,"{'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'distance'}",0.52,0.49
0,tf_idf,"{'metric': 'manhattan', 'n_neighbors': 5, 'weights': 'distance'}",0.44,0.42


From teh above, it appears that the best performance is achieved when the dataset that has been transformed by a CountVectorizer() is used, and the following parameters are specified:

- **Distance Metric**: Euclidean
- **Number of Neighbours**: 5
- **Weights**: Distance

To assess the performance of this model in morde detail, the model will be created using these optimal parameters.

In [38]:
##¬†Create a KNN classifier using the optimal parameters specified above
knn = KNeighborsClassifier(metric='euclidean', n_neighbors=5, weights='distance')

## Fit the model using the count vectorized dataset
knn.fit(count_X_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='distance')

Next, predictions can be generated using the model for both the test and training set.  These will be used to calculate a number of metrics to assess performance of the model.

In [47]:
knn_preds_train = knn.predict(count_X_train)
knn_preds_test = knn.predict(count_X_test)

#### Classification Report

The classification report provides the following key metrics:
- **Precision**: Defined as the number of times the model correctly assigned a given classification,  as a proportion of all observations with that predicted classification.  A low precison suggests a high rate of false positives.  
- **Recall**: Defined as the number of times the model correctly assigned a given classification,  as a proportion of all observations with that actually had that classification.  A low recall suggests a high rate of false negatives.  
- **f1-score**: The harmonic mean of precision and recall.  
- **Accuracy**: The proportion of all observations that were correctly classified

Note that there is an inverse relationship between precision and recall.  Balance is therefore important.  

In [48]:
## Print the classification report for the training dataset
print(classification_report(y_train, knn_preds_train))

                  precision    recall  f1-score   support

       broadband       1.00      1.00      1.00        74
        contract       0.98      1.00      0.99       427
customer service       0.99      0.99      0.99       753
          device       1.00      1.00      1.00       217
         network       1.00      1.00      1.00       436
           other       0.99      0.99      0.99      1341
       promotion       1.00      0.99      1.00       253

        accuracy                           0.99      3501
       macro avg       0.99      0.99      0.99      3501
    weighted avg       0.99      0.99      0.99      3501



In [49]:
##¬†Print the classification report for the test daaset
print(classification_report(y_test, knn_preds_test))

                  precision    recall  f1-score   support

       broadband       0.50      0.10      0.17        10
        contract       0.58      0.11      0.18       101
customer service       0.64      0.35      0.45       178
          device       0.48      0.20      0.29        59
         network       0.73      0.18      0.29       106
           other       0.49      0.95      0.65       354
       promotion       0.71      0.22      0.34        68

        accuracy                           0.52       876
       macro avg       0.59      0.30      0.34       876
    weighted avg       0.58      0.52      0.46       876



The following observations can be made:

- There is a significant mismatch between the weighted average of accuracy of the training and test set (99% v. 52%).  This is indicative of overfitting, and suggests the model does not generalize well to new data.  

- Whilst there is good balance between precision and recall for the training data, this is not the case for the test data.  With the exception of the 'other' category, precision is notably larger than recall.  This suggests that there is an issue with false negatives.

The performance of this model is too low to take it forward to production.  Alternatives must be sought.

### Multinomial Naive Bayes Classifier

Next, the Multinomial Naive Bayes Classifier is assessed for suitability.  This classifier is based on Bayes theorem.

In [50]:
## Specify the classifier, in this case naive bayes
nb = MultinomialNB()

## Create a parameter grid to identify optimal parameters
nb_param_grid = {'alpha':[0.5,0.8,1]}

In [27]:
## Run the best_model_parameters_dataset() function to identify the optimal dataset and parameters to use
best_model_parameters_dataset(nb, nb_param_grid, datasets)

Unnamed: 0,Dataset,Parameters,Test Score,Training Score
1,count,{'alpha': 0.8},0.69,0.66
0,tf_idf,{'alpha': 0.5},0.64,0.61


To assess the performance further, this model will be created using the optimal combination of parameters:
- Count Vectorized dataset
- Alpha = 0.8

In [51]:
##¬†Create a multinomial NB classifier using the optimal parameters specified above
mnb = MultinomialNB(alpha=0.8)

## Fit the model using the count vectorized dataset
mnb.fit(count_X_train, y_train)


MultinomialNB(alpha=0.8, class_prior=None, fit_prior=True)

Next, predictions can be generated using the model for both the test and training set. These will be used to calculate a number of metrics to assess performance of the model.

In [52]:
## Generate predictions for the MNB model
mnb_preds_train = mnb.predict(count_X_train)
mnb_preds_test = mnb.predict(count_X_test)

#### Classification Report

In [53]:
## Print the classification report for the training dataset
print(classification_report(y_train, mnb_preds_train))

                  precision    recall  f1-score   support

       broadband       1.00      0.49      0.65        74
        contract       0.84      0.86      0.85       427
customer service       0.75      0.92      0.83       753
          device       0.97      0.71      0.82       217
         network       0.87      0.90      0.88       436
           other       0.91      0.87      0.89      1341
       promotion       0.95      0.79      0.86       253

        accuracy                           0.86      3501
       macro avg       0.90      0.79      0.83      3501
    weighted avg       0.87      0.86      0.86      3501



In [55]:
## Print the classification report for the test dataset
print(classification_report(y_test, mnb_preds_test))

                  precision    recall  f1-score   support

       broadband       1.00      0.10      0.18        10
        contract       0.56      0.61      0.58       101
customer service       0.56      0.81      0.66       178
          device       0.71      0.34      0.46        59
         network       0.73      0.77      0.75       106
           other       0.81      0.73      0.77       354
       promotion       0.74      0.47      0.58        68

        accuracy                           0.69       876
       macro avg       0.73      0.55      0.57       876
    weighted avg       0.71      0.69      0.68       876



From the above it can be seen that:

- There is still evidence of overfitting, as accuracy predictions using the training data are greater than those using the test data.  However, this shows better balance than under the K Nearest Neighbours classifier.
- There is better balance between precision and recall for the training and test data.  

### Multinomial Logistic Regression

Logistic Regression is a classification algorithm that employs Maximum Likelihood Estimation to generate a model capable of dividing observations into different groups.

In [28]:
## Specify the classifier, in this case LogisticRegression()
logreg = LogisticRegression(random_state=55, max_iter=15000, multi_class='multinomial')

## Create a parameter grid to identify optimal parameters
logreg_param_grid = {'C':[1,2,10],
                     'class_weight': ['balanced', None],
                     'solver':['newton-cg', 'sag', 'saga','lbfgs']}

In [29]:
best_model_parameters_dataset(logreg, logreg_param_grid, datasets)



Unnamed: 0,Dataset,Parameters,Test Score,Training Score
0,tf_idf,"{'C': 2, 'class_weight': 'balanced', 'solver': 'newton-cg'}",0.75,0.71
1,count,"{'C': 1, 'class_weight': 'balanced', 'solver': 'newton-cg'}",0.75,0.71


The above indicates that the following combination of parameters lead to optimal performance:

- **Dataset**: the data vectorised using tf-idf yields marginally better performance
- **C**: 2
- **class_weight**: balanced
- **solver**: newton-cg

The model is therefore fitted using these parameters.

In [100]:
## Fit the model using optimal dataset and parameters
logreg_ = LogisticRegression(random_state=213, solver='newton-cg', max_iter=15000, C=2, class_weight='balanced')
logreg_.fit(tf_idf_X_train, y_train)



LogisticRegression(C=2, class_weight='balanced', dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=15000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=213, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

To assess the performance of each model, predictions are generated for teh training and test dataset.

In [101]:
## Create predictions for the training and test dataset
logreg_preds_train = logreg_.predict(tf_idf_X_train)
logreg_preds_test = logreg_.predict(tf_idf_X_test)

#### Classification Report

Again, the classification report is generated for the training and test data.

In [102]:
## Print the classification report for the training dataset
print(classification_report(y_train, logreg_preds_train))

                  precision    recall  f1-score   support

       broadband       0.91      1.00      0.95        74
        contract       0.89      0.95      0.92       427
customer service       0.92      0.92      0.92       753
          device       0.90      0.96      0.93       217
         network       0.94      0.96      0.95       436
           other       0.97      0.92      0.94      1341
       promotion       0.95      0.96      0.96       253

        accuracy                           0.94      3501
       macro avg       0.93      0.95      0.94      3501
    weighted avg       0.94      0.94      0.94      3501



In [103]:
##¬†Print the classification report for the test daaset
print(classification_report(y_test, logreg_preds_test))

                  precision    recall  f1-score   support

       broadband       0.50      0.50      0.50        10
        contract       0.64      0.68      0.66       101
customer service       0.75      0.72      0.74       178
          device       0.64      0.76      0.70        59
         network       0.82      0.79      0.80       106
           other       0.82      0.83      0.83       354
       promotion       0.82      0.66      0.73        68

        accuracy                           0.77       876
       macro avg       0.71      0.71      0.71       876
    weighted avg       0.77      0.77      0.77       876



There are similar issues with overfitting with this model.  The accuracy of predictions using the training data is about 17% higher than the model built using the test data.  

**Note**: By default, an l2 penalty is included within a logistic regression model solved using 'newton-cg'.  It is unfortunately not possible to make any enhancements to this penalty in order to improve the issue of overfitting.  

To see if this can be improved, an l2 penalty will be introduced.  This provides one mechanism to overcome overfitting.

It can also be seen from the above that the model performs poorly when classifying tweets as 'broadband' and 'contract', but performs much better when classifying into the other categories.

### Random Forest

A random forest is considered as the next possible classifier.  Random forests work by creating a number of different decision trees (specified by n_estimators) and then outputting the mode of the predictions made by each decision tree.

In [30]:
## Define the classifier to be used, in this case RandomForestClassifier(), specify a random_state
## so that the results are reproducible
forest = RandomForestClassifier(random_state=55)

## Specify the parameter grid to be assessed
forest_param_grid = {'n_estimators': [75,150,300,450],
                    'criterion': ['gini', 'entropy'],
                  'max_depth':[None, 5, 10, 15],
                  'class_weight': ['balanced', None],
                  'bootstrap': [True, False]
             }

In [31]:
## Run the best_model_parameters_dataset() function to identify the optimal dataset and parameters to use
best_model_parameters_dataset(forest, forest_param_grid, datasets)

Unnamed: 0,Dataset,Parameters,Test Score,Training Score
0,tf_idf,"{'bootstrap': False, 'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': None, 'n_estimators': 450}",0.76,0.7
1,count,"{'bootstrap': False, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'n_estimators': 300}",0.71,0.69


As can be seen from the above, the following combinations of parameters yield optimal performance:
- **Dataset**: Data vectorized using tf-idf vectorization
- **bootstrap**: False
- **class_weight**: balanced
- **criterion**: gini
- **max_depth**: None
- **n_estimators**:450

To investigate the performance of this model further, it is created below.

In [71]:
## Define the random forest classifier
forest = RandomForestClassifier(bootstrap=False, criterion='gini', max_depth=None, 
                                n_estimators=450, random_state=213, class_weight='balanced')

## Fit the model to the tf_idf data
forest.fit(tf_idf_X_train, y_train)

RandomForestClassifier(bootstrap=False, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=450, n_jobs=None, oob_score=False,
                       random_state=213, verbose=0, warm_start=False)

In [72]:
## Generate predictions using this model
forest_preds_train = forest.predict(tf_idf_X_train)
forest_preds_test = forest.predict(tf_idf_X_test)

#### Classification Report

The classification report is generated.

In [73]:
##¬†Generate the classification report for the training data
print(classification_report(y_train, forest_preds_train))

                  precision    recall  f1-score   support

       broadband       1.00      1.00      1.00        74
        contract       0.97      1.00      0.99       427
customer service       0.98      0.99      0.99       753
          device       1.00      1.00      1.00       217
         network       0.99      1.00      1.00       436
           other       1.00      0.98      0.99      1341
       promotion       0.99      1.00      1.00       253

        accuracy                           0.99      3501
       macro avg       0.99      1.00      0.99      3501
    weighted avg       0.99      0.99      0.99      3501



In [74]:
##¬†Generate the classification report for the testing data
print(classification_report(y_test, forest_preds_test))

                  precision    recall  f1-score   support

       broadband       0.67      0.40      0.50        10
        contract       0.65      0.67      0.66       101
customer service       0.73      0.71      0.72       178
          device       0.63      0.49      0.55        59
         network       0.81      0.78      0.80       106
           other       0.77      0.89      0.83       354
       promotion       0.95      0.51      0.67        68

        accuracy                           0.75       876
       macro avg       0.74      0.64      0.68       876
    weighted avg       0.76      0.75      0.75       876



In the above, there is evidence of marked overfitting (accuracy is 99% with training data, but 75% with test data).  The balance between precision and recall is mixed.

### Support Vector Machine

Next, the Support Vector Machine (SVM).  This model attempts to find the decision boundary which maximises the distance between the boundary and the training observatiopns.  This model includes a parameter (C), which specifies the balance between finding this optimal boundary for most datapoints, and misclassifying observations.

In [32]:
##¬†Specify the classification model, in this case a support vector machine
svm = SVC(gamma='auto', random_state=55)

##¬†Specify the parameter grid to be used during the GridSearchCV
svm_param_grid = {'C':[1,5,10],
                'class_weight':['balanced', None]}

In [33]:
## Run the best_model_parameters_dataset() function to identify the optimal dataset and parameters to use
best_model_parameters_dataset(svm, svm_param_grid, datasets)

Unnamed: 0,Dataset,Parameters,Test Score,Training Score
1,count,"{'C': 10, 'class_weight': 'balanced'}",0.55,0.47
0,tf_idf,"{'C': 1, 'class_weight': None}",0.4,0.38


As can be seen from the above, performance of the SVM is weak relative to some of the other models (e.g. Multinomial Logistic Regression).  For this reason, we will not progress with this classifier.

### XG Boost

The XG Boost model is not part of the sklearn library. Therefore GridSearch will not be performed for this model. However, the performance of this model will be compared both using the tf-idf dataset and the count vectorized dataset.

#### XG Boost with TF-IDF Dataset

In [34]:
## Specify the classifier, in this case XG Boost
boost = xgb.XGBClassifier()

##¬†Fit the model using the training data
boost.fit(tf_idf_X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [35]:
##¬†Create predictions for the training and test datasets
boost_preds_train = boost.predict(tf_idf_X_train)
boost_preds_test = boost.predict(tf_idf_X_test)

In [75]:
## Print the accuracy score for XG Boost using training data
accuracy_score(y_train, boost_preds_train)

0.7649243073407598

In [76]:
## Print the accuracy score for XG Boost using test data
accuracy_score(y_test, boost_preds_test)

0.7203196347031964

#### XG Boost with Count Vectorized Dataset

In [77]:
## Specify the classifier, in this case XG Boost
boost = xgb.XGBClassifier()

##¬†Fit the model using the training data
boost.fit(count_X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [78]:
##¬†Create predictions for the training and test datasets
boost_preds_train = boost.predict(count_X_train)
boost_preds_test = boost.predict(count_X_test)

In [80]:
## Print the accuracy score for XG Boost using training data
accuracy_score(y_train, boost_preds_train)

0.7223650385604113

In [81]:
## Print the accuracy score for XG Boost using test data
accuracy_score(y_test, boost_preds_test)

0.7191780821917808

#### Assessment

As can be seen from the above, the XG Boost performs slightly better when using the tf-idf dataset.  

## Label Unlabelled Data

Having built the initial model, it is possible to investigate whether predicting labels for the unlabelled data, and then  re-running this model to include the predicted categories for the unlabelled data, will improve performance.

Before doing this, it is first necessary to import the unlabelled data, which has been stored in a Pandas DataFrame using pickle.

In [83]:
##¬†Import the unlabelled data using Pickle
unlabelled_df = pd.read_pickle('cleaned_unlabelled_tweets')

In [84]:
## Display the first five rows of the DataFrame
unlabelled_df.head()

Unnamed: 0,network,datetime,original_tweet,subject,sentiment,lemmatized_tweets_tokens,lemmatized_tweets_string
39,@VodafoneUK,2019-12-04 01:20:56,@avipan_lko @VodafoneIN @VodafoneGroup @VodafoneUK @TRAI @rssharma3 @rsprasad @narendramodi ‡§ê‡§∏‡•á ‡§π‡•Ä ‡§∞‡§π‡•á‡§ó‡§æ ‡§µ‡•ã‡§°‡§æ‡§´‡•ã‡§® ‡§∏‡•Å‡§ß‡§æ‡§∞ ‡§π‡•ã ‡§π‡•Ä ‡§®‡§π‡•Ä‡§Ç ‡§∏‡§ï‡§§‡§æ,,0.0,[],
41,@VodafoneUK,2019-12-04 00:48:17,@danielrome18 @VodafoneUK fucking hell üò±,,-0.6,"[fucking, hell]",fucking hell
56,@VodafoneUK,2019-12-03 22:46:47,@VodafoneUK I was hoping you‚Äôd say that.,,0.0,"[hoping, youd, say]",hoping youd say
61,@VodafoneUK,2019-12-03 22:38:07,@VodafoneUK please explain https://t.co/PhHJdMbrG9,,0.0,"[please, explain]",please explain
65,@VodafoneUK,2019-12-03 22:22:46,"@Townsley85 @VodafoneUK Hear, hear!",,0.0,"[hear, hear]",hear hear


Next, it is necessary to vectorize the unlabelled tweets using the tf_idf vectorizer.

In [None]:
## Transform unlabelled tweets using the tfidf_vectorizer
tf_idf_unlabelled_X_train = tfidf_vectorizer.transform(unlabelled_df['lemmatized_tweets_string'])

Once this is complete, it is possible to predict categorise for the unlabelled data using the chosen model.  The logistic regression model is chosen as it offers a good balance between accuracy for the training and test data.

In [None]:
## Make predictions for the unlabelled tweets
unlabelled_y_train = logreg.predict(tf_idf_unlabelled_X_train)

Finally, the unlabelled tweets and the predicted categories can be combined into a single dataset.

In [85]:
## Create a combined list of tweets (X)
all_X_train = list(X_train)
all_X_train.extend(unlabelled_df['lemmatized_tweets_string'])

In [91]:
## Vectorize the complete list of tweets
tf_idf_all_X_train = tfidf_vectorizer.transform(all_X_train)

In [87]:
##¬†Create a combined list of categories (y)
all_y_train = np.append(values=np.array(y_train), arr=unlabelled_y_train)

## Re-Run Model

Finally, a new model can be created using the enlarged dataset.  To ensure assessment is comparable to prior models, the same test dataset will be used as before.  A multinomial logistic regression model will be used.

In [97]:
## Create a list containing the combined dataset
complete_datasets = [{'name': 'complete_tf_idf','X_train': tf_idf_all_X_train, 'y_train':all_y_train, 'X_test': tf_idf_X_test, 'y_test':y_test}]


In [98]:
## Specify the classifier, in this case LogisticRegression()
logreg = LogisticRegression(random_state=55, max_iter=15000, multi_class='multinomial')

## Create a parameter grid to identify optimal parameters
logreg_param_grid = {'C':[1,2,10],
                     'class_weight': ['balanced', None],
                     'solver':['newton-cg', 'sag', 'saga','lbfgs']}

In [99]:
## Run the best_model_parameters_dataset() function to identify the optimal dataset and parameters to use
best_model_parameters_dataset(logreg, logreg_param_grid, complete_datasets)



Unnamed: 0,Dataset,Parameters,Test Score,Training Score
0,complete_tf_idf,"{'C': 1, 'class_weight': None, 'solver': 'saga'}",0.35,0.32


The results above suggest that the model deteriorates when using unlabelled data categorised by the original model.  This will therefore not be taken further.  

## Save Model using Pickle

The chosen logistic regression model (run using only the labelled data) is saved using a Pickle object.  This model will be used to generate predictions in the next step **Step 5 - Interpret Results**.

In [108]:
## Use Pickle and joblib to save the Pickle object
joblib.dump(logreg_, 'model.pkl')

['model.pkl']