<a href="https://colab.research.google.com/github/hblacksmith/Clustering/blob/main/MSc_DTS_New_ML_Classifier_Model_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

## Importing the libraries

In [631]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [632]:
dataset = pd.read_csv('AppData2.csv', encoding='ISO-8859–1')

## Cleaning the texts

In [633]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 100):
  text = re.sub('[^a-zA-Z]', ' ', dataset['Text'][i])
  text = text.lower()
  text = text.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')
  text = [ps.stem(word) for word in text if not word in set(all_stopwords)]
  text = ' '.join(text)
  corpus.append(text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [634]:
# View all words in corpus
#print(corpus)

## Creating the Bag of Words model

In [635]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values

## Splitting the dataset into the Training set and Test set

In [636]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## Training the Naive Bayes model on the Training set

In [637]:
#Classifier: GaussianNB
#from sklearn.naive_bayes import GaussianNB
#classifier = GaussianNB()
#classifier.fit(X_train, y_train)

In [638]:
#Classifier: MultinomialNB
#from sklearn.naive_bayes import MultinomialNB
#classifier = MultinomialNB()
#classifier.fit(X_train, y_train)

In [639]:
#Classifier: SVM
#from sklearn.svm import SVC
#classifier = SVC(kernel = 'linear', random_state = 0)
#classifier.fit(X_train, y_train)

In [640]:
#Classifier: Kernel SVM
#from sklearn.svm import SVC
#classifier = SVC(kernel = 'rbf', random_state = 0)
#classifier.fit(X_train, y_train)

In [641]:
#Classifier: K-NN
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

KNeighborsClassifier()

In [642]:
#Classifier: Random Forest 
#from sklearn.ensemble import RandomForestClassifier
#classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
#classifier.fit(X_train, y_train)

## Predicting the Test set results

In [643]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]]


## Measuring Performance

In [644]:
#Confusion Matrix - Accuracy
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
#TL = TN
#TR = FN
#BL = FP
#BR = TP


[[25  0]
 [ 2  3]]


0.9333333333333333

In [645]:
#What the model got right
y_true = np.array([0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0])
y_pred = np.array([0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
true_positives = ((y_pred == y_true) & (y_pred == 1)).sum()
true_positives

3

In [646]:
true_negatives = ((y_pred == y_true) & (y_pred == 0)).sum()
true_negatives

25

In [647]:
#What the model got wrong
false_positives = ((y_pred != y_true) & (y_pred == 1)).sum()
false_positives

0

In [648]:
false_negatives = ((y_pred != y_true) & (y_pred == 0)).sum()
false_negatives

2

In [649]:
#Precision - how many true labels did the model detect (1.0 = NO FALSE POSITIVES)
precision = true_positives / (true_positives + false_positives)
precision

1.0

In [650]:
#Recall - how many positive examples were detected in the dataset
recall = true_positives / (true_positives + false_negatives)
recall

0.6

In [651]:
#RMSE - Root Mean Square Error
rmse = np.sqrt((y_true - y_pred) ** 2 / len(y_true))
rmse

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.18257419,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.18257419,
       0.        , 0.        , 0.        , 0.        , 0.        ])

## Predicting if a single statement is Pass [1] or Reject [0]

### Positive review

In [657]:
new_statement = "When leading a team, I am proudest when I see my direct reports develop and start working autonomously to a high standard. An example was when I worked at the charity Retail Trust as Head of Marketing & Insights and inherited a junior marketing insights team where I needed to upskill them in modern ways of data driven marketing. They had had no formal training or individual objectives. I worked with team members individually on an annual learning and development plan, covering strengths and development areas. Recognising the value of feedback, I met with everyone separately bi-weekly to track progress and discuss any training needs. Each month I got the whole team together and asked them to showcase what new projects they were working on using their new skills and how that contributed to the charity’s vision of supporting an extra 20,000 people a year with wellbeing in the retail sector. After a few months they had built up practical knowledge on data-led decision making, influential communications and proving return on investment. I arranged for them to present to the heads of department explaining what they had learnt about those most in need in the retail sector, any new skills and actions the charity should take.  Consequently, the CEO took notice and asked my team to begin working with other areas of the business in digital upskilling. What I valued most was the praise and achievements the team received as they grew and successfully helped the charity deliver its vision."
new_statement = re.sub('[^a-zA-Z]', ' ', new_statement)
new_statement = new_statement.lower()
new_statement = new_statement.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
new_statement = [ps.stem(word) for word in new_statement if not word in set(all_stopwords)]
new_statement = ' '.join(new_statement)
new_corpus = [new_statement]
new_X_test = cv.transform(new_corpus).toarray()
new_y_pred = classifier.predict(new_X_test)
print(new_y_pred)

[1]


The review was correctly predicted as positive by our model.

### Negative review

**Solution:** We just repeat the same text preprocessing process we did before, but this time with a single review.

In [658]:
new_statement = "As the MOD's Head of Workforce Analytics for Strategic Command I analysed the military and civilian workforce data across ten departments including Digital, Intelligence, Special Forces and Cyber. When the annual task of reviewing critical gaps in the workforce was commissioned by Head Office I sought to work with the departments in creating their analysis instead of issuing a spreadsheet and waiting for each HR team to return it. I knew UKStratCom's strategic areas were cyber, special forces, intelligence and interoperability enabled through technology. I analysed over 25k employee records to identify the strength and establishment of those strategic roles and identified shortfalls. Next, I reached out to the finance and workforce teams of each department and presented my data, aiming for the following outcomes: 1) Understanding and agreeing the workforce data 2) Assessing the significance of the critical workforce gaps 3) Discussing mitigating actions to address the gaps 4) Agreeing when the gaps would be resolved Insufficient recruitment was a dominant factor and in many instances the Army were unable to fill their posts. I wanted to dig further and get their view for a rounded analysis. I shared my findings with the Army’s workforce analytics team who recognised the affect their recruitment pipeline was having on my Command's outputs and they described the actions they were taking. By reaching out to a wide selection of stakeholders I built a comprehensive view of the problem and delivered a robust analysis which I included on the Integrated Review paper."
new_statement = re.sub('[^a-zA-Z]', ' ', new_statement)
new_statement = new_statement.lower()
new_statement = new_statement.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
new_statement = [ps.stem(word) for word in new_statement if not word in set(all_stopwords)]
new_statement = ' '.join(new_statement)
new_corpus = [new_statement]
new_X_test = cv.transform(new_corpus).toarray()
new_y_pred = classifier.predict(new_X_test)
print(new_y_pred)

[0]


The review was correctly predicted as negative by our model.