<a href="https://nbviewer.jupyter.org/github/alisonmitchell/Stock-Prediction/blob/main/Sentiment_Analysis/NLP_Text_Preprocessing_and_Classification.ipynb" 
   target="_parent">
   <img src="https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg" 
      width="109" height="20" alt="render in nbviewer">
</a>

# NLP - Text Preprocessing and Classification

## 1. Introduction
Text preprocessing is an approach for cleaning and preparing text data for NLP tasks before training a model. Various preprocessing steps in an NLP pipeline, including text normalisation techniques for transforming a text into a canonical (standard) form to reduce noise, will be investigated using NLTK and spaCy libraries on NSE market news articles collected by web scraping from [moneycontrol.com](https://www.moneycontrol.com/).  

Text classification is the problem of assigning categories to text data according to its content. Different techniques to extract information from raw text data for training a classification model will be explored including Bag of Words, TF-IDF and Word Embedding.

One of the primary applications of NLP is to cut through the noise (high dimensionality from large volumes of text) and identify the signal (extract the main topics). Topic modelling is the practice of using a quantitative algorithm to automatically output the key topics that a body of text is about. Here, the Latent Dirichlet Allocation (LDA) algorithm from the Gensim package will be used.





## 2. Install/import libraries

In [1]:
!pip -q install spacy
!python -m spacy download en
!pip install -q pyLDAvis
!pip install contractions

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 12.8/12.8 MB 2.4 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.24.3 which is incompatible.


Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-2.0.0-cp310-cp310-win_amd64.whl (39 kB)
Collecting anyascii
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
     -------------------------------------- 289.9/289.9 kB 1.5 MB/s eta 0:00:00
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import nltk
import re
import pprint
import string
import contractions
import spacy
import en_core_web_sm
import gensim
import gensim.corpora as corpora
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
from contractions import contractions_dict

from nltk import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.manifold import TSNE
from sklearn import linear_model

from IPython.display import clear_output

from spacy import displacy

from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from pprint import pprint


## 3. Import data
Read in text file of articles collected by web scraping from Investing.com. Data has been prepared for further processing by appending ---newarticle--- to the body text of each article. This will then be specified as the separator for splitting the string into a list using the split() method. New lines will also be removed.

In [None]:
txt_file = open("azn_bodytext_20210104.txt")
articles = txt_file.read().replace('\\n', ' ').split('---newarticle---')
articles

[' The FTSE 100 firm has provided 530,000 doses ready for use on Monday at six hospital trusts  The UK has begun the rollout of the coronavirus (COVID-19) vaccine produced by PLC ( ) and Oxford University.  The FTSE 100 firm has provided 530,000 doses ready for use on Monday at six hospital trusts, in Oxford, London, Sussex, Lancashire and Warwickshire.  Most other available doses will be sent to GP-led services and care homes later in the week.  \'I\'m so pleased to be getting the COVID vaccine today and really proud it is one that was invented in Oxford.\'    82-year-old Brian Pinker became the first person in the world to receive the new Oxford vaccine this morning at @OUHospitals. ???? pic.twitter.com/nhnd3Sx97m — NHS England and NHS Improvement (@NHSEngland) January 4, 2021  The country has established over 730 vaccination sites and hundreds more are opening this week to take the total to over 1,000.  The government has secured access to 100mln doses of the inoculation on behalf o

## 4. Text Preprocessing

### 4.1 Tokenisation

Tokenisation is a way of separating text into smaller units called tokens, most commonly using space as a delimiter. Types can be broadly classified as word, character and subword (n-gram) tokenisation. 

Word tokenisation is the most commonly used tokenisation algorithm and splits words into their own strings to facilitate counting, for example.

In [None]:
tokens = [word_tokenize(article) for article in articles]
text = nltk.Text(tokens)
text

<Text: ['The', 'FTSE', '100', 'firm', 'has', 'provided', '530,000', 'doses', 'ready', 'for', 'use', 'on', 'Monday', 'at', 'six', 'hospital', 'trusts', 'The', 'UK', 'has', 'begun', 'the', 'rollout', 'of', 'the', 'coronavirus', '(', 'COVID-19', ')', 'vaccine', 'produced', 'by', 'PLC', '(', ')', 'and', 'Oxford', 'University', '.', 'The', 'FTSE', '100', 'firm', 'has', 'provided', '530,000', 'doses', 'ready', 'for', 'use', 'on', 'Monday', 'at', 'six', 'hospital', 'trusts', ',', 'in', 'Oxford', ',', 'London', ',', 'Sussex', ',', 'Lancashire', 'and', 'Warwickshire', '.', 'Most', 'other', 'available', 'doses', 'will', 'be', 'sent', 'to', 'GP-led', 'services', 'and', 'care', 'homes', 'later', 'in', 'the', 'week', '.', "'I", "'m", 'so', 'pleased', 'to', 'be', 'getting', 'the', 'COVID', 'vaccine', 'today', 'and', 'really', 'proud', 'it', 'is', 'one', 'that', 'was', 'invented', 'in', 'Oxford', '.', "'", '82-year-old', 'Brian', 'Pinker', 'became', 'the', 'first', 'person', 'in', 'the', 'world', 'to

### 4.2 Removing Stopwords

Stopwords are some of the most common words and are necessary for sentences to be grammatically correct (e.g. a, is, an, the, and). However, they carry very little or no useful information and are filtered out as part of the text preprocessing stage to remove noise so that machine learning algorithms can better focus on the signal, or words which define the meaning of the text.

We will use the list from nltk.corpus and exclude tokens with characters that are not alphabetical using the isalpha() method.

In [None]:
stop_words = set(stopwords.words('english'))

# Iterate through all tokens for all articles 
filtered_articles = [[word for word in article if not word in stop_words if word.isalpha()] for article in tokens]
print(stop_words)
print(filtered_articles)

{'me', 'no', 'own', 'but', 'doesn', 'about', 'haven', 'here', 'a', 'because', 'myself', 'than', 'the', 'other', "hadn't", 'while', 'm', 'whom', "needn't", 'is', 's', 'my', 'same', 'been', 'before', 'yourself', 'any', 'have', "she's", 'for', 'needn', 'hasn', 'those', 'and', "mustn't", 'more', 'as', 'from', 'that', 'doing', 've', 'above', 'which', 'these', 'they', 'our', 'couldn', 'in', 'such', 'further', 'ain', 'after', 'into', 'herself', 'who', 'of', 'be', 'if', 'each', "you've", 'it', 'this', 'there', 'their', 'all', 'too', 'aren', 'shouldn', "shouldn't", 't', "don't", 'are', 'up', 'once', 'ourselves', 'now', 'd', "you'll", 'by', 'mightn', 'should', "that'll", 'itself', 'we', "weren't", 'wouldn', 'not', "hasn't", 'nor', 'most', 'her', 'so', 'hadn', 'you', 'was', 'an', 'weren', 'them', "it's", 'did', 'he', 'didn', "you'd", 'theirs', 'isn', 'what', "you're", "should've", 'where', "isn't", 'hers', 'with', 'against', 'until', 'very', 'am', 'mustn', 'has', 'on', 'then', 'during', 'just', '

The nltk.corpus list can be further enhanced by adding or removing custom words as appropriate using the add() or remove() methods respectively. The example below shows that stopwords from NLTK are all lower case as 'The' is not filtered out. Text could be changed to lower case using the lower() method so that later on in the process the same word is not represented as two different words in the vector space resulting in more dimensions. 

Further text preprocessing steps could include expanding contractions to their original form, and removal of punctuation, special characters and accented characters.

In [None]:
# Remove 'The' by adding it to the stopwords list

stop_words.add('The')

# Iterate through all tokens for all articles 
filtered_articles = [[word for word in article if not word in stop_words if word.isalpha()] for article in tokens]
print(stop_words)
print(filtered_articles)

{'me', 'no', 'own', 'but', 'doesn', 'about', 'haven', 'here', 'a', 'because', 'myself', 'than', 'the', 'other', "hadn't", 'while', 'm', 'whom', "needn't", 'is', 's', 'my', 'same', 'been', 'before', 'yourself', 'any', 'have', "she's", 'for', 'needn', 'hasn', 'those', 'and', "mustn't", 'more', 'as', 'from', 'that', 'doing', 've', 'above', 'which', 'these', 'they', 'our', 'couldn', 'in', 'The', 'such', 'further', 'ain', 'after', 'into', 'herself', 'who', 'of', 'be', 'if', 'each', "you've", 'it', 'this', 'there', 'their', 'all', 'too', 'aren', 'shouldn', "shouldn't", 't', "don't", 'are', 'up', 'once', 'ourselves', 'now', 'd', "you'll", 'by', 'mightn', 'should', "that'll", 'itself', 'we', "weren't", 'wouldn', 'not', "hasn't", 'nor', 'most', 'her', 'so', 'hadn', 'you', 'was', 'an', 'weren', 'them', "it's", 'did', 'he', 'didn', "you'd", 'theirs', 'isn', 'what', "you're", "should've", 'where', "isn't", 'hers', 'with', 'against', 'until', 'very', 'am', 'mustn', 'has', 'on', 'then', 'during', 'j

## 5. Text normalisation

Stemming and lemmatisation are text normalisation techniques within NLP that are used to prepare text, words, and documents for further processing and both generate the root form of the inflected words.

### 5.1 Stemming

Stemming is the process of removing the suffix from a word to reduce inflected/derived words to their word stem, base or root form. There are many ways to perform stemming such as lookup table, suffix-stripping algorithms etc. These mainly rely on chopping off ‘s’, ‘es’, ‘ed’, ‘ing’, ‘ly’ etc from the end of the words until the stem is reached. 

Stemming is a somewhat crude method for cataloguing related words and sometimes the conversion is not desirable. The stem might not be an actual word, and English has many exceptions where a more sophisticated process is required to overcome the two main stemming errors of over-stemming and under-stemming. Nevertheless, stemming helps in standardising text and follows an algorithm with steps to perform on the words which makes it simpler and faster than lemmatisation.

One of the most common — and effective — stemming algorithms is the **Porter Stemmer** developed by Martin Porter in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. The algorithm employs five phases of word reduction, each with its own set of mapping rules. 





In [None]:
ps = PorterStemmer()

def nltk_stemmer(text):
  # Iterate for all words for all articles 
  stemmed_articles = [[ps.stem(w) for w in filtered_article] for filtered_article in filtered_articles]
  return stemmed_articles

nltk_stemmed_articles = nltk_stemmer(filtered_articles)
print(nltk_stemmed_articles)

[['ftse', 'firm', 'provid', 'dose', 'readi', 'use', 'monday', 'six', 'hospit', 'trust', 'UK', 'begun', 'rollout', 'coronaviru', 'vaccin', 'produc', 'plc', 'oxford', 'univers', 'ftse', 'firm', 'provid', 'dose', 'readi', 'use', 'monday', 'six', 'hospit', 'trust', 'oxford', 'london', 'sussex', 'lancashir', 'warwickshir', 'most', 'avail', 'dose', 'sent', 'servic', 'care', 'home', 'later', 'week', 'pleas', 'get', 'covid', 'vaccin', 'today', 'realli', 'proud', 'one', 'invent', 'oxford', 'brian', 'pinker', 'becam', 'first', 'person', 'world', 'receiv', 'new', 'oxford', 'vaccin', 'morn', 'ouhospit', 'nh', 'england', 'nh', 'improv', 'nhsengland', 'januari', 'countri', 'establish', 'vaccin', 'site', 'hundr', 'open', 'week', 'take', 'total', 'govern', 'secur', 'access', 'dose', 'inocul', 'behalf', 'whole', 'UK', 'crown', 'depend', 'oversea', 'territori', 'more', 'million', 'peopl', 'UK', 'alreadi', 'vaccin', 'vaccin', 'rollout', 'continu', 'pace', 'howev', 'critic', 'govern', 'plan', 'sinc', 'req

We can see that the algorithm has stemmed 'ready' to the unusual root of 'readi', and there are other output stems such as 'provid' and 'hospit' which are not linguistically valid.

### 5.2 Lemmatisation

In contrast to stemming, which just removes the last few characters, lemmatisation does conversion properly with the use of a corpus. It removes inflectional endings and considers a language’s full vocabulary to return the base or dictionary form of a word which belongs to the language and is known as a lemma.

Lemmatisation looks at the surrounding text to determine a given word’s part of speech, which might have to be defined to obtain the correct lemma.






### Lemmatisation using NLTK 

NLTK uses the WordNet corpus, a large, free and publicly available lexical database for the English language aiming to establish structured semantic relationships between words. This has to be imported as WordNetLemmatizer.

In [None]:
# Create an instance of the WordNetLemmatizer() and call the lemmatize() function on each token iteratively

wnl = WordNetLemmatizer()

def nltk_lemmatiser(text):
  lemmatised_articles = [[wnl.lemmatize(w) for w in filtered_article] for filtered_article in filtered_articles]
  return lemmatised_articles

nltk_lemmatised_articles = nltk_lemmatiser(filtered_articles)
print(nltk_lemmatised_articles)

[['FTSE', 'firm', 'provided', 'dos', 'ready', 'use', 'Monday', 'six', 'hospital', 'trust', 'UK', 'begun', 'rollout', 'coronavirus', 'vaccine', 'produced', 'PLC', 'Oxford', 'University', 'FTSE', 'firm', 'provided', 'dos', 'ready', 'use', 'Monday', 'six', 'hospital', 'trust', 'Oxford', 'London', 'Sussex', 'Lancashire', 'Warwickshire', 'Most', 'available', 'dos', 'sent', 'service', 'care', 'home', 'later', 'week', 'pleased', 'getting', 'COVID', 'vaccine', 'today', 'really', 'proud', 'one', 'invented', 'Oxford', 'Brian', 'Pinker', 'became', 'first', 'person', 'world', 'receive', 'new', 'Oxford', 'vaccine', 'morning', 'OUHospitals', 'NHS', 'England', 'NHS', 'Improvement', 'NHSEngland', 'January', 'country', 'established', 'vaccination', 'site', 'hundred', 'opening', 'week', 'take', 'total', 'government', 'secured', 'access', 'dos', 'inoculation', 'behalf', 'whole', 'UK', 'crown', 'dependency', 'overseas', 'territory', 'More', 'million', 'people', 'UK', 'already', 'vaccinated', 'vaccine', 'r

### Lemmatisation using spaCy

spaCy is a free and open source NLP library with a lot of prebuilt models. The default model for the English language is en_core_web_sm, a pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities. Components included in the model are tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatize. Once the language model instance is downloaded, an nlp object can be created. Here nlp refers to the the language model loaded by en_core_web_sm.



In [None]:
# Load the installed pre-built statistical model "en_core_web_sm"
nlp = spacy.load('en_core_web_sm')

We will define a function to lemmatise a list of lists, or a list of articles each containing a list of tokens. This can be done by iterating over the articles, joining them and then using nlp to input text into the spaCy NLP pipeline. Processing text with the nlp object returns a Doc object which is a container for accessing linguistic annotations for a  given input string. It holds all the information about the tokens, their linguistic features and their relationships. Next, get the lemma for each item in the list and return to a list of lists again by lemmatising the item at the index where the item occurs. 

In [None]:
def spacy_lemmatiser(text):
  for filtered_article in filtered_articles:
    doc = nlp(' '.join(filtered_article))     
    for indexer, i in enumerate(doc):
      return filtered_articles

spacy_lemmatised_articles = spacy_lemmatiser(filtered_articles)
print(spacy_lemmatised_articles)


[['FTSE', 'firm', 'provided', 'doses', 'ready', 'use', 'Monday', 'six', 'hospital', 'trusts', 'UK', 'begun', 'rollout', 'coronavirus', 'vaccine', 'produced', 'PLC', 'Oxford', 'University', 'FTSE', 'firm', 'provided', 'doses', 'ready', 'use', 'Monday', 'six', 'hospital', 'trusts', 'Oxford', 'London', 'Sussex', 'Lancashire', 'Warwickshire', 'Most', 'available', 'doses', 'sent', 'services', 'care', 'homes', 'later', 'week', 'pleased', 'getting', 'COVID', 'vaccine', 'today', 'really', 'proud', 'one', 'invented', 'Oxford', 'Brian', 'Pinker', 'became', 'first', 'person', 'world', 'receive', 'new', 'Oxford', 'vaccine', 'morning', 'OUHospitals', 'NHS', 'England', 'NHS', 'Improvement', 'NHSEngland', 'January', 'country', 'established', 'vaccination', 'sites', 'hundreds', 'opening', 'week', 'take', 'total', 'government', 'secured', 'access', 'doses', 'inoculation', 'behalf', 'whole', 'UK', 'crown', 'dependencies', 'overseas', 'territories', 'More', 'million', 'people', 'UK', 'already', 'vaccinat


### Stemming vs Lemmatisation and NLTK vs spaCy

Lemmatisation is typically seen as much more accurate than simple stemming, which is why spaCy has opted to have only lemmatisation available instead of stemming, ensuring that morphological variants are always actual words.

However, if speed and performance are a priority then stemming should be used since lemmatisers scan a corpus which consumes time and processing power. Ultimately, it depends on the application you are working on. For example, if you are building a language application in which language is important you should choose lemmatisation and use a corpus to match root forms.

NLTK is a string processing library. It takes strings as input and returns strings or lists of strings as output. Whereas, spaCy uses an object-oriented approach. When we parse a text, spaCy returns a document object whose words and sentences are objects themselves.

In word tokenisation and POS-tagging spaCy performs better, but in sentence tokenisation, NLTK outperforms spaCy.

spaCy has support for word vectors whereas NLTK does not. spaCy also comes with pre-trained language models which can be used for better part-of-speech (POS) tagging, and named entity recognition (NER) to find out whether or not a word is a named entity, such as persons, locations, organisations, etc.

## 6. Text Classification

Text classification is the problem of assigning categories to text data according to its content, and strategies include Bag of Words, TF-IDF and Word Embeddings.

### 6.1 Bag of Words (BoW)

The Bag of Words (BoW) model is the simplest form of text representation in numbers and is a method of feature extraction with text data. The model builds a vocabulary from a corpus of documents and is only concerned with keeping track of word counts and whether known words occur in the document, not the order or structure of the words. Each word in the vocabulary becomes a feature and a document is represented by a vector with the same length as the vocabulary (a “bag of words”).

The scikit-learn library's CountVectorizer() class implements both tokenisation and occurrence counting in a single class. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

CountVectorizer's parameters will be set to stop_words='english' to use the built-in list, and the ngram_range will just be the default range of (1,1). An n-gram is just a string of n words in a row. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, (2, 2) means only bigrams, and (3,3) means only trigrams.



In [None]:
# Create an instance of the CountVectorizer class
CountVec = CountVectorizer(ngram_range=(1, 1), 
                           stop_words='english')
# Learn the vocabulary dictionary and transform to a vector of token counts
Count_data = CountVec.fit_transform(articles) 
 
# Create dataframe that can be used for building models
cv_dataframe = pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names())
print(cv_dataframe)

    000  004  01  012  016p  02  ...  yesterday  yield  yields  york  zero  şahin
0     3    0   0    0     0   0  ...          0      0       0     0     0      0
1     0    0   0    0     0   0  ...          0      0       0     1     0      0
2     0    0   0    0     0   0  ...          0      0       0     0     0      0
3     0    0   0    0     0   0  ...          0      0       0     0     0      0
4     0    0   0    0     0   0  ...          0      0       0     0     0      0
5     0    0   0    0     0   0  ...          0      0       0     1     0      0
6     0    0   0    0     0   1  ...          0      0       0     0     0      0
7     3    0   0    0     0   0  ...          0      0       0     0     0      1
8     0    0   0    0     0   0  ...          0      0       0     0     0      0
9     5    0   0    0     0   0  ...          0      0       0     0     0      0
10    0    1   0    0     0   0  ...          0      0       0     0     0      0
11    1    0   0

We get a dataframe for the unprocessed articles where the unique tokens are the columns and the count for each article are the rows. We will try the NLTK stemmed and lemmatised articles and the spaCy lemmatised articles in turn.



### NLTK preprocessed stemmed articles

In [None]:
# Join the NLTK stemmed tokens back into a list of articles

nltk_preprocessed_articles_stem = [','.join(article).replace(',', ' ') for article in nltk_stemmed_articles]
nltk_preprocessed_articles_stem

['ftse firm provid dose readi use monday six hospit trust UK begun rollout coronaviru vaccin produc plc oxford univers ftse firm provid dose readi use monday six hospit trust oxford london sussex lancashir warwickshir most avail dose sent servic care home later week pleas get covid vaccin today realli proud one invent oxford brian pinker becam first person world receiv new oxford vaccin morn ouhospit nh england nh improv nhsengland januari countri establish vaccin site hundr open week take total govern secur access dose inocul behalf whole UK crown depend oversea territori more million peopl UK alreadi vaccin vaccin rollout continu pace howev critic govern plan sinc requir two dose jab administ week apart british medic associ said cancel patient book second dose grossli unfair bbc report though chief medic offic said prefer vaccin mani peopl possibl first dose meanwhil astrazeneca appli get jab approv south korea vietnam bought dose talk compani purchas share pharma giant jump monday m

In [None]:
# Learn the vocabulary dictionary and transform to a vector of token counts
Count_data = CountVec.fit_transform(nltk_preprocessed_articles_stem)
 
# Create dataframe that can be used for building models
nltk_stem_cv_dataframe = pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names())
print(nltk_stem_cv_dataframe)

    aal  abcellera  abcl  abil  acceler  ...  yesterday  yield  york  zero  şahin
0     0          0     0     0        0  ...          0      0     0     0      0
1     0          0     0     0        0  ...          0      0     1     0      0
2     0          0     0     0        0  ...          0      0     0     0      0
3     0          0     0     0        0  ...          0      0     0     0      0
4     0          0     0     0        0  ...          0      0     0     0      0
5     0          0     0     0        0  ...          0      0     1     0      0
6     1          0     0     0        0  ...          0      0     0     0      0
7     0          0     0     0        1  ...          0      0     0     0      1
8     0          0     0     0        0  ...          0      0     0     0      0
9     0          0     0     0        0  ...          0      0     0     0      0
10    0          0     0     0        0  ...          0      0     0     0      0
11    0         

There are so many columns, it is unlikely we will be able to use this as an analysis. Instead, we could have a look at words that appear frequently.

In [None]:
# Create a mask so we only get the terms that have a frequency greater than 5 
nltk_stem_frequent_words = list(nltk_stem_cv_dataframe.sum()[nltk_stem_cv_dataframe.sum() > 5].index)

nltk_stem_cv_dataframe[nltk_stem_frequent_words]

Unnamed: 0,accord,activ,ad,addit,administ,administr,advanc,agenc,ahead,allow,alreadi,american,amid,analysi,analyst,announc,anoth,antibodi,apog,approv,astrazeneca,author,avail,averag,azn,barrel,base,battl,becam,benefit,biggest,billion,biontech,biotech,bitcoin,bntx,brexit,bring,britain,british,...,sunday,suppli,support,surg,target,technolog,test,therapeut,thi,thursday,tier,time,today,toll,tomorrow,track,trade,treatment,trial,trump,tuesday,uk,union,unit,univers,use,vaccin,valu,variant,viru,vote,wall,way,wednesday,week,window,world,year,yesterday,york
0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,3,0,0,1,2,7,0,0,0,0,0,0,0,3,0,1,0,0,0
1,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,3,0,1
2,0,1,0,0,2,0,0,0,0,0,0,0,0,1,3,1,1,0,0,4,6,2,0,1,2,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,2,0,0,1,3,8,0,0,0,0,0,0,0,1,0,0,4,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,1,0,0,0,1,0,0,2,0,0,0,0,0,1,0,0,3,2,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,...,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,10,0,0,0,0,0,0,0,4,0,2,0,0,0
5,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,1,0,0,1,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,3,0,0,0,0,4,1,0,0,0,0,0,0,0,0,0,3,0,1
6,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0
7,1,0,0,0,1,0,0,1,0,0,1,0,2,0,0,2,0,0,0,1,2,2,1,0,2,0,0,1,1,0,0,0,2,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,2,4,15,0,1,2,0,0,0,0,5,0,1,0,0,0
8,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,4,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,2,2,...,3,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,3,0,0,0,0,1,0,2,2,1,3,0,1,0,0,1,1,0,1,0,0,2,0,0
9,3,0,1,1,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,2,1,0,0,0,1,0,0,0,1,0,0,4,0,0,1,0,0,1,0,0,...,0,0,0,0,0,4,0,0,1,0,0,0,0,1,0,1,0,0,0,2,0,0,1,0,0,0,4,1,0,1,1,0,1,2,2,0,0,4,0,0


### NLTK preprocessed lemmatised articles

In [None]:
# Join the NLTK lemmatised tokens back into a list of articles

nltk_preprocessed_articles_lem = [','.join(article).replace(',', ' ') for article in nltk_lemmatised_articles]
nltk_preprocessed_articles_lem

['FTSE firm provided dos ready use Monday six hospital trust UK begun rollout coronavirus vaccine produced PLC Oxford University FTSE firm provided dos ready use Monday six hospital trust Oxford London Sussex Lancashire Warwickshire Most available dos sent service care home later week pleased getting COVID vaccine today really proud one invented Oxford Brian Pinker became first person world receive new Oxford vaccine morning OUHospitals NHS England NHS Improvement NHSEngland January country established vaccination site hundred opening week take total government secured access dos inoculation behalf whole UK crown dependency overseas territory More million people UK already vaccinated vaccine rollout continue pace However criticism government plan since required two dos jab administered week apart British Medical Association said cancelling patient booked second dos grossly unfair BBC reported though chief medical officer said preferable vaccinate many people possible first dose Meanwhi

In [None]:
# Learn the vocabulary dictionary and transform to a vector of token counts
Count_data = CountVec.fit_transform(nltk_preprocessed_articles_lem)
 
# Create dataframe that can be used for building models
nltk_lem_cv_dataframe = pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names())
print(nltk_lem_cv_dataframe)

    aal  abcellera  abcl  ability  ...  yield  york  zero  şahin
0     0          0     0        0  ...      0     0     0      0
1     0          0     0        0  ...      0     1     0      0
2     0          0     0        0  ...      0     0     0      0
3     0          0     0        0  ...      0     0     0      0
4     0          0     0        0  ...      0     0     0      0
5     0          0     0        0  ...      0     1     0      0
6     1          0     0        0  ...      0     0     0      0
7     0          0     0        0  ...      0     0     0      1
8     0          0     0        0  ...      0     0     0      0
9     0          0     0        0  ...      0     0     0      0
10    0          0     0        0  ...      0     0     0      0
11    0          0     0        1  ...      0     0     0      0
12    0          0     0        0  ...      0     0     1      0
13    0          0     0        0  ...      0     0     0      0
14    0          0     0 

In [None]:
# Create a mask so we only get the terms that have a frequency greater than 5 
nltk_lem_frequent_words = list(nltk_lem_cv_dataframe.sum()[nltk_lem_cv_dataframe.sum() > 5].index)

nltk_lem_cv_dataframe[nltk_lem_frequent_words]

Unnamed: 0,according,added,administered,administration,agency,ahead,american,amid,analysis,analyst,announced,antibody,apog,approval,approved,astrazeneca,authorization,available,average,azn,barrel,biggest,billion,biontech,biotech,bitcoin,bntx,brexit,britain,british,business,buy,care,case,check,chief,climbed,close,closed,cocktail,...,strong,study,sunday,support,surge,surged,target,therapeutics,thursday,tier,time,today,toll,tomorrow,trade,trading,treatment,trial,trump,tuesday,uk,union,united,university,use,vaccination,vaccine,value,variant,virus,vote,wall,way,wednesday,week,window,world,year,yesterday,york
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,3,0,0,1,2,1,4,0,0,0,0,0,0,0,3,0,1,0,0,0
1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,3,0,1,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,3,0,0,0,0,0,0,0,0,0,0,3,0,1
2,0,0,2,0,0,0,0,0,1,3,1,0,0,2,2,6,2,0,1,2,0,0,1,1,1,0,0,0,0,0,0,2,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,2,0,0,1,3,0,8,0,0,0,0,0,0,0,1,0,0,4,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,1,0,0,0,0,0,0,0,0,1,2,2,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,8,0,0,0,0,0,0,0,4,0,2,0,0,0
5,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,3,0,1,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,1,3,0,0,0,0,1,3,0,0,0,0,0,0,0,0,0,0,3,0,1
6,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0
7,1,0,1,0,1,0,0,2,0,0,1,0,0,0,1,2,0,1,0,2,0,0,0,2,0,0,1,0,0,0,0,0,1,4,0,2,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,2,2,5,9,0,1,2,0,0,0,0,5,0,1,0,0,0
8,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,4,0,0,0,0,0,1,0,0,1,0,0,1,2,2,2,1,0,2,0,0,0,0,0,0,...,0,0,3,0,1,0,0,0,1,0,0,0,0,0,0,3,0,0,0,0,1,0,2,2,1,0,3,0,1,0,0,1,1,0,1,0,0,2,0,0
9,3,1,0,1,0,1,0,0,0,0,1,0,0,0,1,1,0,0,0,1,0,0,4,0,0,1,0,0,0,0,0,1,1,0,1,2,0,2,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,1,0,0,0,1,3,1,0,1,1,0,1,2,2,0,0,4,0,0


### spaCy preprocessed lemmatised articles

In [None]:
# Join the spaCy lemmatised tokens back into a list of articles

spacy_preprocessed_articles_lem = [','.join(article).replace(',', ' ') for article in spacy_lemmatised_articles]
spacy_preprocessed_articles_lem

['FTSE firm provided doses ready use Monday six hospital trusts UK begun rollout coronavirus vaccine produced PLC Oxford University FTSE firm provided doses ready use Monday six hospital trusts Oxford London Sussex Lancashire Warwickshire Most available doses sent services care homes later week pleased getting COVID vaccine today really proud one invented Oxford Brian Pinker became first person world receive new Oxford vaccine morning OUHospitals NHS England NHS Improvement NHSEngland January country established vaccination sites hundreds opening week take total government secured access doses inoculation behalf whole UK crown dependencies overseas territories More million people UK already vaccinated vaccine rollout continue pace However criticism government plan since required two doses jabs administered weeks apart British Medical Association said cancelling patients booked second doses grossly unfair BBC reported though chief medical officers said preferable vaccinate many people p

In [None]:
# Learn the vocabulary dictionary and transform to a vector of token counts
Count_data = CountVec.fit_transform(spacy_preprocessed_articles_lem)
 
# Create dataframe that can be used for building models
spacy_lem_cv_dataframe = pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names())
print(spacy_lem_cv_dataframe)

    aal  abcellera  abcl  ability  accelerate  ...  yield  yields  york  zero  şahin
0     0          0     0        0           0  ...      0       0     0     0      0
1     0          0     0        0           0  ...      0       0     1     0      0
2     0          0     0        0           0  ...      0       0     0     0      0
3     0          0     0        0           0  ...      0       0     0     0      0
4     0          0     0        0           0  ...      0       0     0     0      0
5     0          0     0        0           0  ...      0       0     1     0      0
6     1          0     0        0           0  ...      0       0     0     0      0
7     0          0     0        0           1  ...      0       0     0     0      1
8     0          0     0        0           0  ...      0       0     0     0      0
9     0          0     0        0           0  ...      0       0     0     0      0
10    0          0     0        0           0  ...      0       0

In [None]:
# Create a mask so we only get the terms that have a frequency greater than 5 
spacy_lem_frequent_words = list(spacy_lem_cv_dataframe.sum()[spacy_lem_cv_dataframe.sum() > 5].index)

spacy_lem_cv_dataframe[spacy_lem_frequent_words]

Unnamed: 0,according,added,administered,administration,agency,ahead,american,amid,analysis,analyst,announced,antibody,apog,approval,approved,astrazeneca,authorization,available,average,azn,biggest,billion,biontech,biotech,bitcoin,bntx,brexit,britain,british,business,buy,care,case,cases,checks,chief,climbed,close,closed,cocktail,...,support,surge,surged,target,therapeutics,thursday,tier,time,today,toll,tomorrow,trade,trading,treatment,trial,trump,tuesday,uk,union,united,university,use,vaccination,vaccine,vaccines,value,variant,virus,vote,wall,way,wednesday,week,weeks,window,world,year,years,yesterday,york
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,3,0,0,1,2,1,4,0,0,0,0,0,0,0,0,2,1,0,1,0,0,0,0
1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,2,0,0,0,0,0,0,0,0,0,0,0,2,1,0,1
2,0,0,2,0,0,0,0,0,1,3,1,0,0,1,2,6,2,0,1,2,0,1,1,1,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,2,0,0,1,3,0,8,0,0,0,0,0,0,0,0,0,1,0,0,0,4,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,1,0,0,0,0,0,0,0,0,1,2,2,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,7,1,0,0,0,0,0,0,0,3,1,0,2,0,0,0,0
5,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,1,3,0,0,0,0,1,1,2,0,0,0,0,0,0,0,0,0,0,0,2,1,0,1
6,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,1,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,1,0,1,0,1,0,0,2,0,0,1,0,0,0,1,2,0,1,0,2,0,0,2,0,0,1,0,0,0,0,0,1,0,4,0,2,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,2,2,4,8,2,0,1,2,0,0,0,0,2,3,0,1,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,4,0,0,0,0,1,0,0,1,0,0,1,2,2,2,1,0,0,2,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0,0,0,3,0,0,0,0,1,0,2,2,1,0,2,1,0,1,0,0,1,1,0,1,0,0,0,2,0,0,0
9,3,1,0,1,0,1,0,0,0,0,1,0,0,0,1,1,0,0,0,1,0,4,0,0,1,0,0,0,0,0,1,1,0,0,1,2,0,2,1,0,...,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,1,0,0,0,1,2,1,1,0,1,1,0,1,2,2,0,0,0,3,1,0,0


Although Bag of Words is quite efficient and easy to implement, still there are some disadvantages:

*   The model ignores the location information of the word. The same words in a different order will have the same vector representation.
*   The model does not respect the semantics of the word. Words often used in the same context are represented by different vectors. 
*   This approach causes a significant dimensionality problem - the more documents you have the larger the vocabulary, and the longer the vectors. 
*   Additionally, the vectors would also contain many 0s, thereby resulting in a huge sparse feature matrix.
*   If the model comes across a new word it has not seen yet it will probably end up ignoring this word.


### 6.2 Term Frequency-Inverse Document Frequency (TF-IDF)

Term frequency is not necessarily the best representation for text. In fact, you can find in the corpus common words with the highest frequency but little predictive power over the target variable. To address this problem there is an advanced variant of the Bag of Words that, instead of simple counting, uses the Term Frequency-Inverse Document Frequency (TF-IDF).

Term Frequency (TF) is a measure of how frequently a term, t, appears in a document, d, or the number of times a word appears in a document divided by the total number of words in the document. Every document has its own term frequency.

Inverse Document Frequency (IDF) is a measure of how important a term is. We need the IDF value because computing just the TF alone is not sufficient to understand the importance of words. The IDF is the log of the number of documents divided by the number of documents that contain the word w, and determines the weight of rare words across all documents in the corpus.

Basically, the value of a word increases proportionally to count, but it is inversely proportional to the frequency of the word in the corpus.
Computing the TF-IDF score for each word in the corpus will show that words with a higher score are more important, and those with a lower score are less important.

Scikit-learn's TfidfVectorizer() class converts a collection of raw documents to a matrix of TF-IDF features. The smooth_idf parameter is used to smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. This prevents zero divisions. We will look at examples with and without smoothing.



### NLTK preprocessed stemmed articles

In [None]:
# Without smooth IDF
print("Without Smoothing:")
# Create an instance of the TfidfVectorizer class without smoothing
tf_idf_vec = TfidfVectorizer(use_idf=True, 
                        smooth_idf=False,  
                        ngram_range=(1,1),stop_words='english')
# Transform
tf_idf_data = tf_idf_vec.fit_transform(nltk_preprocessed_articles_stem)
 
# Create dataframe
nltk_stem_tf_idf_dataframe = pd.DataFrame(tf_idf_data.toarray(),columns = tf_idf_vec.get_feature_names())
print(nltk_stem_tf_idf_dataframe)
print("\n")
  

# With smooth IDF
tf_idf_vec_smooth = TfidfVectorizer(use_idf=True,  
                        smooth_idf=True,  
                        ngram_range=(1,1),stop_words='english')
 
# Transform 
tf_idf_data_smooth = tf_idf_vec_smooth.fit_transform(nltk_preprocessed_articles_stem)
 
print("With Smoothing:")
# Create dataframe
nltk_stem_tf_idf_dataframe_smooth = pd.DataFrame(tf_idf_data_smooth.toarray(),columns = tf_idf_vec_smooth.get_feature_names())
print(nltk_stem_tf_idf_dataframe_smooth)

Without Smoothing:
         aal  abcellera      abcl  ...      york     zero     şahin
0   0.000000   0.000000  0.000000  ...  0.000000  0.00000  0.000000
1   0.000000   0.000000  0.000000  ...  0.061619  0.00000  0.000000
2   0.000000   0.000000  0.000000  ...  0.000000  0.00000  0.000000
3   0.000000   0.000000  0.000000  ...  0.000000  0.00000  0.000000
4   0.000000   0.000000  0.000000  ...  0.000000  0.00000  0.000000
5   0.000000   0.000000  0.000000  ...  0.050182  0.00000  0.000000
6   0.130307   0.000000  0.000000  ...  0.000000  0.00000  0.000000
7   0.000000   0.000000  0.000000  ...  0.000000  0.00000  0.057328
8   0.000000   0.000000  0.000000  ...  0.000000  0.00000  0.000000
9   0.000000   0.000000  0.000000  ...  0.000000  0.00000  0.000000
10  0.000000   0.000000  0.000000  ...  0.000000  0.00000  0.000000
11  0.000000   0.000000  0.000000  ...  0.000000  0.00000  0.000000
12  0.000000   0.000000  0.000000  ...  0.000000  0.03468  0.000000
13  0.000000   0.000000  0.00

### NLTK preprocessed lemmatised articles

In [None]:
# Without smooth IDF
print("Without Smoothing:")
# Create an instance of the TfidfVectorizer class without smoothing
tf_idf_vec = TfidfVectorizer(use_idf=True, 
                        smooth_idf=False,  
                        ngram_range=(1,1),stop_words='english') # to use only  bigrams ngram_range=(2,2)
# Transform
tf_idf_data = tf_idf_vec.fit_transform(nltk_preprocessed_articles_lem)
 
# Create dataframe
nltk_lem_tf_idf_dataframe = pd.DataFrame(tf_idf_data.toarray(),columns=tf_idf_vec.get_feature_names())
print(nltk_lem_tf_idf_dataframe)
print("\n")
 

# With smooth IDF
tf_idf_vec_smooth = TfidfVectorizer(use_idf=True,  
                        smooth_idf=True,  
                        ngram_range=(1,1),stop_words='english')
 
# Transform 
tf_idf_data_smooth = tf_idf_vec_smooth.fit_transform(nltk_preprocessed_articles_lem)
 
print("With Smoothing:")
# Create dataframe
nltk_lem_tf_idf_dataframe_smooth = pd.DataFrame(tf_idf_data_smooth.toarray(),columns=tf_idf_vec_smooth.get_feature_names())
print(nltk_lem_tf_idf_dataframe_smooth)

Without Smoothing:
         aal  abcellera      abcl  ...      york      zero     şahin
0   0.000000   0.000000  0.000000  ...  0.000000  0.000000  0.000000
1   0.000000   0.000000  0.000000  ...  0.059641  0.000000  0.000000
2   0.000000   0.000000  0.000000  ...  0.000000  0.000000  0.000000
3   0.000000   0.000000  0.000000  ...  0.000000  0.000000  0.000000
4   0.000000   0.000000  0.000000  ...  0.000000  0.000000  0.000000
5   0.000000   0.000000  0.000000  ...  0.049126  0.000000  0.000000
6   0.125116   0.000000  0.000000  ...  0.000000  0.000000  0.000000
7   0.000000   0.000000  0.000000  ...  0.000000  0.000000  0.057283
8   0.000000   0.000000  0.000000  ...  0.000000  0.000000  0.000000
9   0.000000   0.000000  0.000000  ...  0.000000  0.000000  0.000000
10  0.000000   0.000000  0.000000  ...  0.000000  0.000000  0.000000
11  0.000000   0.000000  0.000000  ...  0.000000  0.000000  0.000000
12  0.000000   0.000000  0.000000  ...  0.000000  0.035052  0.000000
13  0.000000   

### spaCy preprocessed stemmed articles

In [None]:
# Without smooth IDF
print("Without Smoothing:")
# Create an instance of the TfidfVectorizer class without smoothing
tf_idf_vec = TfidfVectorizer(use_idf=True, 
                        smooth_idf=False,  
                        ngram_range=(1,1),stop_words='english') # to use only  bigrams ngram_range=(2,2)
# Transform
tf_idf_data = tf_idf_vec.fit_transform(spacy_preprocessed_articles_lem)
 
# Create dataframe
spacy_lem_tf_idf_dataframe = pd.DataFrame(tf_idf_data.toarray(),columns=tf_idf_vec.get_feature_names())
print(spacy_lem_tf_idf_dataframe)
print("\n")
 

# With smooth IDF
tf_idf_vec_smooth = TfidfVectorizer(use_idf=True,  
                        smooth_idf=True,  
                        ngram_range=(1,1),stop_words='english')
 
# Transform 
tf_idf_data_smooth = tf_idf_vec_smooth.fit_transform(spacy_preprocessed_articles_lem)
 
print("With Smoothing:")
# Create dataframe
spacy_lem_tf_idf_dataframe_smooth = pd.DataFrame(tf_idf_data_smooth.toarray(),columns=tf_idf_vec_smooth.get_feature_names())
print(spacy_lem_tf_idf_dataframe_smooth)

Without Smoothing:
         aal  abcellera    abcl  ...      york      zero     şahin
0   0.000000    0.00000  0.0000  ...  0.000000  0.000000  0.000000
1   0.000000    0.00000  0.0000  ...  0.058960  0.000000  0.000000
2   0.000000    0.00000  0.0000  ...  0.000000  0.000000  0.000000
3   0.000000    0.00000  0.0000  ...  0.000000  0.000000  0.000000
4   0.000000    0.00000  0.0000  ...  0.000000  0.000000  0.000000
5   0.000000    0.00000  0.0000  ...  0.048850  0.000000  0.000000
6   0.124539    0.00000  0.0000  ...  0.000000  0.000000  0.000000
7   0.000000    0.00000  0.0000  ...  0.000000  0.000000  0.057399
8   0.000000    0.00000  0.0000  ...  0.000000  0.000000  0.000000
9   0.000000    0.00000  0.0000  ...  0.000000  0.000000  0.000000
10  0.000000    0.00000  0.0000  ...  0.000000  0.000000  0.000000
11  0.000000    0.00000  0.0000  ...  0.000000  0.000000  0.000000
12  0.000000    0.00000  0.0000  ...  0.000000  0.035064  0.000000
13  0.000000    0.00000  0.0000  ...  0.000

As above with the Bag of Words, the data is high dimensional and any useful analysis would require selecting the columns with the highest TF-IDF. 

Bag of Words and TF-IDF can convert textual data into numerical data but are unable to take into consideration context, which is necessary for detecting similarity between words or translating documents into another language.  

### 6.3 Word Embedding (Word2Vec)

A word embedding is a vector representation of a particular word in an n-dimensional space, and words that are closer in the vector space are expected to be similar in meaning.

Word2Vec is one of the most popular techniques for learning word embeddings and uses a two-layer neural network. The input of Word2Vec is a text corpus and its output is a set of vectors known as feature vectors that represent words in that corpus. 

Word2vec trains words against other words that neighbour them in the input corpus.  It does this by using a combination of two unsupervised algorithms: Continuous Bag of Words (CBOW), which uses context to predict a target word, and the Skip-gram model, which uses a word to predict a target context.





### Vector values

We will use spaCy's default English language model to create a word vector stored as an array.

In [None]:
# For example we can make a vector out of the word 'growth'
doc = nlp(u'growth')
print(doc.vector.shape)
print(doc.vector)

(96,)
[ 8.58920932e-01  6.20367289e-01 -1.31036282e+00 -1.47370267e+00
  1.15732229e+00  1.45928264e-01  3.80185628e+00 -3.77992868e-01
 -3.98049355e-02  4.33264065e+00  2.92952061e+00  8.00308824e-01
  3.01639080e+00 -3.59804606e+00  3.54130149e-01 -1.21122885e+00
  1.14853054e-01  1.39134896e+00 -1.78165305e+00 -2.06363511e+00
  2.36152411e+00  7.72602320e-01 -8.03935468e-01 -8.96134377e-01
 -1.67789042e+00 -1.00386596e+00 -1.03682268e+00 -4.12988234e+00
  3.13561988e+00 -1.83762562e+00  3.80061746e+00  7.12692738e-02
 -5.48399806e-01  1.19633079e+00  1.81020129e+00 -2.74967170e+00
  1.97480202e+00  2.59926796e-01 -5.68437576e+00  2.69245803e-01
  5.60801363e+00  4.15427685e-01 -4.15742397e-04 -3.28472829e+00
  2.21204090e+00 -2.24462688e-01 -1.26612931e-01 -1.02066851e+00
  1.14587569e+00  3.23814487e+00  1.36192608e+00 -1.05267203e+00
 -2.12705851e+00 -2.20375228e+00 -1.33382213e+00  2.47395849e+00
  1.26877412e-01  7.50522017e-01  5.21210432e-02  1.01763225e+00
  1.89651203e+00 -2

### Vector similarity

The similarity() method exposes vector relationships and it is possible to do this for several token combinations iteratively. By default spaCy uses cosine similarity, the most widely used method to compare two vectors, which considers vector orientation independent of vector magnitude. 

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalised to both have length 1. The cosine of 0° is 1 and less than 1 for any angle in the interval.

It is also the case that words with opposite meaning, but that often appear in the same context, may have similar vectors.

In [None]:
# Get the similarity for 'up' and 'growth'
nlp('up').similarity(nlp('growth'))

0.24320669822797533

The vector for the word 'growth' had a lot of dimensions so we will make use of word embeddings to look at similarity between documents. One possibility for stock prediction could be to use a changing similarity score to detect when confidence in a stock is changing. This could be used to classify articles.

We will find all the word embeddings for all the words in all the articles and then take the mean of the embeddings for each article.

We will then calculate the angle between the two articles using the dot product which considers orientation and also scales with vector magnitude.

Where the angle is ~0, the cosθ component of the formula equals ~1. If the angle is nearer to 90 (orthogonal/perpendicular), the cosθ component equals ~0, and at 180 the cosθ component equals ~-1.

Therefore, the cosθ component increases the result where there is less of an angle between the two vectors. So, a higher dot product correlates with higher orientation.

The dot product calculation is straightforward and therefore gives benefits in terms of computation time but it is not normalised — meaning larger vectors will tend to score higher dot products, despite being less similar.

A norm is a measure of the size of a matrix or vector and you can compute it in NumPy with the np.linalg.norm() function. One important use of norm is to transform a given vector into a unit-length vector, that is, making the magnitude of vector = 1, while still preserving its direction.

### NLTK preprocessed stemmed articles

In [None]:
# Find all embeddings for all words in all articles, 
# Take the mean of the embeddings for each article

article_vector_list = [np.mean([nlp(word).vector for word in article]) for article in nltk_preprocessed_articles_stem] 

# Calculate the angle between two articles. 
# The articles are represented by vectors, so their similarity is defined by the angle between them 
np.dot(article_vector_list[0], article_vector_list[1]) / (np.linalg.norm(article_vector_list[0]) * np.linalg.norm(article_vector_list[1]))

1.0

### Named Entity Recognition

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction.

Using spaCy’s built-in displaCy visualiser we can see what the named entities look like.

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value or as a string, using the attributes ent.label and ent.label_. The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

In [None]:
doc = nlp(nltk_preprocessed_articles_stem[0])

entities=[(i, i.label_) for i in doc.ents]
entities

[(readi, 'ORG'),
 (monday, 'DATE'),
 (six, 'CARDINAL'),
 (hospit, 'GPE'),
 (UK, 'GPE'),
 (monday, 'DATE'),
 (six, 'CARDINAL'),
 (hospit, 'GPE'),
 (london, 'GPE'),
 (lancashir warwickshir, 'ORG'),
 (later week, 'DATE'),
 (today, 'DATE'),
 (brian pinker, 'PERSON'),
 (first, 'ORDINAL'),
 (open week, 'DATE'),
 (UK, 'GPE'),
 (million, 'CARDINAL'),
 (UK, 'GPE'),
 (two, 'CARDINAL'),
 (british, 'NORP'),
 (cancel patient, 'PERSON'),
 (second, 'ORDINAL'),
 (grossli unfair, 'ORG'),
 (bbc, 'ORG'),
 (medic offic, 'PERSON'),
 (possibl, 'GPE'),
 (meanwhil astrazeneca appli, 'PERSON'),
 (south korea, 'GPE'),
 (vietnam, 'GPE'),
 (compani, 'ORG'),
 (monday, 'DATE')]

### NLTK preprocessed lemmatised articles

In [None]:
# Find all embeddings for all words in all articles, 
# Take the mean of the embeddings for each article

article_vector_list = [np.mean([nlp(word).vector for word in article]) for article in nltk_preprocessed_articles_lem] 

# Calculate the angle between two articles. 
# The articles are represented by vectors, so their similarity is defined by the angle between them  
np.dot(article_vector_list[0], article_vector_list[1]) / (np.linalg.norm(article_vector_list[0]) * np.linalg.norm(article_vector_list[1]))

1.0

In [None]:
doc = nlp(nltk_preprocessed_articles_lem[0])

entities=[(i, i.label_) for i in doc.ents]
entities

[(Monday six, 'DATE'),
 (UK, 'GPE'),
 (PLC Oxford University FTSE, 'ORG'),
 (Monday, 'DATE'),
 (six, 'CARDINAL'),
 (later week, 'DATE'),
 (COVID, 'ORG'),
 (today, 'DATE'),
 (Oxford, 'ORG'),
 (Brian Pinker, 'PERSON'),
 (first, 'ORDINAL'),
 (Oxford, 'ORG'),
 (OUHospitals NHS England NHS Improvement NHSEngland January, 'ORG'),
 (hundred, 'CARDINAL'),
 (UK, 'GPE'),
 (million, 'CARDINAL'),
 (UK, 'GPE'),
 (two, 'CARDINAL'),
 (British Medical Association, 'ORG'),
 (second, 'ORDINAL'),
 (BBC, 'ORG'),
 (AstraZeneca, 'ORG'),
 (South Korea, 'GPE'),
 (Vietnam, 'GPE'),
 (Monday morning, 'TIME')]

### spaCy preprocessed lemmatised articles

In [None]:
# Find all embeddings for all words in all articles, 
# Take the mean of the embeddings for each article

article_vector_list = [np.mean([nlp(word).vector for word in article]) for article in spacy_preprocessed_articles_lem] 

# Calculate the angle between two articles. 
# The articles are represented by vectors, so their similarity is defined by the angle between them 
np.dot(article_vector_list[0], article_vector_list[1]) / (np.linalg.norm(article_vector_list[0]) * np.linalg.norm(article_vector_list[1]))

1.0

In [None]:
doc = nlp(spacy_preprocessed_articles_lem[0])

entities=[(i, i.label_) for i in doc.ents]
entities

[(Monday, 'DATE'),
 (six, 'CARDINAL'),
 (UK, 'GPE'),
 (PLC Oxford University FTSE, 'ORG'),
 (Monday, 'DATE'),
 (six, 'CARDINAL'),
 (Oxford London, 'ORG'),
 (later week, 'DATE'),
 (COVID, 'ORG'),
 (today, 'DATE'),
 (Oxford, 'ORG'),
 (Brian Pinker, 'PERSON'),
 (first, 'ORDINAL'),
 (Oxford, 'ORG'),
 (OUHospitals NHS England NHS Improvement NHSEngland January, 'ORG'),
 (hundreds, 'CARDINAL'),
 (UK, 'GPE'),
 (million, 'CARDINAL'),
 (UK, 'GPE'),
 (two, 'CARDINAL'),
 (weeks, 'DATE'),
 (British Medical Association, 'ORG'),
 (second, 'ORDINAL'),
 (BBC, 'ORG'),
 (AstraZeneca, 'ORG'),
 (South Korea, 'GPE'),
 (Vietnam, 'GPE'),
 (Monday morning, 'TIME')]

## 7. Topic Modelling

Topic modelling is an unsupervised machine learning technique that's capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterise a set of documents.

A widely used topic modelling algorithm is the Latent Dirichlet Allocation (LDA) from the Gensim package.

In the LDA model, each document is viewed as a collection of topics that are present in the corpus in a certain proportion, and each topic as a collection of keywords, again in a certain proportion. Just by looking at the keywords, you can identify what the topic is all about.




### 7.1 Create the Dictionary and Corpus needed for Topic Modelling

The two main inputs to the LDA topic model are the dictionary (id2word) and the corpus.

### NLTK preprocessed stemmed articles



In [None]:
# Create Dictionary
id2word_nltk_stem = corpora.Dictionary(nltk_stemmed_articles)

# Create Corpus
texts_nltk_stem = nltk_stemmed_articles

# Term Document Frequency
corpus_nltk_stem = [id2word_nltk_stem.doc2bow(text) for text in texts_nltk_stem]

# View
print(corpus_nltk_stem[:1])

[[(0, 3), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 8), (30, 1), (31, 1), (32, 2), (33, 2), (34, 2), (35, 2), (36, 1), (37, 2), (38, 1), (39, 1), (40, 2), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 2), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 3), (58, 1), (59, 2), (60, 1), (61, 1), (62, 2), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 4), (70, 1), (71, 1), (72, 2), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 2), (84, 1), (85, 2), (86, 1), (87, 1), (88, 1), (89, 1), (90, 2), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 2), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 2), (109, 1), (110, 1)

Gensim creates a unique id for each word in the document. The produced corpus shown above is a mapping of (word_id, word_frequency).

For example, (0, 3) above implies word id 0 occurs three times in the first document. Likewise, word id 1 occurs once and so on.

This is used as the input by the LDA model.

To see what word a given id corresponds to, pass the id as a key to the dictionary.

In [None]:
# word id 0 passed as key to the dictionary

id2word_nltk_stem[0]

'UK'

### NLTK preprocessed lemmatised articles

In [None]:
# Create Dictionary
id2word_nltk_lem = corpora.Dictionary(nltk_lemmatised_articles)

# Create Corpus
texts_nltk_lem = nltk_lemmatised_articles

# Term Document Frequency
corpus_nltk_lem = [id2word_nltk_lem.doc2bow(text) for text in texts_nltk_lem]

# View
print(corpus_nltk_lem[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 3), (17, 1), (18, 1), (19, 2), (20, 1), (21, 1), (22, 4), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 3), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 7), (55, 1), (56, 1), (57, 2), (58, 2), (59, 1), (60, 1), (61, 1), (62, 2), (63, 1), (64, 1), (65, 2), (66, 1), (67, 1), (68, 1), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 2), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 2), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 2), (93, 1), (94, 2), (95, 1), (96, 1), (97, 1), (98, 1), (99, 2), (100, 2), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 2), (108, 1), (109, 1), (110, 1)

In [None]:
# word id 0 passed as key to the dictionary

id2word_nltk_lem[0]

'Association'

### spaCy preprocessed lemmatised articles

In [None]:
# Create Dictionary
id2word_spacy_lem = corpora.Dictionary(spacy_lemmatised_articles)

# Create Corpus
texts_spacy_lem = spacy_lemmatised_articles

# Term Document Frequency
corpus_spacy_lem = [id2word_spacy_lem.doc2bow(text) for text in texts_spacy_lem]

# View
print(corpus_spacy_lem[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 3), (17, 1), (18, 1), (19, 2), (20, 1), (21, 1), (22, 4), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 3), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 7), (56, 1), (57, 2), (58, 2), (59, 1), (60, 1), (61, 1), (62, 2), (63, 1), (64, 1), (65, 2), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 2), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 2), (94, 1), (95, 2), (96, 1), (97, 1), (98, 1), (99, 1), (100, 2), (101, 2), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 2), (109, 1), (110, 1)

In [None]:
id2word_spacy_lem[0]

'Association'

### 7.2 Build the Topic Model

In addition to the corpus and dictionary, you need to provide the number of topics as with any unsupervised learning technique. 

### NLTK preprocessed stemmed articles

In [None]:
# Build the topic model with 10 topics
nltk_stem_lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus_nltk_stem,
                                           id2word=id2word_nltk_stem,
                                           num_topics=10,              # Define the number of topics we want
                                           random_state=100,
                                           update_every=1,             # Determines how often the model parameters should be updated
                                           chunksize=10,               # Number of documents to be used in each training chunk
                                           passes=10,                  # Total number of training passes.
                                           alpha='auto',               # Hyperparameter that affects sparsity of the topics
                                           per_word_topics=True)

### NLTK preprocessed lemmatised articles

In [None]:
# Build the topic model with 10 topics
nltk_lem_lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus_nltk_lem,
                                           id2word=id2word_nltk_lem,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=10,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

### spaCy preprocessed lemmatised articles

In [None]:
# Build the topic model with 10 topics
spacy_lem_lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus_spacy_lem,
                                           id2word=id2word_spacy_lem,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=10,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

### 7.3 View the topics in LDA model

The model is built with 10 different topics each of which is a combination of keywords with each keyword contributing a certain weighting to the topic.

You can see the keywords for each topic and the weighting (importance) of each keyword using lda_model.print_topics()

### NLTK preprocessed stemmed articles

In [None]:
# Print the Keyword in the 10 topics
pprint(nltk_stem_lda_model.print_topics())
nltk_stem_doc_lda = nltk_stem_lda_model[corpus_nltk_stem]

[(0,
  '0.044*"vaccin" + 0.019*"dose" + 0.019*"million" + 0.017*"astrazeneca" + '
  '0.016*"week" + 0.016*"approv" + 0.014*"receiv" + 0.013*"within" + '
  '0.012*"clinic" + 0.012*"mean"'),
 (1,
  '0.015*"year" + 0.014*"UK" + 0.014*"trade" + 0.013*"index" + 0.013*"market" '
  '+ 0.012*"wednesday" + 0.012*"new" + 0.012*"ftse" + 0.012*"close" + '
  '0.011*"vaccin"'),
 (2,
  '0.028*"compani" + 0.018*"said" + 0.015*"gain" + 0.013*"rose" + 0.012*"firm" '
  '+ 0.011*"stock" + 0.011*"share" + 0.011*"US" + 0.009*"strong" + '
  '0.009*"also"'),
 (3,
  '0.023*"stock" + 0.015*"posit" + 0.014*"nasdaq" + 0.012*"compani" + '
  '0.011*"vaccin" + 0.010*"market" + 0.009*"ad" + 0.009*"pass" + 0.009*"new" + '
  '0.009*"next"'),
 (4,
  '0.030*"vaccin" + 0.019*"million" + 0.013*"case" + 0.009*"record" + '
  '0.009*"state" + 0.009*"hospit" + 0.008*"death" + 0.008*"first" + '
  '0.008*"accord" + 0.008*"countri"'),
 (5,
  '0.035*"vaccin" + 0.029*"dose" + 0.022*"said" + 0.019*"receiv" + '
  '0.018*"pfizer" + 0.

### NLTK preprocessed lemmatised articles

In [None]:
# Print the Keyword in the 10 topics
pprint(nltk_lem_lda_model.print_topics())
nltk_lem_doc_lda = nltk_lem_lda_model[corpus_nltk_lem]

[(0,
  '0.021*"million" + 0.020*"case" + 0.016*"vaccine" + 0.013*"death" + '
  '0.009*"said" + 0.009*"patient" + 0.009*"one" + 0.008*"people" + '
  '0.008*"state" + 0.007*"November"'),
 (1,
  '0.018*"company" + 0.016*"gained" + 0.012*"Shares" + 0.010*"firm" + '
  '0.010*"target" + 0.007*"AMC" + 0.007*"BTIG" + 0.007*"education" + '
  '0.007*"Tesla" + 0.007*"CNBC"'),
 (2,
  '0.025*"UK" + 0.019*"vaccine" + 0.015*"FTSE" + 0.014*"point" + '
  '0.012*"coronavirus" + 0.011*"Wednesday" + 0.010*"higher" + 0.009*"case" + '
  '0.009*"session" + 0.008*"index"'),
 (3,
  '0.017*"gain" + 0.013*"Keator" + 0.011*"dollar" + 0.010*"rose" + '
  '0.010*"yield" + 0.009*"going" + 0.009*"gained" + 0.008*"lowest" + '
  '0.007*"robust" + 0.007*"dip"'),
 (4,
  '0.024*"dose" + 0.023*"dos" + 0.020*"second" + 0.015*"week" + '
  '0.013*"received" + 0.012*"vaccine" + 0.012*"patient" + 0.012*"time" + '
  '0.012*"receive" + 0.010*"Oxford"'),
 (5,
  '0.028*"company" + 0.026*"said" + 0.020*"vaccine" + 0.018*"rose" + '
  

### spaCy preprocessed lemmatised articles

In [None]:
# Print the Keyword in the 10 topics
pprint(spacy_lem_lda_model.print_topics())
spacy_lem_doc_lda = spacy_lem_lda_model[corpus_spacy_lem]

[(0,
  '0.019*"Inc" + 0.015*"death" + 0.013*"antibody" + 0.011*"Wednesday" + '
  '0.010*"cocktail" + 0.009*"In" + 0.009*"program" + 0.009*"said" + '
  '0.009*"premarket" + 0.008*"Phase"'),
 (1,
  '0.025*"vaccine" + 0.020*"higher" + 0.018*"NASDAQ" + 0.018*"AstraZeneca" + '
  '0.015*"approval" + 0.012*"Oxford" + 0.012*"lower" + 0.011*"stock" + '
  '0.011*"University" + 0.011*"company"'),
 (2,
  '0.000*"Inc" + 0.000*"This" + 0.000*"vaccine" + 0.000*"I" + 0.000*"said" + '
  '0.000*"new" + 0.000*"Stocks" + 0.000*"Grade" + 0.000*"announced" + '
  '0.000*"premarket"'),
 (3,
  '0.019*"vaccine" + 0.016*"doses" + 0.015*"PLC" + 0.014*"million" + '
  '0.014*"patients" + 0.012*"UK" + 0.011*"first" + 0.009*"jab" + '
  '0.009*"government" + 0.009*"people"'),
 (4,
  '0.022*"US" + 0.014*"firm" + 0.013*"results" + 0.009*"Government" + '
  '0.009*"treatment" + 0.009*"operating" + 0.009*"deal" + 0.008*"completed" + '
  '0.006*"Atacand" + 0.006*"Elsewhere"'),
 (5,
  '0.028*"vaccine" + 0.022*"said" + 0.015*

### 7.4 Compute Model Perplexity and Coherence Score

Model perplexity and topic coherence provide a convenient measure to judge how good a given topic model is.

### NLTK preprocessed stemmed articles

In [None]:
# Compute Perplexity
print('\nPerplexity: ', nltk_stem_lda_model.log_perplexity(corpus_nltk_stem))  # a measure of how good the model is, the lower the better.

# Compute Coherence Score
nltk_stem_coherence_model_lda = CoherenceModel(model=nltk_stem_lda_model, texts=nltk_stemmed_articles, dictionary=id2word_nltk_stem, coherence='c_v')
coherence_lda = nltk_stem_coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.251362434417866

Coherence Score:  0.41024711207836706


### NLTK preprocessed lemmatised articles

In [None]:
# Compute Perplexity
print('\nPerplexity: ', nltk_lem_lda_model.log_perplexity(corpus_nltk_lem))  # a measure of how good the model is, the lower the better.

# Compute Coherence Score
nltk_lem_coherence_model_lda = CoherenceModel(model=nltk_lem_lda_model, texts=nltk_lemmatised_articles, dictionary=id2word_nltk_lem, coherence='c_v')
coherence_lda = nltk_lem_coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.598879257642947

Coherence Score:  0.4153665581845173


### spaCy preprocessed lemmatised articles

In [None]:
# Compute Perplexity
print('\nPerplexity: ', spacy_lem_lda_model.log_perplexity(corpus_spacy_lem))  # a measure of how good the model is, the lower the better.

# Compute Coherence Score
spacy_lem_coherence_model_lda = CoherenceModel(model=spacy_lem_lda_model, texts=spacy_lemmatised_articles, dictionary=id2word_spacy_lem, coherence='c_v')
coherence_lda = spacy_lem_coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.738811766758385

Coherence Score:  0.40859644898360276


One approach to finding the optimal number of topics is to build many LDA models with different numbers of topics and pick the one that gives the highest coherence value.

### 7.5 Visualise the topics and keywords


pyLDAvis is a visualisation tool designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualisation. 

Each bubble on the left hand plot represents a topic, the larger the bubble the more important the topic, relative to the data.

A good topic model will have fairly large, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant. A model with too many topics, will typically have many overlapping, smaller bubbles clustered in one region of the chart indicating the similarity between topics.

Saliency is a measure of how much the term tells you about the topic.

Relevance is a weighted average of the probability of the word given the topic and the word given the topic normalised by the probability of the topic.

### NLTK preprocessed stemmed articles

In [None]:
# Visualise the 10 topics 
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(nltk_stem_lda_model, corpus_nltk_stem, id2word_nltk_stem)
vis

### NLTK preprocessed lemmatised articles

In [None]:
# Visualise the 10 topics 
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(nltk_lem_lda_model, corpus_nltk_lem, id2word_nltk_lem)
vis

### spaCy preprocessed lemmatised articles

In [None]:
# Visualise the 10 topics 
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(spacy_lem_lda_model, corpus_spacy_lem, id2word_spacy_lem)
vis