**This approach involves the use of an LDA - Latent Dirichlet Allocation model to model the topics within the provided text**

In [9]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\250\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel


import spacy


import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
%matplotlib inline

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

**In addition to the default stop words, we add more words to our stop words list from the provided text, which in the general context of the document add no meaningful information** 

In [10]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['page', 'booklet', 'edition', 'training', 'information','2014','october','2'])

In [11]:
import json
with open("training_booklet.json","r") as f:
    data = json.load(f)

In [12]:
data = list(data.values())

**Text Preprocessing:**
- Removal of basic punctuation symbols and non alphanumeric characters
- Symbols removed include email characters, new line characters, backslashes, 
- Non alphanumeric characters are also removed

In [13]:
# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

# remove more backslashes
data = [re.sub(r'[^\w\s()/\\]', '', sent) for sent in data]

In [14]:
data

['Mid Essex Hospital Services NHS NHS Trust Your Mandatory Training Booklet October 2014 First Edition This book is for you to support and underpin the work you do Please use it wisely abide by the content and work with us to achieve our vision CARING MINDS ARTERIA CARING HEART ',
 'We are extremely grateful to Barts Health NHS Trust who produced their first mandatory training handbook in 2013 in response to the amalgamation of three hospitals under the umbrella of Barts Health They generously shared their hard work and gave us permission to reproduce some of the generic charts and photographs page 2 Your Mandatory Training Booklet  October 2014 First Edition ',
 'Introduction Welcome to the first edition of the MEHT NHS Trust Mandatory Training Booklet We are aware that for some of our staff working specific shifts temporary patterns and family commitments may preclude undertaking training in the more traditional way Please raise any issues or concerns in relation to any of the topics

In [15]:
len(data)

80

**Tokenizing and Cleaning Up:**
As seen above, we have 80 sentences/paragraphs in this text.
For each paragraph, yield a list of tokenized words, ensuring to further eliminate all punctuations in the process

In [16]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

[['mid', 'essex', 'hospital', 'services', 'nhs', 'nhs', 'trust', 'your', 'mandatory', 'training', 'booklet', 'october', 'first', 'edition', 'this', 'book', 'is', 'for', 'you', 'to', 'support', 'and', 'underpin', 'the', 'work', 'you', 'do', 'please', 'use', 'it', 'wisely', 'abide', 'by', 'the', 'content', 'and', 'work', 'with', 'us', 'to', 'achieve', 'our', 'vision', 'caring', 'minds', 'arteria', 'caring', 'heart']]


**Remove all stopwords from the text**

In [17]:
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) 
             if word not in stop_words] for doc in texts]

In [18]:
data_words_nostops = remove_stopwords(data_words)

#### Adding Bigrams and Trigrams to our data
Bigrams and Trigrams are sequences of adjacent words in a document. Bigrams consist of sequences two adjacent words and trigram consist of consequences of 3 adjacent words. For each paragraph/sentence in the text, we extract these bigrams and add them to the paragraph/sentence. Adding this extra information is important/helpful for the model when it comes to identifying the topics/themes.

Bigrams and Trigrams models were built using the genism python package

In [19]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [20]:
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

In [21]:
def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

In [22]:
data_words_bigrams = make_bigrams(data_words_nostops)

**Lemmatization**:
Reducing words in the document to their root

In [23]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [25]:
nlp = spacy.load("en_core_web_sm",disable=['parser', 'ner'])

In [26]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

**Building the LDA Model:**

In [27]:
id2word = corpora.Dictionary(data_lemmatized)

In [28]:
texts = data_lemmatized

In [29]:
corpus = [id2word.doc2bow(text) for text in texts]

**Some model parameters**
- number of topics: 20 (The model is configured to generate 20 topics)

In [30]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

### Raw text topic output from the model:
- The model output these 20 output topic ids along with the keywords contained in each topic, from which we can determine the general themes

In [31]:
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.051*"record" + 0.025*"care" + 0.025*"record_keepe" + 0.015*"good" + '
  '0.014*"professional" + 0.014*"make" + 0.012*"patient" + 0.011*"clear" + '
  '0.010*"ensure" + 0.008*"entry"'),
 (1,
  '0.056*"abuse" + 0.019*"safeguard" + 0.019*"type" + 0.016*"apply" + '
  '0.015*"adult" + 0.014*"essex" + 0.014*"tick" + 0.013*"institutional" + '
  '0.012*"individual" + 0.012*"relationship"'),
 (2,
  '0.026*"blood" + 0.025*"transfusion" + 0.016*"patient" + 0.016*"component" + '
  '0.010*"error" + 0.009*"first" + 0.009*"sample" + 0.009*"contact" + '
  '0.009*"transfuse" + 0.009*"reaction"'),
 (3,
  '0.024*"abuse" + 0.020*"adult" + 0.015*"report" + 0.012*"evidence" + '
  '0.011*"vulnerable" + 0.009*"safeguarding" + 0.009*"user" + 0.008*"happen" + '
  '0.008*"people" + 0.007*"concern"'),
 (4,
  '0.001*"child" + 0.001*"abuse" + 0.001*"need" + 0.001*"family" + 0.001*"use" '
  '+ 0.001*"violence" + 0.001*"care" + 0.001*"trust" + 0.001*"make" + '
  '0.001*"health"'),
 (5,
  '0.017*"use" + 0.013

**Interactive Visualization of the model's output**

In [32]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

**Saving the Visualization as an HTML File**

In [34]:
from IPython.display import HTML

In [35]:
pyLDAvis.enable_notebook()


vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)


pyLDAvis.save_html(vis, 'lda_visualization.html')

display(HTML('lda_visualization.html'))