# NLP for Health and Social Care 2018
## Skills Development Workshop
-----------------------------------------



### Before we can do anything we need data!

The data that we are interested in are held in pdf file formats, which is typical for published reports.

We've manually stored a sample of pdf's in the repo, the links for the source of these files can be found in the issues. 

We have not included a function for webscraping pdfs's in this notebook, this is something that could be added to enable automatic collection of large numbers of documents. (tip: scrape documents into file structure as in repo to enable seemless integration with the code below) 

---
### read pdf file and convert to text

let's start off by importing a single pdf file - sample.pdf - and convert into a text file

to this we make use of pdfminer3k libraries (here for more info https://github.com/jaepil/pdfminer3k )
(if using for the first time you will need to load into anaconda using conda install syntax below in a terminal window)

this code creates a function called convert, which can be called with the file path for the pdf file

In [25]:
# install pdfminer3k if using for first time (paste syntax below in Terminal window)
# conda install -c conda-forge pdfminer3k Python libraries

# import pdf miner libraries

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

# create a function to read a pdf document and conver to text, returns text

def convert(fname, pages=None):
# cant get pages func to run...

    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text 

# create an object using the converted text, use the file path for the data
text_file = convert('Data/sample.pdf')
print(text_file)

The Fremantle Trust
Apthorp Care Centre

Inspection report

Nurserymans Road
London
N11 1EQ

Tel: 02082114000
Website: www.fremantletrust.org

Ratings

Date of inspection visit:
09 October 2017
10 October 2017
11 October 2017

Date of publication:
04 January 2018

Overall rating for this service

Requires Improvement  

Is the service safe?

Is the service effective?

Is the service caring?

Is the service responsive?

Is the service well-led?

Requires Improvement     

Requires Improvement     

Good     

Good     

Requires Improvement     

1 Apthorp Care Centre Inspection report 04 January 2018

Summary of findings

Overall summary

This inspection took place on 9, 10 and 11 October 2017 and was unannounced. During the last inspection 
on 1 February 2017 we found the service was in breach of four legal requirements and regulations 
associated with the Health and Social Care Act 2008. We found that people who used the service were not 
protected against the risks associated with 

---
### read and convert multiple pdf files, create a dataframe of text files

In [24]:
import pandas as pd
import glob
from sklearn.model_selection import train_test_split
import re

files = glob.glob("Data/*reports/*.pdf")

records = []

for i in files:

    #open the file
    text_file = convert(i)

    # read and process text
    text = text_file.replace("\n", " ")

#     # lets remove the classes from the text in case they are infliuencing the tests
#     to_remove = ['good', 'outstanding', 'inadequate', 'improve']

#     for term in to_remove:
#         text = text.replace(term, ' ')

    text = re.sub('[^a-zA-Z]+', ' ', text)
    # make it all lower case
    text = text.lower()

    # append to our list of reports
    records.append(text)

#convert to dataframe
df = pd.DataFrame(records)

#write train and test data
df.to_csv("Data/full_text.txt", header = None, index= False, quotechar=" ")

# print the head to show a couple of rows
print(df.head())


                                                   0
0  miss s g howard victoria lodge care home date ...
1  healthcare homes lsc limited sandown park care...
2  the cedars baildon limited the cedars inspecti...
3  valeo limited the lodge inspection report scar...
4  summerfield private residential home limited s...


---
## NLP Basics

now that we have our pdf files converted into text and saved as a dataframe, we can use NLP techniques to prepare the data

here's some defintions of words we'll be seeing a lot of in NLP:

> *Tokenization* – process of converting a text into tokens

> *Tokens* – words or entities present in the text

> *Text object* – a sentence or a phrase or a word or an article

lets go through some basic NLP functions that will help us....


| Table of Contents |
| ----------------- |
| Stemming          |
| Lemmatisation |
| Word Embeddings |
| Part-of-Speech Tagging |
| Named Entity Disambiguation |
| Named Entity Recognition |
| Sentiment Analysis |
| Semantic Text Similarity |
| Language Identification |
| Text Summarisation |

In [1]:
import nltk 
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

---
### Noise Removal

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.

For example:
- language stopwords (commonly used words of a language – is, am, the, of, in etc), 
- URLs or links, social media entities (mentions, hashtags), 
- punctuations and industry specific words. 

This step deals with removal of all types of noisy entities present in the text.

A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object by tokens (or by words), eliminating those tokens which are present in the noise dictionary.

In [4]:
noise_list = ["is", "a", "this", "..."] 

def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

_remove_noise("this is a at the ... sample text")

'at the sample text'

In [5]:
# Sample code to remove a regex pattern 
import re


def _remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

regex_pattern = "#[\w]*" 

_remove_regex("remove this #hashtag from this website", regex_pattern)

'remove this  from this website'

---
### Lexicon Normalization


Normalization is a pivotal step for feature engineering with text as it converts the high dimensional features (N different features) to the low dimensional space (1 feature), which is an ideal ask for any ML model.
 
Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

In [6]:
from nltk.stem.wordnet import WordNetLemmatizer 

# Defining objects 
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer 
# Defining objects
stem = PorterStemmer()

In [7]:
word_1 = "multiplying" 
word_2 = "integrating"
print lem.lemmatize(word_1, "v")
print lem.lemmatize(word_2, "v")

multiply
integrate


In [8]:
word_1 = "multiplying" 
word_2 = "integrating"
print stem.stem(word_1)
print stem.stem(word_2)

multipli
integr


---
### Object Standardization

In [65]:
text_raw ="This is a rt retweeted tweet by me and its awsm"

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
def _lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) 
        new_text = " ".join(new_words) 
    return new_text

_lookup_words(text_raw)

'This is a Retweet retweeted tweet by me and its awesome'

---
#### Text to Features
#### Syntactic Parsing

In [72]:
from nltk import word_tokenize, pos_tag

text = "I am learning Natural Language Processing on Analytics Vidhya"
tokens = word_tokenize(text)

print pos_tag(tokens)

## Proper Noun:
#NNP (Proper Noun)
# VBG Verb gerund 

setence_1 = "Please book my flight for Delhi"

setence_2 = "I am going to read this book in the flight"


# Context based ?

pos_tag(word_tokenize(setence_2))

pos_tag(word_tokenize(setence_1))

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'), ('Vidhya', 'NNP')]


#### The process of detecting the named entities such as person names, location names, company names etc from the text is called as NER.

---
### Data Analysis

Now that we have cleaned our text and formatted the data structure, we can carry out some analysis - *yipee!*

---
#### Topic Modelling 

Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. 

>Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. 

> A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

In [156]:
# Latent Dirichlet Allocation (LDA)

doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]
doc_clean

[['Sugar',
  'is',
  'bad',
  'to',
  'consume.',
  'My',
  'sister',
  'likes',
  'to',
  'have',
  'sugar,',
  'but',
  'not',
  'my',
  'father.'],
 ['My',
  'father',
  'spends',
  'a',
  'lot',
  'of',
  'time',
  'driving',
  'my',
  'sister',
  'around',
  'to',
  'dance',
  'practice.'],
 ['Doctors',
  'suggest',
  'that',
  'driving',
  'may',
  'cause',
  'increased',
  'stress',
  'and',
  'blood',
  'pressure.']]

#### Gensim

> Gensim is a free Python library designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible. https://radimrehurek.com/gensim/index.html

In [152]:
import gensim

from gensim import corpora

In [158]:
# Creating the term dictionary of our corpus, where every unique term is assigned an index.  
dictionary = corpora.Dictionary(doc_clean)

#Converting list of documents (corpus) into Document
#Term Matrix using dictionary prepared above. 
## ?? vectors number of times each elements appears
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

print (doc_term_matrix)

# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results 
print(ldamodel.print_topics())

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 2)], [(0, 1), (9, 1), (11, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)], [(17, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)]]
[(0, u'0.029*"my" + 0.029*"sister" + 0.029*"My" + 0.029*"to" + 0.029*"increased" + 0.029*"stress" + 0.029*"Doctors" + 0.029*"suggest" + 0.029*"and" + 0.029*"may"'), (1, u'0.064*"driving" + 0.037*"lot" + 0.037*"a" + 0.037*"around" + 0.037*"time" + 0.037*"dance" + 0.037*"father" + 0.037*"of" + 0.037*"practice." + 0.037*"spends"'), (2, u'0.089*"to" + 0.051*"My" + 0.051*"sister" + 0.051*"my" + 0.051*"sugar," + 0.051*"not" + 0.051*"Sugar" + 0.051*"is" + 0.051*"bad" + 0.051*"have"')]


In [154]:
text="He is a big big boy"

doc_clean= [text.split()]
dictionary = corpora.Dictionary(doc_clean)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results 
print(ldamodel.print_topics()[0])

(0, u'0.200*"He" + 0.200*"boy" + 0.200*"is" + 0.200*"a" + 0.200*"big"')


In [163]:
# N-Grams as Features
def generate_ngrams(text, n):
    words = text.split()
    output = []  
    for i in range(len(words)-n+1):
        output.append(words[i:i+n])
    return output

# Binary gram 
generate_ngrams('this is a sample text', 2)

[['this', 'is'], ['is', 'a'], ['a', 'sample'], ['sample', 'text']]

#### Term Frequency (TF) 

> TF for a term “t” is defined as the count of a term “t” in a document “D”

#### Inverse Document Frequency (IDF) 
> IDF for a term is defined as logarithm of ratio of total documents available in the corpus and number of documents  containing the term T.

In [164]:
from sklearn.feature_extraction.text import TfidfVectorizer

obj = TfidfVectorizer()
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)

print X

  (0, 7)	0.5844829010200651
  (0, 2)	0.5844829010200651
  (0, 4)	0.444514311537431
  (0, 1)	0.34520501686496574
  (1, 1)	0.3853716274664007
  (1, 0)	0.652490884512534
  (1, 3)	0.652490884512534
  (2, 4)	0.444514311537431
  (2, 1)	0.34520501686496574
  (2, 6)	0.5844829010200651
  (2, 5)	0.5844829010200651


---
### Word Embedding (text vectors)

#### Word2Vec
#### GloVe 

Word2Vec model is composed of preprocessing module, a shallow neural network model called Continuous Bag of Words and another shallow neural network model called skip-gram. These models are widely used for all other nlp problems. It first constructs a vocabulary from the training corpus and then learns word embedding representations. Following code using gensim package prepares the word embedding as the vector

In [206]:
from gensim.models import Word2Vec

sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],
             ['machine', 'learning'], ['deep', 'learning']]


# train the model on your corpus
model = Word2Vec(sentences, min_count = 1)

print model.similarity('data', 'science')

print model['learning'] 

-0.2366132946433113
[ 2.5001867e-03 -4.5279837e-03  3.2164674e-04 -4.4285697e-03
  4.1545318e-03 -2.1649762e-03 -3.8046767e-03 -1.7539723e-03
  5.9814251e-04  3.5325827e-03 -3.9156429e-03  5.8388693e-04
  2.1662123e-03 -4.5341258e-03  1.5721378e-03  4.6008709e-03
 -2.1712466e-03 -1.1091464e-03 -3.1049079e-03 -3.8329777e-03
  2.6114439e-03  6.9665466e-04  1.8230440e-03 -2.0439368e-04
 -1.3953244e-03  2.5326526e-03 -1.3467000e-03  7.9407363e-04
 -8.5678376e-04 -2.7677296e-03 -3.2886763e-03 -4.6619233e-03
  7.3562146e-06  1.1412736e-03 -1.2158153e-03  7.6474954e-04
 -4.3854597e-03  1.7202114e-03  2.6836181e-03  1.4622611e-03
 -2.5849789e-03  4.8951223e-03  4.8026382e-03 -1.6887399e-03
  3.7905199e-03 -1.8123538e-03 -1.8461549e-03 -3.6006782e-03
 -8.0820097e-04  2.4030278e-03  2.6205764e-03 -1.3611730e-03
  3.2463272e-03  4.4065714e-03 -4.9645538e-03  1.0241962e-03
  7.2724192e-04  1.0305879e-03  3.8043426e-03 -1.5256479e-03
  4.4337765e-04  8.2238915e-04  1.0542222e-03  1.5848925e-04
  2.



---
# Advanced NLP tasks

---

### Text Classification


Text classification, in common words is defined as a technique to systematically classify a text object (document or sentence) in one of the fixed category. It is really helpful when the amount of data is too large, especially for organizing, information filtering, and storage purposes.

A typical natural language classifier consists of two parts: *Training* and *Prediction* 
(as shown in image below) 

Firstly the text input is processes and features are created. The machine learning models then learn these features and is used for predicting against the new text.

Examples of text classification applications are: Email Spam Identification, topic classification of news, sentiment classification and organization of web pages by search engines.



In [250]:
# Installing
#pip install -U textblob
#python -m textblob.download_corpora


from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob


training_corpus = [ ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]

test_corpus = [("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]

model = NBC(training_corpus) 

print(model.classify("Their codes are amazing."))
print(model.classify("I don't like their computer."))

print (model.classify(test_corpus[4][0]))
print(model.accuracy(test_corpus))

Class_A
Class_B
Class_A
0.833333333333


In [259]:
import sklearn 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn import svm

# preparing data for SVM model (using the same training_corpus, test_corpus from naive bayes example)
train_data = []
train_labels = []

for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])

test_data = [] 
test_labels = []

for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])

# Create feature vectors 
vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)

# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)

# Apply model on test data 
test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear 
model = svm.SVC(kernel='linear') 
model.fit(train_vectors, train_labels) 
prediction = model.predict(test_vectors)


print (classification_report(test_labels, prediction))
 

             precision    recall  f1-score   support

    Class_A       0.50      0.67      0.57         3
    Class_B       0.50      0.33      0.40         3

avg / total       0.50      0.50      0.49         6



---
### Text Matching / Similarity

#### A. Levenshtein Distance 
The Levenshtein distance between two strings is defined as 

> the minimum number of edits needed to transform one string into the other,
> with the allowable edit operations being insertion, deletion, or substitution of a single character. 

Following is the implementation for efficient memory computations.

In [264]:
def levenshtein(s1,s2): 
    if len(s1) > len(s2):
        s1,s2 = s2,s1 
    distances = range(len(s1) + 1) 
    for index2,char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1,char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1]) 
            else:
                 newDistances.append(1 + min((distances[index1], distances[index1+1], newDistances[-1]))) 
        distances = newDistances 
    return distances[-1]

print(levenshtein("analyze","analyse"))

print levenshtein("analyze","information")

1
9


---
### Phonetic Matching 

A Phonetic matching algorithm takes a keyword as input (person’s name, location name etc) and produces a character string that identifies a set of words that are (roughly) phonetically similar. It is very useful for searching large text corpuses, correcting spelling errors and matching relevant names. 

**Soundex** and **Metaphone**
are two main phonetic algorithms used for this purpose. 

Python’s module Fuzzy is used to compute soundex strings for different words, for example –

In [296]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzz.ratio("ACME Factory alpha", "ACME Factory Inc.")
fuzz.partial_ratio("ACME Factory alpha", "ACME Factory Inc.")

76

In [297]:
import fuzzy

soundex = fuzzy.Soundex(4)

print soundex('ankita')
print soundex('aunkit')

A523
A523
