# NLP Introduction

![nlp](https://wrm5sysfkg-flywheel.netdna-ssl.com/wp-content/uploads/2019/01/NLP-Technology-in-Healthcare.jpg)

# 1.0.0 - Setup and course info

This is the notes and exercises from the following udemy course: https://www.udemy.com/course/nlp-natural-language-processing-with-python

# 2.0.0 - Text formatting basics

## 2.0.1 - Formatted strings with f-strings and format method.
Before python 3.6 it was common to use the `.format` method of formatting strings. This repository is based on python 3.7 and above and this means we can take advantage of what are commonly referred to in the python world as `f-strings` or, formatted string literals.

In [29]:
# an example of the .format method
v1 = 'One'
v2 = 'Two'
v3 = 'Three'
print("{}, {}, {}".format(v1,v2,v3))

One, Two, Three


In [30]:
# an f-string example 
v1 = 'One'
v2 = 'Two'
v3 = 'Three'
print(f"{v1}, {v2}, {v3}")

One, Two, Three


These newer string formatting options allow for us to perform operations on the subjects too. 

In [6]:
# create a example dictionary
d = { "id": 12345, "ref": 3335577, "name": "Ed" }

In [7]:
print(f"Employee: {d['name']} has id: {d['id']} and reference: {d['ref']}")

Employee: Ed has id: 12345 and reference: 3335577


## 2.0.2 Formatting structures with f-strings
Let's have a look at formating some structured data in a way that's a little bit more accessible. 

In [8]:
data = [('Author', 'Topic', 'Pages'), 
        ('A. Thakur', 'Approaching ML', 300), 
        ('J. Howard', 'fastai/Pytorch', 550 ), 
        ('D. Spiegelhalter', 'Art of Stats', 330)]

In [9]:
# show a poor formatted, or unformatted output of the table
for author, topic, pages in data:
    print(f"{author} {topic} {pages}")

Author Topic Pages
A. Thakur Approaching ML 300
J. Howard fastai/Pytorch 550
D. Spiegelhalter Art of Stats 330


In [10]:
# show a position formatted example
for author, topic, pages in data:
    print(f"{author:{16}} {topic:{30}} {pages:>{6}}")

Author           Topic                           Pages
A. Thakur        Approaching ML                    300
J. Howard        fastai/Pytorch                    550
D. Spiegelhalter Art of Stats                      330


## 2.0.3 Formatting date and time data

We may want, or need, to format time and date based data. We can use http://strftime.org to reference the correct codes to match our formatting intentions. 

In [11]:
from datetime import datetime

# declare a date
today = datetime(year=2020, month=9, day=19)

print(f"{today}")

2020-09-19 00:00:00


In [16]:
# some selected field formats applicable to the day
print(f"{today:%a}")
print(f"{today:%A}")
print(f"{today:%w}")

# some selected field formats applicable to the month
print(f"{today:%b}")
print(f"{today:%B}")
print(f"{today:%m}")

Sat
Saturday
6
Sep
September
09


# 2.1.0 - Working with Text files in Python

In [19]:
# jupyter method quickly writing a testfile

In [31]:
%%writefile sampletext.txt
This is a sample text file for testing.
This is the second line of the file
...and this is the third.

Overwriting sampletext.txt


## 2.1.1 - opening and reading files

In [32]:
# standard python inbuilt method to open a file. 
myfile = open('sampletext.txt')

In [33]:
# reads a file in entirety 
myfile.read()

'This is a sample text file for testing.\nThis is the second line of the file\n...and this is the third.\n'

Multiple calls to `read()` will not have the desired effect because of the cursor for a file. After a call to `read()` the cursor will be at the end pf the file and subsequent call will return an empty string, or unexpected result. To reset the cursor position back to the beginning of a file we can use the `seek(0)` method to facilitate further calls to `read()` with a more expected action/outcome.

In [34]:
# reset the file cursor 
myfile.seek(0)

# assign the contents of a file to a variable 
content = myfile.read()

# close the file. We should ensure to always close a file we are working with 
# after we have finished with it, or no longer need it to be open. Forgetting
# to clean up here can cause errors in other programs if your file is reqiured 
# or affected by other scripts or programs. 
myfile.close()

# note that we have opened, grabbed and closed the file but our variable lives 
# on and we can work with the content of a file without it needing to be open. 
print(content)

This is a sample text file for testing.
This is the second line of the file
...and this is the third.



## 2.1.2 - Reading files line by line 

It's more likely that when working with files you may want to read, process them line by line. We can do this by using the `readlines()` method to read a file line by line and create list structure of the lines within a file. 

In [50]:
myfile = open('resources/sampletext.txt')

In [51]:
mylines = myfile.readlines()

In [52]:
mylines

['This is a sample text file for testing.\n',
 'This is the second line of the file\n',
 '...and this is the third.\n']

Now that we have a variable that is housing the lines of a file we can demonstrate somw simple operations we can perform with it.

In [53]:
# We have a newline character at the end of each line. If we simply
# list by each line they will be separated by a empty line because
# the default end of a print statement is the newline character. If
# we simply want to show the lines as they appear in the file we can
# iterate across the line values and remove only the very last char 
# of each line, because we know that for each line of our file that 
# character will be a newline char. 

for line in mylines:
    print(line[:-1])

This is a sample text file for testing.
This is the second line of the file
...and this is the third.


In [64]:
# we can do other ridiculous operations too such as showing
# only every other word

for line in mylines:
    line = line[:-1]
    words = line.split()
    subset = [x for idx, x in enumerate(words) if idx % 2 == 0]
    print(f"File content     : {line}")
    print(f"Processed content: {subset}")
    

File content     : This is a sample text file for testing.
Processed content: ['This', 'a', 'text', 'for']
File content     : This is the second line of the file
Processed content: ['This', 'the', 'line', 'the']
File content     : ...and this is the third.
Processed content: ['...and', 'is', 'third.']


In [65]:
myfile.close()

## 2.1.3 - Writing to a file 

Now we'll have a look at writing to a file. Important note is that writing to and appending to a file are two different things. Write will overwrite a files content and append will append more content to the existing content.

In [66]:
# open a file in both read and write mode
# caution is that changes will overwrite the files current content
myfile = open('resources/sampletext.txt', 'w+')

In [68]:
# note that the read method returns an empty result here because our
# file has effectively been truncated and is ready to be written to
# and that content will be the entire content of the file
myfile.read()

''

In [69]:
myfile.write("This is the new content of the file.")

36

In [70]:
# reset the cursor and read the file to confirm our new content. 
myfile.seek(0)
myfile.read()

'This is the new content of the file.'

In [71]:
myfile.close()

## 2.1.4 - Appending to a file

In [72]:
# open file in append mode. Allows reading and appending to file. 
myfile = open('resources/sampletext.txt', 'a+')

In [73]:
myfile.write("This is the appended content")

28

In [75]:
# close the file 
myfile.close()

In [76]:
# reopen the file 
myfile = open('resources/sampletext.txt')
myfile.read()

'This is the new content of the file.This is the appended content'

In [77]:
myfile.close()

An interesting alternative to the `open()` and `close()` calls given that can be a fragile mechanism is to use the `with open` that allows us to open up a file in a given mode and operate contextually in a code block with an implicit file closure occurring at the end of the block 

In [78]:
with open('resources/sampletext.txt') as myfile:
    lines = myfile.read()
    print(lines)

This is the new content of the file.This is the appended content


## 2.1.5 - Working with pdf files

It is a common occurrence to be required to read in data from pdf files. We can use the `PyPDF2` library to help here, but it is worth noting that not all PDFs have text that can be extracted. This can happen where PDFs are created from scanning rather than saving text documents in a .pdf format. These can be difficult to extract as it is often treated like an image and requires more specialised software. 

In [80]:
import PyPDF2

In [82]:
myfile = open('resources/US_Declaration.pdf', mode='rb')

In [83]:
pdf_reader = PyPDF2.PdfFileReader(myfile)

In [84]:
pdf_reader.numPages

5

In [85]:
p1 = pdf_reader.getPage(0)

In [88]:
page_text = p1.extractText()

In [90]:
page_text

"Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the\npolitical bands which have connected them with another, and to assume among the powers of the\nearth, the separate and equal station to which the Laws of Nature and of Nature's God entitle\n\nthem, a decent respect to the opinions of mankind requires that they should declare the causes\n\nwhich impel them to the separation. \nWe hold these truths to be self-evident, that all men are created equal, that they are endowed by\n\ntheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving\n\ntheir just powers from the consent of the governed,ŠThat whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter or 

In [91]:
myfile.close()

In [92]:
f = open('resources/US_Declaration.pdf', mode='rb')
pdf_reader = PyPDF2.PdfFileReader(f)
p1 = pdf_reader.getPage(0)
pdf_writer = PyPDF2.PdfFileWriter()
pdf_writer.addPage(p1)

In [93]:
pdf_out = open('resources/newpdf.pdf', mode='wb')
pdf_writer.write(pdf_out)

In [94]:
pdf_out.close()
f.close()

## 2.1.6 - Regex

We're going to cover regex (regular expressions) as a way to harvest pattern matched datums from some information, some text. We may know th general format but not the specifics of what we're looking for, eg. find all phone numbers in a document. This is where regex comes into play. Syntactically, regex can be quite daunting at first, but there are countless websites that can help and by grasping the basics of regex you can go a log way to understanding fairly complex patterns. Let's look at some rules. 
- every character type has a corresponding pattern code.
- using a `\` is a way to have python take note that we have a special character and not just a typially character/letter.
- eg. digits have the `\d` pattern.
- We can take a pattern of `f'\d{3}-\d{3}-\d{4}'` to match a pattern of 999-999-9999

In [95]:
# we need some sample text to demonstrate regex in python
sample_text = "The Tel number of Norman, candidate #1 is 012-443-6955. Candidate #2, Morgan can be reached on 332-445-7712 after 17:00"

In [96]:
import re

In [98]:
pattern = "Morgan"

match = re.search(pattern, sample_text)

In [100]:
match.span()

(70, 76)

Matching where multiple instances maybe found requires usage of a different method. 

In [101]:
sample_text = "Morgan called and said yes. Morgan can start on Monday after next"

In [106]:
# demonstrate the findall. Assignation to the result
# of a findall creates a list. 
res = re.findall(pattern, sample_text)
res

['Morgan', 'Morgan']

In [103]:
# demonstrate the finditer method.
for match in re.finditer(pattern, sample_text):
    print(match.span())

(0, 6)
(28, 34)


#### Methods Summary
- The `search` method will return the value and span of the first match.
- The `findall` method is helpful to get the number of matches in a particular text. 
- The `finditer` is a good approach if you want to iterate over the matches. It offers a better route to controlling _per-match_ operations.



In [112]:
sample_text = "Margot called while you were out. Her number is 755-634-9545 or 433-229-4429 if after 17:00"
sample_text

'Margot called while you were out. Her number is 755-634-9545 or 433-229-4429 if after 17:00'

In [109]:
pattern = r'\d{3}-\d{3}-\d{4}'

In [113]:
num = re.search(pattern, sample_text)
num.group()

'755-634-9545'

In [115]:
for match in re.finditer(pattern, sample_text):
    print(match.group(), match.span())

755-634-9545 (48, 60)
433-229-4429 (64, 76)


With a regular expression pattern we can also split a pattern to groups and pull off individual groups from the match.

In [116]:
pattern = r'(\d{3})-(\d{3})-(\d{4})'

In [117]:
for match in re.finditer(pattern, sample_text):
    print(match.group(), match.span())

755-634-9545 (48, 60)
433-229-4429 (64, 76)


now lets say we want just the national and area codes here, or the 999-999 part.

In [119]:
for match in re.finditer(pattern, sample_text):
    print(match.group(1), match.group(2))

755 634
433 229


In [132]:
# output the groups in each match instance as a tuple
for match in re.finditer(pattern, sample_text):
    print(match.groups())
    

('755', '634', '9545')
('433', '229', '4429')


We can see the usage of fixed pattern sets, as above. We can also have flexibility in there to match to zero, one or more than one match conditions. We can even have conditional matches operating as a `logical or` case.

In [133]:
pattern = r'man|woman'

In [136]:
sample_text = "Batman called Batwoman for help against the badman"

In [137]:
matches = re.findall(pattern, sample_text)
matches

['man', 'woman', 'man']

We've identified the matched cases above, but it maybe doesn't make that much sense as we've selected partials from the sample_text. To remedy that we can use wildcards.

In [155]:
# Take the while word of anything that matches man
pattern = r'\w*man'

In [156]:
matches = re.findall(pattern, sample_text)
matches

['Batman', 'Batwoman', 'badman']