# NLP Introduction

![nlp](https://wrm5sysfkg-flywheel.netdna-ssl.com/wp-content/uploads/2019/01/NLP-Technology-in-Healthcare.jpg)

# 1.0.0 - Setup and course info

This is the notes and exercises from the following udemy course: https://www.udemy.com/course/nlp-natural-language-processing-with-python

# 2.0.0 - Text formatting basics

## 2.0.1 - Formatted strings with f-strings and format method.
Before python 3.6 it was common to use the `.format` method of formatting strings. This repository is based on python 3.7 and above and this means we can take advantage of what are commonly referred to in the python world as `f-strings` or, formatted string literals.

In [29]:
# an example of the .format method
v1 = 'One'
v2 = 'Two'
v3 = 'Three'
print("{}, {}, {}".format(v1,v2,v3))

One, Two, Three


In [30]:
# an f-string example 
v1 = 'One'
v2 = 'Two'
v3 = 'Three'
print(f"{v1}, {v2}, {v3}")

One, Two, Three


These newer string formatting options allow for us to perform operations on the subjects too. 

In [6]:
# create a example dictionary
d = { "id": 12345, "ref": 3335577, "name": "Ed" }

In [7]:
print(f"Employee: {d['name']} has id: {d['id']} and reference: {d['ref']}")

Employee: Ed has id: 12345 and reference: 3335577


## 2.0.2 Formatting structures with f-strings
Let's have a look at formating some structured data in a way that's a little bit more accessible. 

In [8]:
data = [('Author', 'Topic', 'Pages'), 
        ('A. Thakur', 'Approaching ML', 300), 
        ('J. Howard', 'fastai/Pytorch', 550 ), 
        ('D. Spiegelhalter', 'Art of Stats', 330)]

In [9]:
# show a poor formatted, or unformatted output of the table
for author, topic, pages in data:
    print(f"{author} {topic} {pages}")

Author Topic Pages
A. Thakur Approaching ML 300
J. Howard fastai/Pytorch 550
D. Spiegelhalter Art of Stats 330


In [10]:
# show a position formatted example
for author, topic, pages in data:
    print(f"{author:{16}} {topic:{30}} {pages:>{6}}")

Author           Topic                           Pages
A. Thakur        Approaching ML                    300
J. Howard        fastai/Pytorch                    550
D. Spiegelhalter Art of Stats                      330


## 2.0.3 Formatting date and time data

We may want, or need, to format time and date based data. We can use http://strftime.org to reference the correct codes to match our formatting intentions. 

In [11]:
from datetime import datetime

# declare a date
today = datetime(year=2020, month=9, day=19)

print(f"{today}")

2020-09-19 00:00:00


In [16]:
# some selected field formats applicable to the day
print(f"{today:%a}")
print(f"{today:%A}")
print(f"{today:%w}")

# some selected field formats applicable to the month
print(f"{today:%b}")
print(f"{today:%B}")
print(f"{today:%m}")

Sat
Saturday
6
Sep
September
09


# 2.1.0 - Working with Text files in Python

In [19]:
# jupyter method quickly writing a testfile

In [31]:
%%writefile sampletext.txt
This is a sample text file for testing.
This is the second line of the file
...and this is the third.

Overwriting sampletext.txt


## 2.1.1 - opening and reading files

In [32]:
# standard python inbuilt method to open a file. 
myfile = open('sampletext.txt')

In [33]:
# reads a file in entirety 
myfile.read()

'This is a sample text file for testing.\nThis is the second line of the file\n...and this is the third.\n'

Multiple calls to `read()` will not have the desired effect because of the cursor for a file. After a call to `read()` the cursor will be at the end pf the file and subsequent call will return an empty string, or unexpected result. To reset the cursor position back to the beginning of a file we can use the `seek(0)` method to facilitate further calls to `read()` with a more expected action/outcome.

In [34]:
# reset the file cursor 
myfile.seek(0)

# assign the contents of a file to a variable 
content = myfile.read()

# close the file. We should ensure to always close a file we are working with 
# after we have finished with it, or no longer need it to be open. Forgetting
# to clean up here can cause errors in other programs if your file is reqiured 
# or affected by other scripts or programs. 
myfile.close()

# note that we have opened, grabbed and closed the file but our variable lives 
# on and we can work with the content of a file without it needing to be open. 
print(content)

This is a sample text file for testing.
This is the second line of the file
...and this is the third.



## 2.1.2 - Reading files line by line 

It's more likely that when working with files you may want to read, process them line by line. We can do this by using the `readlines()` method to read a file line by line and create list structure of the lines within a file. 

In [50]:
myfile = open('sampletext.txt')

In [51]:
mylines = myfile.readlines()

In [52]:
mylines

['This is a sample text file for testing.\n',
 'This is the second line of the file\n',
 '...and this is the third.\n']

Now that we have a variable that is housing the lines of a file we can demonstrate somw simple operations we can perform with it.

In [53]:
# We have a newline character at the end of each line. If we simply
# list by each line they will be separated by a empty line because
# the default end of a print statement is the newline character. If
# we simply want to show the lines as they appear in the file we can
# iterate across the line values and remove only the very last char 
# of each line, because we know that for each line of our file that 
# character will be a newline char. 

for line in mylines:
    print(line[:-1])

This is a sample text file for testing.
This is the second line of the file
...and this is the third.


In [64]:
# we can do other ridiculous operations too such as showing
# only every other word

for line in mylines:
    line = line[:-1]
    words = line.split()
    subset = [x for idx, x in enumerate(words) if idx % 2 == 0]
    print(f"File content     : {line}")
    print(f"Processed content: {subset}")
    

File content     : This is a sample text file for testing.
Processed content: ['This', 'a', 'text', 'for']
File content     : This is the second line of the file
Processed content: ['This', 'the', 'line', 'the']
File content     : ...and this is the third.
Processed content: ['...and', 'is', 'third.']


In [37]:
#myfile.close()