# Natural Language Processing Tutorial

Author: Matthew K. MacLeod



Some simple exploration of the text Turn of a Screw by Henry James

https://www.gutenberg.org/ebooks/209

for my friend Wendy Gordon

## Introduction



### NLP resources

https://en.wikipedia.org/wiki/Natural_language_processing

https://en.wikipedia.org/wiki/N-gram

https://en.wikipedia.org/wiki/Katz's_back-off_model

https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation

https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing


### Download data

In [2]:
import os, sys

In [3]:
os.chdir('./data')

In [4]:
os.getcwd()

'/home/matej/develop/mkm_notebooks/data'

## get data

ie the book

In [7]:
%%bash
wget http://www.gutenberg.org/cache/epub/209/pg209.txt
cp pg209.txt turn_of_a_screw.txt
./clean_gutenberg.sh pg209.txt screw

New lines:  4546 screw.txt


--2016-05-13 13:48:37--  http://www.gutenberg.org/cache/epub/209/pg209.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252946 (247K) [text/plain]
Saving to: ‘pg209.txt.1’

     0K .......... .......... .......... .......... .......... 20%  313K 1s
    50K .......... .......... .......... .......... .......... 40%  331K 0s
   100K .......... .......... .......... .......... .......... 60%  342K 0s
   150K .......... .......... .......... .......... .......... 80%  485K 0s
   200K .......... .......... .......... .......... .......   100%  498K=0.7s

2016-05-13 13:48:38 (377 KB/s) - ‘pg209.txt.1’ saved [252946/252946]



#### load python libraries

In [8]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
%matplotlib inline
mpl.rcParams['figure.figsize'] = (12.0, 8.0)

In [10]:
from functools import reduce

In [11]:
sys.path.append('/home/matej/develop/pdapt')

In [12]:
import pdapt_lib.machine_learning.nlp as nlp

## Exploration

### Simple analysis

let's see some vocabulary sizes..this is an extremely simple word count analysis but is part of more complicated approaches. (Starting with Uni-gram models, bag of words).

#### Shakespeare

In [17]:
with open("screw.txt") as myfile:
    screw_string="".join(line.rstrip() for line in myfile)

In [18]:
screw_tokens = nlp.tokenize(screw_string, 1, True)

In [19]:
screw_tuples = sorted(screw_tokens.items(), key=lambda x: -x[1])

In [20]:
# number of unique words in this book
screw_vocab_length = len(screw_tuples)
screw_vocab_length

4255

In [32]:
# total word count in the book
sum(map(lambda x: x[1], screw_tuples))

17005

In [24]:
# most common
screw_tuples[0:50]

[('I', 1766),
 ('little', 184),
 ('mrs', 108),
 ('grose', 105),
 ('did', 99),
 ('time', 89),
 ('know', 86),
 ('said', 76),
 ('oh', 70),
 ('face', 69),
 ('way', 68),
 ('just', 67),
 ('miles', 65),
 ('moment', 63),
 ('saw', 61),
 ('felt', 60),
 ('went', 59),
 ('say', 58),
 ('things', 56),
 ('miss', 56),
 ('eyes', 55),
 ('yes', 55),
 ('came', 54),
 ('looked', 54),
 ('quite', 54),
 ('flora', 53),
 ('took', 49),
 ('like', 47),
 ('thing', 47),
 ('gave', 45),
 ('day', 45),
 ('place', 45),
 ('come', 45),
 ('hand', 44),
 ('child', 44),
 ('long', 43),
 ('course', 43),
 ('seen', 42),
 ('mean', 42),
 ('turned', 42),
 ('straight', 41),
 ('house', 40),
 ('think', 40),
 ('great', 39),
 ('round', 38),
 ('room', 38),
 ('away', 38),
 ('old', 37),
 ('night', 37),
 ('tell', 37)]

In [23]:
# least common and more interesting
screw_tuples[-50:]

[('complications', 1),
 ('dashed', 1),
 ('uneasily', 1),
 ('dealt', 1),
 ('profit', 1),
 ('edifying', 1),
 ('type', 1),
 ('bib', 1),
 ('caretakers', 1),
 ('curls', 1),
 ('inspired', 1),
 ('slighted', 1),
 ('unexpectedness', 1),
 ('assented', 1),
 ('redeemed', 1),
 ('IN', 1),
 ('arch', 1),
 ('VE', 1),
 ('strike', 1),
 ('insurmountable', 1),
 ('husband', 1),
 ('unmentionable', 1),
 ('gaping', 1),
 ('interlocutress', 1),
 ('perceptible', 1),
 ('architectural', 1),
 ('flounder', 1),
 ('shining', 1),
 ('scrappy', 1),
 ('venial', 1),
 ('rosily', 1),
 ('occasionally', 1),
 ('bravery', 1),
 ('perplexed', 1),
 ('wrest', 1),
 ('judicial', 1),
 ('stars', 1),
 ('suggest', 1),
 ('twitter', 1),
 ('determine', 1),
 ('lamp', 1),
 ('rightly', 1),
 ('bewilderedly', 1),
 ('breathing', 1),
 ('discussion', 1),
 ('reentered', 1),
 ('astir', 1),
 ('finds', 1),
 ('unbruised', 1),
 ('float', 1)]

In [34]:
# get number of occurances of turn
turn_count = list(filter(lambda x: x[0] == 'turn', screw_tuples))
turn_count

[('turn', 22)]

In [29]:
# get number of occurances of screw
screw_count = list(filter(lambda x: x[0] == 'screw', screw_tuples))
screw_count

[('screw', 2)]

## Context

Often more important than the exact word is its context..let's investigate

In [46]:
screw_trigrams = nlp.tokenize(screw_string, 3, False)
screw_trigram_tuples = sorted(screw_trigrams.items(), key=lambda x: -x[1])

In [47]:
# most common trigrams
screw_trigram_tuples[0:20]

[('I do not', 33),
 ('it was a', 21),
 ('one of the', 21),
 ('that I had', 19),
 ('it was not', 18),
 ('there was a', 17),
 ('I ca not', 17),
 ('at any rate', 17),
 ('that I was', 16),
 ('on the spot', 15),
 ('mrs grose s', 14),
 ('I had been', 14),
 ('in the world', 14),
 ('I did not', 13),
 ('but it was', 13),
 ('I had seen', 13),
 ('as I had', 13),
 ('I had not', 13),
 ('I should have', 13),
 ('do you mean', 13)]

In [48]:
# least common trigrams
screw_trigram_tuples[-20:]

[('peril when do', 1),
 ('and just so', 1),
 ('given me the', 1),
 ('easy and he', 1),
 ('she dropped with', 1),
 ('shrouded as I', 1),
 ('in me the', 1),
 ('you terrible miserable', 1),
 ('saw a great', 1),
 ('the acute prevision', 1),
 ('that would open', 1),
 ('miss I would', 1),
 ('HER she is', 1),
 ('was hideous at', 1),
 ('and not too', 1),
 ('the ravage of', 1),
 ('concerned with my', 1),
 ('arms they KNOW', 1),
 ('it only in', 1),
 ('KNEW how I', 1)]

In [49]:
# finally let's print out trigrams which contain turn
list(filter(lambda x: 'turn' == x[0].split(" ")[1], screw_trigram_tuples))

[('the turn of', 2),
 ('another turn of', 2),
 ('headmasters turn infallibly', 1),
 ('her turn pale', 1),
 ('to turn over', 1),
 ('belonged turn on', 1),
 ('great turn of', 1),
 ('dreadful turn to', 1),
 ('to turn you', 1),
 ('it turn as', 1),
 ('chance turn on', 1),
 ('to turn my', 1),
 ('did turn but', 1),
 ('up turn my', 1),
 ('a turn into', 1),
 ('the turn my', 1),
 ('and turn cold', 1),
 ('the turn mistaken', 1),
 ('that turn at', 1),
 ('in turn within', 1)]

In [50]:
screw_pentigrams = nlp.tokenize(screw_string, 5, False)
screw_pentigram_tuples = sorted(screw_pentigrams.items(), key=lambda x: -x[1])
list(filter(lambda x: 'turn' == x[0].split(" ")[2], screw_pentigram_tuples))

[('at the turn of a', 2),
 ('effect another turn of the', 1),
 ('now to turn over was', 1),
 ('enough to turn you out', 1),
 ('a dreadful turn to be', 1),
 ('this to turn my back', 1),
 ('thing up turn my back', 1),
 ('which in turn within a', 1),
 ('only another turn of the', 1),
 ('saw it turn as I', 1),
 ('slope the turn mistaken at', 1),
 ('of that turn at ME', 1),
 ('made her turn pale intention', 1),
 ('the chance turn on me', 1),
 ('with the turn my matters', 1),
 ('take a turn into the', 1),
 ('breath and turn cold he', 1),
 ('sordid headmasters turn infallibly to', 1),
 ('once belonged turn on receipt', 1),
 ('companion did turn but the', 1),
 ('the great turn of the', 1)]

In [53]:
screw_heptigrams = nlp.tokenize(screw_string, 7, False)
screw_heptigram_tuples = sorted(screw_heptigrams.items(), key=lambda x: -x[1])
list(filter(lambda x: 'turn' == x[0].split(" ")[3], screw_heptigram_tuples))

[('was enough to turn you out for', 1),
 ('myself at the turn of a page', 1),
 ('the effect another turn of the screw', 1),
 ('my breath and turn cold he was', 1),
 ('instead of that turn at ME an', 1),
 ('wounded which in turn within a few', 1),
 ('has the chance turn on me yes', 1),
 ('that with the turn my matters had', 1),
 ('had once belonged turn on receipt of', 1),
 ('at this to turn my back on', 1),
 ('icy slope the turn mistaken at night', 1),
 ('front only another turn of the screw', 1),
 ('I made her turn pale intention to', 1),
 ('whole thing up turn my back and', 1),
 ('could take a turn into the grounds', 1),
 ('over the great turn of the staircase', 1),
 ('stupid sordid headmasters turn infallibly to the', 1),
 ('my companion did turn but the inquiry', 1),
 ('there at the turn of a path', 1),
 ('definitely saw it turn as I might', 1),
 ('had now to turn over was simply', 1),
 ('what a dreadful turn to be sure', 1)]

In [56]:
screw_nonigrams = nlp.tokenize(screw_string, 9, False)
screw_nonigram_tuples = sorted(screw_nonigrams.items(), key=lambda x: -x[1])
list(filter(lambda x: 'turn' == x[0].split(" ")[4], screw_nonigram_tuples))

[('fair front only another turn of the screw of', 1),
 ('he has the chance turn on me yes I', 1),
 ('only instead of that turn at ME an expression', 1),
 ('it had once belonged turn on receipt of an', 1),
 ('it was enough to turn you out for never', 1),
 ('rather wounded which in turn within a few seconds', 1),
 ('even stupid sordid headmasters turn infallibly to the vindictive', 1),
 ('gives the effect another turn of the screw what', 1),
 ('I definitely saw it turn as I might have', 1),
 ('intention I made her turn pale intention to get', 1),
 ('the icy slope the turn mistaken at night and', 1),
 ('this my companion did turn but the inquiry she', 1),
 ('disapproval what a dreadful turn to be sure miss', 1),
 ('I had now to turn over was simply and', 1),
 ('the whole thing up turn my back and retreat', 1),
 ('I could take a turn into the grounds and', 1),
 ('found myself at the turn of a page and', 1),
 ('appear there at the turn of a path and', 1),
 ('merely at this to turn my back o

note see file mkm_notebooks/license.txt for lgpl license of this notebook.