# Natural Language Processing Tutorial

Author: Matthew K. MacLeod



Some simple exploration of the text Turn of a Screw by Henry James

https://www.gutenberg.org/ebooks/209

for my friend Wendy Gordon

## Introduction



### NLP resources

https://en.wikipedia.org/wiki/Natural_language_processing

https://en.wikipedia.org/wiki/N-gram

https://en.wikipedia.org/wiki/Katz's_back-off_model

https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation

https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing


### Download data

In [2]:
import os, sys

In [3]:
os.chdir('./data')

In [4]:
os.getcwd()

'/home/matej/develop/mkm_notebooks/data'

## get data

ie the book

In [7]:
%%bash
wget http://www.gutenberg.org/cache/epub/209/pg209.txt
cp pg209.txt turn_of_a_screw.txt
./clean_gutenberg.sh pg209.txt screw

New lines:  4546 screw.txt


--2016-05-13 13:48:37--  http://www.gutenberg.org/cache/epub/209/pg209.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252946 (247K) [text/plain]
Saving to: ‘pg209.txt.1’

     0K .......... .......... .......... .......... .......... 20%  313K 1s
    50K .......... .......... .......... .......... .......... 40%  331K 0s
   100K .......... .......... .......... .......... .......... 60%  342K 0s
   150K .......... .......... .......... .......... .......... 80%  485K 0s
   200K .......... .......... .......... .......... .......   100%  498K=0.7s

2016-05-13 13:48:38 (377 KB/s) - ‘pg209.txt.1’ saved [252946/252946]



#### load python libraries

In [8]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
%matplotlib inline
mpl.rcParams['figure.figsize'] = (12.0, 8.0)

In [10]:
from functools import reduce

In [11]:
sys.path.append('/home/matej/develop/pdapt')

In [12]:
import pdapt_lib.machine_learning.nlp as nlp

## Exploration

### Simple analysis

let's see some vocabulary sizes..this is an extremely simple word count analysis but is part of more complicated approaches. (Starting with Uni-gram models, bag of words).

#### Shakespeare

In [17]:
with open("screw.txt") as myfile:
    screw_string="".join(line.rstrip() for line in myfile)

In [18]:
screw_tokens = nlp.tokenize(screw_string, 1, True)

In [19]:
screw_tuples = sorted(screw_tokens.items(), key=lambda x: -x[1])

In [20]:
# number of unique words in this book
screw_vocab_length = len(screw_tuples)
screw_vocab_length

4255

In [32]:
# total word count in the book
sum(map(lambda x: x[1], screw_tuples))

17005

In [24]:
# most common
screw_tuples[0:50]

[('I', 1766),
 ('little', 184),
 ('mrs', 108),
 ('grose', 105),
 ('did', 99),
 ('time', 89),
 ('know', 86),
 ('said', 76),
 ('oh', 70),
 ('face', 69),
 ('way', 68),
 ('just', 67),
 ('miles', 65),
 ('moment', 63),
 ('saw', 61),
 ('felt', 60),
 ('went', 59),
 ('say', 58),
 ('things', 56),
 ('miss', 56),
 ('eyes', 55),
 ('yes', 55),
 ('came', 54),
 ('looked', 54),
 ('quite', 54),
 ('flora', 53),
 ('took', 49),
 ('like', 47),
 ('thing', 47),
 ('gave', 45),
 ('day', 45),
 ('place', 45),
 ('come', 45),
 ('hand', 44),
 ('child', 44),
 ('long', 43),
 ('course', 43),
 ('seen', 42),
 ('mean', 42),
 ('turned', 42),
 ('straight', 41),
 ('house', 40),
 ('think', 40),
 ('great', 39),
 ('round', 38),
 ('room', 38),
 ('away', 38),
 ('old', 37),
 ('night', 37),
 ('tell', 37)]

In [23]:
# least common and more interesting
screw_tuples[-50:]

[('complications', 1),
 ('dashed', 1),
 ('uneasily', 1),
 ('dealt', 1),
 ('profit', 1),
 ('edifying', 1),
 ('type', 1),
 ('bib', 1),
 ('caretakers', 1),
 ('curls', 1),
 ('inspired', 1),
 ('slighted', 1),
 ('unexpectedness', 1),
 ('assented', 1),
 ('redeemed', 1),
 ('IN', 1),
 ('arch', 1),
 ('VE', 1),
 ('strike', 1),
 ('insurmountable', 1),
 ('husband', 1),
 ('unmentionable', 1),
 ('gaping', 1),
 ('interlocutress', 1),
 ('perceptible', 1),
 ('architectural', 1),
 ('flounder', 1),
 ('shining', 1),
 ('scrappy', 1),
 ('venial', 1),
 ('rosily', 1),
 ('occasionally', 1),
 ('bravery', 1),
 ('perplexed', 1),
 ('wrest', 1),
 ('judicial', 1),
 ('stars', 1),
 ('suggest', 1),
 ('twitter', 1),
 ('determine', 1),
 ('lamp', 1),
 ('rightly', 1),
 ('bewilderedly', 1),
 ('breathing', 1),
 ('discussion', 1),
 ('reentered', 1),
 ('astir', 1),
 ('finds', 1),
 ('unbruised', 1),
 ('float', 1)]

In [34]:
# get number of occurances of turn
turn_count = list(filter(lambda x: x[0] == 'turn', screw_tuples))
turn_count

[('turn', 22)]

In [29]:
# get number of occurances of screw
screw_count = list(filter(lambda x: x[0] == 'screw', screw_tuples))
screw_count

[('screw', 2)]

## Context

Often more important than the exact word is its context..let's investigate

In [36]:
screw_trigrams = nlp.tokenize(screw_string, 3, True)
screw_trigram_tuples = sorted(screw_trigrams.items(), key=lambda x: -x[1])

In [38]:
# most common trigrams
screw_trigram_tuples[0:20]

[('mrs grose I', 10),
 ('I felt I', 9),
 ('I know I', 7),
 ('I I I', 6),
 ('I saw I', 6),
 ('mrs grose looked', 5),
 ('I did know', 5),
 ('said mrs grose', 5),
 ('I said I', 5),
 ('dear little miles', 5),
 ('I say I', 4),
 ('I think I', 4),
 ('I suppose I', 4),
 ('oh yes I', 4),
 ('I did I', 4),
 ('I I felt', 4),
 ('know I know', 4),
 ('miss jessel miss', 4),
 ('I mrs grose', 3),
 ('little gentleman I', 3)]

In [39]:
# least common trigrams
screw_trigram_tuples[-20:]

[('I mean face', 1),
 ('treated possession happened', 1),
 ('bed I left', 1),
 ('groan thank god', 1),
 ('pupils mentioned looked', 1),
 ('horrible letter locked', 1),
 ('business practically settled', 1),
 ('hour spoken passed', 1),
 ('thickness ice formation', 1),
 ('quite understand feeling', 1),
 ('theory I accordingly', 1),
 ('breath terror minutes', 1),
 ('saw outside reached', 1),
 ('stared short retreated', 1),
 ('I determined proof', 1),
 ('wild irrelevance fail', 1),
 ('I left burning', 1),
 ('interview ashamed having', 1),
 ('bear poor woman', 1),
 ('cruel I like', 1)]

In [42]:
# finally let's print out trigrams which contain turn
list(filter(lambda x: 'turn' == x[0].split(" ")[1], screw_trigram_tuples))

[('merely turn abandon', 1),
 ('slope turn mistaken', 1),
 ('did turn inquiry', 1),
 ('I turn grounds', 1),
 ('thing turn retreat', 1),
 ('I turn pale', 1),
 ('chance turn yes', 1),
 ('effect turn screw', 1),
 ('belonged turn receipt', 1),
 ('great turn staircase', 1),
 ('fair turn screw', 1),
 ('thought turn truly', 1),
 ('I turn page', 1),
 ('ones turn matters', 1),
 ('instead turn ME', 1),
 ('headmasters turn infallibly', 1),
 ('dreadful turn sure', 1),
 ('appear turn path', 1),
 ('saw turn I', 1),
 ('I turn simply', 1),
 ('breath turn cold', 1),
 ('wounded turn seconds', 1)]

note see file mkm_notebooks/license.txt for lgpl license of this notebook.