# Introduction to NLP
Alexander O'Connor &lt;Alexander.OConnor@dcu.ie&gt;

This notebook will demonstrate some of the basics of python for natural language processing.
To run this locall, install iPython, nltk and download the file.

If you want to know more, I recommend this: <http://www.nltk.org/book>

A handy reference file: <https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf>

# Outline

1. Strings and characters&mdash;how text is represented in Python
2. From body text to words&mdash;tokenisation &amp; segmentation 
3. Word Frequencies &amp; Simple Statistics
5. Topic Models
6. Parsing

In [None]:
#Library imports and setup -- this code brings in the nltk library, and other supporting utilities.
import nltk


# Strings and Characters


In [None]:
#change the text between the " to try something else.
test_string = u"Ebenezer unexpectedly bagged two tranquil aardvarks with his jiffy vacuum cleaner."

In [None]:
#print it out
test_string

In [None]:
#show the numerical representation
[ord(word) for word in test_string]

In [None]:
#simple string operations
string_one = u"hello"
string_two = u"world"

#stick them together (concatenate)
print string_one + string_two
#replace a letter
print string_one.replace('e','E')
#title-case the first word
print (string_one+' '+string_two).capitalize()

# From body text to words&mdash;tokenisation &amp; segmentation 

In [None]:
#simplest possible tokenisation: split on spaces
space_split = "An example sentence"
#split it
word_list = space_split.split(' ')
#print the list
print word_list
#just the *second* word
print word_list[1]

In [None]:
punctuation_example = "This example sentence has a full stop, and a comma."
print punctuation_example.split(' ')

In [None]:
para_example = "This is two sentences. Should we take this into account?"
print para_example.split(' ')

## Word Frequencies and Simple Statistics

In [None]:
complex_example = "Mr. Benn is an ordinary fellow, living an ordinary life in an ordinary suburban house, at Number 52, Festive Road. One day, Mr. Benn receives an invitation to a fancy dress party, and so, donning his bowler hat, he sets off to find a costume to wear. Unable to find a suitable outfit in the usual shops, he turns down a small lane and finds a shop filled with strange and unusual costumes. Inside, Mr Benn asks the fez-wearing shopkeeper if he can try on a suit of red armour; he then enters the changing room, puts on the outfit, and walks through another door... and suddenly finds himself transported back to medieval times. It is the first of many amazing and extraordinary adventures for Mr. Benn"

In [None]:
#Split the text into sentences
print nltk.sent_tokenize(complex_example)

In [None]:
#Split the text into words
print nltk.word_tokenize(complex_example)

In [None]:
# What does this do?
print nltk.word_tokenize(nltk.sent_tokenize(complex_example)[0])

In [None]:
# Use a frequency distribution 
distribution = nltk.FreqDist(nltk.word_tokenize(complex_example))

In [None]:
#ten most common words
print distribution.most_common(10)

In [None]:
# print some bigrams
print list(nltk.bigrams(nltk.word_tokenize(complex_example)))[0:5]

In [None]:
# Some longer text

pride = '''It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that he is considered the rightful property
of some one or other of their daughters.

"My dear Mr. Bennet," said his lady to him one day, "have you heard that
Netherfield Park is let at last?"

Mr. Bennet replied that he had not.

"But it is," returned she; "for Mrs. Long has just been here, and she
told me all about it."

Mr. Bennet made no answer.

"Do you not want to know who has taken it?" cried his wife impatiently.

"You want to tell me, and I have no objection to hearing it."

This was invitation enough.'''



In [None]:
alice = '''Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!
Oh dear! I shall be late!' (when she thought it over afterwards, it
occurred to her that she ought to have wondered at this, but at the time
it all seemed quite natural); but when the Rabbit actually TOOK A WATCH
OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,
Alice started to her feet, for it flashed across her mind that she had
never before seen a rabbit with either a waistcoat-pocket, or a watch
to take out of it, and burning with curiosity, she ran across the field
after it, and fortunately was just in time to see it pop down a large
rabbit-hole under the hedge.

In another moment down went Alice after it, never once considering how
in the world she was to get out again.

The rabbit-hole went straight on like a tunnel for some way, and then
dipped suddenly down, so suddenly that Alice had not a moment to think
about stopping herself before she found herself falling down a very deep
well.'''

In [None]:
pride_words = nltk.word_tokenize(pride)
alice_words = nltk.word_tokenize(alice)

pride_freq = nltk.FreqDist(pride_words)
alice_freq = nltk.FreqDist(alice_words)

print pride_freq.most_common(5)
print alice_freq.most_common(5)

## Punctuation and Stopwords 

In [None]:
punctuation = [',','.','\'','"','!',';','(',')','``','\'\'']
alice_nopunc = [word for word in alice_words if word not in punctuation]
pride_nopunc = [word for word in pride_words if word not in punctuation]

In [None]:
print alice_nopunc

In [None]:
print pride_nopunc

In [None]:
pride_freq = nltk.FreqDist(pride_nopunc)
alice_freq = nltk.FreqDist(alice_nopunc)

In [None]:
print pride_freq.most_common(5)
print alice_freq.most_common(5)

In [None]:
from nltk.corpus import stopwords
stops = stopwords.words('english')

print sorted(stops)[0:5]

In [None]:
alice_nostop = [word for word in alice_nopunc if word not in stops]
pride_nostop = [word for word in pride_nopunc if word not in stops]

pride_freq = nltk.FreqDist(pride_nostop)
alice_freq = nltk.FreqDist(alice_nostop)

print pride_freq.most_common(5)
print alice_freq.most_common(5)

In [None]:
# Stemming
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()

print stemmer.stem('bricks')
print stemmer.stem('bicycles')
print stemmer.stem('houses')

words = ['walk','walker','walks','walking', 'walked']
stemmed = set([stemmer.stem(word) for word in words])

print list(stemmed)

##Part of Speech Tagging

In [None]:
sentence = 'There once was a man from Nantucket.'

nltk.pos_tag(nltk.word_tokenize(sentence))

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
#POS Tag Alice
tagged_alice = nltk.pos_tag(nltk.word_tokenize(alice))

#Print all the NN (Noun, singular or mass)
[tag[0] for tag in tagged_alice if tag[1] == 'NN']
