# BIA686 Lab Session 1: Natural Language Processing in Python
Technologies based on NLP are becoming increasingly widespread. For example, phones and handheld computers support predictive text and handwriting recognition; web search engines give access to information locked up in unstructured text; machine translation allows us to retrieve texts written in Chinese and read them in Spanish; text analysis enables us to detect sentiment in tweets and blogs.




This session is based on the Python programming language together with an open source library called the Natural Language Toolkit (NLTK).

### Agenda: 
1. Set up the tools
2. String processing:	tokenize, stem	tokenizers, sentence tokenizers, stemmers
3. Part-of-speech tagging
4. Machine learning	:  classification using naive Bayes
4. Chunking, named-entity


### What is PIP
pip is a package management system used to install and manage software packages written in Python. 
1. install a package: `sudo pip install package-name`
2. remove a package : `pip uninstall package-name`
3. show installed packages: `pip install --upgrade package-name`

### How to install PIP
Installing with get-pip.py
To install pip, securely download get-pip.py   https://bootstrap.pypa.io/get-pip.py

Open your terminal(MAC) or CMD(Windows), then run the following:
`python get-pip.py` 

PIP installed if there is no error

### How to install NLTK 

1. Mac: open the terminal and type : `sudo pip install nltk`
2. Windows: open the command line tool and type :  ```pip install nltk```

Other library that you need: 
1. Numpy : This is a scientific computing library with support for multidimensional arrays and linear algebra, required for certain probability, tagging, clustering, and classification tasks.  (`pip install numpy`)
2. Matplotlib :  This is a 2D plotting library for data visualization (`pip install matplotlib`)


# Get started

In [1]:
#download the corpus and books
import nltk
nltk.download()

showing info http://www.nltk.org/nltk_data/


True

In [3]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [151]:
text4

<Text: Inaugural Address Corpus>

In [4]:
#we can find the locations of a word in the text
#This positional information can be displayed using a dispersion plot
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

### Simple Statistics 
Frequency distribution,  it tells us the frequency of each vocabulary item in the text. 

In [12]:
# number of the words in this corpus
len(text4)

145735

In [6]:
#Frequency Distributions
fdist1 = FreqDist(text4)
fdist1.most_common(50)
fdist1.plot(50, cumulative=True)

In [3]:
from nltk.corpus import inaugural

In [154]:
inaugural.fileids()

[u'1789-Washington.txt',
 u'1793-Washington.txt',
 u'1797-Adams.txt',
 u'1801-Jefferson.txt',
 u'1805-Jefferson.txt',
 u'1809-Madison.txt',
 u'1813-Madison.txt',
 u'1817-Monroe.txt',
 u'1821-Monroe.txt',
 u'1825-Adams.txt',
 u'1829-Jackson.txt',
 u'1833-Jackson.txt',
 u'1837-VanBuren.txt',
 u'1841-Harrison.txt',
 u'1845-Polk.txt',
 u'1849-Taylor.txt',
 u'1853-Pierce.txt',
 u'1857-Buchanan.txt',
 u'1861-Lincoln.txt',
 u'1865-Lincoln.txt',
 u'1869-Grant.txt',
 u'1873-Grant.txt',
 u'1877-Hayes.txt',
 u'1881-Garfield.txt',
 u'1885-Cleveland.txt',
 u'1889-Harrison.txt',
 u'1893-Cleveland.txt',
 u'1897-McKinley.txt',
 u'1901-McKinley.txt',
 u'1905-Roosevelt.txt',
 u'1909-Taft.txt',
 u'1913-Wilson.txt',
 u'1917-Wilson.txt',
 u'1921-Harding.txt',
 u'1925-Coolidge.txt',
 u'1929-Hoover.txt',
 u'1933-Roosevelt.txt',
 u'1937-Roosevelt.txt',
 u'1941-Roosevelt.txt',
 u'1945-Roosevelt.txt',
 u'1949-Truman.txt',
 u'1953-Eisenhower.txt',
 u'1957-Eisenhower.txt',
 u'1961-Kennedy.txt',
 u'1965-Johnson.tx

Let's look at how the words *America* and *citizen* are used over time. The following code converts the words in the Inaugural corpus to lowercase using w.lower(), then checks if they start with either of the "targets" america or citizen using startswith(). Thus it will count words like American's and Citizens. 

In [16]:
#Notice that the year of each text appears in its filename. To get the year out of the filename,
#we extracted the first four characters, using fileid[:4].
print [fileid[:4] for fileid in inaugural.fileids()]

[u'1789', u'1793', u'1797', u'1801', u'1805', u'1809', u'1813', u'1817', u'1821', u'1825', u'1829', u'1833', u'1837', u'1841', u'1845', u'1849', u'1853', u'1857', u'1861', u'1865', u'1869', u'1873', u'1877', u'1881', u'1885', u'1889', u'1893', u'1897', u'1901', u'1905', u'1909', u'1913', u'1917', u'1921', u'1925', u'1929', u'1933', u'1937', u'1941', u'1945', u'1949', u'1953', u'1957', u'1961', u'1965', u'1969', u'1973', u'1977', u'1981', u'1985', u'1989', u'1993', u'1997', u'2001', u'2005', u'2009']


In [7]:
cfd = nltk.ConditionalFreqDist(
           (target, fileid[:4])
           for fileid in inaugural.fileids()
           for w in inaugural.words(fileid)
           for target in ['america', 'citizen']
           if w.lower().startswith(target)) 
cfd.plot()

##### stop words

There is also a corpus of stopwords, that is, high-frequency words like the, to and also that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.

In [11]:
from nltk.corpus import stopwords
sw = stopwords.words('english')

[u'i',
 u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your',
 u'yours',
 u'yourself',
 u'yourselves',
 u'he',
 u'him',
 u'his',
 u'himself',
 u'she',
 u'her',
 u'hers',
 u'herself',
 u'it',
 u'its',
 u'itself',
 u'they',
 u'them',
 u'their',
 u'theirs',
 u'themselves',
 u'what',
 u'which',
 u'who',
 u'whom',
 u'this',
 u'that',
 u'these',
 u'those',
 u'am',
 u'is',
 u'are',
 u'was',
 u'were',
 u'be',
 u'been',
 u'being',
 u'have',
 u'has',
 u'had',
 u'having',
 u'do',
 u'does',
 u'did',
 u'doing',
 u'a',
 u'an',
 u'the',
 u'and',
 u'but',
 u'if',
 u'or',
 u'because',
 u'as',
 u'until',
 u'while',
 u'of',
 u'at',
 u'by',
 u'for',
 u'with',
 u'about',
 u'against',
 u'between',
 u'into',
 u'through',
 u'during',
 u'before',
 u'after',
 u'above',
 u'below',
 u'to',
 u'from',
 u'up',
 u'down',
 u'in',
 u'out',
 u'on',
 u'off',
 u'over',
 u'under',
 u'again',
 u'further',
 u'then',
 u'once',
 u'here',
 u'there',
 u'when',
 u'where',
 u'why',
 u'how',
 u'all

In [47]:
#ratio of useful information 
content = [w for w in text if w.lower() not in sw]
len(content) / float(len(text))

0.5229560503653893

#### names
Names corpus, containing 8,000 first names categorized by gender. The male and female names are stored in separate files. Let's find names which appear in both files, i.e. names that are ambiguous for gender:

In [13]:
names = nltk.corpus.names
names.fileids()

[u'female.txt', u'male.txt']

In [49]:
#names both in male and female list
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]

[u'Abbey',
 u'Abbie',
 u'Abby',
 u'Addie',
 u'Adrian',
 u'Adrien',
 u'Ajay',
 u'Alex',
 u'Alexis',
 u'Alfie',
 u'Ali',
 u'Alix',
 u'Allie',
 u'Allyn',
 u'Andie',
 u'Andrea',
 u'Andy',
 u'Angel',
 u'Angie',
 u'Ariel',
 u'Ashley',
 u'Aubrey',
 u'Augustine',
 u'Austin',
 u'Averil',
 u'Barrie',
 u'Barry',
 u'Beau',
 u'Bennie',
 u'Benny',
 u'Bernie',
 u'Bert',
 u'Bertie',
 u'Bill',
 u'Billie',
 u'Billy',
 u'Blair',
 u'Blake',
 u'Bo',
 u'Bobbie',
 u'Bobby',
 u'Brandy',
 u'Brett',
 u'Britt',
 u'Brook',
 u'Brooke',
 u'Brooks',
 u'Bryn',
 u'Cal',
 u'Cam',
 u'Cammy',
 u'Carey',
 u'Carlie',
 u'Carlin',
 u'Carmine',
 u'Carroll',
 u'Cary',
 u'Caryl',
 u'Casey',
 u'Cass',
 u'Cat',
 u'Cecil',
 u'Chad',
 u'Chris',
 u'Chrissy',
 u'Christian',
 u'Christie',
 u'Christy',
 u'Clair',
 u'Claire',
 u'Clare',
 u'Claude',
 u'Clem',
 u'Clemmie',
 u'Cody',
 u'Connie',
 u'Constantine',
 u'Corey',
 u'Corrie',
 u'Cory',
 u'Courtney',
 u'Cris',
 u'Daffy',
 u'Dale',
 u'Dallas',
 u'Dana',
 u'Dani',
 u'Daniel',
 u'Dann

## Processing Raw Text

The most important source of texts is undoubtedly the Web. It's convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access them.

In [15]:
import urllib2
url = "http://www.gutenberg.org/files/2554/2554.txt"
response = urllib2.urlopen(url)
raw = response.read().decode('utf8')

In [57]:
len(raw)

1176896

In [70]:
# tokenize text into sentence
from nltk.tokenize import sent_tokenize
sent_tokenize_list = sent_tokenize(raw)

sent_tokenize uses an instance of PunktSentenceTokenizer from the nltk. tokenize.punkt module. This instance has already been trained on and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.

There are total 17 european languages that NLTK support for sentence tokenize, and you can use them as the following steps:



In [71]:
import nltk.data
text = "this's a sent tokenize test. this is sent two. is this sent three? \
sent 4 is cool! Now it's your turn."
tokenizer = nltk.data.load("tokenizers/punkt/spanish.pickle") #spanish.pickle
tokenizer.tokenize(text)

["this's a sent tokenize test.",
 'this is sent two.',
 'is this sent three?',
 'sent 4 is cool!',
 "Now it's your turn."]

In [16]:
spanish_tokenizer = nltk.data.load("tokenizers/punkt/spanish.pickle")
spanish_tokenizer.tokenize('Hola amigo. Estoy bien.')


['Hola amigo.', 'Estoy bien.']

In [18]:
#tokenize text into words 
from nltk.tokenize import word_tokenize
#word_tokenize('Hello World.')
word_tokenize("this's a test")

['this', "'s", 'a', 'test']

### Part-of-speech tagging 
Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes.

Once you've tokenized the sentences you need to tag them. Tagging is not necessary for all purposes but it does help the computer better understand the objects and references in your sentences. Eg: NN

In [20]:
import nltk
text = nltk.word_tokenize("Dive into NLTK: Part-of-speech tagging and POS Tagger")
nltk.pos_tag(text)

[('Dive', 'NNP'),
 ('into', 'IN'),
 ('NLTK', 'NNP'),
 (':', ':'),
 ('Part-of-speech', 'JJ'),
 ('tagging', 'NN'),
 ('and', 'CC'),
 ('POS', 'NNP'),
 ('Tagger', 'NNP')]

In [23]:
nltk.help.upenn_tagset('CC')

CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet


### Stemming
Stemming and Lemmatization are the basic text processing methods for English text. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. 

Stemming the process by which endings are removed from words in order to remove things like tense or plurality. It's not appropriate for all cases but can make it easier to connect together tenses to see if you're covering the same subject matter.


In [27]:
pstemmer = nltk.PorterStemmer()
lstemmer = nltk.LancasterStemmer()
wnlemmatizer = nltk.WordNetLemmatizer()
a = ["i","like","playing","basketball"]
for i in a:
    print pstemmer.stem(i)
#pstemmer.stem('maximum')
#lstemmer.stem('maximum')
pstemmer.stem('presumably')
# lstemmer.stem('presumably')
# pstemmer.stem('saying')

i
like
play
basketbal


u'presum'

In [28]:
from nltk.stem import PorterStemmer
new_text = """Stemming algorithms attempt to automatically remove suffixes (and in some 
cases prefixes) in order to find the "root word" or stem of a given word. This 
is useful in various natural language processing scenarios, such as search """
ps = PorterStemmer()
from nltk import word_tokenize
for i in word_tokenize(new_text):
    print ps.stem(i)

Stem
algorithm
attempt
to
automat
remov
suffix
(
and
in
some
case
prefix
)
in
order
to
find
the
``
root
word
''
or
stem
of
a
given
word
.
Thi
is
use
in
variou
natur
languag
process
scenario
,
such
as
search


### Remove punctuation 

In [32]:
import string
sentence = 'good morning, how are you doing today". the waiter said'
exclude = set(string.punctuation)
sentence_without_punc = ''.join(ch for ch in sentence if ch not in exclude)
sentence_without_punc


'good morning how are you doing today the waiter said'

In [35]:
### remove_non_ascii_2
import re
def remove_non_ascii_2(text):
    return re.sub(r'[^\x00-\x7F]',' ', text)

### Chunking
Chunking grabs chunks of text that might be more meaningful to your research or program. You create a list of parts of speech and run that over your corpus. It will extract the phrasing that you need.

Remember you've got to customize it to the part of speech tagger that you're using, like Brown or the Stanford Tagger.

In [40]:
#name recognition in python
import nltk
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import LineTokenizer
from nltk.corpus import state_union
fw = open("247/064.txt")
text = remove_non_ascii_2(fw.read())

In [46]:
import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sents = LineTokenizer(blanklines=u'discard').tokenize(text.strip())
for i in sents:
    words = nltk.word_tokenize(i)
#   words = [word for word in words if word.lower() not in string.punctuation]
    pos = nltk.pos_tag(words)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []
    print person_list 

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
['Jersey Medical School']
[]
[]
[]
[]
[]
[]
['Lee J. Brooks']
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]


KeyboardInterrupt: 

### Text Classification
Text Classification is very useful technique in text analysis, such as it can be used in spam filtering, language identification, sentiment analysis, genre classification and etc. 

In [56]:
from nltk.corpus import names
import random

In [57]:
names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

In [59]:
random.shuffle(names)

In [138]:
names[0:10]

[(u'Webb', 'male'),
 (u'Casper', 'male'),
 (u'Arne', 'male'),
 (u'Carine', 'female'),
 (u'Ginger', 'male'),
 (u'Jewell', 'female'),
 (u'Adrian', 'female'),
 (u'Florri', 'female'),
 (u'Prince', 'male'),
 (u'Jabez', 'male')]

The most important thing for a text classifier is feature, which can be very flexible, and defined by human engineer. Here, we just use the final letter of a given name as the feature, and build a dictionary containing relevant information about a given name:

In [61]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [62]:
gender_features('Gary')

{'last_letter': 'y'}

In [66]:
featuresets = [(gender_features(n), g) for (n, g) in names]

In [142]:
featuresets[0:10]

[({'last_letter': u'b'}, 'male'),
 ({'last_letter': u'r'}, 'male'),
 ({'last_letter': u'e'}, 'male'),
 ({'last_letter': u'e'}, 'female'),
 ({'last_letter': u'r'}, 'male'),
 ({'last_letter': u'l'}, 'female'),
 ({'last_letter': u'n'}, 'female'),
 ({'last_letter': u'i'}, 'female'),
 ({'last_letter': u'e'}, 'male'),
 ({'last_letter': u'z'}, 'male')]

In [68]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [70]:
#len(train_set)
len(test_set)

500

In [73]:
from nltk import NaiveBayesClassifier
nb_classifier = NaiveBayesClassifier.train(train_set)
#nb_classifier.classify(gender_features('Gary'))
#nb_classifier.classify(gender_features('Grace'))

'female'

In [74]:
from nltk import classify
classify.accuracy(nb_classifier, test_set)

0.702

In [147]:
nb_classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = u'a'           female : male   =     37.0 : 1.0
             last_letter = u'k'             male : female =     32.3 : 1.0
             last_letter = u'f'             male : female =     15.3 : 1.0
             last_letter = u'p'             male : female =     12.6 : 1.0
             last_letter = u'd'             male : female =     10.0 : 1.0
