### Chapter 3. Processing Raw Text

##### 1. How can we write programs to access text from local files and from the Web
##### 2. How can we split documents up into individual words and punctuation symbols
##### 3. How can we write programs to produce formatted output and save it in a file


In [13]:
from __future__ import division
import nltk, re, pprint
import requests
import json

#### 3.1 Accessing Text from the Web and from Disk 

In [3]:
# English translation of Crime and Punishment 
from urllib.request import urlopen
url = "https://www.gutenberg.org/files/2554/2554-0.txt"
raw_http = urlopen(url).read()
raw = raw_http.decode("utf-8")


In [5]:
len(raw)

1135214

In [7]:
type(raw)

str

In [9]:
raw[:75]

'*** START OF THE PROJECT GUTENBERG EBOOK 2554 ***\n\n\n\n\nCRIME AND PUNISHMENT\n'

In [15]:
tokens = nltk.word_tokenize(raw) # list of words and punctuations
type(tokens)
len(tokens)

253688

In [17]:
tokens[:10]

['*', '*', '*', 'START', 'OF', 'THE', 'PROJECT', 'GUTENBERG', 'EBOOK', '2554']

In [19]:
text = nltk.Text(tokens)
#text[1020:1060]

In [21]:
# Project Gutenberg appears as a collocation
text.collocations()

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Porfiry Petrovitch; Amalia Ivanovna; great deal; young man;
Nikodim Fomitch; Ilya Petrovitch; Andrey Semyonovitch; Hay Market;
Dmitri Prokofitch; Good heavens; police station; head clerk


In [27]:
# Dealing with HTML: Much of the text on the Web is in the form of HTML documents

# BBC News Story: "Blondes to dies out in 200 Years"

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read() 
# print(html) to see HTML content

In [25]:
from bs4 import BeautifulSoup 
# clean_html: takes an HTML string and returns raw text
# get_text() from BeautifulSoup performs better then clean_html 
raw = BeautifulSoup(html).get_text()
# tokenize raw text
tokens = nltk.word_tokenize(raw)

In [29]:
raw.find("gene") # index number is 1490

1490

In [40]:
tokens = tokens[96:399] # find the start and end indexes of content 
text = nltk.Text(tokens)
text.concordance('gene') # select the tokens of interest: 'gene'

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


##### Processing Blogosphere: Accessing the content of a blog 

In [31]:
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

'Language Log'

##### Reading Local Files 

In [37]:
f = open('chp2myowncorpus.txt')
raw = f.read()
print(raw)


If you have a your own collection of text files that you would like to access using the
methods discussed earlier, you can easily load them with the help of NLTKâ€™s Plain
textCorpusReader. Check the location of your files on your file system; in the following
example, we have taken this to be the directory /usr/share/dict. Whatever the location,
set this to be the value of corpus_root . The second parameter of the PlaintextCor
pusReader initializer can be a list of fileids, like ['a.txt', 'test/b.txt'], or a pattern
that matches all fileids, like '[abc]/.*\.txt' (see Section 3.4 for information about
regular expressions).


In [35]:
# list of files in your current directory
import os
os.listdir('.')

['.anaconda',
 '.conda',
 '.condarc',
 '.continuum',
 '.dotnet',
 '.idlerc',
 '.ipynb_checkpoints',
 '.ipython',
 '.jupyter',
 '.matplotlib',
 '.ms-ad',
 '.VirtualBox',
 '.virtual_documents',
 '3D Objects',
 'absence.ipynb',
 'anaconda3',
 'AppData',
 'Application Data',
 'BullseyeCoverageError.txt',
 'chp2myowncorpus.txt',
 'Contacts',
 'Cookies',
 'Cross Validation.ipynb',
 'Decision Tree.ipynb',
 'Documents',
 'Downloads',
 'Favorites',
 'iCloudPhotos',
 'IntelGraphicsProfiles',
 'K-Means Clustering.ipynb',
 'KNN.ipynb',
 'Links',
 'Local Settings',
 'MicrosoftEdgeBackups',
 'miktex-console.lock',
 'Multivariate Analysis.ipynb',
 'Music',
 'My Documents',
 'Natural Language Processing Chp 1,2.ipynb',
 'Natural Language Processing Chp3.ipynb',
 'NetHood',
 'NTUSER.DAT',
 'ntuser.dat.LOG1',
 'ntuser.dat.LOG2',
 'NTUSER.DAT{3bd68553-6bc1-11ed-a7c1-a928fad0adb3}.TxR.0.regtrans-ms',
 'NTUSER.DAT{3bd68553-6bc1-11ed-a7c1-a928fad0adb3}.TxR.1.regtrans-ms',
 'NTUSER.DAT{3bd68553-6bc1-11ed-a7c

In [39]:
f = open('chp2myowncorpus.txt', 'rU')
for line in f: 
    print(line.strip())

ValueError: invalid mode: 'rU'

##### ASCII text and HTML text are human-readable formats
##### Text often comes in bianry formats - such as PDF AND MS Word
##### pypdf pywin32 libraries provide access to these formats 

##### Capturing User Input

In [50]:
# prompt the user to type a line of input
s = input("Enter some text: ") 
print("You typed", len(nltk.word_tokenize(s)), 'words.')

Enter some text:  i love to swim


You typed 4 words.


##### Basic Operations with Strings

In [46]:
# If a string contains a single quote, we must backslash-escape the quote
circus = 'Monty Python\'s Flying Circus'
circus

"Monty Python's Flying Circus"

In [52]:
# Sequence of two strings joined into a single string: We need to use backshalsh or parentheses 
couplet = "Shall I compare thee to a Summer's day?"\
"Thou are more lovely and more temperate:"
couplet

couplet = ("Shall I compare thee to a Summer's day," "Thou are more lovely and more temperate:")
couplet

"Shall I compare thee to a Summer's day,Thou are more lovely and more temperate:"

In [64]:
# Triple-quote string to give newline between new lines
couplet = """Shall I compare thee to a Summer's day? Thou are more lovely and more temperate:"""
couplet

"Shall I compare thee to a Summer's day? Thou are more lovely and more temperate:"

##### Strings are IMMUTABLE, whereas lists are MUTABLE

#### 3.3 Text Processing with Unicode