<a href="https://colab.research.google.com/github/lblogan14/web_scraping_with_python/blob/master/ch9_read_write_nl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Summarizing Data
Modifying the n-gram used in Chapter 8, and then looking for sets of 2-grams and returning a `Counter` object with all 2-grams:

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
from collections import Counter

In [0]:
def cleanSentence(sentence):
  sentence = sentence.split(' ')
  sentence = [word.strip(string.punctuation + string.whitespace) for word in sentence]
  sentence = [word for word in sentence if len(word) > 1
              or (word.lower() == 'a' or word.lower() == 'i')]
  return sentence

def cleanInput(content):
  content = content.upper()
  content = re.sub('\n', ' ', content)
  content = bytes(content, 'UTF-8')
  content = content.decode('ascii', 'ignore')
  sentences = content.split('. ')
  return [cleanSentence(sentence) for sentence in sentences]

def getNgramsFromSentence(content, n):
  output = []
  for i in range(len(content)-n+1):
    output.append(content[i:i+n])
  return output

def getNgrams(content, n):
  content = cleanInput(content)
  ngrams = Counter()
  ngrams_list = []
  for sentence in content:
    newNgrams = [' '.join(ngram) for ngram in getNgramsFromSentence(sentence, n)]
    ngrams_list.extend(newNgrams)
    ngrams.update(newNgrams)
  return(ngrams)

In [13]:
content = str(urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt').read(), 'utf-8')
ngrams = getNgrams(content, 2)
print(ngrams)



Not all of the 2-grams are meaningful. To get rid of these unwanted words, use the so-called Corpus of Contemporary American English. The list of 5,000 most frequently found words is available for free, and fortunately, this is far more than enough to act as a basic filter to weed out the most common 2-grams. \\
The first 100 words vastly improves the results, with the addition of an `isCommon` function:

In [0]:
def isCommon(ngram):
  commonWords = ['THE', 'BE', 'AND', 'OF', 'A', 'IN', 'TO', 'HAVE', 'IT', 'I', 'THAT', 
                 'FOR', 'YOU', 'HE', 'WITH', 'ON', 'DO', 'SAY', 'THIS', 'THEY', 'IS', 
                 'AN', 'AT', 'BUT', 'WE', 'HIS', 'FROM', 'THAT', 'NOT', 'BY', 'SHE', 
                 'OR', 'AS', 'WHAT', 'GO', 'THEIR', 'CAN', 'WHO', 'GET', 'IF', 'WOULD', 
                 'HER', 'ALL', 'MY', 'MAKE', 'ABOUT', 'KNOW', 'WILL', 'AS', 'UP', 'ONE', 
                 'TIME', 'HAS', 'BEEN', 'THERE', 'YEAR', 'SO', 'THINK', 'WHEN', 'WHICH', 
                 'THEM', 'SOME', 'ME', 'PEOPLE', 'TAKE', 'OUT', 'INTO', 'JUST', 'SEE', 
                 'HIM', 'YOUR', 'COME', 'COULD', 'NOW', 'THAN', 'LIKE', 'OTHER', 'HOW', 
                 'THEN', 'ITS', 'OUR', 'TWO', 'MORE', 'THESE', 'WANT', 'WAY', 'LOOK', 
                 'FIRST', 'ALSO', 'NEW', 'BECAUSE', 'DAY', 'MORE', 'USE', 'NO', 'MAN', 
                 'FIND', 'HERE', 'THING', 'GIVE', 'MANY', 'WELL']
  for word in ngram:
    if word in commonWords:
      return True
  return False

Then, update the `getNgramsFromSentence` function:

In [0]:
def getNgramsFromSentence(content, n):
  output = []
  for i in range(len(content)-n+1):
    if not isCommon(content[i:i+n]):
      output.append(content[i:i+n])
  return output

In [9]:
ngrams = getNgrams(content, 2)
print(ngrams)



Notice that this list of common word from relatively modern times to filter the result may not be appropriate given that the text was written in 1841.

To write text summaries from the extracted text in some key topics, search for the first sentence that contains each "popular" n-gram. The theory being that the first instance will yield a satisfactory overview of the body of the content.

In [0]:
def getFirstSentenceContaining(ngram, content):
  #print(ngram)
  sentences = content.upper().split('.')
  for sentence in sentences:
    if ngram in sentence:
      return sentence + '\n'
  return ''

In [15]:
print(getFirstSentenceContaining('EXECUTIVE DEPARTMENT', content))

 SUCH A ONE WAS AFFORDED BY THE EXECUTIVE DEPARTMENT CONSTITUTED BY THE CONSTITUTION



With longer blocks of text, or more varied text, it may be worth looking at 3-grams or
even 4-grams when retrieving the “most important” sentences of a passage.

#Markov Models
The following code generates arbitrarily long Markov chains (with the chain length set to 100) based on the structure of its text:

In [0]:
from urllib.request import urlopen
from random import randint

In [0]:
def wordListSum(wordList):
  sum = 0
  for word, value in wordList.items():
    sum += value
  return sum

def retrieveRandomWord(wordList):
  randIndex = randint(1, wordListSum(wordList))
  for word, value in wordList.items():
    randIndex -= value
    if randIndex <= 0:
      return word
    
def buildWordDict(text):
  #Remove newlines and quotes
  text = text.replace('\n', ' ')
  text = text.replace(' '' ', '')
  
  #Make sure punctuation marks are treated as their own "words"
  #so that they will be included in the Markov chain
  punctuation = [',','.',';',':']
  for symbol in punctuation:
    text = text.replace(symbol, ' {} '.format(symbol))
    
  words = text.split(' ')
  #Filter out empty words
  words = [word for word in words if word != '']
  
  wordDict = {}
  for i in range(1, len(words)):
    if words[i-1] not in wordDict:
      #Create a new dictionary for this word
      wordDict[words[i-1]] = {}
    if words[i] not in wordDict[words[i-1]]:
      wordDict[words[i-1]][words[i]] = 0
    wordDict[words[i-1]][words[i]] += 1
  return wordDict

In [0]:
text = str(urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt').read(), 'utf-8')
wordDict = buildWordDict(text)

In [22]:
#Generate a Markov chain of length 100
length = 100
chain = ['I']
for i in range(0, length):
  newWord = retrieveRandomWord(wordDict[chain[-1]])
  chain.append(newWord)
  
print(' '.join(chain))

I shall be guaranteed by all . Fellow-citizens , placed in our constitutional principle that of them , as well as citizens of devising schemes of some cases , but in the Commonwealth and forbearance . Long the rich are told by the amount of my earnest desire to liberty . But with or classed with any or republic , it appears to be exposed from the senate continued to say that it is called me by the management of the banishment from his own votes , therefore , either of their interference in the remark of hasty enactment . A


This output changes every time it is run.

The function buildWordDict takes in the string of text, which was retrieved from the
internet. It then does some cleaning and formatting, removing quotes and putting
spaces around other punctuation so it is effectively treated as a separate word. After
this, it builds a two-dimensional dictionary—a dictionary of dictionaries—that has
the following form:


```
{word_a : {word_b : 2, word_c : 1, word_d : 1}, word_e : {word_b : 5, word_d : 2},...}
```

This means that, `word_a` was found 4 times, two instance of which were followed by `word_b`, one by `word_c`, and one by `word_d`...

This Markov model can be interpreted as: `word_a` would have a 50% pointing toward `word_b`, a 25% toward `word_c`, and a 25% toward `word_d`.

Markov chains model how websites link from one page to the next. Large collections of these
links as pointers can form weblike graphs that are useful to store, track, and analyze.
In this way, Markov chains form the foundation for both how to think about web
crawling, and how the web crawlers can think.

##Six Degrees of Wikipedia: Conclusion
A *breadth-first search* is performed by first searching all links that link directly to the starting page. If those links do not contain the target page, then a second level of links -- pages that are linked by a page that is linked by the starting page -- is searched. This process continues until either the depth limit is reached or the target page is found.

A complete solution to the breadth-first search, using a table of links as described in Chapter 6, is as follows:

In [0]:
import pymysql

In [0]:
conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', user='root', passwd='root', db='mysql', charset='utf8')
cur = conn.cursor()
cur.execute('USE wikipedia')

In [0]:
def getUrl(pageId):
  cur.execute('SELECT url FROM pages WHERE id = %s', (int(pageId)))
  return cur.fetchone()[0]

def getLinks(fromPageId):
  cur.execute('SELECT toPageId FROM links WHERE fromPageId = %s', (int(fromPageId)))
  if cur.rowcount == 0:
    return []
  return [x[0] for x in cur.fetchall()]


In [0]:
def searchBreadth(targetPageId, paths=[[1]]):
  newPaths = []
  for path in paths:
    links = getLinks(path[-1])
    for link in links:
      if link == targetPageId:
        return path + [link]
      else:
        newPaths.append(path+[link])
  return searchBreadth(targetPageId, newPaths)

In [0]:
nodes = getLinks(1)
targetPageId = 28624
pageIds = searchBreadth(targetPageId)
for pageId in pageIds:
  print(getUrl(pageId))

#Natural Language Toolkit
The *Natural Language Toolkit* (NLTK) is a suite of Python libraries designed to iden‐
tify and tag parts of speech found in natural English text.

##Installation and Setup
The `nltk` module:

In [24]:
!pip3 install nltk



After installing the module, it’s a good idea to download its preset text repositories:

In [0]:
import nltk
nltk.download()

##Statisticall Analysis with NLTK
Analysis with NLTK always starts with the `Text` object.

In [39]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

True

In [0]:
from nltk import word_tokenize
from nltk import Text

tokens = word_tokenize('Here is some not very interesting text')
text = Text(tokens)

In [32]:
print(text)

<Text: Here is some not very interesting text...>


Text objects can be manipulated much like normal Python array, which can be used to count the number of unique words in a text and compare it against the total number of words:

In [40]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [41]:
len(text6)/len(set(text6))

7.833333333333333

Try to put the text into a frequency distribution object to determine some of the most common words and the frequencies for various words:

In [42]:
from nltk import FreqDist

fdist = FreqDist(text6)
fdist.most_common(10)

[(':', 1197),
 ('.', 816),
 ('!', 801),
 (',', 731),
 ("'", 421),
 ('[', 319),
 (']', 312),
 ('the', 299),
 ('I', 255),
 ('ARTHUR', 225)]

To create, search and list 2-grams:

In [0]:
from nltk import bigrams

bigrams = bigrams(text6)
bigramsDist = FreqDist(bigrams)

To search for the 2-grams “Sir Robin,” the user needs to break it into the tuple (“Sir”, “Robin”), to match the way the 2-grams are represented in the frequency distribution.

In [44]:
bigramsDist[('Sir', 'Robin')]

18

For other n-grams,

In [47]:
from nltk import ngrams
fourgrams = ngrams(text6, 4)
fourgramsDist = FreqDist(fourgrams)
fourgramsDist[('father', 'smelt', 'of', 'elderberries')]

1

Frequency distributions, text objects, and n-grams also can be iterated through and operated on in a loop. The following prints out all 4-grams that begin with the word “coconut,”

In [0]:
for fourgram in fourgrams:
  if fourgram[0] == 'coconut':
    print(fourgram)

##Lexicographical Analysis with NLTK
NLTK uses by default a popular system of tagging parts of speech developed by the
University of Pennsylvania’s Penn Treebank Project. Although some of the tags make
sense (e.g., CC is a coordinating conjunction), others can be confusing (e.g., RP is a
particle).

In addition to measuring language, NLTK can assist in finding meaning in the words based on context and its own sizable dictionaries. At a basic level, NLTK can identify parts of speech:

In [0]:
from nltk.book import *
from nltk import word_tokenize

text = word_tokenize('Strange women lying in ponds distributing swords is no basis for a system of government.')

In [51]:
from nltk import pos_tag

pos_tag(text)

[('Strange', 'JJ'),
 ('women', 'NNS'),
 ('lying', 'VBG'),
 ('in', 'IN'),
 ('ponds', 'NNS'),
 ('distributing', 'VBG'),
 ('swords', 'NNS'),
 ('is', 'VBZ'),
 ('no', 'DT'),
 ('basis', 'NN'),
 ('for', 'IN'),
 ('a', 'DT'),
 ('system', 'NN'),
 ('of', 'IN'),
 ('government', 'NN'),
 ('.', '.')]

Each word is separated into a tuple containing the word and a tag identifying the part of speech.

Although this might seem like a straightforward lookup, the complexity needed to perform the task correctly becomes apparent with the following example:

In [52]:
text = word_tokenize('The dust was thick so he had to dust')
pos_tag(text)

[('The', 'DT'),
 ('dust', 'NN'),
 ('was', 'VBD'),
 ('thick', 'RB'),
 ('so', 'RB'),
 ('he', 'PRP'),
 ('had', 'VBD'),
 ('to', 'TO'),
 ('dust', 'VB')]

Notice that the word “dust” is used twice in the sentence: once as a noun, and again as a verb. NLTK identifies both usages correctly, based on their context in the sentence.

The `pos_tag` function can be useful when scraping text off a site and being able to search it as a verb not a proper noun for instance:

In [0]:
from nltk import word_tokenize, sent_tokenize, pos_tag

sentences = sent_tokenize('Google is one of the best companies in the world. I constantly google myself to see what I\'m up to.')
nouns = ['NN', 'NNS', 'NNP', 'NNPS']

In [56]:
for sentence in sentences:
  if 'google' in sentence.lower():
    taggedWords = pos_tag(word_tokenize(sentence))
    for word in taggedWords:
      if word[0].lower() == 'google' and word[1] in nouns:
        print(sentence)

Google is one of the best companies in the world.


This prints only sentences that contain the word “google” (or “Google”) as some sort of a noun, not a verb.