# Finding and building Corpora from the web

* Extracting text from structured data to build a corpus for text analysis
    * Simple text formats
    * Web pages (HTML)

### COMM313 Spring 2019 (02/11/19)



## Readings


* Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum - **Read Chs. 3&4**
* NLTK Book Ch. 3 http://www.nltk.org/book/ch03.html 



## Setup

* run the following cells to import modules used and set up needed parameters

In [59]:
%matplotlib inline

import matplotlib.pyplot as plt
from collections import Counter
import os

import requests
from bs4 import BeautifulSoup

## Our Python text analysis steps so far...

1. Create a string object containing text 
    * either with directly e.g. 
        ```
        text = "this is my text...."
        ```
    * or most often by reading the contents of a file
        ```
        text = open('path/to/file.txt').read()
        ```

In [60]:
three_bears_text = open('/data/kids/threebears.txt').read()

2. Normalize the text
    * using string functions to put into lowercase and remove unwanted characters

In [61]:
three_bears_text_lc = three_bears_text.lower()

In [62]:
rdict=str.maketrans('','','!,."')
three_bears_text_lc_norm = three_bears_text_lc.translate(rdict)

3. Tokenize the text
    * split the normalized string into a list of tokens

In [63]:
three_bears_tokens = three_bears_text_lc_norm.split()

In [64]:
three_bears_tokens[:20]

['once',
 'upon',
 'a',
 'time',
 'there',
 'was',
 'a',
 'little',
 'girl',
 'named',
 'goldilocks',
 'she',
 'went',
 'for',
 'a',
 'walk',
 'in',
 'the',
 'forest',
 'pretty']

### Could do all in one line like this...

In [65]:
three_bears_tokens = three_bears_text.lower().translate(str.maketrans('','','!,."')).split()

4. Create a frequency list
    * We can tally up the __TYPES__ in a list of __TOKENS__ with the `Counter` object

In [66]:
frequency_list = Counter(three_bears_tokens)

In [67]:
frequency_list.most_common(20)

[('the', 34),
 ('she', 29),
 ('in', 14),
 ('and', 13),
 ('porridge', 10),
 ('chair', 10),
 ("someone's", 9),
 ('been', 9),
 ('my', 9),
 ('bear', 9),
 ('was', 8),
 ('too', 8),
 ('goldilocks', 7),
 ('this', 7),
 ('it', 7),
 ('to', 7),
 ('three', 6),
 ('is', 6),
 ('so', 6),
 ('bed', 6)]

## Loop and filter idiom in Python

* We have started to look at using 

    1. __LOOPS__ to walk through all items in a list
    
    2. __CONDITIONS__ to select when or what to process in the list


#### Example: find words ending in `ing`

In [68]:
"said".endswith('ing')

False

In [69]:
"saying".endswith('ing')

True

In [70]:
ing_words = []

for token in three_bears_tokens:
    
    if token.endswith('ing'):
        ing_words.append(token)

In [71]:
ing_words

['feeling',
 'living',
 'sleeping',
 'eating',
 'eating',
 'eating',
 'sitting',
 'sitting',
 'sitting',
 'sleeping',
 'sleeping',
 'sleeping']

#### Example: find words ending in `ed`

In [74]:
my_list=[]

In [75]:
my_list.append('a')

In [76]:
my_list

['a']

In [77]:
my_list.append('b')

In [81]:
my_list

['a', 'b', 1]

In [82]:
my_list.append(1)
my_list.append(1)
my_list.append(1)

In [83]:
my_list

['a', 'b', 1, 1, 1, 1]

In [84]:
ed_words = []

for token in three_bears_tokens:
    
    if token.endswith('ed'):
        ed_words.append(token)

In [73]:
ed_words

['named',
 'knocked',
 'answered',
 'walked',
 'tasted',
 'exclaimed',
 'tasted',
 'tasted',
 'decided',
 'tired',
 'walked',
 'exclaimed',
 'whined',
 'tried',
 'sighed',
 'settled',
 'tired',
 'bed',
 'bed',
 'bed',
 'growled',
 'cried',
 'growled',
 'cried',
 'decided',
 'growled',
 'bed',
 'bed',
 'bed',
 'exclaimed',
 'screamed',
 'jumped',
 'opened',
 'returned']

* Then we can create a frequency list with a `Counter`

In [85]:
Counter(ed_words).most_common()

[('bed', 6),
 ('tasted', 3),
 ('exclaimed', 3),
 ('growled', 3),
 ('walked', 2),
 ('decided', 2),
 ('tired', 2),
 ('cried', 2),
 ('named', 1),
 ('knocked', 1),
 ('answered', 1),
 ('whined', 1),
 ('tried', 1),
 ('sighed', 1),
 ('settled', 1),
 ('screamed', 1),
 ('jumped', 1),
 ('opened', 1),
 ('returned', 1)]

#### Example: find instances of specific words (_types_)

In [11]:
word_list = ['ate','eat','eating','eaten', 
             'sleep', 'sleeping', 'slept',
             'sit', 'sitting', 'sat'
            ]

In [86]:
word_list

['ate',
 'eat',
 'eating',
 'eaten',
 'sleep',
 'sleeping',
 'slept',
 'sit',
 'sitting',
 'sat']

In [89]:
'ATE' in word_list

False

In [90]:
filtered_tokens = []

for t in three_bears_tokens:
    if t in word_list:
        filtered_tokens.append(t)

In [91]:
filtered_tokens

['ate',
 'eaten',
 'sat',
 'sat',
 'sleeping',
 'eating',
 'eating',
 'eating',
 'ate',
 'sitting',
 'sitting',
 'sitting',
 'sleeping',
 'sleeping',
 'sleeping']

#### A simpler example

In [93]:
alphabet = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
alphabet[:5]

['A', 'B', 'C', 'D', 'E']

In [94]:
len(alphabet)

26

In [95]:
alphabet[1], alphabet[7], alphabet[12]

('B', 'H', 'M')

In [96]:
for letter in alphabet:
    print(letter, end=', ')

A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, 

In [101]:
list(range(10,20))

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [98]:
for pos in range(len(alphabet)):
    print(pos, alphabet[pos])

0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
10 K
11 L
12 M
13 N
14 O
15 P
16 Q
17 R
18 S
19 T
20 U
21 V
22 W
23 X
24 Y
25 Z


In [38]:
for pos, letter in enumerate(alphabet):
    print(pos, letter)

0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
10 K
11 L
12 M
13 N
14 O
15 P
16 Q
17 R
18 S
19 T
20 U
21 V
22 W
23 X
24 Y
25 Z


In [102]:
vowels = 'aeiou'

for pos, letter in enumerate(alphabet):
    if letter.lower() in vowels:
        print(pos, letter)

0 A
4 E
8 I
14 O
20 U


#### Example: A simple concordance listing

* This gives a simple way to build a KWIC concordance

In [40]:
for pos, word in enumerate(three_bears_tokens):
    if word in word_list:
        print(three_bears_tokens[pos-4:pos+4])

['said', 'happily', 'and', 'she', 'ate', 'it', 'all', 'up']
['all', 'up', 'after', "she'd", 'eaten', 'the', 'three', "bears'"]
['saw', 'three', 'chairs', 'goldilocks', 'sat', 'in', 'the', 'first']
['she', 'exclaimed', 'so', 'she', 'sat', 'in', 'the', 'second']
['asleep', 'as', 'she', 'was', 'sleeping', 'the', 'three', 'bears']
['came', 'home', "someone's", 'been', 'eating', 'my', 'porridge', 'growled']
['papa', 'bear', "someone's", 'been', 'eating', 'my', 'porridge', 'said']
['mama', 'bear', "someone's", 'been', 'eating', 'my', 'porridge', 'and']
['my', 'porridge', 'and', 'they', 'ate', 'it', 'all', 'up']
['baby', 'bear', "someone's", 'been', 'sitting', 'in', 'my', 'chair']
['papa', 'bear', "someone's", 'been', 'sitting', 'in', 'my', 'chair']
['mama', 'bear', "someone's", 'been', 'sitting', 'in', 'my', 'chair']
['bear', 'growled', "someone's", 'been', 'sleeping', 'in', 'my', 'bed']
['my', 'bed', "someone's", 'been', 'sleeping', 'in', 'my', 'bed']
['mama', 'bear', "someone's", 'been', '

![](pythonista_badge.png)
### List comprehensions 

* OK so here is the cool stuff!
* **List comprehensions** are a way to do this kind of loop+condition list filtering in one line of code

In [42]:
some_letters = ['A','B','C','D','E']

* The syntax may seem a little strange at first.
* We use the same `for _ in _list_` construction, but without the `:` and indented block.
* Instead the code that would be in the block comes **BEFORE** the `for`
* Like this:

In [43]:
[l for l in some_letters]

['A', 'B', 'C', 'D', 'E']

* So here we've:
    1. looped over each item in the list and pointed to it with `l` 
    2. created a new list one item at a item with the current value pointed to added each time

In [44]:
some_letters2 = [letter for letter in some_letters]

In [45]:
some_letters2

['A', 'B', 'C', 'D', 'E']

* Ummm... so not a big deal right now BUT...
* We can add a condition after the `for _ in _list_` like this:

In [46]:
[l for l in some_letters if l in ['A','E']]

['A', 'E']

In [47]:
[l for l in some_letters*5 if l in ['A','E']]

['A', 'E', 'A', 'E', 'A', 'E', 'A', 'E', 'A', 'E']

### And...

* This...

In [None]:
filtered_tokens = []
for token in three_bears_tokens:
    if token in word_list:
        filtered_tokens.append(token)

* Becomes this...

In [49]:
filtered_tokens2 = [token for token in three_bears_tokens if token in word_list]

In [50]:
filtered_tokens

['ate',
 'eaten',
 'sat',
 'sat',
 'sleeping',
 'eating',
 'eating',
 'eating',
 'ate',
 'sitting',
 'sitting',
 'sitting',
 'sleeping',
 'sleeping',
 'sleeping']

In [51]:
filtered_tokens2

['ate',
 'eaten',
 'sat',
 'sat',
 'sleeping',
 'eating',
 'eating',
 'eating',
 'ate',
 'sitting',
 'sitting',
 'sitting',
 'sleeping',
 'sleeping',
 'sleeping']

### OK enough for now... 

* But spend sometime trying this idiom out... you will see it **A LOT**!

## Building a corpus from a single text file

* You'll often find you get a large text file that contains many texts that you want to be able to analyze individually.
* It is usually best to split them into a series of smaller files or into a list and save in a spreadsheet like (columns and rows CSV file).

### A couple of quick examples

1. The Brothers Grimm Fairy Tales from _Project Guttenberg_(http://www.gutenberg.org/files/2591/2591-0.txt)
      
2. The downloaded results of a search using _LexisNexis_

In [1]:
grimm_url = 'http://www.gutenberg.org/files/2591/2591-0.txt'

* We can use the `requests.get()` function to load the contents of a internet resource (i.e. web page) into a string

In [103]:
resp=requests.get(grimm_url)
grimm_text = resp.text

In [108]:
print(grimm_text[:3000])

﻿The Project Gutenberg EBook of Grimms’ Fairy Tales, by The Brothers Grimm

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Grimms’ Fairy Tales

Author: The Brothers Grimm

Translator: Edgar Taylor and Marian Edwardes

Posting Date: December 14, 2008 [EBook #2591]
Release Date: April, 2001
Last Updated: November 7, 2016

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK GRIMMS’ FAIRY TALES ***




Produced by Emma Dudding, John Bickers, and Dagny





FAIRY TALES

By The Brothers Grimm



PREPARER’S NOTE

     The text is based on translations from
     the Grimms’ Kinder und Hausmarchen by
     Edgar Taylor and Marian Edwardes.




CONTENTS:

     THE GOLDEN BIRD
     HANS IN LUCK
     

* Then we can use familar string functions and slicing to locate key bits of the large file that we might want to split up.

In [105]:
len(grimm_text)

549812

In [106]:
grimm_text.find('CONTENTS')

941

In [109]:
grimm_text.find('THE BROTHERS GRIMM')

2761

In [110]:
contents=grimm_text[950:2761]

In [111]:
print(contents)



     THE GOLDEN BIRD
     HANS IN LUCK
     JORINDA AND JORINDEL
     THE TRAVELLING MUSICIANS
     OLD SULTAN
     THE STRAW, THE COAL, AND THE BEAN
     BRIAR ROSE
     THE DOG AND THE SPARROW
     THE TWELVE DANCING PRINCESSES
     THE FISHERMAN AND HIS WIFE
     THE WILLOW-WREN AND THE BEAR
     THE FROG-PRINCE
     CAT AND MOUSE IN PARTNERSHIP
     THE GOOSE-GIRL
     THE ADVENTURES OF CHANTICLEER AND PARTLET
       1. HOW THEY WENT TO THE MOUNTAINS TO EAT NUTS
       2. HOW CHANTICLEER AND PARTLET WENT TO VISIT MR KORBES
     RAPUNZEL
     FUNDEVOGEL
     THE VALIANT LITTLE TAILOR
     HANSEL AND GRETEL
     THE MOUSE, THE BIRD, AND THE SAUSAGE
     MOTHER HOLLE
     LITTLE RED-CAP [LITTLE RED RIDING HOOD]
     THE ROBBER BRIDEGROOM
     TOM THUMB
     RUMPELSTILTSKIN
     CLEVER GRETEL
     THE OLD MAN AND HIS GRANDSON
     THE LITTLE PEASANT
     FREDERICK AND CATHERINE
     SWEETHEART ROLAND
     SNOWDROP
     THE PINK
     CLEVER ELSIE
 

In [113]:
contents_list=[line.strip() for line in contents.split('\r\n') if line>'']

In [114]:
for i,item in enumerate(contents_list):
    print(i,item)

0 THE GOLDEN BIRD
1 HANS IN LUCK
2 JORINDA AND JORINDEL
3 THE TRAVELLING MUSICIANS
4 OLD SULTAN
5 THE STRAW, THE COAL, AND THE BEAN
6 BRIAR ROSE
7 THE DOG AND THE SPARROW
8 THE TWELVE DANCING PRINCESSES
9 THE FISHERMAN AND HIS WIFE
10 THE WILLOW-WREN AND THE BEAR
11 THE FROG-PRINCE
12 CAT AND MOUSE IN PARTNERSHIP
13 THE GOOSE-GIRL
14 THE ADVENTURES OF CHANTICLEER AND PARTLET
15 1. HOW THEY WENT TO THE MOUNTAINS TO EAT NUTS
16 2. HOW CHANTICLEER AND PARTLET WENT TO VISIT MR KORBES
17 RAPUNZEL
18 FUNDEVOGEL
19 THE VALIANT LITTLE TAILOR
20 HANSEL AND GRETEL
21 THE MOUSE, THE BIRD, AND THE SAUSAGE
22 MOTHER HOLLE
23 LITTLE RED-CAP [LITTLE RED RIDING HOOD]
24 THE ROBBER BRIDEGROOM
25 TOM THUMB
26 RUMPELSTILTSKIN
27 CLEVER GRETEL
28 THE OLD MAN AND HIS GRANDSON
29 THE LITTLE PEASANT
30 FREDERICK AND CATHERINE
31 SWEETHEART ROLAND
32 SNOWDROP
33 THE PINK
34 CLEVER ELSIE
35 THE MISER IN THE BUSH
36 ASHPUTTEL
37 THE WHITE SNAKE
38 THE WOLF AND THE SEVEN LITTLE KIDS
39 THE QUEEN BEE
40 THE ELVES

In [115]:
story_titles=[item for i,item in enumerate(contents_list) if i not in (15,16,42,57,58)]

In [116]:
story_titles

['THE GOLDEN BIRD',
 'HANS IN LUCK',
 'JORINDA AND JORINDEL',
 'THE TRAVELLING MUSICIANS',
 'OLD SULTAN',
 'THE STRAW, THE COAL, AND THE BEAN',
 'BRIAR ROSE',
 'THE DOG AND THE SPARROW',
 'THE TWELVE DANCING PRINCESSES',
 'THE FISHERMAN AND HIS WIFE',
 'THE WILLOW-WREN AND THE BEAR',
 'THE FROG-PRINCE',
 'CAT AND MOUSE IN PARTNERSHIP',
 'THE GOOSE-GIRL',
 'THE ADVENTURES OF CHANTICLEER AND PARTLET',
 'RAPUNZEL',
 'FUNDEVOGEL',
 'THE VALIANT LITTLE TAILOR',
 'HANSEL AND GRETEL',
 'THE MOUSE, THE BIRD, AND THE SAUSAGE',
 'MOTHER HOLLE',
 'LITTLE RED-CAP [LITTLE RED RIDING HOOD]',
 'THE ROBBER BRIDEGROOM',
 'TOM THUMB',
 'RUMPELSTILTSKIN',
 'CLEVER GRETEL',
 'THE OLD MAN AND HIS GRANDSON',
 'THE LITTLE PEASANT',
 'FREDERICK AND CATHERINE',
 'SWEETHEART ROLAND',
 'SNOWDROP',
 'THE PINK',
 'CLEVER ELSIE',
 'THE MISER IN THE BUSH',
 'ASHPUTTEL',
 'THE WHITE SNAKE',
 'THE WOLF AND THE SEVEN LITTLE KIDS',
 'THE QUEEN BEE',
 'THE ELVES AND THE SHOEMAKER',
 'THE JUNIPER-TREE',
 'THE TURNIP',
 'C

In [119]:
grimm_text[-10000:]

'ied in Section 4, “Information about donations to\r\n     the Project Gutenberg Literary Archive Foundation.”\r\n\r\n- You provide a full refund of any money paid by a user who notifies\r\n     you in writing (or by e-mail) within 30 days of receipt that s/he\r\n     does not agree to the terms of the full Project Gutenberg-tm\r\n     License.  You must require such a user to return or\r\n     destroy all copies of the works possessed in a physical medium\r\n     and discontinue all use of and all access to other copies of\r\n     Project Gutenberg-tm works.\r\n\r\n- You provide, in accordance with paragraph 1.F.3, a full refund of any\r\n     money paid for a work or a replacement copy, if a defect in the\r\n     electronic work is discovered and reported to you within 90 days\r\n     of receipt of the work.\r\n\r\n- You comply with all other terms of this agreement for free\r\n     distribution of Project Gutenberg-tm works.\r\n\r\n1.E.9.  If you wish to charge a fee or distribute a

In [120]:
stories=grimm_text[2800:grimm_text.find('*****')]

In [122]:
stories[-200:]

'd peacefully and happily with\r\nher children for many years. She took the two rose-trees with her, and\r\nthey stood before her window, and every year bore the most beautiful\r\nroses, white and red.\r\n\r\n\r\n'

In [124]:
story_titles

['THE GOLDEN BIRD',
 'HANS IN LUCK',
 'JORINDA AND JORINDEL',
 'THE TRAVELLING MUSICIANS',
 'OLD SULTAN',
 'THE STRAW, THE COAL, AND THE BEAN',
 'BRIAR ROSE',
 'THE DOG AND THE SPARROW',
 'THE TWELVE DANCING PRINCESSES',
 'THE FISHERMAN AND HIS WIFE',
 'THE WILLOW-WREN AND THE BEAR',
 'THE FROG-PRINCE',
 'CAT AND MOUSE IN PARTNERSHIP',
 'THE GOOSE-GIRL',
 'THE ADVENTURES OF CHANTICLEER AND PARTLET',
 'RAPUNZEL',
 'FUNDEVOGEL',
 'THE VALIANT LITTLE TAILOR',
 'HANSEL AND GRETEL',
 'THE MOUSE, THE BIRD, AND THE SAUSAGE',
 'MOTHER HOLLE',
 'LITTLE RED-CAP [LITTLE RED RIDING HOOD]',
 'THE ROBBER BRIDEGROOM',
 'TOM THUMB',
 'RUMPELSTILTSKIN',
 'CLEVER GRETEL',
 'THE OLD MAN AND HIS GRANDSON',
 'THE LITTLE PEASANT',
 'FREDERICK AND CATHERINE',
 'SWEETHEART ROLAND',
 'SNOWDROP',
 'THE PINK',
 'CLEVER ELSIE',
 'THE MISER IN THE BUSH',
 'ASHPUTTEL',
 'THE WHITE SNAKE',
 'THE WOLF AND THE SEVEN LITTLE KIDS',
 'THE QUEEN BEE',
 'THE ELVES AND THE SHOEMAKER',
 'THE JUNIPER-TREE',
 'THE TURNIP',
 'C

In [30]:
for sidx, title in enumerate(story_titles):
    start_pos = stories.find(title)
    
    if sidx==len(story_titles)-1:
        end_pos = len(stories)
    else:
        end_pos = stories.find(story_titles[sidx+1])
        
    print("{: <5}{: <50}{: >10}{: >10}".format(sidx, title, start_pos, end_pos))

0    THE GOLDEN BIRD                                            1     12719
1    HANS IN LUCK                                           12719     24688
2    JORINDA AND JORINDEL                                   24688     30710
3    THE TRAVELLING MUSICIANS                               30710     37764
4    OLD SULTAN                                             37764     42232
5    THE STRAW, THE COAL, AND THE BEAN                      42232     44987
6    BRIAR ROSE                                             44987     52935
7    THE DOG AND THE SPARROW                                52935     59887
8    THE TWELVE DANCING PRINCESSES                          59887     68395
9    THE FISHERMAN AND HIS WIFE                             68395     79350
10   THE WILLOW-WREN AND THE BEAR                           79350     84258
11   THE FROG-PRINCE                                        84258     90425
12   CAT AND MOUSE IN PARTNERSHIP                           90425     95618
13   THE GOO

In [125]:
if not os.path.exists('data/fairy_tales'):
    os.makedirs('data/fairy_tales')

In [126]:


for sidx, title in enumerate(story_titles):
    start_pos = stories.find(title)
    
    if sidx==len(story_titles)-1:
        end_pos = len(stories)
    else:
        end_pos = stories.find(story_titles[sidx+1])
        
    print(sidx, story_titles[sidx], start_pos, end_pos)
    
    filename = title.replace(' ','_').replace('[','').replace(']','').replace(',','')
    
    with open('data/fairy_tales/{}.txt'.format(filename.lower()),'w') as out:
        out.write(stories[start_pos:end_pos])
        print('\twritten to data/fairy_tales/{}.txt\n'.format(filename.lower()))

0 THE GOLDEN BIRD 1 12719
	written to data/fairy_tales/the_golden_bird.txt

1 HANS IN LUCK 12719 24688
	written to data/fairy_tales/hans_in_luck.txt

2 JORINDA AND JORINDEL 24688 30710
	written to data/fairy_tales/jorinda_and_jorindel.txt

3 THE TRAVELLING MUSICIANS 30710 37764
	written to data/fairy_tales/the_travelling_musicians.txt

4 OLD SULTAN 37764 42232
	written to data/fairy_tales/old_sultan.txt

5 THE STRAW, THE COAL, AND THE BEAN 42232 44987
	written to data/fairy_tales/the_straw_the_coal_and_the_bean.txt

6 BRIAR ROSE 44987 52935
	written to data/fairy_tales/briar_rose.txt

7 THE DOG AND THE SPARROW 52935 59887
	written to data/fairy_tales/the_dog_and_the_sparrow.txt

8 THE TWELVE DANCING PRINCESSES 59887 68395
	written to data/fairy_tales/the_twelve_dancing_princesses.txt

9 THE FISHERMAN AND HIS WIFE 68395 79350
	written to data/fairy_tales/the_fisherman_and_his_wife.txt

10 THE WILLOW-WREN AND THE BEAR 79350 84258
	written to data/fairy_tales/the_willow-wren_and_the_bear.

## Navigating a web page and extracting content with Python

* In the Jupyter file browser open the file `my_webpage.html` 
    1. by clicking on it (this should open it as a webpage)
    2. by selecting the checkbox and the Edit menu button
    

In [127]:
content=open('my_webpage.html').read()

print(content)

<html>
    
    
    <head>
        <title>My webpage</title>
        
        <style>
            
        </style>
    </head>
    
    <body>
        <h1>This is a top level heading</h1>
        
        <p>A paragraph of text.</p>
        
        <p>Another one</p>
        
        <h2>A second level heading</h2>
        
        <p>And then another paragraph with text.</p>
    </body>
</html>


In [128]:
Counter(content.split()).most_common()

[('level', 2),
 ('paragraph', 2),
 ('text.</p>', 2),
 ('<html>', 1),
 ('<head>', 1),
 ('<title>My', 1),
 ('webpage</title>', 1),
 ('<style>', 1),
 ('</style>', 1),
 ('</head>', 1),
 ('<body>', 1),
 ('<h1>This', 1),
 ('is', 1),
 ('a', 1),
 ('top', 1),
 ('heading</h1>', 1),
 ('<p>A', 1),
 ('of', 1),
 ('<p>Another', 1),
 ('one</p>', 1),
 ('<h2>A', 1),
 ('second', 1),
 ('heading</h2>', 1),
 ('<p>And', 1),
 ('then', 1),
 ('another', 1),
 ('with', 1),
 ('</body>', 1),
 ('</html>', 1)]

In [129]:
html_doc = BeautifulSoup(content, 'lxml')

In [130]:
html_doc.find_all('p')

[<p>A paragraph of text.</p>,
 <p>Another one</p>,
 <p>And then another paragraph with text.</p>]

In [131]:
html_doc.find_all(['p','h1','h2'])

[<h1>This is a top level heading</h1>,
 <p>A paragraph of text.</p>,
 <p>Another one</p>,
 <h2>A second level heading</h2>,
 <p>And then another paragraph with text.</p>]

In [38]:
content_tags = html_doc.find_all(['p','h1','h2'])

for tag in content_tags:
    print(tag.text)

This is a top level heading
A paragraph of text.
Another one
A second level heading
And then another paragraph with text.


In [132]:
content = [t.text for t in content_tags]

In [133]:
content

['This is a top level heading',
 'A paragraph of text.',
 'Another one',
 'A second level heading',
 'And then another paragraph with text.']

In [134]:
Counter('\n '.join(content).split()).most_common()

[('level', 2),
 ('heading', 2),
 ('A', 2),
 ('paragraph', 2),
 ('text.', 2),
 ('This', 1),
 ('is', 1),
 ('a', 1),
 ('top', 1),
 ('of', 1),
 ('Another', 1),
 ('one', 1),
 ('second', 1),
 ('And', 1),
 ('then', 1),
 ('another', 1),
 ('with', 1)]

### Ok a real world example

* Example news article: 
    * https://www.theguardian.com/technology/shortcuts/2019/feb/10/how-did-apples-airpods-go-from-mockery-to-millennial-status-symbol

In [135]:
article_url = 'https://www.theguardian.com/technology/shortcuts/2019/feb/10/how-did-apples-airpods-go-from-mockery-to-millennial-status-symbol'

In [136]:
page_content = requests.get(article_url).text

In [137]:
print(page_content[:3000])


<!DOCTYPE html>
<html id="js-context" class="js-off is-not-modern id--signed-out" lang="en" data-page-path="/technology/shortcuts/2019/feb/10/how-did-apples-airpods-go-from-mockery-to-millennial-status-symbol">
<head>
<!--
     __        __                      _     _      _
     \ \      / /__    __ _ _ __ ___  | |__ (_)_ __(_)_ __   __ _
      \ \ /\ / / _ \  / _` | '__/ _ \ | '_ \| | '__| | '_ \ / _` |
       \ V  V /  __/ | (_| | | |  __/ | | | | | |  | | | | | (_| |
        \_/\_/ \___|  \__,_|_|  \___| |_| |_|_|_|  |_|_| |_|\__, |
                                                            |___/
    Ever thought about joining us?
    https://workforus.theguardian.com/careers/digital-development/
     --->
<title>How did Apple’s AirPods go from mockery to millennial status symbol? | Technology | The Guardian</title>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge"/>
<meta name="format-detection" content="telephone=no"/>
<meta name="HandheldFriendly" co

In [138]:
Counter(page_content.split()).most_common(20)

[('(min-width:', 646),
 ('0', 304),
 (':', 230),
 ('.element-pullquote.element--showcase', 160),
 ('solid', 144),
 ('<li', 141),
 ('</li>', 137),
 ('<a', 135),
 ('.meta__twitter', 133),
 ('(max-width:', 130),
 ('.element-pullquote', 130),
 ('<div', 126),
 ('.meta__email', 123),
 ('</a>', 118),
 ('data-link-name="nav2', 112),
 ('.inline-garnett-quote', 104),
 ('</div>', 101),
 ('<span', 89),
 ('and', 77),
 ('.content__meta-container', 77)]

In [139]:
doc = BeautifulSoup(page_content, 'lxml')

In [140]:
article_content = doc.find('div', {'class': 'content__article-body'})

In [141]:
article_content.find_all('p')

[<p><span class="drop-cap"><span class="drop-cap__inner">O</span></span>f all the widely ridiculed tech products, Apple’s AirPods have experienced an extraordinary turnaround. Back in 2016, they were roundly mocked by the tech industry. Tiny wireless earbuds? It seemed like a recipe for disaster – streets would be littered with these lost headphones, which would clutter up city pavements like discarded gloves and babies’ socks.</p>,
 <p>“If only there were an invention that could keep those AirPods tethered together, like a string,” wrote Ashley Esqueda from the tech website CNET <a class="u-underline" data-link-name="in body link" href="https://twitter.com/AshleyEsqueda/status/773590547625746432">on Twitter</a>. “The beauty of the headphone cable is just like the beauty of a tampon string: it is there to help you keep track of a very important item,” wrote <a class="u-underline" data-link-name="in body link" href="https://www.theguardian.com/technology/2016/sep/07/apple-airpods-launch

In [142]:
para_text=[p.text for p in article_content.find_all('p')]

In [143]:
print(para_text)

['Of all the widely ridiculed tech products, Apple’s AirPods have experienced an extraordinary turnaround. Back in 2016, they were roundly mocked by the tech industry. Tiny wireless earbuds? It seemed like a recipe for disaster – streets would be littered with these lost headphones, which would clutter up city pavements like discarded gloves and babies’ socks.', '“If only there were an invention that could keep those AirPods tethered together, like a string,” wrote Ashley Esqueda from the tech website CNET on Twitter. “The beauty of the headphone cable is just like the beauty of a tampon string: it is there to help you keep track of a very important item,” wrote Julia Carrie Wong in the Guardian.', 'But fast-forward to 2019 and, somehow, the £159-a-pair little pods have transformed into a bona fide status symbol. Diana Ross has a pair, Kristen Stewart wears them and a woman in Virginia has even started a cottage industry by turning them into earrings for people (which does solve the pr

In [144]:
Counter('\n '.join(para_text).split()).most_common(20)

[('the', 26),
 ('a', 12),
 ('of', 11),
 ('AirPods', 5),
 ('like', 5),
 ('–', 5),
 ('and', 5),
 ('is', 5),
 ('to', 5),
 ('have', 4),
 ('them', 4),
 ('people', 4),
 ('who', 4),
 ('tech', 3),
 ('in', 3),
 ('by', 3),
 ('for', 3),
 ('with', 3),
 ('on', 3),
 ('it', 3)]