
# Natural Language Processing (NLP)

Analyzing, understanding and generating natural human languages using computers


# Web data extraction (Web scraping)



### 1. Using Python urllib with urlopen 

to access data from the web in an automated fashion.


Alternatively, you can use a web browser go to specific URL and save the page as text to a
local file. Then access text data from computer with open() as shown in NLP Part I



Internet contains a large source of text data. Let's access more text from the web ...


### Accessing free ebooks from the Project Gutenberg:

code below loads that text file at specified URL, just as a browser would: 

- Instruction print(raw) will displaying it visually

In [1]:
#Loading text 2554:

from urllib.request import urlopen
URL = "http://www.gutenberg.org/files/2554/2554-0.txt"
file = urlopen(URL)
raw = file.read().decode("utf8")

In [2]:
type(raw)

str

In [3]:
import nltk
from nltk import word_tokenize, sent_tokenize
text = word_tokenize(raw)
sent = sent_tokenize(raw)

In [4]:
type(text)

list

In [5]:
type(sent)

list

In [6]:
text[600:625]

['to',
 'the',
 'subject',
 'in',
 'his',
 'writings',
 '.',
 'He',
 'describes',
 'the',
 'awful',
 'agony',
 'of',
 'the',
 'condemned',
 'man',
 'and',
 'insists',
 'on',
 'the',
 'cruelty',
 'of',
 'inflicting',
 'such',
 'torture']

In [13]:
sent[20]

'The intense suffering of this experience left a lasting stamp on\r\nDostoevsky’s mind.'

In [17]:
#Load text 18440:

from urllib.request import urlopen
URLbook = "https://www.gutenberg.org/files/18440/18440-0.txt"
file_book = urlopen(URLbook)
raw_book = file_book.read().decode("utf8")

In [18]:
text_book = word_tokenize(raw_book)

In [21]:
text_book[510:525]

['Reasoning',
 ',',
 "''",
 'gives',
 'an',
 'admirably',
 'succinct',
 'account',
 'of',
 'their',
 'position',
 '.',
 'I',
 'agree',
 'with']

In [22]:
sent_book = sent_tokenize(raw_book)

In [29]:
sent_book[180:184]

['Precise thinking needs precise language                         348\r\n§2.',
 'Nomenclature and Terminology                                    349\r\n§3.',
 'Definition                                                      352\r\n§4.',
 'Rules for testing a Definition                                  352\r\n§5.']

##  HTML 

###  urlopen request

In [7]:
from urllib.request import urlopen
url ="https://www.bbc.com/future/article/20210119-why-saving-whales-can-help-fight-climate-change"

file = urlopen(url)
html = file.read().decode("utf8")
html[799:848]

'title" content="How whales help cool the Earth"/>'

In [8]:
# Type print(html) to display the HTML document

### Python API  requests  


Some websites offer data downloadable in text format. 


Twitter and Facebook provide access to some data through their APIs.


- Using Python API (Application Programming Interface) =  server to retrieve (and send) data to using code

to access data from the websites that don’t offer such options

- Requests = a simple HTTP library for Python  that allows to easily send requests  - no need to manually add query strings to URL


The code below makes a get request to a web server that will download the HTML contents of a specified webpage:

In [9]:
import requests
page = requests.get("https://www.mcdonalds.com/us/en-us/full-menu.html")
page

<Response [200]>

In [10]:
#status code 200 <=> the webpage successfully downloaded
page.status_code

200

In [11]:
#Type print(page.content) #prints the HTML content

# Extracting Text out of HTML:



Using Python's BeautifulSoup to parse HTML 


In [14]:
from bs4 import BeautifulSoup

In [15]:
page = BeautifulSoup(page.content, 'html.parser')

get text out of HTML:

In [16]:
# To find_all <p> tags and extract text:

page.find_all('p')[0].get_text()

'We’re putting safety first. Learn how.'

In [17]:
text_page = page.get_text()
text_page[:100]

"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMcDonald's Menu: Our Full McDonald's Food Menu | McDonald's\n\n\n\n\n\n\n\n"

In [29]:
print(text_page)[:50]


































McDonald's Menu: Our Full McDonald's Food Menu | McDonald's











365



We’re putting safety first. Learn how.


Close





skip to content


 






Order Now









Language
            






English
                  



Español
                  











Home



Our Menu





View Full Menu





Beverages





Breakfast





Burgers





Chicken & Sandwiches





Combo Meal





Desserts & Shakes





Happy Meal





McCafé® Drinks





McCafé® Bakery





Snacks & Sides





$1 $2 $3 Dollar Menu




View Full Menu








McCafé



About Our Food



Deals & Our App



Trending Now


{{fullRlData.locateLabel}}








                Search
              





{{firstRestaurantName}} {{fullRlData.changePreferredLocationLinkText}} Change Your Location









Sign Up for Email
            


Careers
            



X





 









Language
          




English
                  



Español
                  









Sign Up for Emai

TypeError: 'NoneType' object is not subscriptable

In [18]:
#get text out of the other HTML file:

text_html = BeautifulSoup(html, 'html.parser').get_text()

text_html[:25]

'\n\n\n\n\nHow whales help cool'

In [19]:
text_html[:100]

'\n\n\n\n\nHow whales help cool the Earth - BBC Future\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

## Using regular Expressions 

further text cleaning is needed to tokenize the text and remove other elements besides html tags, image maps, JS, forms & tables 


In [20]:
import re

def clean_text(text):
    text = re.sub('<[^>]*>', '', text) # removes HTML markups 
    text = re.sub('[\W]+', ' ', text.lower()) # remove non-word characters and converted the text into lowercase
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) # find emoticons
    text = text + " ".join(emoticons).replace('-', '') #add emoticons at the end 
    return text

In [21]:
clean_text_html = clean_text(text_html)
clean_text_html[:100]

' how whales help cool the earth bbc future homepageaccessibility linksskip to contentaccessibility h'

### Tokenize

Text: string to list encoding


#### Using regex split:

\s = any whitespace character

In [23]:
words = re.split(r'\s+', clean_text_html)
words[:20]

['',
 'how',
 'whales',
 'help',
 'cool',
 'the',
 'earth',
 'bbc',
 'future',
 'homepageaccessibility',
 'linksskip',
 'to',
 'contentaccessibility',
 'help',
 'bbc',
 'accountnotifications',
 'homenewssportweatheriplayersoundscbbccbeebiesfoodbitesizeartstasterlocalthreemenu',
 'searchsearch',
 'the',
 'bbcsearch']

Using nltk word_tokenize 

In [24]:
import nltk
from nltk import word_tokenize
tokens = word_tokenize(text_html)
tokens[:20]

['How',
 'whales',
 'help',
 'cool',
 'the',
 'Earth',
 '-',
 'BBC',
 'Future',
 'HomepageAccessibility',
 'linksSkip',
 'to',
 'contentAccessibility',
 'Help',
 'BBC',
 'AccountNotifications',
 'HomeNewsSportWeatheriPlayerSoundsCBBCCBeebiesFoodBitesizeArtsTasterLocalThreeMenu',
 'SearchSearch',
 'the',
 'BBCSearch']

To use useful nltk methods such as nltk findall() we could transform the text from list to a nltk text using nltk.Text(tokens) 

In [57]:
Text_html = nltk.Text(tokens)

In [60]:
#To find all adjectives of ocean

Text_html.findall(r"<the> (<.*>) <ocean>") #whitespaces between <> are ignored by using nltk findall()

open; deep


### Using Chrome DevTools to Explore HTML


https://developers.google.com/web/tools/chrome-devtools

Other browsers:

- Web Console UI (Firefox)

https://developer.mozilla.org/en-US/docs/Tools/Web_Console/UI_Tour

- Web Development Tools (Safari)

https://developer.apple.com/safari/tools/

