# Custom Sources - HTML

In [1]:
import nltk
import urllib.request

Websites are written in HTML, so when you pull information directly from a site, you will get all the code back along with the text.

In [2]:
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

In [3]:
response = urllib.request.urlopen(url)

In [4]:
html = response.read()

In [5]:
html[:1000]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Python (programming language) - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Python_(programming_language)","wgTitle":"Python (programming language)","wgCurRevisionId":864962678,"wgRevisionId":864962678,"wgArticleId":23862,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 errors: external links","Wikipedia semi-protected pages","Use dmy dates from August 2015","Wikipedia articles needing clarification from May 2018","All articles with unsourced statements","Articles with unsourced statements from May 2018","Articles containing potenti

We will use a Python library called BeautifulSoup to make working with the HTML easier.

In [6]:
from bs4 import BeautifulSoup

In [7]:
soup = BeautifulSoup(html, "html5lib")

Wikipedia places most readable text within paragraph tags or "p" tags.

In [8]:
web_paragraph = [p_tag.text for p_tag in soup.find_all('p')]

In [9]:
web_tokens = [nltk.word_tokenize(paragraph) for paragraph in web_paragraph]

Now that we have each paragraph tokenized, we can find the first one on the page.

In [10]:
print(web_tokens[2])

['Python', 'is', 'an', 'interpreted', 'high-level', 'programming', 'language', 'for', 'general-purpose', 'programming', '.', 'Created', 'by', 'Guido', 'van', 'Rossum', 'and', 'first', 'released', 'in', '1991', ',', 'Python', 'has', 'a', 'design', 'philosophy', 'that', 'emphasizes', 'code', 'readability', ',', 'notably', 'using', 'significant', 'whitespace', '.', 'It', 'provides', 'constructs', 'that', 'enable', 'clear', 'programming', 'on', 'both', 'small', 'and', 'large', 'scales', '.', '[', '27', ']', 'In', 'July', '2018', ',', 'Van', 'Rossum', 'stepped', 'down', 'as', 'the', 'leader', 'in', 'the', 'language', 'community', 'after', '30', 'years', '.', '[', '28', ']', '[', '29', ']']
