# Extracting text from HTML file

There are lots of data sources from which we might want to extract information, such as initial public offerings for various companies. E.g., [Tesla's IPO prospectus](https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm). One can imagine trying to mine such documents in an effort to predict which IPOs will do poorly or well.

HTML has both text as well as so-called markup like `<b>`, which is used to specify formatting information.

We will use the well-known [Beautiful soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Python library to extract text. `pip install bs4`

First, either do a "save as" or do what the cool kids do:

In [1]:
! curl https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4792  100  4792    0     0  90415      0 --:--:-- --:--:-- --:--:-- 90415


If you then do `open /tmp/TeslaIPO.html` from the command line, it will pop up in your browser window. Also take a look at what HTML looks like in the wild:

In [2]:
! head -15 /tmp/TeslaIPO.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>SEC.gov | Request Rate Threshold Exceeded</title>
<style>
html {height: 100%}
body {height: 100%; margin:0; padding:0;}
#header {background-color:#003968; color:#fff; padding:15px 20px 10px 20px;font-family:Arial, Helvetica, sans-serif; font-size:20px; border-bottom:solid 5px #000;}
#footer {background-color:#003968; color:#fff; padding:15px 20px;font-family:Arial, Helvetica, sans-serif; font-size:20px;}
#content {max-width:650px;margin:60px auto; padding:0 20px 100px 20px; background-image:url(seal_bw.png);background-repeat:no-repeat;background-position:50% 100%;}
h1 {font-family:Georgia, Times, serif; font-size:20px;}
h2 {text-align:center; font-family:Georgia, Times, serif; font-size:20px; width:100%; border-bottom:solid #999 1px;padding

## Main script

Our main program accepts a file name parameter from the commandline, opens it, gets its text, converts the HTML to text, and close the file. Our first attempt, after looking at the documentation, might be the following (file `ipo-text.py`):

In [3]:
%%time
import sys
from bs4 import BeautifulSoup

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()

soup = BeautifulSoup(html_text, 'lxml')
text = soup.get_text()
print(text[0:300])




SEC.gov | Request Rate Threshold Exceeded



U.S. Securities and Exchange Commission

Your Request Originates from an Undeclared Automated Tool
To allow for equitable access to all users, SEC reserves the right to limit requests originating from undeclared automated tools. Your request has been i
CPU times: user 81.6 ms, sys: 25.1 ms, total: 107 ms
Wall time: 157 ms


## Tidy up

Let's improve our program by creating a function to extract text from HTML:

In [4]:
def html2text(html_text):
    soup = BeautifulSoup(html_text, 'lxml')
    text = soup.get_text()
    return text

Then, our main program looks like:

In [5]:
import sys
from bs4 import BeautifulSoup

def html2text(html_text):
    soup = BeautifulSoup(html_text, 'lxml')
    text = soup.get_text()
    return text

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()
text = html2text(html_text)
print(text[0:300])




SEC.gov | Request Rate Threshold Exceeded



U.S. Securities and Exchange Commission

Your Request Originates from an Undeclared Automated Tool
To allow for equitable access to all users, SEC reserves the right to limit requests originating from undeclared automated tools. Your request has been i


### Exercise

Copy that program into a Python file called `ipo-text.py` and run it from the command line.  You will also have to download the [TeslaIPO.html](https://github.com/parrt/msds692/blob/master/data/TeslaIPO.html) file using your browser or `curl` from the commandline.

### Exercise

Print out the number of unique words in the document (split on whitespace). For Tesla's IPO, I get 10602 unique words.

In [6]:
len(set(text.split()))

253

In [7]:
import numpy as np
len(np.unique(text.split()))

253