# Extracting text from HTML file

There are lots of data sources from which we might want to extract information, such as initial public offerings for various companies. E.g., [Tesla's IPO prospectus](https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm). One can imagine trying to mine such documents in an effort to predict which IPOs will do poorly or well.

HTML has both text as well as so-called markup like `<b>`, which is used to specify formatting information.

We will use the well-known [Beautiful soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Python library to extract text. `pip install bs4`

First, either do a "save as" or do what the cool kids do:

In [1]:
! curl https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2306k    0 2306k    0     0  8906k      0 --:--:-- --:--:-- --:--:-- 8940k


If you then do `open /tmp/TeslaIPO.html` from the command line, it will pop up in your browser window. Also take a look at what HTML looks like in the wild:

In [2]:
! head -15 /tmp/TeslaIPO.html

<DOCUMENT>
<TYPE>S-1
<SEQUENCE>1
<FILENAME>ds1.htm
<DESCRIPTION>REGISTRATION STATEMENT ON FORM S-1
<TEXT>
<HTML><HEAD>
<TITLE>Registration Statement on Form S-1</TITLE>
</HEAD>
 <BODY BGCOLOR="WHITE">
<h5 align="left"><a href="#toc">Table of Contents</a></h5>

 <P STYLE="margin-top:0px;margin-bottom:0px" ALIGN="center"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B>As filed with the Securities and Exchange Commission on January 29, 2010 </B></FONT></P>
<P STYLE="margin-top:0px;margin-bottom:0px" ALIGN="right"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B>Registration No.&nbsp;333-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</B></FONT></P>
<P STYLE="font-size:2px;margin-top:0px;margin-bottom:0px">&nbsp;</P> <P STYLE="line-height:0px;margin-top:0px;margin-bottom:0px;border-bottom:0.5pt solid #000000">&nbsp;</P> <P


## Main script

Our main program accepts a file name parameter from the commandline, opens it, gets its text, converts the HTML to text, and close the file. Our first attempt, after looking at the documentation, might be the following (file `ipo-text.py`):

In [3]:
import sys
from bs4 import BeautifulSoup

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()
soup = BeautifulSoup(html_text, 'html.parser')
text = soup.get_text()
print(text[0:300])


S-1
1
ds1.htm
REGISTRATION STATEMENT ON FORM S-1


Registration Statement on Form S-1


Table of Contents
As filed with the Securities and Exchange Commission on January 29, 2010 
Registration No. 333-                
      UNITED STATES  SECURITIES AND EXCHANGE COMMISSION  Washington, D.C. 20549  


## Tidy up

Let's improve our program by creating a function to extract text from HTML:

In [4]:
def html2text(html_text):
    soup = BeautifulSoup(html_text, 'html.parser')
    text = soup.get_text()
    return text

Then, our main program looks like:

In [5]:
import sys
from bs4 import BeautifulSoup

def html2text(html_text):
    soup = BeautifulSoup(html_text, 'html.parser')
    text = soup.get_text()
    return text

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()
text = html2text(html_text)
print(text[0:300])


S-1
1
ds1.htm
REGISTRATION STATEMENT ON FORM S-1


Registration Statement on Form S-1


Table of Contents
As filed with the Securities and Exchange Commission on January 29, 2010 
Registration No. 333-                
      UNITED STATES  SECURITIES AND EXCHANGE COMMISSION  Washington, D.C. 20549  


### Exercise

Copy that program into a Python file called `ipo-text.py` and run it from the command line.  You will also have to download the [TeslaIPO.html](https://github.com/parrt/msds692/blob/master/data/TeslaIPO.html) file using your browser or `curl` from the commandline.

### Exercise

Print out the number of unique words in the document (split on whitespace). For Tesla's IPO, I get 10602 unique words.

In [6]:
len(set(text.split()))

10546

In [8]:
import numpy as np
len(np.unique(text.split()))

10546