<table align="left"><td>
  <a target="_blank"  href="https://colab.research.google.com/github/nateraw/BeautifulSauce/blob/master/notebooks/getting_started.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab
  </a>
</td><td>
  <a target="_blank"  href="https://github.com/nateraw/BeautifulSauce/blob/master/notebooks/getting_started.ipynb">
    <img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
</td></table>

# Install BeautifulSauce

In [0]:
!pip install BeautifulSauce --upgrade

In [0]:
from BeautifulSauce import Sauce
from BeautifulSauce.featurizer import Featurizer

import re

# Utilities
BeautifulSauce comes with a few handy utilities and functionalities that are built on top of the BeautifulSoup object. 

## Reading from file
BeautifulSauce has a built in class method to simply init soup `from_file(filepath)`. Here, we show how that works after creating and saving an example HTML file.

In [0]:
# First, we'll make a file and save it to the Colab server
html_str = """
<!DOCTYPE html>
<html>
    <head>
        <meta charset="utf-8"></meta>
    </head>
    <body>
        <div style="font-weight: bold;">
            <div>
                bold text here
                    <p>bold text</p>
                    <div style="font-weight: normal;">
                        normal text
                        <p>normal text</p>
                    </div>            
            </div>
        </div>

        <div>
            <div>
                normal text
            </div>
            <div>
                <b>bold text</b>
            </div>
        </div>

    </body>
</html>
"""

with open("example.html","w") as w:
    w.write(html_str)

In [42]:
soup = Sauce.from_file('example.html')
print(soup)

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
 </head>
 <body>
  <div style="font-weight: bold;">
   <div>
    bold text here
    <p>
     bold text
    </p>
    <div style="font-weight: normal;">
     normal text
     <p>
      normal text
     </p>
    </div>
   </div>
  </div>
  <div>
   <div>
    normal text
   </div>
   <div>
    <b>
     bold text
    </b>
   </div>
  </div>
 </body>
</html>



## Reading from URL
Additionally, instead of having to write the boilerplate necessary for initializing Soup objects from a hosted URL, you can just call the `from_url(url)` function of the Sauce class.

In [0]:
url = "https://en.wikipedia.org/wiki/Grace_Hopper"
soup = Sauce.from_url(url)
tag = soup.find(id='firstHeading')
print(tag.text)

Grace Hopper


## Indexing HTML trees
BeautifulSauce also adds a unique attribute to each tag: "idx". This attribute denotes the position of the tag within the HTML tree. Take a look at the example below...

In [0]:
soup = Sauce.from_file('example.html')
print(soup)
for tag in soup.find_all():
    print("Tag Name: {:4s} | Tag idx: {}".format(tag.name, tag.idx))

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
 </head>
 <body>
  <div style="font-weight: bold;">
   <div>
    bold text here
    <p>
     bold text
    </p>
    <div style="font-weight: normal;">
     normal text
     <p>
      normal text
     </p>
    </div>
   </div>
  </div>
  <div>
   <div>
    normal text
   </div>
   <div>
    <b>
     bold text
    </b>
   </div>
  </div>
 </body>
</html>

Tag Name: html | Tag idx: [0]
Tag Name: head | Tag idx: [0, 0]
Tag Name: meta | Tag idx: [0, 0, 0]
Tag Name: body | Tag idx: [0, 1]
Tag Name: div  | Tag idx: [0, 1, 0]
Tag Name: div  | Tag idx: [0, 1, 0, 0]
Tag Name: p    | Tag idx: [0, 1, 0, 0, 0]
Tag Name: div  | Tag idx: [0, 1, 0, 0, 1]
Tag Name: p    | Tag idx: [0, 1, 0, 0, 1, 0]
Tag Name: div  | Tag idx: [0, 1, 1]
Tag Name: div  | Tag idx: [0, 1, 1, 0]
Tag Name: div  | Tag idx: [0, 1, 1, 1]
Tag Name: b    | Tag idx: [0, 1, 1, 1, 0]


---

**You can get tags by their `.idx` attribute by calling `sauce.get_from_idx(indices)`**

In [11]:
soup.get_from_idx([0,1,0,0,0]).name

'p'

# Featurization of HTML Documents
The main functionality of BeautifulSauce is to help you featurize HTML documents. What does this mean? Well, basically, you are able to add attributes to `Tag` elements via the `BeautifulSauce.features.Featurizer`. These attributes can be:
  - Categorical - Downstream, you can automatically dummy code these in a dataframe (if you want to). 
  - Numerical - Downstream, you can standardize these on scale of 0.0-1.0 (if you want).
  - Text

Any feature can be added via one of the 3 built in decorators inside the Featurizer. 
  - @ftrs.add_categorical_feature
  - @ftrs.add_numerical_feature
  - @ftrs.add_text_feature
  
Let's walk through an example...

In [0]:
# Initialize a Featurizer
ftrs = Featurizer()

In [0]:
# Add a categorical feature
@ftrs.add_categorical_feature("tag_name")
def f_tag_name(tag):
    return tag.name

In [0]:
# Add numerical feature
@ftrs.add_numerical_feature('char_cnt')
def f_char_cnt(tag):
    if tag.name in ['head', 'meta', 'script']:
        return 0
    texts = list(tag.find_all(text=True, recursive=False))
    if len(texts) < 1:
        return 0
    texts = " ".join(texts).strip()
    texts = re.sub("\n", " ", texts)
    return len(texts)

In [0]:
# Add text feature
@ftrs.add_text_feature('text')
def f_text(tag):
    if tag.name in ['head', 'meta', 'script']:
        return ""
    texts = list(tag.find_all(text=True, recursive=False))
    if len(texts) < 1:
        return ""
    texts = " ".join(texts).strip()
    texts = re.sub("\n", " ", texts)
    return texts

In [0]:
# Read in soup from file
soup = Sauce.from_file('example.html')
# Apply your featurizer to this soup object
ftrs.featurize(soup)

In [38]:
# Take a look at what the .features attribute looks like
for tag in soup.find_all():
    print(tag.features)

{'categorical': {'tag_name': 'html'}, 'numerical': {'char_cnt': 0}, 'text': {'text': ''}}
{'categorical': {'tag_name': 'head'}, 'numerical': {'char_cnt': 0}, 'text': {'text': ''}}
{'categorical': {'tag_name': 'meta'}, 'numerical': {'char_cnt': 0}, 'text': {'text': ''}}
{'categorical': {'tag_name': 'body'}, 'numerical': {'char_cnt': 0}, 'text': {'text': ''}}
{'categorical': {'tag_name': 'div'}, 'numerical': {'char_cnt': 0}, 'text': {'text': ''}}
{'categorical': {'tag_name': 'div'}, 'numerical': {'char_cnt': 14}, 'text': {'text': 'bold text here'}}
{'categorical': {'tag_name': 'p'}, 'numerical': {'char_cnt': 9}, 'text': {'text': 'bold text'}}
{'categorical': {'tag_name': 'div'}, 'numerical': {'char_cnt': 11}, 'text': {'text': 'normal text'}}
{'categorical': {'tag_name': 'p'}, 'numerical': {'char_cnt': 11}, 'text': {'text': 'normal text'}}
{'categorical': {'tag_name': 'div'}, 'numerical': {'char_cnt': 0}, 'text': {'text': ''}}
{'categorical': {'tag_name': 'div'}, 'numerical': {'char_cnt':

---
### to_dataframe()
Now, it's pretty clear that the `soup.features` attribute, while helpful, would be potentially annoying to interface with when using the soup object itself. The true purpose of this attribute is to aid in outputting these features to a Pandas DataFrame. Let's do that now...

In [39]:
df = ftrs.to_dataframe(soup)
df

Unnamed: 0,tag_name,char_cnt,text
0,html,0,
1,head,0,
2,meta,0,
3,body,0,
4,div,0,
5,div,14,bold text here
6,p,9,bold text
7,div,11,normal text
8,p,11,normal text
9,div,0,


### Normalization + Dummy Coding
As mentioned previously, there is a reason why the decorator functions are separated out into categorical, numerical, and text based features. It is so we can both dummy code the categorical features, and standardize the numerical features if we choose to do so. This happens in the `ftrs.to_dataframe(soup)` function. Take a look...

In [40]:
df = ftrs.to_dataframe(soup, normalize=True)
df

Unnamed: 0,char_cnt,tag_name_b,tag_name_body,tag_name_div,tag_name_head,tag_name_html,tag_name_meta,tag_name_p
0,0.0,0,0,0,0,1,0,0
1,0.0,0,0,0,1,0,0,0
2,0.0,0,0,0,0,0,1,0
3,0.0,0,1,0,0,0,0,0
4,0.0,0,0,1,0,0,0,0
5,1.0,0,0,1,0,0,0,0
6,0.962713,0,0,0,0,0,0,1
7,0.979648,0,0,1,0,0,0,0
8,0.979648,0,0,0,0,0,0,1
9,0.0,0,0,1,0,0,0,0
