In [1]:
from bs4 import BeautifulSoup

In [2]:
with open('rt_html/et_the_extraterrestrial.html') as file:
    soup = BeautifulSoup(file, 'lxml')

In [3]:
#soup   # Returns the HTML data as seen in a text editor

Using [soup.find()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) method.

In [4]:
soup.find('title')  # To get title of webpage

<title>E.T. The Extra-Terrestrial (1982) - Rotten Tomatoes</title>

The __find_all()__ method scans the entire document looking for results, but sometimes you only want to find one result. If you know a document only has one `<body>` tag, it’s a waste of time to scan the entire document looking for more.  
Rather than passing in __limit=1__ every time you call find_all, you can use the __find()__ method.

The result obtained is title of webpage, not the web page.

To get movie title onle we will have to use string slicing.

In [5]:
soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]

'E.T. The Extra-Terrestrial\xa0(1982)'

__\xa0__ is unicode for non-breaking string and will be removed while cleaning.

In [6]:
len(' - Rotten Tomatoes')

18

The Jupyter Notebook below contains template code that:

- Creates an empty list, df_list, to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the [most efficient way of building a DataFrame row by row](https://stackoverflow.com/a/28058264)).
- Loops through each movie's Rotten Tomatoes HTML file in the rt_html folder.
- Opens each HTML file and passes it into a file handle called file.
- Creates a DataFrame called df by converting df_list using the pd.DataFrame constructor.

Your task is to extract the title, audience score, and number of audience ratings in each HTML file so each trio can be appended as a dictionary to df_list.

The Beautiful Soup methods required for this task are:

- find()
- find_all()

There is an excellent tutorial on these methods ([Searching the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree)) in the Beautiful Soup documentation. Please consult that tutorial if you are stuck.

In [7]:
from bs4 import BeautifulSoup
import os
import pandas as pd

In [8]:
# List of dictionaries to build file by file and later convert to a DataFrame
df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as fh:
        soup = BeautifulSoup(fh, 'lxml')
        # To get title of webpage
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        # To get audience score
        audience_score = soup.find('div', class_ = "audience-score meter").find('span').contents[0][:-1]
        # To get user ratings
        User_ratings = soup.find('div', class_= "audience-info hidden-xs superPageFontColor").find_all('div')[1].contents[-1].replace(',','')
         # Append to list of dictionaries
        df_list.append({'title': title,
                        'audience_score': int(audience_score),
                        'number_of_audience_ratings': int(User_ratings)})


df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])
        

#### audience_score is in __div tag__ with __class="audience-score meter"__. And within it only one __span tag__ is there. This can be found using the navigating property.

In the code :
>soup.find('div', class_= "audience-info hidden-xs superPageFontColor").find('div').contents[-1]

After __class__ an underscore is given as class is a keyword in Python. 





- [Stack Overflow: Beautiful Soup and Unicode Problems](https://stackoverflow.com/questions/19508442/beautiful-soup-and-unicode-problems)
- [Stack Overflow: Python: Removing \xa0 from string](https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string)

In [9]:
df.head()

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,12 Angry Men (Twelve Angry Men) (1957),97,103672
1,The 39 Steps (1935),86,23647
2,The Adventures of Robin Hood (1938),89,33584
3,All About Eve (1950),94,44564
4,All Quiet on the Western Front (1930),89,17768
