## Quiz

With your knowledge of HTML file structure, you're going to use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to extract our desired Audience Score metric and number of audience ratings, along with the movie title like in the video above (so we have something to merge the datasets on later) for each HTML file, then save them in a pandas DataFrame.

The Jupyter Notebook below contains template code that:
* Creates an empty list, df_list, to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the [most efficient way of building a DataFrame row by row](https://stackoverflow.com/a/28058264)).
* Loops through each movie's Rotten Tomatoes HTML file in the rt_html folder.
* Opens each HTML file and passes it into a filehandle called file.
* Creates a DataFrame called df by converting df_list using the [pd.DataFrame constructor](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).
  
Your task is to extract the title, audience score, and the number of audience ratings in each HTML file so each trio can be appended as a dictionary to df_list.

The [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) methods required for this task are:
* `find()`
* `find_all()`

**Need a Hint?**    
There is an excellent tutorial on these methods in the Beautiful Soup documentation: [Searching the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree). Please consult that tutorial if you are stuck.

**Helpful Resources:**
* Beautiful Soup `.find` [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find)
* Beautiful Soup `.contents` [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children)

In [1]:
from bs4 import BeautifulSoup
import os
import pandas as pd

In [2]:
!ls

df_solution.pkl  gathering.ipynb  rt_html


In [3]:
!ls "./rt_html"

1000013-12_angry_men.html
1000121-39_steps.html
1000355-adventures_of_robin_hood.html
1000626-all_about_eve.html
1000642-all_quiet_on_the_western_front.html
1003707-casablanca.html
1007818-frankenstein.html
1011615-king_kong.html
1012007-laura.html
1012928-m.html
1013139-maltese_falcon.html
1013775-metropolis.html
1017289-rear_window.html
1017293-rebecca.html
1020333-streetcar_named_desire.html
1021749-touch_of_evil.html
1046060-high_noon.html
1048445-snow_white_and_the_seven_dwarfs.html
12_years_a_slave.html
400_blows.html
alien.html
apocalypse_now.html
argo_2012.html
army_of_shadows.html
arrival_2016.html
baby_driver.html
battleship_potemkin.html
beatles_a_hard_days_night.html
bicycle_thieves.html
boyhood.html
bride_of_frankenstein.html
brooklyn.html
citizen_kane.html
dr_strangelove.html
dunkirk_2017.html
et_the_extraterrestrial.html
finding_nemo.html
get_out.html
godfather.html
godfather_part_ii.html
gone_with_the_wind.html
grapes_of_wrath.ht

In [4]:
with open(os.path.join("rt_html", "zootopia.html")) as file:
   
    soup = BeautifulSoup(file,"lxml")
    
    title = soup.find("title").contents[0][:-len(" - Rotten Tomatoes")]
    
    # Find a div.audience-score.meter > Find a span.superPageFontColor > Retrieving the 1st content in the list > Removing %
    audience_score = soup.find("div", class_="audience-score meter") \
                         .find("span", class_="superPageFontColor") \
                         .contents[0] \
                        .replace("%","")
    
    num_audience_ratings = soup.find("div", class_="audience-info hidden-xs superPageFontColor") \
                               .find_all("div")[1] \
                               .contents[-1] \
                               .strip() \
                               .replace(",","")

In [5]:
print(soup)

<!DOCTYPE html>
<html lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
<script src="//cdn.optimizely.com/js/594670329.js"></script>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="VPPXtECgUUeuATBacnqnCm4ydGO99reF-xgNklSbNbc" name="google-site-verification"/>
<meta content="034F16304017CA7DCF45D43850915323" name="msvalidate.01"/>
<link href="https://staticv2-4.rottentomatoes.com/static/images/iphone/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="https://staticv2-4.rottentomatoes.com/static/images/icons/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="https://staticv2-4.rottentomatoes.com/static/styles/css/rt_main.css" rel="stylesheet"/>
<script id="jsonLdSchema" type="application/ld+json">{"@context":"http

In [6]:
title

'Zootopia\xa0(2016)'

In [7]:
audience_score

'92'

In [8]:
num_audience_ratings

'98633'

In [9]:
# List of dictionaries to build file by file and later convert to a DataFrame
df_list = []
folder = 'rt_html'

for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        # Your code here
        # Note: a correct implementation may take ~15 seconds to run
        soup = BeautifulSoup(file,"lxml")
    
        title = soup.find("title").contents[0][:-len(" - Rotten Tomatoes")]

        audience_score = soup.find("div", class_="audience-score meter") \
                             .find("span", class_="superPageFontColor") \
                             .contents[0] \
                            .replace("%","")

        num_audience_ratings = soup.find("div", class_="audience-info hidden-xs superPageFontColor") \
                                   .find_all("div")[1] \
                                   .contents[-1] \
                                   .strip() \
                                   .replace(",","")
        
        
        # Append to list of dictionaries
        df_list.append({'title': title,
                        'audience_score': int(audience_score),
                        'number_of_audience_ratings': int(num_audience_ratings)})
df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])

In [10]:
df.head()

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,Dr. Strangelove Or How I Learned to Stop Worry...,94,208215
1,Frankenstein (1931),87,41140
2,All About Eve (1950),94,44564
3,Roman Holiday (1953),94,62895
4,The Night of the Hunter (1955),90,24322


In [11]:
df.shape

(100, 3)

## Solution Test
Run the cell below the see if your solution is correct. If an `AssertionError` is thrown, your solution is incorrect. If no error is thrown, your solution is correct.

In [12]:
df_solution = pd.read_pickle('df_solution.pkl')
df.sort_values('title', inplace = True)
df.reset_index(inplace = True, drop = True)
df_solution.sort_values('title', inplace = True)
df_solution.reset_index(inplace = True, drop = True)
pd.testing.assert_frame_equal(df, df_solution)