## Webscraping with BeautifulSoup

### In this example,  we will:
1) Extract the movie data from IMDb and Metacritic websites.

2) Compare the ratings of each website for the common movies.

### Import necessary packages and extract the text from IMDB website:

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
import pandas as pd
url_imdb= "https://assets-datascientest.s3.eu-west-1.amazonaws.com/IMDB_en.html"
page_imdb= urlopen(url_imdb)
bs_imdb = bs(page_imdb, "html.parser")
print("\n".join(bs_imdb.prettify().splitlines()[0:30]))

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <script type="text/javascript">
   var ue_t0=ue_t0||+new Date();
  </script>
  <script type="text/javascript">
   window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].isStub=1}};e.exec=function(b,a){return function(){try{return b.apply(this,arguments)}catch(c){ueLogError(c,{attribution:a||"undefined",logLevel:"WARN"})}}}})(ue_csm);


    var ue_err_chan = 'jserr';
(function(d,e){function h(f,b){if(!(a.ec>a.mxe)&&f){a.ter.push(f);b=b||{};var c=f.logLevel||b.logLevel;c&&c!==k&&c!==m&&c!==n&&c!==p||a.ec++;c&&c!=k||a.ec

#### Find all the movies listed on the website:

In [2]:
films_imdb = bs_imdb.findAll("td", class_="titleColumn")
print(films_imdb[:5])

[<td class="titleColumn">
<a href="/title/tt11564570/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=ea4e08e1-c8a3-47b5-ac3a-75026647c16e&amp;pf_rd_r=BQWZRBFAM81S7K6ZBPJP&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=moviemeter&amp;ref_=chtmvm_tt_1" title="Rian Johnson (dir.), Daniel Craig, Edward Norton">Glass Onion: une histoire à couteaux tirés</a>
<span class="secondaryInfo">(2022)</span>
<div class="velocity">1
<span class="secondaryInfo">(
<span class="global-sprite titlemeter up"></span>
1)</span>
</div>
</td>, <td class="titleColumn">
<a href="/title/tt1630029/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=ea4e08e1-c8a3-47b5-ac3a-75026647c16e&amp;pf_rd_r=BQWZRBFAM81S7K6ZBPJP&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=moviemeter&amp;ref_=chtmvm_tt_2" title="James Cameron (dir.), Sam Worthington, Zoe Saldana">Avatar: la voie de l'eau</a>
<span class="secondaryInfo">(2022)</span>
<div class="velocity">2
<span class="secondaryInfo">(
<span class="global-sprite titlemeter down"></span>
1)</sp

#### Get the first movie listed:

In [3]:
film_first= films_imdb[0]
print(film_first.find("a").string.strip())

Glass Onion: une histoire à couteaux tirés


#### Get the ranking of the first movie:

In [4]:
rank = film_first.find("div", class_= "velocity").contents[0]
print(rank)

1



#### Get the year of the first movie:

In [5]:
year= film_first.find("span", class_="secondaryInfo").string.strip("()")
print(year)

2022


#### Get the rating of the first movie, the total number of ratings listed and the first 5 ratings:

In [6]:
rating_ = bs_imdb.findAll("td", class_= "ratingColumn imdbRating")
rating_first= rating_[0].find("strong").string.strip()
print(rating_first)
print(len(rating_))
print(rating_[:5])

7,3
100
[<td class="ratingColumn imdbRating">
<strong title="7,3 based on 207 962 user ratings">7,3</strong>
</td>, <td class="ratingColumn imdbRating">
<strong title="7,9 based on 183 394 user ratings">7,9</strong>
</td>, <td class="ratingColumn imdbRating">
<strong title="7,9 based on 664 235 user ratings">7,9</strong>
</td>, <td class="ratingColumn imdbRating">
<strong title="7,4 based on 7 995 user ratings">7,4</strong>
</td>, <td class="ratingColumn imdbRating">
<strong title="7,9 based on 1 289 668 user ratings">7,9</strong>
</td>]


#### Create a Dataframe of IMDb variables with title, year and rating:

In [7]:
import re

ratings = []
for rating in rating_:
    rating_string = str(rating)
    rating_value = re.search(r'[0-9]\,[0-9]', rating_string)
    if rating_value:
        rating_value = rating_value.group()
    else:
        rating_value = None
    ratings.append(rating_value)
import pandas as pd
titles= []
years= []

for film in films_imdb:
    titles.append(film.find("a").string.strip())
    years.append(film.find("span", class_="secondaryInfo").string.strip("()"))

df_imdb = pd.DataFrame({'Title': titles, 'Year': years, 'Rating IMDB': ratings})
df_imdb.head()

Unnamed: 0,Title,Year,Rating IMDB
0,Glass Onion: une histoire à couteaux tirés,2022,73
1,Avatar: la voie de l'eau,2022,79
2,À couteaux tirés,2019,79
3,Babylon,2022,74
4,Avatar,2009,79


### Extract the text from Metacritic website:

In [8]:
import requests
from bs4 import BeautifulSoup as bs
url_meta = "https://www.metacritic.com/browse/movies/score/metascore/year/filtered"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"}
res = requests.get(url_meta, headers=headers).text
soup_meta= bs (res, "html.parser")
soup_meta.prettify().splitlines()[0:30]

['<!DOCTYPE html>',
 '<html xmlns:fb="http://ogp.me/ns/fb#" xmlns:og="http://opengraphprotocol.org/schema/">',
 ' <head>',
 '  <title>',
 '   Best Movies for 2023 - Metacritic',
 '  </title>',
 '  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>',
 '  <meta content="See how well critics are rating the Best Movies for 2023" name="description"/>',
 '  <meta content="Metacritic" name="application-name"/>',
 '  <meta content="#000000" name="msapplication-TileColor"/>',
 '  <meta content="/images/win8tile/76bf1426-2886-4b87-ae1c-06424b6bb8a2.png" name="msapplication-TileImage"/>',
 '  <meta content="618k3mbeki8tar7u6wvrum5lxs5cka" name="facebook-domain-verification">',
 '   <meta content="Best Movies for 2023" property="og:title"/>',
 '   <meta content="website" property="og:type"/>',
 '   <meta content="https://www.metacritic.com/browse/movies/score/metascore/year/filtered" property="og:url"/>',
 '   <meta content="https://static.metacritic.com/images/icons/mc_fb_og.png

#### Get all the movies from Metacritic website:

In [9]:
films_meta = soup_meta.findAll("td", class_= "clamp-summary-wrap")
print(len(films_meta))

100


#### Get the title of the first movie listed:

In [10]:
title_meta = films_meta[0].find("h3").string.strip()
print(title_meta)

Saint Omer


#### Get the rating of the first movie listed:

In [11]:
rating_meta = films_meta[0].find("div", class_= "metascore_w").string.strip()
print(rating_meta)

91


#### Get the complete list of all movie titles and ratings and create a dataframe:

In [12]:
ratings_= []
titles_= []
for film in films_meta:
    title = film.find('h3').get_text(strip=True)
    rating = film.find('div', class_= "metascore_w").string.strip()
    titles_.append(title)
    ratings_.append(rating)
df_meta = pd.DataFrame({'Title': titles_, 'Rating Meta': ratings_})
df_meta.head()

Unnamed: 0,Title,Rating Meta
0,Saint Omer,91
1,The Blue Caftan,86
2,Alcarràs,85
3,Shin Ultraman,85
4,Full Time,83


#### Convert all titles to uppercase:

In [13]:
df_imdb["Title"]= df_imdb["Title"].str.upper()
df_meta["Title"]= df_meta["Title"].str.upper()

#### Merge two dataframes imdb and meta:

In [14]:
df= pd.merge(df_imdb, df_meta, on="Title")
df.head()

Unnamed: 0,Title,Year,Rating IMDB,Rating Meta
0,KNOCK AT THE CABIN,2023,,63
1,M3GAN,2022,61.0,72


##### There are two movies in common only and IMDB doesn't have rating for the first one. IMDB lists the rating over 10 and Meta rates over 100. So, if we look at imdb over 100, M3GAN would be 61/100. Therefore, we can say that Meta has the best rating with 72/100.