# Web Scraping


Hoy vamos a hacer un web scraping sencillo. Al final de la clase vamos a ser capaces de extraer una base de datos a partir del siguiente [Link](http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1).

En esta página encontramos la lista de las películas más votadas en la página IMDB para el año 2017. La idea es extraer 2 listas. La primera debe contener la clasisficación de cada película y la segunda debe contener la duración en minutos.

## Links que usaremos a lo largo de la clase:

- [IMDB](https://www.imdb.com/)
- [Guía Scraping IMDB](https://www.dataquest.io/blog/web-scraping-beautifulsoup/)
- [Pythex - Herramienta Online para probar expresiones regulares](https://pythex.org/)
- [Cheat Sheet de expresiones regulares en Pyhton](https://www.debuggex.com/cheatsheet/regex/python)

In [2]:
import re
from requests import get
from bs4 import BeautifulSoup
import pandas as pd

Siempre recordar que la clave de cualquier ejericio de web scraping es obtener el código HTML correcto. Hay 2 formas de trabajar con códgio HTML. La primer, es consultarlo directamente desde el navegador, con la opción *Ver código fuente* o *View page source*. 

In [2]:
# Una forma de empezar a trabajar con este código HTML es copiar el texto desde el navegador y declarar un string en Pyhton


texto = '''
<a href="/title/tt0468569/?ref_=adv_li_i"
> <img alt="The Dark Knight"
class="loadlate"
loadlate="https://ia.media-imdb.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_UX67_CR0,0,67,98_AL_.jpg"
data-tconst="tt0468569"
height="98"
src="https://images-na.ssl-images-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB499613450_.png"
width="67" />
</a>        </div>
        <div class="lister-item-content">
<h3 class="lister-item-header">
        <span class="lister-item-index unbold text-primary">2.</span>
    
    <a href="/title/tt0468569/?ref_=adv_li_tt"
>The Dark Knight</a>
    <span class="lister-item-year text-muted unbold">(2008)</span>
</h3>
    <p class="text-muted ">
            <span class="certificate">PG-13</span>
             <span class="ghost">|</span> 
                <span class="runtime">152 min</span>
             <span class="ghost">|</span> 
            <span class="genre">
Action, Crime, Drama            </span>
    </p>
    <div class="ratings-bar">
    <div class="inline-block ratings-imdb-rating" name="ir" data-value="9">
        <span class="global-sprite rating-star imdb-rating"></span>
        <strong>9.0</strong>
    </div>
            <div class="inline-block ratings-user-rating">
                <span class="userRatingValue" id="urv_tt0468569" data-tconst="tt0468569">
                    <span class="global-sprite rating-star no-rating"></span>
                    <span name="ur" data-value="0" class="rate" data-no-rating="Rate this">Rate this</span>
                </span>
    <div class="starBarWidget" id="sb_tt0468569">
<div class="rating rating-list" data-starbar-class="rating-list" data-auth="" data-user="" id="tt0468569|imdb|9|9|||search|title" data-ga-identifier=""
title="Users rated this 9/10 (1,909,907 votes) - click stars to rate" itemtype="http://schema.org/AggregateRating" itemscope itemprop="aggregateRating">
  <meta itemprop="ratingValue" content="9" />
  <meta itemprop="bestRating" content="10" />
  <meta itemprop="ratingCount" content="1909907" />
<span class="rating-bg">&nbsp;</span>
<span class="rating-imdb " style="width: 126px">&nbsp;</span>
<span class="rating-stars">
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>1</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>2</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>3</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>4</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>5</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>6</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>7</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>8</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>9</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>10</span></a>
</span>
<span class="rating-rating "><span class="value">9</span><span class="grey">/</span><span class="grey">10</span></span>
<span class="rating-cancel "><a href="/title/tt0468569/vote?v=X;k=" title="Delete" rel="nofollow"><span>X</span></a></span>
&nbsp;</div>
    </div>
            </div>
            <div class="inline-block ratings-metascore">
<span class="metascore  favorable">82        </span>
        Metascore
            </div>
    </div>
<p class="text-muted">
When the menace known as the Joker emerges from his mysterious past, he wreaks havoc and chaos on the people of Gotham, the Dark Knight must accept one of the greatest psychological and physical tests of his ability to fight injustice.</p>
    <p class="">
    Director:
<a href="/name/nm0634240/?ref_=adv_li_dr_0"
>Christopher Nolan</a>
             <span class="ghost">|</span> 
    Stars:
<a href="/name/nm0000288/?ref_=adv_li_st_0"
>Christian Bale</a>, 
<a href="/name/nm0005132/?ref_=adv_li_st_1"
>Heath Ledger</a>, 
<a href="/name/nm0001173/?ref_=adv_li_st_2"
>Aaron Eckhart</a>, 
<a href="/name/nm0000323/?ref_=adv_li_st_3"
>Michael Caine</a>
    </p>
        <p class="sort-num_votes-visible">
                <span class="text-muted">Votes:</span>
                <span name="nv" data-value="1909907">1,909,907</span>
<span class="ghost">|</span>                <span class="text-muted">Gross:</span>
                <span name="nv" data-value="534,858,444">$534.86M</span>
        </p>
        </div>
    </div>
    <div class="lister-item mode-advanced">
        <div class="lister-top-right">
    <div class="ribbonize" data-tconst="tt1375666" data-caller="filmosearch"></div>
        </div>
        <div class="lister-item-image float-left">


<a href="/title/tt1375666/?ref_=adv_li_i"
> <img alt="Inception"
class="loadlate"
loadlate="https://ia.media-imdb.com/images/M/MV5BMjAxMzY3NjcxNF5BMl5BanBnXkFtZTcwNTI5OTM0Mw@@._V1_UX67_CR0,0,67,98_AL_.jpg"
data-tconst="tt1375666"
height="98"
src="https://images-na.ssl-images-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB499613450_.png"
width="67" />
</a>        </div>
        <div class="lister-item-content">
<h3 class="lister-item-header">
        <span class="lister-item-index unbold text-primary">3.</span>
    
    <a href="/title/tt1375666/?ref_=adv_li_tt"
>Inception</a>
    <span class="lister-item-year text-muted unbold">(2010)</span>
</h3>
    <p class="text-muted ">
            <span class="certificate">PG-13</span>
             <span class="ghost">|</span> 
                <span class="runtime">148 min</span>
             <span class="ghost">|</span> 
            <span class="genre">
Action, Adventure, Sci-Fi            </span>
    </p>
    <div class="ratings-bar">
    <div class="inline-block ratings-imdb-rating" name="ir" data-value="8.8">
        <span class="global-sprite rating-star imdb-rating"></span>
        <strong>8.8</strong>
    </div>
            <div class="inline-block ratings-user-rating">
                <span class="userRatingValue" id="urv_tt1375666" data-tconst="tt1375666">
                    <span class="global-sprite rating-star no-rating"></span>
                    <span name="ur" data-value="0" class="rate" data-no-rating="Rate this">Rate this</span>
                </span>
    <div class="starBarWidget" id="sb_tt1375666">
<div class="rating rating-list" data-starbar-class="rating-list" data-auth="" data-user="" id="tt1375666|imdb|8.8|8.8|||search|title" data-ga-identifier=""
title="Users rated this 8.8/10 (1,697,078 votes) - click stars to rate" itemtype="http://schema.org/AggregateRating" itemscope itemprop="aggregateRating">
  <meta itemprop="ratingValue" content="8.8" />
  <meta itemprop="bestRating" content="10" />
  <meta itemprop="ratingCount" content="1697078" />
<span class="rating-bg">&nbsp;</span>
<span class="rating-imdb " style="width: 123.2px">&nbsp;</span>
<span class="rating-stars">
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>1</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>2</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>3</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>4</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>5</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>6</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>7</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>8</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>9</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>10</span></a>
</span>
<span class="rating-rating "><span class="value">8.8</span><span class="grey">/</span><span class="grey">10</span></span>
<span class="rating-cancel "><a href="/title/tt1375666/vote?v=X;k=" title="Delete" rel="nofollow"><span>X</span></a></span>
&nbsp;</div>
    </div>
            </div>
            <div class="inline-block ratings-metascore">
<span class="metascore  favorable">74        </span>
        Metascore
            </div>
    </div>
<p class="text-muted">
A thief, who steals corporate secrets through the use of dream-sharing technology, is given the inverse task of planting an idea into the mind of a CEO.</p>
    <p class="">
    Director:
<a href="/name/nm0634240/?ref_=adv_li_dr_0"
>Christopher Nolan</a>
             <span class="ghost">|</span> 
    Stars:
<a href="/name/nm0000138/?ref_=adv_li_st_0"
>Leonardo DiCaprio</a>, 
<a href="/name/nm0330687/?ref_=adv_li_st_1"
>Joseph Gordon-Levitt</a>, 
<a href="/name/nm0680983/?ref_=adv_li_st_2"
>Ellen Page</a>, 
<a href="/name/nm0913822/?ref_=adv_li_st_3"
>Ken Watanabe</a>
    </p>
        <p class="sort-num_votes-visible">
                <span class="text-muted">Votes:</span>
                <span name="nv" data-value="1697078">1,697,078</span>
<span class="ghost">|</span>                <span class="text-muted">Gross:</span>
                <span name="nv" data-value="292,576,195">$292.58M</span>
        </p>
        </div>
    </div>
    <div class="lister-item mode-advanced">
        <div class="lister-top-right">
    <div class="ribbonize" data-tconst="tt0137523" data-caller="filmosearch"></div>
        </div>
        <div class="lister-item-image float-left">


<a href="/title/tt0137523/?ref_=adv_li_i"
> <img alt="Fight Club"
class="loadlate"
loadlate="https://ia.media-imdb.com/images/M/MV5BMzFjMWNhYzQtYTIxNC00ZWQ1LThiOTItNWQyZmMxNDYyMjA5XkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UX67_CR0,0,67,98_AL_.jpg"
data-tconst="tt0137523"
height="98"
src="https://images-na.ssl-images-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB499613450_.png"
width="67" />
</a>        </div>
        <div class="lister-item-content">
<h3 class="lister-item-header">
        <span class="lister-item-index unbold text-primary">4.</span>
    
    <a href="/title/tt0137523/?ref_=adv_li_tt"
>Fight Club</a>
    <span class="lister-item-year text-muted unbold">(1999)</span>
</h3>
    <p class="text-muted ">
            <span class="certificate">R</span>
             <span class="ghost">|</span> 
                <span class="runtime">139 min</span>
             <span class="ghost">|</span> 
            <span class="genre">
Drama            </span>
    </p>
    <div class="ratings-bar">
    <div class="inline-block ratings-imdb-rating" name="ir" data-value="8.8">
        <span class="global-sprite rating-star imdb-rating"></span>
        <strong>8.8</strong>
    </div>
            <div class="inline-block ratings-user-rating">
                <span class="userRatingValue" id="urv_tt0137523" data-tconst="tt0137523">
                    <span class="global-sprite rating-star no-rating"></span>
                    <span name="ur" data-value="0" class="rate" data-no-rating="Rate this">Rate this</span>
                </span>
    <div class="starBarWidget" id="sb_tt0137523">
<div class="rating rating-list" data-starbar-class="rating-list" data-auth="" data-user="" id="tt0137523|imdb|8.8|8.8|||search|title" data-ga-identifier=""
title="Users rated this 8.8/10 (1,553,532 votes) - click stars to rate" itemtype="http://schema.org/AggregateRating" itemscope itemprop="aggregateRating">
  <meta itemprop="ratingValue" content="8.8" />
  <meta itemprop="bestRating" content="10" />
  <meta itemprop="ratingCount" content="1553532" />
<span class="rating-bg">&nbsp;</span>
<span class="rating-imdb " style="width: 123.2px">&nbsp;</span>
<span class="rating-stars">
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>1</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>2</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>3</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>4</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>5</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>6</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>7</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>8</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>9</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>10</span></a>
</span>
<span class="rating-rating "><span class="value">8.8</span><span class="grey">/</span><span class="grey">10</span></span>
<span class="rating-cancel "><a href="/title/tt0137523/vote?v=X;k=" title="Delete" rel="nofollow"><span>X</span></a></span>
&nbsp;</div>
    </div>
            </div>
            <div class="inline-block ratings-metascore">
<span class="metascore  favorable">66        </span>
        Metascore
            </div>
    </div>
<p class="text-muted">
An insomniac office worker, looking for a way to change his life, crosses paths with a devil-may-care soapmaker, forming an underground fight club that evolves into something much, much more.</p>
    <p class="">
    Director:
<a href="/name/nm0000399/?ref_=adv_li_dr_0"
>David Fincher</a>
             <span class="ghost">|</span> 
    Stars:
<a href="/name/nm0000093/?ref_=adv_li_st_0"
>Brad Pitt</a>, 
<a href="/name/nm0001570/?ref_=adv_li_st_1"
>Edward Norton</a>, 
<a href="/name/nm0001533/?ref_=adv_li_st_2"
>Meat Loaf</a>, 
<a href="/name/nm0340260/?ref_=adv_li_st_3"
>Zach Grenier</a>
    </p>
        <p class="sort-num_votes-visible">
                <span class="text-muted">Votes:</span>
                <span name="nv" data-value="1553532">1,553,532</span>
<span class="ghost">|</span>                <span class="text-muted">Gross:</span>
                <span name="nv" data-value="37,030,102">$37.03M</span>
        </p>
        </div>
    </div>
    <div class="lister-item mode-advanced">
        <div class="lister-top-right">
    <div class="ribbonize" data-tconst="tt0110912" data-caller="filmosearch"></div>
        </div>
        <div class="lister-item-image float-left">


<a href="/title/tt0110912/?ref_=adv_li_i"
> <img alt="Pulp Fiction"
class="loadlate"
loadlate="https://ia.media-imdb.com/images/M/MV5BMTkxMTA5OTAzMl5BMl5BanBnXkFtZTgwNjA5MDc3NjE@._V1_UX67_CR0,0,67,98_AL_.jpg"
data-tconst="tt0110912"
height="98"
src="https://images-na.ssl-images-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB499613450_.png"
width="67" />
</a>        </div>
        <div class="lister-item-content">
<h3 class="lister-item-header">
        <span class="lister-item-index unbold text-primary">5.</span>
    
    <a href="/title/tt0110912/?ref_=adv_li_tt"
>Pulp Fiction</a>
    <span class="lister-item-year text-muted unbold">(1994)</span>
</h3>
    <p class="text-muted ">
            <span class="certificate">R</span>
             <span class="ghost">|</span> 
                <span class="runtime">154 min</span>
             <span class="ghost">|</span> 
            <span class="genre">
Crime, Drama            </span>
    </p>
    <div class="ratings-bar">
    <div class="inline-block ratings-imdb-rating" name="ir" data-value="8.9">
        <span class="global-sprite rating-star imdb-rating"></span>
        <strong>8.9</strong>
    </div>
            <div class="inline-block ratings-user-rating">
                <span class="userRatingValue" id="urv_tt0110912" data-tconst="tt0110912">
                    <span class="global-sprite rating-star no-rating"></span>
                    <span name="ur" data-value="0" class="rate" data-no-rating="Rate this">Rate this</span>
                </span>
    <div class="starBarWidget" id="sb_tt0110912">
<div class="rating rating-list" data-starbar-class="rating-list" data-auth="" data-user="" id="tt0110912|imdb|8.9|8.9|||search|title" data-ga-identifier=""
title="Users rated this 8.9/10 (1,515,663 votes) - click stars to rate" itemtype="http://schema.org/AggregateRating" itemscope itemprop="aggregateRating">
  <meta itemprop="ratingValue" content="8.9" />
  <meta itemprop="bestRating" content="10" />
  <meta itemprop="ratingCount" content="1515663" />
<span class="rating-bg">&nbsp;</span>
<span class="rating-imdb " style="width: 124.6px">&nbsp;</span>
<span class="rating-stars">
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>1</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>2</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>3</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>4</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>5</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>6</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>7</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>8</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>9</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>10</span></a>
</span>
<span class="rating-rating "><span class="value">8.9</span><span class="grey">/</span><span class="grey">10</span></span>
<span class="rating-cancel "><a href="/title/tt0110912/vote?v=X;k=" title="Delete" rel="nofollow"><span>X</span></a></span>
&nbsp;</div>
    </div>
            </div>
            <div class="inline-block ratings-metascore">
<span class="metascore  favorable">94        </span>
        Metascore
            </div>
    </div>
<p class="text-muted">
The lives of two mob hitmen, a boxer, a gangster's wife, and a pair of diner bandits intertwine in four tales of violence and redemption.</p>
    <p class="">
    Director:
<a href="/name/nm0000233/?ref_=adv_li_dr_0"
>Quentin Tarantino</a>
             <span class="ghost">|</span> 
    Stars:
<a href="/name/nm0000237/?ref_=adv_li_st_0"
>John Travolta</a>, 
<a href="/name/nm0000235/?ref_=adv_li_st_1"
>Uma Thurman</a>, 
<a href="/name/nm0000168/?ref_=adv_li_st_2"
>Samuel L. Jackson</a>, 
<a href="/name/nm0000246/?ref_=adv_li_st_3"
>Bruce Willis</a>
    </p>
        <p class="sort-num_votes-visible">
                <span class="text-muted">Votes:</span>
                <span name="nv" data-value="1515663">1,515,663</span>
<span class="ghost">|</span>                <span class="text-muted">Gross:</span>
                <span name="nv" data-value="107,928,762">$107.93M</span>
        </p>
        </div>
    </div>
    <div class="lister-item mode-advanced">
        <div class="lister-top-right">
    <div class="ribbonize" data-tconst="tt0109830" data-caller="filmosearch"></div>
        </div>
        <div class="lister-item-image float-left">


<a href="/title/tt0109830/?ref_=adv_li_i"
> <img alt="Forrest Gump"
class="loadlate"
loadlate="https://ia.media-imdb.com/images/M/MV5BNWIwODRlZTUtY2U3ZS00Yzg1LWJhNzYtMmZiYmEyNmU1NjMzXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_UY98_CR0,0,67,98_AL_.jpg"
data-tconst="tt0109830"
height="98"
src="https://images-na.ssl-images-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB499613450_.png"
width="67" />
</a>        </div>
        <div class="lister-item-content">
<h3 class="lister-item-header">
        <span class="lister-item-index unbold text-primary">6.</span>
    
    <a href="/title/tt0109830/?ref_=adv_li_tt"
>Forrest Gump</a>
    <span class="lister-item-year text-muted unbold">(1994)</span>
</h3>
    <p class="text-muted ">
            <span class="certificate">PG-13</span>
             <span class="ghost">|</span> 
                <span class="runtime">142 min</span>
             <span class="ghost">|</span> 
            <span class="genre">
Drama, Romance            </span>
    </p>
    <div class="ratings-bar">
    <div class="inline-block ratings-imdb-rating" name="ir" data-value="8.8">
        <span class="global-sprite rating-star imdb-rating"></span>
        <strong>8.8</strong>
    </div>
            <div class="inline-block ratings-user-rating">
                <span class="userRatingValue" id="urv_tt0109830" data-tconst="tt0109830">
                    <span class="global-sprite rating-star no-rating"></span>
                    <span name="ur" data-value="0" class="rate" data-no-rating="Rate this">Rate this</span>
                </span>
    <div class="starBarWidget" id="sb_tt0109830">
<div class="rating rating-list" data-starbar-class="rating-list" data-auth="" data-user="" id="tt0109830|imdb|8.8|8.8|||search|title" data-ga-identifier=""
title="Users rated this 8.8/10 (1,469,648 votes) - click stars to rate" itemtype="http://schema.org/AggregateRating" itemscope itemprop="aggregateRating">
  <meta itemprop="ratingValue" content="8.8" />
  <meta itemprop="bestRating" content="10" />
  <meta itemprop="ratingCount" content="1469648" />
<span class="rating-bg">&nbsp;</span>
<span class="rating-imdb " style="width: 123.2px">&nbsp;</span>
<span class="rating-stars">
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>1</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>2</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>3</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>4</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>5</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>6</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>7</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>8</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>9</span></a>
      <a href="/register/login?why=vote&ref_=tt_ov_rt"
rel="nofollow" title="Register or login to rate this title" ><span>10</span></a>
</span>
'''

## No obstante, esto es poco práctico

Pyhton permite extraer directamente el código HTML de un link mediante la librería requests

In [9]:
from requests import get

url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'

response = get(url)

print(response.text)




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
        <title>IMDb: Most Voted Titles Released 2017-01-01 to 2017-12-31 - IMDb</title>
  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'functio

## Ahora que tenemos nuestro código HTML en Pyhton empezemos a extraer la información que nos interesa

- Podemos usar expresiones regulares sobre el HTML para encontrar la información que buscamos

In [10]:
lista_duration = re.findall("\<span class=\"runtime\">(\d+)", response.text)
lista_class = re.findall("\<span class=\"certificate\">(.+)\<\/span\>", response.text)

print("Número de elementos de la lista de duraciones",str(len(lista_duration)))
print("Número de elementos de la lista de clasificaciones",str(len(lista_class)))

Número de elementos de la lista de duraciones 50
Número de elementos de la lista de clasificaciones 46


Problema: el largo de las listas no coincide

¿Qué hacer?

Buscar un patrón que me permita extraer los bloques de código html correspondientes a cada película, y luego aplico las expresiones que ya tengo a cada bloque, de tal forma que si no encuentra el dato, me ponga missing 


In [11]:

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)


movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')

classm = []
time = []

for movie in movie_containers:
    lista_duration = [re.findall("\<span class=\"runtime\">(\d+)", str(movie))]
    lista_class = [re.findall("\<span class=\"certificate\">(.+)\<\/span\>", str(movie))]
    classm = classm + lista_class
    time = time + lista_duration



In [12]:
print("Número de elementos de la lista de duraciones",str(len(time)))
print("Número de elementos de la lista de clasificaciones",str(len(classm)))

Número de elementos de la lista de duraciones 50
Número de elementos de la lista de clasificaciones 50


Creamos un diccionario con ambas variables

In [13]:
db = {"duración":time, "clasificación":classm}

In [14]:
database = pd.DataFrame(db)

database

Unnamed: 0,duración,clasificación
0,[137],[R]
1,[141],[PG-13]
2,[106],[PG-13]
3,[152],[PG-13]
4,[136],[PG-13]
5,[130],[PG-13]
6,[133],[PG-13]
7,[104],[R]
8,[164],[R]
9,[112],[R]
