# From IMDb: the Top-rated Movies

### *Katherine Li*

**A web scraping session using *BeautifulSoup***

The first half of this project scrapes the movies released in 2020 and their respective information. <br>
The second half scrapes the all-time top-rated movies according to IMDb.com. 


## Movies Released in 2020

The first half scrapes the movies released specifically in 2020 and their respective information. The original website can be found [here](https://www.imdb.com/search/title?release_date=2020&sort=num_votes,desc&page=1).

In [3]:
# Movies released in 2020

# Get package and url 
from requests import get
url = 'https://www.imdb.com/search/title?release_date=2020&sort=num_votes,desc&page=1'
response = get(url)
print(response.text[:500])



<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle",


In [5]:
#Parse response.text by creating a BeautifulSoup object, and assign this object to html_soup. The 'html.parser' argument indicates that we want to use Python’s built-in HTML parser for the parsing.
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

bs4.BeautifulSoup

In [6]:
#Use the find_all() method to extract all the div containers that have a class attribute of lister-item mode-advanced:

movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
50


In [7]:
#See how information looks like for one single movie, accessing the first container

first_movie = movie_containers[0]
first_movie

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt6723592"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt6723592/"> <img alt="Tenet" class="loadlate" data-tconst="tt6723592" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BYzg0NGM2NjAtNmIxOC00MDJmLTg5ZmYtYzM0MTE4NWE2NzlhXkEyXkFqcGdeQXVyMTA4NjE0NjEy._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt6723592/">Tenet</a>
<span class="lister-item-year text-muted unbold">(2020)</span>
</h3>
<p class="text-muted">
<span class="certificate">PG-13</span>
<span class="ghost">|</span>
<span class="runtime">150 min</span>
<span class="ghost">|</span>
<span class="genre">
Action, Sci-Fi, Thriller            </

**Now the actual part of scraping the information begins!**

In [9]:
# Lists to store the scraped data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
    if container.find('div', class_ = 'ratings-metascore') is not None:
# The name
      name = container.h3.a.text
      names.append(name)
# The year
      year = container.h3.find('span', class_ = 'lister-item-year').text
      years.append(year)
# The IMDB rating
      imdb = float(container.strong.text)
      imdb_ratings.append(imdb)
# The Metascore
      m_score = container.find('span', class_ = 'metascore').text
      metascores.append(int(m_score))
# The number of votes
      vote = container.find('span', attrs = {'name':'nv'})['data-value']
      votes.append(int(vote))

Output the table using pandas.

In [10]:
import pandas as pd
test_df = pd.DataFrame({'movie': names,
'year': years,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes
})
print(test_df.info())
test_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 5 columns):
movie        35 non-null object
year         35 non-null object
imdb         35 non-null float64
metascore    35 non-null int64
votes        35 non-null int64
dtypes: float64(1), int64(2), object(2)
memory usage: 1.5+ KB
None


Unnamed: 0,movie,year,imdb,metascore,votes
0,Tenet,(2020),7.4,69,347498
1,Soul,(2020),8.1,83,237327
2,Wonder Woman 1984,(2020),5.4,60,199514
3,Birds of Prey,(2020),6.1,60,182636
4,The Invisible Man,(I) (2020),7.1,72,179769
5,Extraction,(2020),6.7,56,166962
6,The Trial of the Chicago 7,(2020),7.8,76,142933
7,Enola Holmes,(2020),6.6,68,137356
8,Bad Boys for Life,(2020),6.6,59,136444
9,The Old Guard,(2020),6.7,70,135178


## All-time Top Movies on IMDb.com

Now the second section of this projects scrapes information regarding the all-time top movies from IMDb.com. The original website could be found [here](http://www.imdb.com/chart/top).


In [11]:
import requests
import re

# Download IMDB's Top 250 data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

imdb = []

# Store each item into dictionary (data), then put those into a list (imdb)
for index in range(0, len(movies)):
    # Seperate movie into: 'place', 'title', 'year'
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index))-(len(movie))]
    data = {"movie_title": movie_title,
            "year": year,
            "place": place,
            "star_cast": crew[index],
            "rating": ratings[index],
            "vote": votes[index],
            "link": links[index]}
    imdb.append(data)

for item in imdb:
    print(item['place'], '-', item['movie_title'], '('+item['year']+') -', 'Starring:', item['star_cast'])

1 - The Shawshank Redemption (1994) - Starring: Frank Darabont (dir.), Tim Robbins, Morgan Freeman
2 - The Godfather (1972) - Starring: Francis Ford Coppola (dir.), Marlon Brando, Al Pacino
3 - The Godfather: Part II (1974) - Starring: Francis Ford Coppola (dir.), Al Pacino, Robert De Niro
4 - The Dark Knight (2008) - Starring: Christopher Nolan (dir.), Christian Bale, Heath Ledger
5 - 12 Angry Men (1957) - Starring: Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb
6 - Schindler's List (1993) - Starring: Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes
7 - The Lord of the Rings: The Return of the King (2003) - Starring: Peter Jackson (dir.), Elijah Wood, Viggo Mortensen
8 - Pulp Fiction (1994) - Starring: Quentin Tarantino (dir.), John Travolta, Uma Thurman
9 - The Good, the Bad and the Ugly (1966) - Starring: Sergio Leone (dir.), Clint Eastwood, Eli Wallach
1 -  The Lord of the Rings: The Fellowship of the Ring (2001) - Starring: Peter Jackson (dir.), Elijah Wood, Ian McKellen
11 - Fi

Makes it look better with pandas.

In [13]:
imdb_df = pd.DataFrame(imdb)
imdb_df

Unnamed: 0,movie_title,year,place,star_cast,rating,vote,link
0,The Shawshank Redemption,1994,1,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",9.219578158515473,,/title/tt0111161/
1,The Godfather,1972,2,"Francis Ford Coppola (dir.), Marlon Brando, Al...",9.14749576670067,,/title/tt0068646/
2,The Godfather: Part II,1974,3,"Francis Ford Coppola (dir.), Al Pacino, Robert...",8.97983715793398,,/title/tt0071562/
3,The Dark Knight,2008,4,"Christopher Nolan (dir.), Christian Bale, Heat...",8.969327451981416,,/title/tt0468569/
4,12 Angry Men,1957,5,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",8.934273321261427,,/title/tt0050083/
...,...,...,...,...,...,...,...
245,Anand,1971,246,"Hrishikesh Mukherjee (dir.), Rajesh Khanna, Am...",8.021323036178087,,/title/tt0066763/
246,Time of the Gypsies,1988,247,"Emir Kusturica (dir.), Davor Dujmovic, Bora To...",8.019245806734084,,/title/tt0097223/
247,Nights of Cabiria,1957,248,"Federico Fellini (dir.), Giulietta Masina, Fra...",8.018106001107357,,/title/tt0050783/
248,Throne of Blood,1957,249,"Akira Kurosawa (dir.), Toshirô Mifune, Minoru ...",8.018083024864954,,/title/tt0050613/
