# IMDB Crawler

## This notebook is extracting information out of the official website of IMDB

The main goal is to create a **dataset** that contains information about the **Top Rated Movies**


by Ion Petropoulos

In [1]:
import requests
try:
    r = requests.get("https://www.imdb.com/chart/top?ref_=nv_mv_250")
except requests.exceptions.HTTPError as he:
    print(he)
except requests.exceptions.ConnectionError as ce:
    print(ce)
print(r.content)

b'\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    \n    \n    \n\n    \n    \n    \n\n    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">\n            <style>\n                body#styleguide-v2 {\n                    background: no-repeat fixed center top #000;\n                }\n            </style>\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>IMDb Top 250 - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_tit

In [2]:
from bs4 import BeautifulSoup

html = r.content
soup = BeautifulSoup(html, 'html.parser')
print(soup.h1)

<h1 class="header">Top Rated Movies</h1>


This is the header of the Page We Are going to Crawle

In [3]:
print(soup.prettify)

<bound method Tag.prettify of 
<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
            </style>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>IMDb Top 250 - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitl

What we want to extract out of the page is the **Headers** of the Movies, the **dates** and the **ratings**

If we inspect the page we can see that the information is inside a table with:

* The header of of the page under an ```<a>``` tag
* The date under a ```<span>``` tag
* The rating under a ```<strong>``` tag

In [4]:
table = soup.find("table", {"class":"chart full-width"})
# This is the table with all the movie content

We can find the movie rating inside the title of the ```<strong> tag ``` and extract it with regular expressions

In [5]:
import re

scores = table.find_all('strong')
rates = []
for score in scores:
    full_score = score.get('title')
    rate = re.findall("\d\.\d" ,full_score)
    rates.append(rate.pop())
print(rates)

['9.2', '9.1', '9.0', '9.0', '8.9', '8.9', '8.9', '8.9', '8.8', '8.8', '8.8', '8.8', '8.8', '8.7', '8.7', '8.7', '8.6', '8.6', '8.6', '8.6', '8.6', '8.6', '8.6', '8.6', '8.6', '8.6', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.5', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.4', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.3', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2', '8.2'

In [6]:
import csv

with open('movies.csv', mode='w') as movie_file:
    movie_writer = csv.writer(movie_file)
    count = 0;
    movie_writer.writerow(["Director", "Movie", "Date", "Rating"])
    
    for link in table.find_all("a"):
        if link.get('title') != None:
            date_list = link.findNextSibling()
            date = date_list.get_text()
            stars = link.get('title')
            movie_name_list = link.contents
            movie_name = movie_name_list.pop()
            movie_writer.writerow([stars, movie_name, date,rates[count]])
            count += 1

Now what we need is the rest information

As you can see, we managed to extract the Stars of the Movie, the movie name and the Date