# Web Scraping IMDB Top 250

### 1 - What is web scraping and web crawling？
**Web scraping** is an automated program that queries a web server, requests data (usually in the form of HTML), and then parses that data to extract needed information. For example: Google Web Scraper, Content Aggregators, Price monitoring, Machine Learning Dataset. <br>

**Web crawling** refers to downloading and storing the contents of a large number of websites, by following links in web pages. Web crawlers are called such because they crawl across the web. For example: Search Engine.

### 2 - HTML Introduction

**2.1 What is HTML?**   
-  HTML is the standard markup language for creating Web pages.
-  HTML elements are the building blocks of HTML pages, which are represented by tags
-  HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
-  Browsers do not display the HTML tags, but use them to render the content of the page

**2.2 A Simple HTML Document**

    <!DOCTYPE html>
    <html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
    </body>
    </html>
    
-  The <!DOCTYPE html> declaration defines this document to be HTML5      
-  The < html> element is the root element of an HTML page   
-  The < head> element contains meta information about the document   
-  The < title> element specifies a title for the document  
-  The < body> element contains the visible page content  
-  The < h1> element defines a large heading  
-  The < p> element defines a paragraph. HTML tags normally come in pairs like < p> and < /p>, forward slash

### 3 - Scraping IMDB Top 250

####  1 - Load Packages

In [1]:
import requests
print('Requests version: ' + requests.__version__)

import bs4
print('Beautiful Soup version: ' + bs4.__version__)
# HTML Parser Package
from bs4 import BeautifulSoup

Requests version: 2.22.0
Beautiful Soup version: 4.8.0


####  2 - Load HTML Document

In [5]:
# r = requests.get('https://www.imdb.com/chart/top')
# page_html = r.text

with open('source_code.txt', 'r') as myfile:
    page_html = myfile.read().replace('\n', '')

In [6]:
type(page_html)

str

In [7]:
# one string for the HTML code
page_html[:500]

'<!DOCTYPE html><html    xmlns:og="http://ogp.me/ns#"    xmlns:fb="http://www.facebook.com/2008/fbml">    <head>                 <meta charset="utf-8">        <meta http-equiv="X-UA-Compatible" content="IE=edge">                            <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">            <style>                body#styleguide-v2 {                    background: no-repeat fixed center top #000;                }            </style>        <script '

#### 3 - Pass the HTML Content to BeautifulSoup and construct a tree (BS object) to parse

In [9]:
page_soup = BeautifulSoup(page_html, "html.parser")

#### 4 - Find all the tags inside the tree that include top 250 movies' information 

In [13]:
movies = page_soup.find_all(name = "tr")
len(movies)
movies = movies[1:]
len(movies)

250

In [21]:
# Get movie name
movie = movies[0]
name = movie.find(name="td",attrs={"class":"titleColumn"}).find(name="a").string
name = name.replace(",","|").strip()
print('Movie Name:'  + name)

# Get movie year
year = movie.find(name="td",attrs={"class":"titleColumn"}).find(name="span").string
year = year.replace(")", "").replace("(", "").strip()
print('Movie Year: ' + year)

# Get movie rating
rating = movie.find(name="td",attrs={"class":"ratingColumn imdbRating"}).find(name="strong").string
rating = rating.strip()
print('Movie Rating: ' + rating)

# Get number of user rating
num_user_rating = movie.find(name="td",attrs={"class":"ratingColumn imdbRating"})\
                       .find(name="strong").attrs['title']
num_user_rating = num_user_rating.split(" ")[3].replace(",","")
print('Number of user ratings: ' + num_user_rating)

Movie Name:The Shawshank Redemption
Movie Year: 1994
Movie Rating: 9.2
Number of user ratings: 2239582


#### 5 - Extract movie features and save data in a csv file

In [22]:
# create a file name: 'imdb_top_250.csv'
filename = "imdb_top_250.csv"
f = open(filename, "w", encoding='utf-8')
headers = "Rank,Name,Year,Rating,Num_user_rating"

# Write header in csv
f.write(headers + '\n')

38

In [23]:
# Write data into csv file
Rank = 0
for movie in movies[1:251]:
    
    Rank = Rank + 1
    
    # Movie Name
    Name = movie.find(name="td",attrs={"class":"titleColumn"}).find(name="a").string
    Name = Name.replace(",","|").strip()
    
    # Movie Year
    Year = movie.find(name="td",attrs={"class":"titleColumn"}).find(name="span").string
    Year = Year.replace(")", "").replace("(", "").strip()
    
    # Movie Rating
    Rating = movie.find(name="td",attrs={"class":"ratingColumn imdbRating"}).find(name="strong").string
    Rating = Rating.strip()
    
    # Number of User Ratings
    Num_user_rating = movie.find(name="td",attrs={"class":"ratingColumn imdbRating"}).find(name="strong").attrs['title']
    Num_user_rating = Num_user_rating.split(" ")[3].replace(",","")
    
    f.write(str(Rank) + "," + Name + "," + Year + "," + Rating + "," + Num_user_rating + "\n")

In [24]:
f.close()