# Scraping IMDB website


<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/IMDB_Logo_2016.svg/330px-IMDB_Logo_2016.svg.png" align="right" width=200 height=200/>



**Task**: Scrap the top 100 drama movies from [IMDB](https://www.imdb.com/search/title/?genres=drama&groups=top_1000&sort=user_rating,desc&ref_=adv_prv)

In [105]:
import requests
from lxml import html
import re
from tqdm import tqdm
from urllib.parse import urljoin
from pymongo import MongoClient

In [2]:
response = requests.get(url="https://www.imdb.com/search/title/?genres=drama&groups=top_1000&sort=user_rating,desc&ref_=adv_prv",
                       headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"})
response

<Response [200]>

In [4]:
tree = html.fromstring(response.text)
tree

<Element html at 0x12811669278>

## Scraping the data for a single movie

The following data of a particular movie we want:
- title 
- year of release
- duration
- rating 

In [24]:
movies = tree.xpath("//div[contains(@class,'advance')]")
print(len(movies))
print(movies[2])

50
<Element div at 0x12811669548>


In [40]:
_id = movies[20].xpath(".//div/h3/span[1]/text()")[0]
title = movies[2].xpath(".//div/h3/a/text()")[0]
year = movies[2].xpath(".//div/h3/span[2]/text()")[0]
duration = movies[2].xpath(".//div/p/span[@class='runtime']/text()")[0]
rating = movies[2].xpath(".//div/div[@class='ratings-bar']/div/strong/text()")[0]
print(_id,"|",title,"|", year,"|", duration,"|", rating)

21. | The Dark Knight | (2008) | 152 min | 9.0


### cleaning some data

we will require some cleaning for the data
- id
- year

In [73]:
clean_string(year)

Unable to convert to string


In [81]:
def str_to_int(index):
    '''
    This function converts the number in string format to integer
    '''
    num_regex = re.compile(r"\d+")
    try:
        return int(num_regex.findall(index)[0])
    except:
        print("Unable to convert to integer")
        return None

In [82]:
print("before cleaning:", _id, "after cleaning:",str_to_int(_id))
print("before cleaning:", year, "after cleaning:",str_to_int(year))

before cleaning: 21. after cleaning: 21
before cleaning: (2008) after cleaning: 2008


## scraping data for all the movies

scrap the data for the first page

In [85]:
d = []
movies = tree.xpath("//div[contains(@class,'advance')]")
for movie in tqdm(movies):
    
    _id = movie.xpath(".//div/h3/span[1]/text()")[0]
    title = movie.xpath(".//div/h3/a/text()")[0]
    year = movie.xpath(".//div/h3/span[2]/text()")[0]
    duration = movie.xpath(".//div/p/span[@class='runtime']/text()")[0]
    rating = movie.xpath(".//div/div[@class='ratings-bar']/div/strong/text()")[0]
    
    data = {
        '_id': str_to_int(_id),
        'title': title,
        'year': str_to_int(year),
        'duration': duration,
        'rating': rating
    }
    
    d.append(data)

100%|████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 5570.57it/s]


In [84]:
d

[{'_id': 1,
  'title': 'The Shawshank Redemption',
  'year': 1994,
  'duration': '142 min',
  'rating': '9.3'},
 {'_id': 2,
  'title': 'The Godfather',
  'year': 1972,
  'duration': '175 min',
  'rating': '9.2'},
 {'_id': 3,
  'title': 'The Dark Knight',
  'year': 2008,
  'duration': '152 min',
  'rating': '9.0'},
 {'_id': 4,
  'title': 'The Godfather: Part II',
  'year': 1974,
  'duration': '202 min',
  'rating': '9.0'},
 {'_id': 5,
  'title': '12 Angry Men',
  'year': 1957,
  'duration': '96 min',
  'rating': '9.0'},
 {'_id': 6,
  'title': 'The Lord of the Rings: The Return of the King',
  'year': 2003,
  'duration': '201 min',
  'rating': '8.9'},
 {'_id': 7,
  'title': 'Pulp Fiction',
  'year': 1994,
  'duration': '154 min',
  'rating': '8.9'},
 {'_id': 8,
  'title': "Schindler's List",
  'year': 1993,
  'duration': '195 min',
  'rating': '8.9'},
 {'_id': 9,
  'title': 'Fight Club',
  'year': 1999,
  'duration': '139 min',
  'rating': '8.8'},
 {'_id': 10,
  'title': 'The Lord of the

## Recursion 

Doing the same process for all the movies:

In [87]:
next_page = tree.xpath("//div[@class='desc']/a/@href")[0]
next_page

'/search/title/?genres=drama&groups=top_1000&sort=user_rating,desc&start=51&ref_=adv_nxt'

In [101]:
%%time

movies_data = []
def scrap_movie_data(url):
    
    response = requests.get(url=url,
                       headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"})

    tree = html.fromstring(response.text)
    movies = tree.xpath("//div[contains(@class,'advance')]")
    
    for movie in movies:
    
        _id = movie.xpath(".//div/h3/span[1]/text()")[0]
        title = movie.xpath(".//div/h3/a/text()")[0]
        year = movie.xpath(".//div/h3/span[2]/text()")[0]
        duration = movie.xpath(".//div/p/span[@class='runtime']/text()")[0]
        rating = movie.xpath(".//div/div[@class='ratings-bar']/div/strong/text()")[0]

        data = {
            '_id': str_to_int(_id),
            'title': title,
            'year': str_to_int(year),
            'duration': duration,
            'rating': rating
        }

        movies_data.append(data)
        
        
    next_page = tree.xpath("//div[@class='desc']/a[contains(@class,'next-page')]/@href")
    if len(next_page) > 0:
        new_page_url = urljoin(url,next_page[0])
        scrap_movie_data(new_page_url)
        

url = "https://www.imdb.com/search/title/?genres=drama&groups=top_1000&sort=user_rating,desc&ref_=adv_prv"
scrap_movie_data(url)

Wall time: 30 s


In [103]:
len(movies_data)

745

In [104]:
movies_data[455]

{'_id': 456,
 'title': 'Le passé',
 'year': 2013,
 'duration': '130 min',
 'rating': '7.8'}

## Writing it to mongodb client

will store the data into different collections called "drama_movies"

In [108]:
def insert_data_to_db(movies_data):
    
    # connect to mongodb cloud server
    client = MongoClient("mongodb://kavyajeetbora:bora1992@cluster0-shard-00-00.jicto.mongodb.net:27017,cluster0-shard-00-01.jicto.mongodb.net:27017,cluster0-shard-00-02.jicto.mongodb.net:27017/myFirstDatabase?ssl=true&replicaSet=atlas-r5ojsc-shard-0&authSource=admin&retryWrites=true&w=majority")
    # create a database
    db = client["movies"]
    # create a table / collection in the db
    collection = db["drama_movies"]
    # insert the data into collection
    for movie in tqdm(movies_data):
        record = collection.find_one({"_id":movie["_id"]})
        if record:
            if record["title"] == movie["title"] and record["rating"] != record["rating"]:
                ## update the rating, if any recent changes in the rating
                collection.replace_one({"_id":record["_id"]}, movie)
                print(f"Old item: {record}, new item: {movie}")
        else:
            collection.insert_one(movie)
#         collection.insert_many(currency_data) # to insert many documents at a time
    # finally close the connection
    client.close()

In [109]:
%%time
insert_data_to_db(movies_data)

100%|████████████████████████████████████████████████████████████████████████████████| 745/745 [07:30<00:00,  1.65it/s]


Wall time: 7min 30s
