# **Scrapping Data from IDMB Website using BeautifulSoup**

We will be scraping the first two hundred and fifty search result for feature film ranging from five years from now.

## Importing Packages

In [22]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

## IMD Url for Search Page

In [24]:

url ="https://www.imdb.com/search/title/"
query_params ={
    # we insert the search parameters for movies released past five years
    "release_date" : f'{pd.Timestamp.now().year - 5}-01-01, {pd.Timestamp.now().strftime("%Y-%m-%d")}',
    "title_type": "feature",
    "num_votes":"1000,",
    "count":"250"
    }

## Sending Website Requests to query our Parameters

In [25]:
response = requests.get(url, params=query_params)
html= response.content

## BeautifulSoup

### Using ```html``` content to Create Beautiful Soup Object

In [26]:
soup = bs(html,'html.parser')

#find the containers in the page
containers = soup.find_all('div', {'class':'lister-item mode-advanced'})

In [27]:

data = []

# loop through each container to extract a movie data
for movie in containers:
    
    #exttract the movie title
    title =movie.h3.a.text.strip()
    
    #extract the release year 
    year = movie.h3.find('span',{'class':'lister-item-year'}).text.strip()
    year = year.replace('(','').replace(')','').split()[-1]
    
    #extract the movie duration
    runtime = movie.find('span',{'class':'runtime'}).text.strip()
    
    #extract the genre catgory of the movie
    genre = movie.find('span', {'class': 'genre'}).text.strip()
    
    #extarct the movie rating
    rating = movie.strong.text.strip()
    
    #extract the metascore of the movie
    metascore = movie.find('span', {'class': 'metascore'})
    if metascore is not None:
        metascore = metascore.text.strip()
    else:
        metascore =""
    
    #extract the total votes of the movie
    votes = movie.find('span', {'name':'nv'})['data-value']
    
    #add the data to our empty list movies
    data.append([title,year,runtime,genre,rating, metascore,votes])

In [30]:
#Load the data to dataframe with a rating sort.
df=pd.DataFrame(data, columns=['title','year','runtime','genre','rating', 'metascore','votes']).sort_values("rating", ascending=False)
df.head()

Unnamed: 0,title,year,runtime,genre,rating,metascore,votes
94,Gisaengchung,2019,132 min,"Drama, Thriller",8.5,96,824682
139,Avengers: Infinity War,2018,149 min,"Action, Adventure, Sci-Fi",8.4,68,1097740
194,Spider-Man: Into the Spider-Verse,2018,117 min,"Animation, Action, Adventure",8.4,87,543009
63,Avengers: Endgame,2019,181 min,"Action, Adventure, Drama",8.4,78,1150975
73,Joker,2019,122 min,"Crime, Drama, Thriller",8.4,59,1314422


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 250 entries, 94 to 118
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      250 non-null    object
 1   year       250 non-null    object
 2   runtime    250 non-null    object
 3   genre      250 non-null    object
 4   rating     250 non-null    object
 5   metascore  250 non-null    object
 6   votes      250 non-null    object
dtypes: object(7)
memory usage: 15.6+ KB
