# IMDb Top 250 Movie Scraper

This notebook uses Python and BeautifulSoup to scrape IMDb's Top 250 movies. The goal is to extract useful information such as movie titles, release years, and IMDb ratings.

This custom dataset can be used for data analysis, visualizations, or recommendation systems.


In [73]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np

In [18]:
url ='https://www.imdb.com/chart/top/'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36"}

##  Fetch the IMDb Page

We use the `requests` library to get the HTML content of the IMDb Top 250 movies page.


In [21]:
page=requests.get(url,headers=headers)


##  Parse the HTML Using BeautifulSoup

We use BeautifulSoup to locate the HTML elements that contain movie details.


In [30]:
soup = BeautifulSoup(page.text,"html.parser")

##  Extract Data and Build a DataFrame

We extract the movie title, release year, IMDb rating, and a link to the movie's page. Then we store the data in a Pandas DataFrame.


In [72]:
Top_250_Movies = []
movies = soup.find_all('div', class_='sc-52ea7f05-0 YKXtw')
for movie in movies :
  title = movie.find('h3',class_='ipc-title__text ipc-title__text--reduced').text.split('.')[1] if title else np.nan
  rating =movie.find('span',class_='ipc-rating-star--rating').text if rating else np.nan
  duration = movie.find_all('span',class_='sc-15ac7568-7 cCsint cli-title-metadata-item')[1].text if duration else np.nan
  release_year = movie.find('span',class_='sc-15ac7568-7 cCsint cli-title-metadata-item').text if release_year else np.nan
  parents_guide = movie.find_all('span',class_='sc-15ac7568-7 cCsint cli-title-metadata-item')[2].text if parents_guide else np.nan
  voting_count = movie.find('span',class_='ipc-rating-star--voteCount').text[2:-1] if voting_count else np.nan
  Top_250_Movies.append({'Title':title,'Rating':rating,'Release_Year':release_year ,'Duration':duration,'Parents_Guide':parents_guide})
df = pd.DataFrame(Top_250_Movies)
df

Unnamed: 0,Title,Rating,Release_Year,Duration,Parents_Guide
0,The Shawshank Redemption,9.3,1994,2h 22m,R
1,The Godfather,9.2,1972,2h 55m,R
2,The Dark Knight,9.1,2008,2h 32m,PG-13
3,The Godfather Part II,9.0,1974,3h 22m,R
4,12 Angry Men,9.0,1957,1h 36m,Approved
5,The Lord of the Rings: The Return of the King,9.0,2003,3h 21m,PG-13
6,Schindler's List,9.0,1993,3h 15m,R
7,Pulp Fiction,8.8,1994,2h 34m,R
8,The Lord of the Rings: The Fellowship of the ...,8.9,2001,2h 58m,PG-13
9,"The Good, the Bad and the Ugly",8.8,1966,2h 58m,R


##  Saving the Dataset

We save the final dataset to a CSV file for future use or analysis.


In [72]:
df.to_csv('Top_250_Movies.csv')