<a href="https://colab.research.google.com/github/karahud/karahud/blob/main/Numerical_Programming_in_Python_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [None]:
#Installing all necessary labraries
!pip install bs4
!pip install requests

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [None]:
#import all necessary labraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import time

## **Scrapping Movies Data**

In [None]:
# Specifying the URL from which movies related data will be fetched
url='https://www.justwatch.com/in/movies?release_year_from=2000'

# Sending an HTTP GET request to the URL
page=requests.get(url)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(page.text,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

<!DOCTYPE html>
<html data-vue-meta="%7B%22dir%22:%7B%22ssr%22:%22ltr%22%7D,%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-vue-meta-server-rendered="" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-vue-meta="ssr"/>
  <meta content="IE=edge" data-vue-meta="ssr" httpequiv="X-UA-Compatible"/>
  <meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" data-vue-meta="ssr" name="viewport"/>
  <meta content="JustWatch" data-vue-meta="ssr" property="og:site_name"/>
  <meta content="794243977319785" data-vue-meta="ssr" property="fb:app_id"/>
  <meta content="/appassets/img/JustWatch_logo_with_claim.png" data-vmid="og:image" data-vue-meta="ssr" property="og:image"/>
  <meta content="606" data-vmid="og:image:width" data-vue-meta="ssr" property="og:image:width"/>
  <meta content="302" data-vmid="og:image:height" data-vue-meta="ssr" pro

## **Fetching Movie URL's**

In [None]:
# Write Your Code here
#Creating a empty list
Movie_url= []

#Looping through the entire code to get the all 100 links
for x in soup.find_all('a',attrs={'class':'title-list-grid__item--link'}):
  Movie_url.append('https://www.justwatch.com'+ x['href']) # Adding initial subset of a link to create a accessible link

#Printing length
print("Total No of links:- ",len(Movie_url))

#Printing list of a links
Movie_url

Total No of links:-  100


['https://www.justwatch.com/in/movie/aavesham-2024',
 'https://www.justwatch.com/in/movie/the-crew-2024',
 'https://www.justwatch.com/in/movie/godzilla-minus-one',
 'https://www.justwatch.com/in/movie/laapataa-ladies',
 'https://www.justwatch.com/in/movie/manjummel-boys',
 'https://www.justwatch.com/in/movie/godzilla-x-kong-the-new-empire',
 'https://www.justwatch.com/in/movie/family-star',
 'https://www.justwatch.com/in/movie/hit-man',
 'https://www.justwatch.com/in/movie/maidaan',
 'https://www.justwatch.com/in/movie/munjha',
 'https://www.justwatch.com/in/movie/the-fall-guy',
 'https://www.justwatch.com/in/movie/madgaon-express',
 'https://www.justwatch.com/in/movie/mad-max-fury-road',
 'https://www.justwatch.com/in/movie/dune-part-two',
 'https://www.justwatch.com/in/movie/bade-miyan-chote-miyan-2023',
 'https://www.justwatch.com/in/movie/premalu',
 'https://www.justwatch.com/in/movie/black-magic-2024',
 'https://www.justwatch.com/in/movie/the-gangster-the-cop-the-devil',
 'https:/

## **Scrapping Movie Title**

In [None]:
# Write Your Code here
#Creating a empty list
Movie_title= []
for url in Movie_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    title = soup.find_all('div', attrs = {'data-testid':'titleBlock'})[0].find_all('h1')[0].text
  except Exception as e:
    title = "NA"
  Movie_title.append(title) #appending the list

  time.sleep(4)

#Printing the list
Movie_title


[' Aavesham ',
 ' Crew ',
 ' Godzilla Minus One ',
 ' Laapataa Ladies ',
 ' Manjummel Boys ',
 ' Godzilla x Kong: The New Empire ',
 ' Family Star ',
 ' Hit Man ',
 ' Maidaan ',
 ' Munjya ',
 ' The Fall Guy ',
 ' Madgaon Express ',
 ' Mad Max: Fury Road ',
 ' Dune: Part Two ',
 ' Bade Miyan Chote Miyan ',
 ' Premalu ',
 ' Shaitaan ',
 ' The Gangster, the Cop, the Devil ',
 ' Hereditary ',
 ' Swatantrya Veer Savarkar ',
 ' Rockstar ',
 ' Kung Fu Panda 4 ',
 ' Oppenheimer ',
 ' The First Omen ',
 ' Salaar ',
 ' Challengers ',
 ' Hanu-Man ',
 ' Aquaman and the Lost Kingdom ',
 ' 12th Fail ',
 ' Rathnam ',
 ' 365 Days ',
 ' Zara Hatke Zara Bachke ',
 ' Varshangalkku Shesham ',
 ' Animal ',
 ' Yodha ',
 ' Atlas ',
 ' Inside Out ',
 ' Anyone But You ',
 ' The Goat Life ',
 ' Furiosa: A Mad Max Saga ',
 ' Aranmanai 4 ',
 ' Kingdom of the Planet of the Apes ',
 ' Do Aur Do Pyaar ',
 " Harry Potter and the Philosopher's Stone ",
 ' Gangs of Godavari ',
 ' Spider-Man: No Way Home ',
 ' Teri Baat

## **Scrapping release Year**

In [None]:
# Write Your Code here
#Creating a empty list
Movie_year=[]
for url in Movie_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    year = eval(soup.find_all('div', attrs = {'data-testid':'titleBlock'})[0].find_all('span')[0].text)

  except Exception as e:
    year = "NA"
  Movie_year.append(year) #appending the list

  time.sleep(4)

#Printing the list
Movie_year

## **Scrapping Genres**

In [None]:
# Write Your Code here
#Creating a empty list
Movie_genres= []
for url in Movie_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    for x in soup.find_all('div',attrs={'class':'detail-infos'}):
      if x.find_all('h3',attrs={'class':'detail-infos__subheading'})[0].text=='Genres':
          genre = x.find_all('span',attrs={'class':'detail-infos__value'})[0].text

  except Exception as e:
    genre = "NA"
  Movie_genres.append(genre) #appending the list

  time.sleep(4)

#Printing the list
Movie_genres

['Action & Adventure, Comedy',
 'Comedy, Drama',
 'Science-Fiction, Horror, Action & Adventure, Drama',
 'Comedy, Drama',
 'History, Mystery & Thriller, Drama, Action & Adventure',
 'Science-Fiction, Mystery & Thriller, Action & Adventure, Fantasy',
 'Action & Adventure, Drama, Romance, Comedy',
 'Crime, Romance, Comedy',
 'Drama, History, Sport',
 'Comedy, Horror',
 'Action & Adventure, Comedy, Drama',
 'Comedy, Drama',
 'Science-Fiction, Action & Adventure, Mystery & Thriller',
 'Action & Adventure, Science-Fiction, Drama',
 'Mystery & Thriller, Science-Fiction, Action & Adventure, Comedy',
 'Romance, Comedy',
 'Mystery & Thriller, Horror, Drama',
 'Crime, Action & Adventure, Mystery & Thriller',
 'Mystery & Thriller, Horror, Drama',
 'Drama, History',
 'Drama, Music & Musical, Romance',
 'Kids & Family, Fantasy, Animation, Action & Adventure, Comedy',
 'Drama, History',
 'Horror',
 'Action & Adventure, Crime, Drama, Mystery & Thriller',
 'Romance, Drama, Sport',
 'Science-Fiction, F

## **Scrapping IMBD Rating**

In [None]:
# Write Your Code here
#Creating a empty list
Movie_rating=[]
for url in Movie_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    for x in soup.find_all('div',attrs = {'class':'detail-infos'}):
      if x.find_all('h3', attrs = {'class':'detail-infos__subheading'})[0].text =='Rating':
        rating = x.find_all('div', attrs = {'class':'detail-infos__value'})[0].text.strip()

  except Exception as e:
    rating = "NA"
  Movie_rating.append(rating) #appending the list

  time.sleep(4)

#Printing the list
Movie_rating

['8.0  (11k)',
 '5.9  (30k)',
 '7.8  (119k)',
 '8.5  (32k)',
 '8.3  (16k)',
 '6.2  (74k)',
 '5.1  (3k)',
 '7.0  (43k)',
 '8.1  (16k)',
 '7.7  (12k)',
 '7.0  (102k)',
 '7.1  (35k)',
 '8.1  (1m)',
 '8.6  (451k)',
 '4.2  (36k)',
 '7.9  (12k)',
 '6.6  (46k)',
 '6.9  (23k)',
 '7.3  (383k)',
 '7.7  (15k)',
 '7.7  (49k)',
 '6.3  (43k)',
 '8.3  (755k)',
 '6.5  (35k)',
 '6.5  (65k)',
 '7.3  (69k)',
 '7.9  (24k)',
 '5.6  (88k)',
 '8.9  (118k)',
 '6.5  (3k)',
 '3.3  (99k)',
 '6.1  (15k)',
 '6.9  (2k)',
 '6.2  (92k)',
 '5.7  (6k)',
 '5.6  (40k)',
 '8.1  (795k)',
 '6.1  (93k)',
 '8.6  (6k)',
 '7.8  (85k)',
 '6.2',
 '7.2  (53k)',
 '6.3  (5k)',
 '7.6  (860k)',
 '5.1',
 '8.2  (888k)',
 '6.3  (50k)',
 '8.0  (879k)',
 '7.8  (1k)',
 '4.4',
 '7.9  (20k)',
 '6.9  (9k)',
 '6.6  (44k)',
 '7.2  (114k)',
 '6.9',
 '7.9',
 '5.2  (18k)',
 '6.2  (49k)',
 '6.2  (49k)',
 '7.6  (88k)',
 '6.4  (47k)',
 '8.2  (59k)',
 '8.7  (40k)',
 '3.9  (66k)',
 '2.9  (10k)',
 '7.9  (9k)',
 '6.9  (44k)',
 '7.6  (574k)',
 '7.4  (207k)

## **Scrapping Runtime/Duration**

In [None]:
# Write Your Code here
#Creating a empty list
Movie_runtime=[]
for url in Movies_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    for x in soup.find_all('div',attrs={'class':'detail-infos'}):
      if x.find_all('h3',attrs={'class':'detail-infos__subheading'})[0].text=='Runtime':
        runtime= x.find_all('div',attrs={'class':'detail-infos__value'})[0].text.strip()

  except Exception as e:
    runtime = "NA"
  Movie_runtime.append(runtime) #appending the list

  time.sleep(4)

#Printing the list
Movie_runtime

['2h 38min',
 '1h 58min',
 '2h 5min',
 '2h 2min',
 '2h 15min',
 '1h 55min',
 '2h 39min',
 '1h 56min',
 '3h 1min',
 '2h 20min',
 '2h 6min',
 '2h 23min',
 '2h 0min',
 '2h 47min',
 '2h 44min',
 '2h 36min',
 '2h 12min',
 '1h 50min',
 '2h 7min',
 '2h 56min',
 '2h 39min',
 '1h 34min',
 '3h 0min',
 '1h 59min',
 '2h 55min',
 '2h 12min',
 '2h 39min',
 '2h 4min',
 '2h 27min',
 '2h 39min',
 '1h 54min',
 '2h 20min',
 '2h 46min',
 '3h 21min',
 '2h 10min',
 '1h 58min',
 '1h 35min',
 '1h 44min',
 '2h 52min',
 '2h 28min',
 '2h 28min',
 '2h 25min',
 '2h 20min',
 '2h 32min',
 '2h 23min',
 '2h 28min',
 '2h 21min',
 '2h 35min',
 '2h 28min',
 '2h 12min',
 '2h 38min',
 '2h 14min',
 '1h 49min',
 '1h 49min',
 '2h 8min',
 '2h 16min',
 '1h 44min',
 '1h 55min',
 '2h 1min',
 '2h 59min',
 '1h 56min',
 '1h 53min',
 '2h 46min',
 '1h 56min',
 '2h 13min',
 '2h 19min',
 '2h 0min',
 '1h 45min',
 '2h 21min',
 '1h 57min',
 '2h 2min',
 '2h 52min',
 '2h 30min',
 '2h 23min',
 '2h 36min',
 '2h 32min',
 '1h 45min',
 '3h 0min',

## **Scrapping Age Rating**

In [None]:
# Write Your Code here
#Creating a empty list
Movie_age_rating=[]
for url in Movie_url:
  age_rating = "NA"
  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text, 'html.parser')

    for x in soup.find_all('div', attrs = {'class':'detail-infos'}):
      if x.find_all('h3', attrs = {'class':'detail-infos__subheading'})[0].text == 'Age rating':
          age_rating= x.find_all('div', attrs = {'class':'detail-infos__value'})[0].text

  except Exception as e:
      pass
  Movie_age_rating.append(age_rating) #appending the list

  time.sleep(4)

  #Printing the list
Movie_age_rating

['NA',
 'UA',
 'NA',
 'UA',
 'UA',
 'UA',
 'NA',
 'NA',
 'NA',
 'NA',
 'UA',
 'UA',
 'A',
 'NA',
 'UA',
 'U',
 'UA',
 'NA',
 'A',
 'NA',
 'UA',
 'U',
 'UA',
 'A',
 'A',
 'A',
 'UA',
 'NA',
 'NA',
 'NA',
 'NA',
 'UA',
 'NA',
 'A',
 'UA',
 'NA',
 'U',
 'A',
 'UA',
 'NA',
 'NA',
 'NA',
 'UA',
 'U',
 'UA',
 'UA',
 'NA',
 'UA',
 'A',
 'NA',
 'NA',
 'UA',
 'NA',
 'NA',
 'UA',
 'UA',
 'NA',
 'NA',
 'NA',
 'UA',
 'NA',
 'A',
 'UA',
 'NA',
 'UA',
 'UA',
 'NA',
 'NA',
 'A',
 'A',
 'NA',
 'U',
 'UA',
 'UA',
 'UA',
 'NA',
 'NA',
 'A',
 'UA',
 'NA',
 'UA',
 'A',
 'U',
 'UA',
 'NA',
 'NA',
 'A',
 'NA',
 'UA',
 'NA',
 'NA',
 'A',
 'NA',
 'UA',
 'NA',
 'A',
 'UA',
 'NA',
 'UA',
 'A']

## **Fetching Production Countries Details**

In [None]:
# Write Your Code here
#Creating a empty list
Production_country=[]
for url in Movie_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    for x in soup.find_all('div',attrs={'class':'detail-infos'}):
      if x.find_all('h3',attrs={'class':'detail-infos__subheading'})[0].text==' Production country ':
        prod_country= x.find_all('div',attrs={'class':'detail-infos__value'})[0].text

  except Exception as e:
    prod_country= "NA"
  Production_country.append(prod_country) #appending the list

  time.sleep(4)

  #Printing the list
Production_country

['India',
 'India',
 'Japan',
 'India',
 'India',
 'United States, Australia',
 'India',
 'United States',
 'India',
 'India',
 'Australia, Canada, United States',
 'India',
 'Australia, United States',
 'United States',
 'India',
 'India',
 'India',
 'South Korea',
 'United States',
 'India',
 'India',
 'United States',
 'United States, United Kingdom',
 'United States',
 'India',
 'Italy, United States',
 'India',
 'United States',
 'India',
 'India',
 'Poland',
 'India',
 'India',
 'India',
 'India, United States',
 'United States',
 'United States',
 'United States',
 'India, United States',
 'United States, Australia',
 'India',
 'United States',
 'India',
 'United Kingdom, United States',
 'India',
 'United States',
 'India',
 'United States',
 'India',
 'India',
 'India',
 'South Korea',
 'United States',
 'United Kingdom, United States',
 'India',
 'India',
 'France',
 'United States',
 'United States, Canada, Singapore',
 'India',
 'United States',
 'Sweden, India',
 'India',


## **Fetching Streaming Service Details**

In [None]:
# Write Your Code here
#Creating a empty list
Movie_stream=[]
for url in Movie_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    Stream_provider = soup.find("div",attrs ={'class':'buybox-row__offers'}).find_all("img",attrs={'class':'offer__icon'})
    alt_values = [img['alt'] for img in Stream_provider]
    alt_values = (",".join(alt_values))

  except Exception as e:
    alt_values = "NA"
  Movie_stream.append(alt_values) #appending the list

  time.sleep(4)

#Printing the list
Movie_stream

['Amazon Prime Video',
 'Netflix',
 'Netflix',
 'Netflix',
 'Hotstar',
 'Apple TV',
 'Amazon Prime Video',
 'Netflix',
 'Amazon Prime Video',
 'Bookmyshow',
 'Apple TV',
 'Amazon Prime Video',
 'Amazon Prime Video',
 'Apple TV',
 'Netflix',
 'Hotstar,aha',
 'Netflix',
 'Lionsgate Play',
 'Amazon Prime Video',
 'Zee5',
 'Apple TV',
 'Apple TV',
 'Apple TV',
 'Apple TV',
 'Netflix,Hotstar',
 'Apple TV',
 'Hotstar,Jio Cinema,Zee5',
 'Apple TV',
 'Hotstar',
 'Amazon Prime Video',
 'Netflix',
 'Jio Cinema',
 'Sony Liv',
 'Netflix',
 'Amazon Prime Video',
 'Netflix',
 'Apple TV',
 'Apple TV',
 'NA',
 'Bookmyshow',
 'Bookmyshow',
 'Bookmyshow',
 'Hotstar',
 'Apple TV',
 'Netflix',
 'Apple TV',
 'Amazon Prime Video',
 'Apple TV',
 'aha',
 'Bookmyshow',
 'Netflix',
 'NA',
 'Apple TV',
 'NA',
 'Amazon Prime Video',
 'Amazon Prime Video',
 'Netflix',
 'Apple TV',
 'NA',
 'Amazon Prime Video',
 'Amazon Prime Video',
 'NA',
 'Amazon Prime Video',
 'Apple TV',
 'NA',
 'Sony Liv',
 'NA',
 'Apple TV',

## **Now Creating Movies DataFrame**

In [None]:
# Write Your Code here
data = {
    'Movie_title':Movie_title,
    'Movie_year':Movie_year,
    'Movie_genres':Movie_genres,
    'Movie_rating':Movie_rating,
    'Movie_runtime':Movie_runtime,
    'Movie_age_rating':Movie_age_rating,
    'Production_country':Production_country,
    'Movie_stream':Movie_stream,
    'Movie_url':Movie_url
}
df = pd.DataFrame(data)
df


Unnamed: 0,Movie_title,Movie_year,Movie_genres,Movie_rating,Movie_runtime,Movie_age_rating,Production_country,Movie_stream,Movie_url
0,Aavesham,2024,"Action & Adventure, Comedy",8.0 (11k),2h 38min,,India,Amazon Prime Video,https://www.justwatch.com/in/movie/aavesham-2024
1,Crew,2024,"Comedy, Drama",5.9 (30k),1h 58min,UA,India,Netflix,https://www.justwatch.com/in/movie/the-crew-2024
2,Godzilla Minus One,2023,"Science-Fiction, Horror, Action & Adventure, D...",7.8 (119k),2h 5min,,Japan,Netflix,https://www.justwatch.com/in/movie/godzilla-mi...
3,Laapataa Ladies,2024,"Comedy, Drama",8.5 (32k),2h 2min,UA,India,Netflix,https://www.justwatch.com/in/movie/laapataa-la...
4,Manjummel Boys,2024,"History, Mystery & Thriller, Drama, Action & A...",8.3 (16k),2h 15min,UA,India,Hotstar,https://www.justwatch.com/in/movie/manjummel-boys
...,...,...,...,...,...,...,...,...,...
95,Her,2013,"Romance, Science-Fiction, Drama",8.0 (674k),2h 6min,A,United States,Amazon Prime Video,https://www.justwatch.com/in/movie/her
96,K.G.F: Chapter 2,2022,"Action & Adventure, Crime, Drama, Mystery & Th...",8.3 (150k),2h 46min,UA,India,Amazon Prime Video,https://www.justwatch.com/in/movie/k-g-f-chapt...
97,Kurangu Pedal,2024,Drama,8.2,1h 59min,,India,Amazon Prime Video,https://www.justwatch.com/in/movie/kurangu-pedal
98,Stree,2018,"Horror, Comedy, Drama",7.5 (38k),2h 8min,UA,India,Apple TV,https://www.justwatch.com/in/movie/stree


In [None]:
df.to_csv('Movie_data.csv')

## **Scraping TV  Show Data**

In [None]:
# Specifying the URL from which tv show related data will be fetched
tv_url='https://www.justwatch.com/in/tv-shows?release_year_from=2000'
# Sending an HTTP GET request to the URL
page=requests.get(tv_url)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(page.text,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

<!DOCTYPE html>
<html data-vue-meta="%7B%22dir%22:%7B%22ssr%22:%22ltr%22%7D,%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-vue-meta-server-rendered="" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-vue-meta="ssr"/>
  <meta content="IE=edge" data-vue-meta="ssr" httpequiv="X-UA-Compatible"/>
  <meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" data-vue-meta="ssr" name="viewport"/>
  <meta content="JustWatch" data-vue-meta="ssr" property="og:site_name"/>
  <meta content="794243977319785" data-vue-meta="ssr" property="fb:app_id"/>
  <meta content="/appassets/img/JustWatch_logo_with_claim.png" data-vmid="og:image" data-vue-meta="ssr" property="og:image"/>
  <meta content="606" data-vmid="og:image:width" data-vue-meta="ssr" property="og:image:width"/>
  <meta content="302" data-vmid="og:image:height" data-vue-meta="ssr" pro

## **Fetching Tv shows Url details**

In [None]:
# Write Your Code here
#Creating a empty list
Shows_url= []

#Looping through the entire code to get the all 100 links
for x in soup.find_all('a',attrs={'class':'title-list-grid__item--link'}):
  Shows_url.append('https://www.justwatch.com'+ x['href']) # Adding initial subset of a link to create a accessible link

#Printing length
print("Total No of links:- ",len(Shows_url))

#Printing list of a links
Shows_url

Total No of links:-  100


['https://www.justwatch.com/in/tv-show/panchayat',
 'https://www.justwatch.com/in/tv-show/gullak',
 'https://www.justwatch.com/in/tv-show/heeramandi',
 'https://www.justwatch.com/in/tv-show/mirzapur',
 'https://www.justwatch.com/in/tv-show/house-of-the-dragon',
 'https://www.justwatch.com/in/tv-show/bridgerton',
 'https://www.justwatch.com/in/tv-show/shogun-2024',
 'https://www.justwatch.com/in/tv-show/game-of-thrones',
 'https://www.justwatch.com/in/tv-show/raising-voices',
 'https://www.justwatch.com/in/tv-show/the-boys',
 'https://www.justwatch.com/in/tv-show/fallout',
 'https://www.justwatch.com/in/tv-show/demon-slayer-kimetsu-no-yaiba',
 'https://www.justwatch.com/in/tv-show/murder-in-mahim',
 'https://www.justwatch.com/in/tv-show/apharan',
 'https://www.justwatch.com/in/tv-show/3-body-problem',
 'https://www.justwatch.com/in/tv-show/eric',
 'https://www.justwatch.com/in/tv-show/sunflower-2021',
 'https://www.justwatch.com/in/tv-show/maxton-hall-the-world-between-us',
 'https://ww

## **Fetching Tv Show Title details**

In [None]:
# Write Your Code here
Show_title= []
for url in Shows_url:
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    title = soup.find_all('div', attrs = {'data-testid':'titleBlock'})[0].find_all('h1')[0].text
  except Exception as e:
    title = "NA"
  Show_title.append(title)

  time.sleep(4)
Show_title

[' Panchayat ',
 ' Gullak ',
 ' Heeramandi: The Diamond Bazaar ',
 ' Mirzapur ',
 ' House of the Dragon ',
 ' Bridgerton ',
 ' Shōgun ',
 ' Game of Thrones ',
 ' Raising Voices ',
 ' The Boys ',
 ' Fallout ',
 ' Demon Slayer: Kimetsu no Yaiba ',
 ' Murder in Mahim ',
 ' Apharan ',
 ' 3 Body Problem ',
 ' Eric ',
 ' Sunflower ',
 ' Maxton Hall: The World Between Us ',
 ' Dark Matter ',
 ' Aashram ',
 ' Farzi ',
 ' The 8 Show ',
 ' The Good Doctor ',
 ' Attack on Titan ',
 ' The Family Man ',
 ' Jurassic World: Chaos Theory ',
 ' Young Sheldon ',
 ' The Great Indian Kapil Show ',
 ' Jujutsu Kaisen ',
 ' Reacher ',
 ' From ',
 ' MTV Splitsvilla ',
 ' The Legend of Hanuman ',
 ' The Acolyte ',
 ' True Detective ',
 ' Asur: Welcome to Your Dark Side ',
 ' Peaky Blinders ',
 ' Dehati Ladke ',
 ' Mastram ',
 ' Euphoria ',
 ' Breaking Bad ',
 ' Jamnapaar ',
 ' Money Heist ',
 ' The Last of Us ',
 ' Stranger Things ',
 ' Naruto ',
 ' Scorpion ',
 ' Lucifer ',
 ' Presumed Innocent ',
 ' Gandii B

## **Fetching Release Year**

In [None]:
# Write Your Code here
#Creating a empty list
Show_year=[]
for url in Shows_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    year = eval(soup.find_all('div', attrs = {'data-testid':'titleBlock'})[0].find_all('span')[0].text)

  except Exception as e:
    year = "NA"
  Show_year.append(year) #appending the list

  time.sleep(4)

#Printing the list
Show_year


[2020,
 2019,
 2024,
 2018,
 2022,
 2020,
 2024,
 2011,
 2024,
 2019,
 2024,
 2019,
 2024,
 2018,
 2024,
 2024,
 2021,
 2024,
 2024,
 2020,
 2023,
 2024,
 2017,
 2013,
 2019,
 2024,
 2017,
 2024,
 2020,
 2022,
 2022,
 2008,
 2021,
 2024,
 2014,
 2020,
 2013,
 2023,
 2020,
 2019,
 2008,
 2024,
 2017,
 2023,
 2016,
 2002,
 2014,
 2016,
 2024,
 2018,
 2020,
 2018,
 2020,
 2007,
 2017,
 2019,
 2018,
 2020,
 2021,
 2020,
 2024,
 2004,
 2023,
 2010,
 2020,
 2005,
 2024,
 2022,
 2009,
 2020,
 2018,
 2024,
 2024,
 2024,
 2024,
 2021,
 2005,
 2018,
 2010,
 2011,
 2024,
 2024,
 2017,
 2022,
 2023,
 2024,
 2018,
 2024,
 2024,
 2021,
 2013,
 2024,
 2021,
 2022,
 2004,
 2021,
 2006,
 2004,
 2018,
 2018]

## **Fetching TV Show Genre Details**

In [None]:
# Write Your Code here
#Creating a empty list
Show_genres= []
for url in Shows_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    for x in soup.find_all('div',attrs={'class':'detail-infos'}):
      if x.find_all('h3',attrs={'class':'detail-infos__subheading'})[0].text=='Genres':
          genre = x.find_all('span',attrs={'class':'detail-infos__value'})[0].text

  except Exception as e:
    genre = "NA"
  Show_genres.append(genre) #appending the list

  time.sleep(4)

#Printing the list
Show_genres

['Comedy, Drama',
 'Drama, Kids & Family, Comedy',
 'Romance, Drama, History, War & Military',
 'Crime, Action & Adventure, Drama, Mystery & Thriller',
 'Drama, Fantasy, Romance, Action & Adventure, Science-Fiction',
 'Drama, Romance',
 'War & Military, Drama, History',
 'Drama, Fantasy, Action & Adventure, Science-Fiction',
 'Drama',
 'Science-Fiction, Action & Adventure, Comedy, Crime, Drama',
 'Science-Fiction, War & Military, Action & Adventure, Drama, Mystery & Thriller',
 'Animation, Action & Adventure, Science-Fiction, Fantasy, Mystery & Thriller',
 'Crime',
 'Drama, Action & Adventure, Crime, Mystery & Thriller',
 'Science-Fiction, Drama, Fantasy',
 'Crime, Drama, Mystery & Thriller',
 'Comedy, Crime',
 'Drama, Romance',
 'Science-Fiction, Mystery & Thriller, Drama',
 'Crime, Drama, Mystery & Thriller',
 'Crime, Drama, Mystery & Thriller',
 'Drama, Mystery & Thriller, Comedy',
 'Drama',
 'Animation, Action & Adventure, Drama, Fantasy, Horror, Science-Fiction',
 'Action & Advent

## **Fetching IMDB Rating Details**

In [None]:
# Write Your Code here
#Creating a empty list
Show_rating=[]
for url in Shows_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    for x in soup.find_all('div',attrs = {'class':'detail-infos'}):
      if x.find_all('h3', attrs = {'class':'detail-infos__subheading'})[0].text =='Rating':
        rating = x.find_all('div', attrs = {'class':'detail-infos__value'})[0].text.strip()

  except Exception as e:
    rating = "NA"
  Show_rating.append(rating) #appending the list

  time.sleep(4)

#Printing the list
Show_rating

['9.0  (93k)',
 '9.1  (23k)',
 '6.4  (26k)',
 '8.5  (82k)',
 '8.4  (389k)',
 '7.4  (182k)',
 '8.7  (142k)',
 '9.2  (2m)',
 '7.2  (2k)',
 '8.7  (669k)',
 '8.4  (208k)',
 '8.6  (151k)',
 '8.0  (25k)',
 '8.2  (19k)',
 '7.5  (124k)',
 '7.0  (20k)',
 '7.4  (24k)',
 '7.6  (9k)',
 '7.6  (8k)',
 '7.4  (32k)',
 '8.4  (46k)',
 '7.2  (7k)',
 '8.0  (115k)',
 '9.0  (325k)',
 '8.7  (99k)',
 '7.6  (1k)',
 '7.7  (110k)',
 '6.7  (25k)',
 '8.6  (120k)',
 '8.1  (216k)',
 '8.1  (216k)',
 '3.8',
 '9.1  (12k)',
 '3.5  (53k)',
 '8.9  (661k)',
 '8.5  (65k)',
 '8.8  (657k)',
 '7.2',
 '6.9  (2k)',
 '8.3  (246k)',
 '9.5  (2m)',
 '8.3  (2k)',
 '8.2  (536k)',
 '8.7  (539k)',
 '8.7  (1m)',
 '8.4  (129k)',
 '7.0  (59k)',
 '8.1  (358k)',
 '7.6  (2k)',
 '3.4  (3k)',
 '',
 '8.0  (75k)',
 '3.7  (18k)',
 '8.7  (172k)',
 '8.7  (449k)',
 '9.0  (81k)',
 '8.3  (28k)',
 '9.3  (150k)',
 '8.1  (63k)',
 '6.5  (9k)',
 '8.4  (3k)',
 '8.3  (599k)',
 '6.2  (12k)',
 '8.4  (155k)',
 '6.9  (24k)',
 '8.6  (248k)',
 '6.8',
 '7.2  (23k)',

## **Fetching Age Rating Details**

In [None]:
# Write Your Code here
#Creating a empty list
Show_age_rating=[]
for url in Shows_url:
  age_rating = "NA"
  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text, 'html.parser')

    for x in soup.find_all('div', attrs = {'class':'detail-infos'}):
      if x.find_all('h3', attrs = {'class':'detail-infos__subheading'})[0].text == 'Age rating':
          age_rating= x.find_all('div', attrs = {'class':'detail-infos__value'})[0].text

  except Exception as e:
      pass
  Show_age_rating.append(age_rating) #appending the list

  time.sleep(4)

  #Printing the list
Show_age_rating

['NA',
 'NA',
 'NA',
 'NA',
 'A',
 'A',
 'NA',
 'A',
 'NA',
 'NA',
 'A',
 'NA',
 'NA',
 'NA',
 'A',
 'NA',
 'A',
 'NA',
 'A',
 'NA',
 'NA',
 'NA',
 'U',
 'UA',
 'NA',
 'NA',
 'U',
 'NA',
 'NA',
 'A',
 'NA',
 'U',
 'NA',
 'NA',
 'U',
 'NA',
 'A',
 'NA',
 'NA',
 'A',
 'U',
 'NA',
 'NA',
 'A',
 'NA',
 'NA',
 'U',
 'U',
 'A',
 'A',
 'A',
 'NA',
 'A',
 'U',
 'NA',
 'NA',
 'A',
 'NA',
 'NA',
 'NA',
 'NA',
 'U',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'A',
 'NA',
 'NA',
 'NA',
 'NA',
 'A',
 'NA',
 'U',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'U',
 'NA',
 'NA',
 'NA',
 'NA',
 'A',
 'NA',
 'UA',
 'NA',
 'A',
 'NA',
 'NA',
 'U',
 'NA',
 'NA',
 'U',
 'NA',
 'NA']

## **Fetching Production Country details**

In [None]:
# Write Your Code here
#Creating a empty list
Production_country=[]
for url in Shows_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    for x in soup.find_all('div',attrs={'class':'detail-infos'}):
      if x.find_all('h3',attrs={'class':'detail-infos__subheading'})[0].text==' Production country ':
        prod_country= x.find_all('div',attrs={'class':'detail-infos__value'})[0].text

  except Exception as e:
    prod_country= "NA"
  Production_country.append(prod_country) #appending the list

  time.sleep(4)

  #Printing the list
Production_country

['India',
 'India',
 'India',
 'India',
 'United States',
 'United States',
 'United States',
 'United Kingdom, United States',
 'Spain',
 'United States',
 'United States',
 'Japan',
 'India',
 'India',
 'United States',
 'United Kingdom, United States',
 'India',
 'Germany',
 'United States',
 'India',
 'India',
 'South Korea',
 'United States',
 'Japan',
 'India',
 'United States',
 'United States',
 'India',
 'Japan, United States',
 'United States',
 'United States',
 'India',
 'India',
 'United States',
 'United States',
 'India',
 'United Kingdom',
 'India',
 'India',
 'United States',
 'United States',
 'India',
 'Spain',
 'United States',
 'United States',
 'Japan',
 'United States',
 'United States',
 'United States',
 'India',
 'India',
 'United States',
 'India',
 'Japan',
 'Germany',
 'India',
 'India',
 'India',
 'United States',
 'Mexico',
 'Japan',
 'United States',
 'United States',
 'United Kingdom',
 'India',
 'United Kingdom',
 'Indonesia',
 'United States',
 'Unite

## **Fetching Streaming Service details**

In [None]:
# Write Your Code here
#Creating a empty list
Show_stream=[]
for url in Shows_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    Stream_provider = soup.find("div",attrs ={'class':'buybox-row__offers'}).find_all("img",attrs={'class':'offer__icon'})
    alt_values = [img['alt'] for img in Stream_provider]
    alt_values = (",".join(alt_values))

  except Exception as e:
    alt_values = "NA"
  Show_stream.append(alt_values) #appending the list

  time.sleep(4)

#Printing the list
Show_stream

['Amazon Prime Video',
 'Sony Liv',
 'Netflix',
 'Amazon Prime Video',
 'Jio Cinema',
 'Netflix',
 'Hotstar',
 'Jio Cinema',
 'Netflix',
 'Amazon Prime Video',
 'Amazon Prime Video',
 'Crunchyroll',
 'Jio Cinema',
 'Jio Cinema,Alt Balaji',
 'Netflix',
 'Netflix',
 'Zee5,VI movies and tv',
 'Amazon Prime Video',
 'Apple TV Plus',
 'MX Player',
 'Amazon Prime Video',
 'Netflix',
 'Amazon Prime Video',
 'Amazon Prime Video',
 'Amazon Prime Video',
 'Netflix',
 'Amazon Prime Video',
 'Netflix',
 'Crunchyroll',
 'Amazon Prime Video',
 'NA',
 'Jio Cinema',
 'Hotstar',
 'Hotstar',
 'Jio Cinema',
 'Jio Cinema',
 'Netflix',
 'Amazon miniTV',
 'NA',
 'Jio Cinema',
 'Netflix',
 'Amazon miniTV',
 'Netflix',
 'Jio Cinema',
 'Netflix',
 'Amazon Prime Video',
 'NA',
 'Netflix',
 'Apple TV Plus',
 'Alt Balaji',
 'NA',
 'NA',
 'Alt Balaji',
 'Crunchyroll',
 'Netflix',
 'Netflix',
 'Sony Liv',
 'Sony Liv',
 'Jio Cinema,Netflix',
 'Netflix',
 'Crunchyroll',
 'NA',
 'Amazon Prime Video',
 'Amazon Prime Vi

## **Fetching Duration Details**

In [None]:
# Write Your Code here
#Creating a empty list
Show_runtime=[]
for url in Shows_url:

  #Using try block to avoid errors
  try:
    content = requests.get(url)
    soup = BeautifulSoup(content.text , 'html.parser')

    for x in soup.find_all('div',attrs={'class':'detail-infos'}):
      if x.find_all('h3',attrs={'class':'detail-infos__subheading'})[0].text=='Runtime':
        runtime= x.find_all('div',attrs={'class':'detail-infos__value'})[0].text.strip()

  except Exception as e:
    runtime = "NA"
  Show_runtime.append(runtime) #appending the list

  time.sleep(4)

#Printing the list
Show_runtime

['35min',
 '30min',
 '54min',
 '50min',
 '58min',
 '1h 2min',
 '59min',
 '58min',
 '46min',
 '1h 1min',
 '59min',
 '25min',
 '43min',
 '24min',
 '56min',
 '53min',
 '38min',
 '45min',
 '52min',
 '43min',
 '56min',
 '52min',
 '43min',
 '25min',
 '45min',
 '24min',
 '19min',
 '54min',
 '24min',
 '48min',
 '50min',
 '44min',
 '21min',
 '41min',
 '1h 1min',
 '47min',
 '58min',
 '24min',
 '28min',
 '58min',
 '47min',
 '32min',
 '50min',
 '58min',
 '1h 1min',
 '23min',
 '42min',
 '47min',
 '42min',
 '44min',
 '35min',
 '43min',
 '25min',
 '24min',
 '56min',
 '41min',
 '31min',
 '52min',
 '44min',
 '34min',
 '24min',
 '43min',
 '43min',
 '58min',
 '31min',
 '50min',
 '52min',
 '49min',
 '21min',
 '43min',
 '50min',
 '1h 28min',
 '41min',
 '24min',
 '54min',
 '48min',
 '24min',
 '44min',
 '54min',
 '44min',
 '40min',
 '48min',
 '51min',
 '34min',
 '1h 6min',
 '1h 4min',
 '26min',
 '50min',
 '40min',
 '44min',
 '43min',
 '45min',
 '52min',
 '22min',
 '43min',
 '49min',
 '24min',
 '44min',
 '31m

## **Creating TV Show DataFrame**

In [None]:
# Write Your Code here
Shows_data = {
    'Show_title':Show_title,
    'Show_year':Show_year,
    'Show_genres':Show_genres,
    'Show_rating':Show_rating,
    'Show_runtime':Show_runtime,
    'Show_age_rating':Show_age_rating,
    'Production_country':Production_country,
    'Show_stream':Show_stream,
    'Shows_url':Shows_url
}
df = pd.DataFrame(Shows_data)
df


Unnamed: 0,Show_title,Show_year,Show_genres,Show_rating,Show_runtime,Show_age_rating,Production_country,Show_stream,Shows_url
0,Panchayat,2020,"Comedy, Drama",9.0 (93k),35min,,India,Amazon Prime Video,https://www.justwatch.com/in/tv-show/panchayat
1,Gullak,2019,"Drama, Kids & Family, Comedy",9.1 (23k),30min,,India,Sony Liv,https://www.justwatch.com/in/tv-show/gullak
2,Heeramandi: The Diamond Bazaar,2024,"Romance, Drama, History, War & Military",6.4 (26k),54min,,India,Netflix,https://www.justwatch.com/in/tv-show/heeramandi
3,Mirzapur,2018,"Crime, Action & Adventure, Drama, Mystery & Th...",8.5 (82k),50min,,India,Amazon Prime Video,https://www.justwatch.com/in/tv-show/mirzapur
4,House of the Dragon,2022,"Drama, Fantasy, Romance, Action & Adventure, S...",8.4 (389k),58min,A,United States,Jio Cinema,https://www.justwatch.com/in/tv-show/house-of-...
...,...,...,...,...,...,...,...,...,...
95,Loki,2021,"Fantasy, Science-Fiction, Action & Adventure",8.2 (415k),49min,,United States,Hotstar,https://www.justwatch.com/in/tv-show/loki
96,DEATH NOTE,2006,"Animation, Mystery & Thriller, Science-Fiction...",8.9 (383k),24min,,Japan,Netflix,https://www.justwatch.com/in/tv-show/death-note
97,House,2004,"Drama, Mystery & Thriller",8.7 (519k),44min,U,United States,Amazon Prime Video,https://www.justwatch.com/in/tv-show/house
98,Flames,2018,"Comedy, Drama, Romance, Made in Europe",8.9 (32k),31min,,India,Amazon Prime Video,https://www.justwatch.com/in/tv-show/flames


In [None]:
df.to_csv('Shows_data.csv')

## **Task 2 :- Data Filtering & Analysis**

In [None]:
# Write Your Code here


## **Calculating Mean IMDB Ratings for both Movies and Tv Shows**

In [None]:
# Write Your Code here


## **Analyzing Top Genres**

In [None]:
# Write Your Code here


In [None]:
#Let's Visvalize it using word cloud


## **Finding Predominant Streaming Service**

In [None]:
# Write Your Code here


In [None]:
#Let's Visvalize it using word cloud


## **Task 3 :- Data Export**

In [None]:
#saving final dataframe as Final Data in csv format


In [None]:
#saving filter data as Filter Data in csv format


# **Dataset Drive Link (View Access with Anyone) -**

# ***Congratulations!!! You have completed your Assignment.***