# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [None]:
#Installing all necessary labraries
!pip install bs4
!pip install requests

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [None]:
#import all necessary libraries
import requests  # Library for making HTTP requests
from bs4 import BeautifulSoup  # Library for parsing HTML and XML documents
import re  # Regular expression operations
import pandas as pd  # Library for data manipulation and analysis
import numpy as np  # Library for numerical computing

## **Scrapping Movies Data**

In [None]:
# Specifying the URL from which movies related data will be fetched
url='https://www.justwatch.com/in/movies?release_year_from=2000'

# Sending an HTTP GET request to the URL
page=requests.get(url)
#Here, page is the Resonse Object returned by the HTTP request with all the response data (content, encoding, status, etc).

# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(page.text,'html.parser')

# Printing the prettified HTML content
print(soup.prettify())

<!DOCTYPE html>
<html data-vue-meta="%7B%22dir%22:%7B%22ssr%22:%22ltr%22%7D,%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-vue-meta-server-rendered="" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-vue-meta="ssr"/>
  <meta content="IE=edge" data-vue-meta="ssr" httpequiv="X-UA-Compatible"/>
  <meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" data-vue-meta="ssr" name="viewport"/>
  <meta content="JustWatch" data-vue-meta="ssr" property="og:site_name"/>
  <meta content="794243977319785" data-vue-meta="ssr" property="fb:app_id"/>
  <meta content="/appassets/img/JustWatch_logo_with_claim.png" data-vmid="og:image" data-vue-meta="ssr" property="og:image"/>
  <meta content="606" data-vmid="og:image:width" data-vue-meta="ssr" property="og:image:width"/>
  <meta content="302" data-vmid="og:image:height" data-vue-meta="ssr" pro

## **Fetching Movie URL's**

In [None]:
# Function definition to create a list of URLs from which information can be scraped later on.
def extract_movie_urls(soup):
    """
    This function extracts URLs of movies from the parsed HTML content.
    I/P args: soup (BeautifulSoup): Parsed HTML content of the webpage.
    O/P args: movie_url_list: A list of URLs of movies.
    """
    try:
        # Initialize an empty list to store the URLs of movies
        movie_url_list = []

        # Find all <div> elements with the class 'title-list-grid__item'
        for movie_div in soup.findAll('div', attrs={'class': 'title-list-grid__item'}):
            # Find the <a> element within each <div> and extract the value of the 'href' attribute
            movie_url = 'https://www.justwatch.com' + movie_div.find('a')['href']
            # Append the complete URL to the movie_url_list
            movie_url_list.append(movie_url)

        return movie_url_list

    except Exception as e:
        print("An error occurred while extracting movie URLs:", e)
        return None

In [None]:
#Function calling
movie_url_list=extract_movie_urls(soup)

In [None]:
print(movie_url_list)



## **Scrapping Movie Title**

In [None]:
#Function definition for scrapping movie titles from the movie URLs
def scrape_movie_titles(movie_url_list):
    """ This function scrapes movie titles from the provided list of movie URLs.
    I/Ps: movie_url_list (list): A list of URLs of movies.
    O/Ps: movie_titles: A list of movie titles.
    """
    try:
        # Initialize an empty list to store the movie titles
        movie_titles = []

        # Iterate through each movie URL in the movie_url_list
        for movie_url in movie_url_list:
            try:
                # Sending an HTTP GET request to the URL
                movie_page = requests.get(movie_url)
                # Check if the request was successful (status code 200)
                if movie_page.status_code == 200:
                    # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                    movie_soup = BeautifulSoup(movie_page.text, 'html.parser')
                    # Find the movie title element and extract the text
                    movie_title = movie_soup.find('div', class_='title-block').find('h1').text
                    # Append the movie title to the list
                    movie_titles.append(movie_title)
                else:
                    print(f"Failed to retrieve data from {movie_url}: Status code {movie_page.status_code}")
            except Exception as e:
                print(f"An error occurred while processing {movie_url}: {e}")

        return movie_titles

    except Exception as e:
        print("An error occurred:", e)
        return None

In [None]:
#Function calling
movie_titles=scrape_movie_titles(movie_url_list)

In [None]:
print(len(movie_titles))
print(movie_titles)

100


## **Scrapping release Year**

In [None]:
#Function definition for scrapping release year.
def scrape_movie_release_years(movie_url_list):
    """ This function scrapes release years of movies from the provided list of movie URLs.
    I/P args: movie_url_list (list): A list of URLs of movies.
    O/P args: movie_release_years: A list of release years of movies.
    """
    try:
        # Initialize an empty list to store the release years of movies
        movie_release_years = []

        # Iterate through each movie URL in the movie_url_list
        for movie_url in movie_url_list:
            try:
                # Sending an HTTP GET request to the URL
                movie_page = requests.get(movie_url)

                # Check if the request was successful (status code 200)
                if movie_page.status_code == 200:
                    # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                    movie_soup = BeautifulSoup(movie_page.text, 'html.parser')
                    # Find the release year element and extract the text
                    release_year = movie_soup.find('div', attrs={'class': 'title-block'}).find('span').text
                    # Append the release year to the list
                    movie_release_years.append(release_year)
                else:
                    print(f"Failed to retrieve data from {movie_url}: Status code {movie_page.status_code}")
            except Exception as e:
                print(f"An error occurred while processing {movie_url}: {e}")

        return movie_release_years

    except Exception as e:
        print("An error occurred:", e)
        return None

In [None]:
#Function calling
movie_release_years = scrape_movie_release_years(movie_url_list)

In [None]:
print(len(movie_release_years))
print(movie_release_years)

100
[' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2024) ', ' (2024) ', ' (2021) ', ' (2024) ', ' (2024) ', ' (2024) ', ' (2023) ', ' (2024) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2024) ', ' (2024) ', ' (2024) ', ' (2024) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2020) ', ' (2014) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2024) ', ' (2024) ', ' (2001) ', ' (2023) ', ' (2019) ', ' (2023) ', ' (2022) ', ' (2013) ', ' (2023) ', ' (2023) ', ' (2024) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2024) ', ' (2023) ', ' (2024) ', ' (2023) ', ' (2018) ', ' (2019) ', ' (2023) ', ' (2016) ', ' (2024) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2024) ', ' (2023) ', ' (2024) ', ' (2009) ', ' (2023) ', ' (2011) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2023) ', ' (2024) ', ' (2018) ', ' (2023) ', ' (2023) ',

## **Scrapping Genres**

In [None]:
#This function extracts the movie genres.
def extract_movie_genres(movie_url_list):
    """ This extracts movie genres from the provided list of movie URLs.
    I/P args: movie_url_list (list): A list of URLs of movies.
    O/Ps: movie_genres: A list of movie genres.
    """
    try:
        movie_genres = []  # Initialize an empty list to store the movie genres

        # Iterate through each movie URL in the movie_url_list
        for movie_url in movie_url_list:
            try:
                # Sending an HTTP GET request to the URL
                movie_page = requests.get(movie_url)

                # Check if the request was successful (status code 200)
                if movie_page.status_code == 200:
                    # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                    movie_soup = BeautifulSoup(movie_page.text, 'html.parser')

                    # Find the genre information
                    for val in movie_soup.select("div.detail-infos"):
                        if val.select('h3.detail-infos__subheading')[0].text == "Genres":
                            movie_genres.append(val.select('div.detail-infos__value')[0].text)
                            break
                else:
                    print(f"Failed to retrieve data from {movie_url}: Status code {movie_page.status_code}")
            except Exception as e:
                print(f"An error occurred while processing {movie_url}: {e}")

        return movie_genres

    except Exception as e:
        print("An error occurred:", e)
        return None

In [None]:
movie_genres=extract_movie_genres(movie_url_list)

In [None]:
print(len(movie_genres))
print(movie_genres)

100
['Action & Adventure, Drama, Crime, Mystery & Thriller', 'Action & Adventure, Crime, Drama, Mystery & Thriller', 'Drama, Comedy', 'Drama', 'War & Military, Drama, History', 'Drama, History', 'Drama', 'Fantasy, Science-Fiction, Action & Adventure', 'Comedy, Romance', 'Action & Adventure, Drama, Kids & Family', 'Action & Adventure, Drama, Mystery & Thriller', 'Action & Adventure, Science-Fiction, Drama', 'Action & Adventure, Science-Fiction', 'Comedy, Fantasy, Action & Adventure', 'Action & Adventure, Mystery & Thriller, War & Military', 'Drama, Romance, Action & Adventure', 'Action & Adventure, Mystery & Thriller, Drama', 'Drama, Mystery & Thriller', 'Science-Fiction, Action & Adventure, Fantasy', 'Mystery & Thriller, Action & Adventure, Drama', 'Comedy, Kids & Family, Animation, Action & Adventure', 'Kids & Family, Romance, Drama', 'Mystery & Thriller, Crime, Drama', 'Drama, Crime', 'Action & Adventure, Crime, Drama, Mystery & Thriller', 'Comedy, Drama, Romance, Science-Fiction', '

## **Scrapping IMDB Rating**

In [None]:
#Function definition for scrapping IMDB rating.
def extract_movie_imdb_ratings(movie_url_list):
    """ It extracts IMDb ratings of movies from the provided list of movie URLs.
    I/P args: movie_url_list (list): A list of URLs of movies.
    O/P args: list: A list of IMDb ratings of movies.
    """
    movie_imdb_ratings = []  # Initialize an empty list to store the IMDb ratings

    # Iterate through each movie URL in the movie_url_list
    for movie_url in movie_url_list:
        # Sending an HTTP GET request to the URL
        movie_page = requests.get(movie_url)

        # Check if the request was successful (status code 200)
        if movie_page.status_code == 200:
            # Parsing the HTML content using BeautifulSoup with the 'html.parser'
            movie_soup = BeautifulSoup(movie_page.text, 'html.parser')

            # Find the IMDb rating information
            for val in movie_soup.select("div.detail-infos"):
                if val.select('h3.detail-infos__subheading')[0].text == "Rating":
                    # Try-except block to handle cases where IMDb rating is not available
                    try:
                        movie_imdb_ratings.append(val.select('span span')[0].text)
                    except IndexError:
                        movie_imdb_ratings.append("None")  # If IMDb rating is not available, append "None"
                    break
        else:
            print(f"Failed to retrieve data from {movie_url}: Status code {movie_page.status_code}")

    return movie_imdb_ratings

In [None]:
#Function calling
movie_imdb_ratings=extract_movie_imdb_ratings(movie_url_list)

In [None]:
print(len(movie_imdb_ratings))
print(movie_imdb_ratings)

100
[' 6.3  (82k) ', ' 6.5  (60k) ', ' 6.8  (68k) ', ' 9.1  (99k) ', ' 7.8  (13k) ', ' 8.4  (649k) ', ' 7.1  (68k) ', ' 5.7  (58k) ', ' 6.3  (38k) ', ' 5.4  (19k) ', ' 6.8  (5k) ', ' 8.0  (766k) ', ' 6.1  (2k) ', ' 8.4  (17k) ', ' 7.0  (36k) ', ' 8.1  (2k) ', ' 6.4  (57k) ', ' 7.6  (5k) ', ' 5.6  (110k) ', ' 7.0  (92k) ', ' 6.7  (17k) ', ' 8.2  (13k) ', ' 7.8  (72k) ', ' 7.2  (3k) ', ' 5.0  (1k) ', ' 7.1  (41k) ', ' 6.1  (11k) ', ' 7.7  (208k) ', ' 7.8  (107k) ', ' 7.9  (8k) ', ' 7.7  (233k) ', ' 8.4  (101k) ', ' 6.7  (1k) ', ' 7.0  (8k) ', ' 3.3  (97k) ', ' 8.7  (2m) ', ' 7.9  (90k) ', ' 6.9  (499k) ', ' 5.8  (52k) ', ' 8.8  (5k) ', ' 7.5  ', ' 7.6  (849k) ', ' 7.8  (1k) ', ' 8.2  (71k) ', ' 6.8  ', ' 8.7  (39k) ', ' 8.2  (1m) ', ' 6.0  (89k) ', ' 6.7  (14k) ', ' 6.6  ', ' 7.7  (108k) ', ' 7.0  (178k) ', ' 7.2  (57k) ', ' 7.1  (132k) ', ' 5.7  (11k) ', ' 7.7  (19k) ', ' 8.1  (5k) ', ' 7.0  ', ' 8.0  (103k) ', ' 8.2  (97k) ', ' 6.1  (18k) ', ' 7.1  (99k) ', ' 8.0  (1m) ', ' 4.6  ', ' 6

## **Scrapping Runtime/Duration**

In [None]:
#Function definition for scraping runtime/duration
def extract_movie_runtimes(movie_url_list):
    """ This function extracts movie runtimes from the provided list of movie URLs.
    I/P args: movie_url_list (list): A list of URLs of movies.
    O/P args: movie_runtimes: A list of movie runtimes.
    """
    movie_runtimes = []  # Initialize an empty list to store the movie runtimes
    # Iterate through each movie URL in the movie_url_list
    for movie_url in movie_url_list:
        try:
            # Sending an HTTP GET request to the URL
            movie_page = requests.get(movie_url)

            # Check if the request was successful (status code 200)
            if movie_page.status_code == 200:
                # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                movie_soup = BeautifulSoup(movie_page.text, 'html.parser')

                # Find the runtime information
                for val in movie_soup.select("div.detail-infos"):
                    if val.select('h3.detail-infos__subheading')[0].text == "Runtime":
                        movie_runtimes.append(val.select('div.detail-infos__value')[0].text)
                        break
            else:
                print(f"Failed to retrieve data from {movie_url}: Status code {movie_page.status_code}")
        except Exception as e:
            print(f"An error occurred while processing {movie_url}: {e}")
    return movie_runtimes

In [None]:
movie_runtimes=extract_movie_runtimes(movie_url_list)

In [None]:
print(len(movie_runtimes))
print(movie_runtimes)

100
['3h 21min', '2h 55min', '2h 40min', '2h 26min', '2h 30min', '3h 1min', '2h 18min', '2h 4min', '1h 43min', '2h 39min', '2h 37min', '2h 35min', '2h 35min', '2h 39min', '2h 47min', '2h 28min', '1h 45min', '2h 30min', '1h 45min', '2h 49min', '1h 23min', '2h 35min', '2h 32min', '2h 14min', '2h 18min', '2h 21min', '1h 45min', '3h 26min', '2h 24min', '2h 7min', '2h 44min', '2h 21min', '1h 57min', '2h 38min', '1h 54min', '2h 49min', '1h 46min', '1h 54min', '2h 35min', '2h 38min', '2h 5min', '2h 32min', '1h 45min', '2h 18min', '2h 35min', '2h 46min', '3h 0min', '2h 8min', '2h 20min', '2h 17min', '2h 11min', '2h 11min', '2h 43min', '1h 35min', '2h 28min', '2h 24min', '2h 15min', '2h 23min', '2h 13min', '2h 36min', '2h 0min', '1h 57min', '1h 48min', '2h 30min', '1h 52min', '2h 38min', '2h 21min', '1h 30min', '1h 52min', '1h 47min', '2h 5min', '1h 33min', '2h 33min', '2h 24min', '2h 39min', '1h 59min', '2h 48min', '1h 38min', '2h 48min', '2h 34min', '1h 53min', '2h 24min', '2h 8min', '1h 47mi

## **Scrapping Age Rating**

In [None]:
#This function extracts the movie genres.
def extract_movie_age_rating(movie_url_list):
    """ This extracts movie age rating from the provided list of movie URLs.
    I/P args: movie_url_list (list): A list of URLs of movies.
    O/Ps: movie_genres: A list of movie genres.
    """
    try:
        movie_age_rating = []  # Initialize an empty list to store the movie genres
        # Iterate through each movie URL in the movie_url_list
        for movie_url in movie_url_list:
            try:
                # Sending an HTTP GET request to the URL
                movie_page = requests.get(movie_url)

                # Check if the request was successful (status code 200)
                if movie_page.status_code == 200:
                    # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                    movie_soup = BeautifulSoup(movie_page.text, 'html.parser')

                    # Find the age rating information
                    age_rating_found = False
                    for val in movie_soup.select("div.detail-infos"):
                        if val.select('h3.detail-infos__subheading')[0].text == "Age rating":
                            movie_age_rating.append(val.select('div.detail-infos__value')[0].text)
                            age_rating_found = True
                            break
                    # If age rating information is not found, append NaN
                    if not age_rating_found:
                      movie_age_rating.append(np.nan)
                else:
                    print(f"Failed to retrieve data from {movie_url}: Status code {movie_page.status_code}")
            except Exception as e:
                print(f"An error occurred while processing {movie_url}: {e}")
        return movie_age_rating
    except Exception as e:
        print("An error occurred:", e)
        return None

In [None]:
movie_age_rating = extract_movie_age_rating(movie_url_list)
print(len(movie_age_rating))
print(movie_age_rating)

100
['A', 'A', nan, nan, 'UA', 'UA', 'A', nan, nan, 'UA', 'UA', 'UA', 'U', 'UA', 'UA', 'UA', nan, 'UA', nan, 'UA', nan, 'U', nan, 'A', 'UA', nan, nan, 'A', nan, 'UA', nan, 'A', 'UA', 'UA', nan, nan, 'UA', 'UA', nan, nan, nan, 'U', 'U', 'UA', nan, 'UA', 'A', nan, 'UA', 'UA', nan, nan, 'UA', nan, 'UA', 'UA', 'U', nan, nan, 'UA', nan, 'U', 'A', 'UA', nan, 'UA', 'UA', 'UA', 'UA', nan, nan, nan, 'A', 'A', 'UA', 'A', 'UA', nan, nan, 'UA', 'A', nan, nan, nan, 'UA', nan, 'UA', 'UA', 'A', nan, 'UA', 'A', nan, nan, nan, 'A', 'UA', nan, 'UA', 'UA']


## **Fetching Production Countries Details**

In [None]:
#Function definition for scraping runtime/duration
def extract_movie_Production_country(movie_url_list):
    """ This function extracts movie runtimes from the provided list of movie URLs.
    I/P args: movie_url_list (list): A list of URLs of movies.
    O/P args: movie_Production_country: A list of movie runtimes.
    """
    movie_Production_country = []  # Initialize an empty list to store the movie runtimes
    # Iterate through each movie URL in the movie_url_list
    for movie_url in movie_url_list:
        #print(movie_url)
        try:
            # Sending an HTTP GET request to the URL
            movie_page = requests.get(movie_url)
            # Check if the request was successful (status code 200)
            if movie_page.status_code == 200:
                # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                movie_soup = BeautifulSoup(movie_page.text, 'html.parser')
                # Find the runtime information
                for val in movie_soup.select("div.detail-infos"):
                    if val.select('h3.detail-infos__subheading')[0].text.strip() == "Production country":
                        movie_Production_country.append(val.select('div.detail-infos__value')[0].text)
                        #print(movie_Production_country)
                        break
            else:
                print(f"Failed to retrieve data from {movie_url}: Status code {movie_page.status_code}")
        except Exception as e:
            print(f"An error occurred while processing {movie_url}: {e}")
    return movie_Production_country

movie_Production_country=extract_movie_Production_country(movie_url_list) #Function calling
print(movie_Production_country)
print(len(movie_Production_country))

['India', 'India', 'India', 'India', 'India', 'United States, United Kingdom', 'India', 'United States', 'United States, Australia', 'India', 'India', 'United States', 'India', 'India', 'India', 'India', 'United States, United Kingdom', 'India', 'United States', 'India', 'France, United States, Canada', 'India', 'France', 'India', 'India', 'India', 'United Kingdom, United States', 'United States', 'Spain, United States', 'India', 'United States', 'United Kingdom, United States, Ireland', 'India', 'India', 'Poland', 'United States, United Kingdom', 'South Korea, United States', 'United Kingdom, United States', 'India', 'India', 'India', 'United States, United Kingdom', 'India', 'India', 'India', 'India', 'United States', 'United States', 'India', 'India', 'United States', 'United Kingdom, United States', 'India', 'Australia, United Kingdom', 'India', 'India', 'India', 'India', 'United States', 'India', 'United Kingdom', 'United States, United Kingdom', 'United States', 'India', 'India',

## **Fetching Streaming Service Details**

In [None]:
def extract_movie_streaming_service(movie_url_list):
    """
    This function extracts movie streaming services from the provided list of movie URLs.

    Args:
    movie_url_list (list): A list of URLs of movies.

    Returns:
    list: A list of movie streaming services.
    """
    movie_streaming_service = []  # Initialize an empty list to store the streaming services

    # Iterate through each movie URL in the movie_url_list
    for movie_url in movie_url_list:
        try:
            # Sending an HTTP GET request to the URL
            movie_page = requests.get(movie_url)

            # Check if the request was successful (status code 200)
            if movie_page.status_code == 200:
                # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                movie_soup = BeautifulSoup(movie_page.text, 'html.parser')

                # Find the streaming service information
                streaming_services = []  # Initialize an empty list to store streaming services

                for val in movie_soup.select("article[data-v-3f103c69]"):
                    text = val.find('p', {'data-v-3f103c69': True})
                    if text is not None:
                        streaming_text = text.text.split("streaming on", 1)[-1].strip().rstrip('.')
                        streaming_services.extend(streaming_text.split(','))  # Extend with multiple services
                movie_streaming_service.append(streaming_services)
            else:
                print(f"Failed to retrieve data from {movie_url}: Status code {movie_page.status_code}")
        except Exception as e:
            print(f"An error occurred while processing {movie_url}: {e}")
            movie_streaming_service.append([])  # Append an empty list if an error occurs

    return movie_streaming_service

movie_streaming_service=extract_movie_streaming_service(movie_url_list)

In [None]:
print(movie_streaming_service)
print(len(movie_streaming_service))

100


## **Now Creating Movies DataFrame**

In [None]:
# Create a dictionary of movie data
data_dict = {
    'Movie Names': movie_titles,
    'Release Year': movie_release_years,
    'Movie Genres': movie_genres,
    'Movie IMDb Ratings': movie_imdb_ratings,
    'Movie Url': movie_url_list,
    'Movie Streaming Service': movie_streaming_service
}
#Create a datafame from scrapped movie data
movie_df=pd.DataFrame(data_dict)
movie_df

Unnamed: 0,Movie Names,Release Year,Movie Genres,Movie IMDb Ratings,Movie Url,Movie Streaming Service
0,Animal,(2023),"Action & Adventure, Drama, Crime, Mystery & Th...",6.3 (82k),https://www.justwatch.com/in/movie/animal-2022,[Netflix]
1,Salaar,(2023),"Action & Adventure, Crime, Drama, Mystery & Th...",6.5 (60k),https://www.justwatch.com/in/movie/salaar,"[Netflix, Hotstar]"
2,Dunki,(2023),"Drama, Comedy",6.8 (68k),https://www.justwatch.com/in/movie/dunki,[Netflix]
3,12th Fail,(2023),Drama,9.1 (99k),https://www.justwatch.com/in/movie/12th-fail,[Hotstar]
4,Sam Bahadur,(2023),"War & Military, Drama, History",7.8 (13k),https://www.justwatch.com/in/movie/sam-bahadur,[Zee5]
...,...,...,...,...,...,...
95,Napoleon,(2023),"Action & Adventure, Drama, History, War & Mili...",6.4 (111k),https://www.justwatch.com/in/movie/napoleon-2023,"[You can buy ""Napoleon"" on Apple TV, Google P..."
96,Andhadhun,(2018),"Mystery & Thriller, Comedy, Crime, Music & Mus...",8.2 (101k),https://www.justwatch.com/in/movie/andhadhun,[Jio Cinema for free with ads or buy it as dow...
97,Naadu,(2023),Drama,8.9,https://www.justwatch.com/in/movie/naadu,[Amazon Prime Video]
98,Chithha,(2023),"Mystery & Thriller, Drama",8.3 (6k),https://www.justwatch.com/in/movie/chittha,[Hotstar]


In [None]:
movie_df

Unnamed: 0,Movie Names,Release Year,Movie Genres,Movie IMDb Ratings,Movie Url,Movie Streaming Service
0,Animal,(2023),"Action & Adventure, Drama, Crime, Mystery & Th...",6.3 (82k),https://www.justwatch.com/in/movie/animal-2022,[Netflix]
1,Salaar,(2023),"Action & Adventure, Crime, Drama, Mystery & Th...",6.5 (60k),https://www.justwatch.com/in/movie/salaar,"[Netflix, Hotstar]"
2,Dunki,(2023),"Drama, Comedy",6.8 (68k),https://www.justwatch.com/in/movie/dunki,[Netflix]
3,12th Fail,(2023),Drama,9.1 (99k),https://www.justwatch.com/in/movie/12th-fail,[Hotstar]
4,Sam Bahadur,(2023),"War & Military, Drama, History",7.8 (13k),https://www.justwatch.com/in/movie/sam-bahadur,[Zee5]
...,...,...,...,...,...,...
95,Napoleon,(2023),"Action & Adventure, Drama, History, War & Mili...",6.4 (111k),https://www.justwatch.com/in/movie/napoleon-2023,"[You can buy ""Napoleon"" on Apple TV, Google P..."
96,Andhadhun,(2018),"Mystery & Thriller, Comedy, Crime, Music & Mus...",8.2 (101k),https://www.justwatch.com/in/movie/andhadhun,[Jio Cinema for free with ads or buy it as dow...
97,Naadu,(2023),Drama,8.9,https://www.justwatch.com/in/movie/naadu,[Amazon Prime Video]
98,Chithha,(2023),"Mystery & Thriller, Drama",8.3 (6k),https://www.justwatch.com/in/movie/chittha,[Hotstar]


## **Scraping TV  Show Data**

In [None]:
# Specifying the URL from which tv show related data will be fetched
tv_url='https://www.justwatch.com/in/tv-shows?release_year_from=2000'
# Sending an HTTP GET request to the URL
page=requests.get(tv_url)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(page.text,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

<!DOCTYPE html>
<html data-vue-meta="%7B%22dir%22:%7B%22ssr%22:%22ltr%22%7D,%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-vue-meta-server-rendered="" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-vue-meta="ssr"/>
  <meta content="IE=edge" data-vue-meta="ssr" httpequiv="X-UA-Compatible"/>
  <meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" data-vue-meta="ssr" name="viewport"/>
  <meta content="JustWatch" data-vue-meta="ssr" property="og:site_name"/>
  <meta content="794243977319785" data-vue-meta="ssr" property="fb:app_id"/>
  <meta content="/appassets/img/JustWatch_logo_with_claim.png" data-vmid="og:image" data-vue-meta="ssr" property="og:image"/>
  <meta content="606" data-vmid="og:image:width" data-vue-meta="ssr" property="og:image:width"/>
  <meta content="302" data-vmid="og:image:height" data-vue-meta="ssr" pro

## **Fetching Tv shows Url details**

In [None]:
# Function definition to create a list of TV show URLs from which information can be scraped later on.
def extract_tv_urls(soup):
    """
    This function extracts URLs of TV shows from the parsed HTML content.
    I/P args: soup (BeautifulSoup): Parsed HTML content of the webpage.
    O/P args: tv_url_list: A list of URLs of tv.
    """
    try:
        # Initialize an empty list to store the URLs of tv
        tv_url_list = []

        # Find all <div> elements with the class 'title-list-grid__item'
        for tv_div in soup.findAll('div', attrs={'class': 'title-list-grid__item'}):
            # Find the <a> element within each <div> and extract the value of the 'href' attribute
            tv_url = 'https://www.justwatch.com' + tv_div.find('a')['href']
            # Append the complete URL to the tv_url_list
            tv_url_list.append(tv_url)

        return tv_url_list

    except Exception as e:
        print("An error occurred while extracting movie URLs:", e)
        return None

In [None]:
tv_url_list=extract_tv_urls(soup)

In [None]:
print(tv_url_list)
print(len(tv_url_list))

['https://www.justwatch.com/in/tv-show/panchayat', 'https://www.justwatch.com/in/tv-show/game-of-thrones', 'https://www.justwatch.com/in/tv-show/true-detective', 'https://www.justwatch.com/in/tv-show/mirzapur', 'https://www.justwatch.com/in/tv-show/solo-leveling-2024', 'https://www.justwatch.com/in/tv-show/indian-police-force', 'https://www.justwatch.com/in/tv-show/death-and-other-details', 'https://www.justwatch.com/in/tv-show/aarya', 'https://www.justwatch.com/in/tv-show/bigg-boss', 'https://www.justwatch.com/in/tv-show/one-day', 'https://www.justwatch.com/in/tv-show/jack-reacher', 'https://www.justwatch.com/in/tv-show/mr-and-mrs-smith', 'https://www.justwatch.com/in/tv-show/yellowstone', 'https://www.justwatch.com/in/tv-show/mastram', 'https://www.justwatch.com/in/tv-show/halo', 'https://www.justwatch.com/in/tv-show/farzi', 'https://www.justwatch.com/in/tv-show/jujutsu-kaisen', 'https://www.justwatch.com/in/tv-show/young-sheldon', 'https://www.justwatch.com/in/tv-show/the-twelve', '

## **Fetching Tv Show Title details**

In [None]:
#Function definition for scrapping tv titles from the tv URLs
def scrape_tv_titles(tv_url_list):
    """ This function scrapes tv show titles from the provided list of tv show URLs.
    I/Ps: tv_url_list (list): A list of URLs of tv shows.
    O/Ps: tv_titles: A list of tv show titles.
    """
    try:
        # Initialize an empty list to store the tv show titles
        tv_titles = []

        # Iterate through each tv URL in the tv_url_list
        for tv_url in tv_url_list:
            try:
                # Sending an HTTP GET request to the URL
                tv_page = requests.get(tv_url)
                # Check if the request was successful (status code 200)
                if tv_page.status_code == 200:
                    # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                    tv_soup = BeautifulSoup(tv_page.text, 'html.parser')
                    # Find the tv show title element and extract the text
                    tv_title = tv_soup.find('div', class_='title-block').find('h1').text
                    # Append the movie title to the list
                    tv_titles.append(tv_title)
                else:
                    print(f"Failed to retrieve data from {tv_url}: Status code {tv_page.status_code}")
            except Exception as e:
                print(f"An error occurred while processing {tv_url}: {e}")

        return tv_titles

    except Exception as e:
        print("An error occurred:", e)
        return None

In [None]:
#Function calling
tv_titles=scrape_tv_titles(tv_url_list)

In [None]:
print(len(tv_titles))
print(tv_titles)

100
['Panchayat', 'Game of Thrones', 'True Detective', 'Mirzapur', 'Solo Leveling', 'Indian Police Force', 'Death and Other Details', 'Aarya', 'Bigg Boss', 'One Day', 'Reacher', 'Mr. & Mrs. Smith', 'Yellowstone', 'Mastram', 'Halo', 'Farzi', 'Jujutsu Kaisen', 'Young Sheldon', 'The Twelve', 'Avatar: The Last Airbender', 'Love Never Lies: Poland', 'Money Heist', 'Criminal Justice', 'Masters of the Air', 'Loki', 'Superman & Lois', 'Aashram', 'The Last of Us', 'Scam 1992', 'Shark Tank India', 'The Freelancer', 'Griselda', 'Marry My Husband', 'The Legend of Hanuman', 'The Bear', 'Fargo', 'Gandii Baat', 'Ek Thi Begum', 'Breaking Bad', 'Spartacus', 'Avatar: The Last Airbender', 'Poacher', 'Mashle: Magic and Muscles', 'Lucifer', 'Succession', 'Berlin', 'Gullak', 'Euphoria', 'The Family Man', 'Monarch: Legacy of Monsters', 'The Rookie', 'The Railway Men - The Untold Story of Bhopal 1984', 'House', 'Naruto Shippūden', 'Tokyo Vice', 'Alexander: The Making of a God', 'Modern Family', 'Dark Desire',

## **Fetching Release Year**

In [None]:
#Function definition for scrapping release year.
def scrape_movie_release_years(tv_url_list):
    """ This function scrapes release years of tv shows from the provided list of tv show URLs.
    I/P args: tv_url_list (list): A list of URLs of tv shows.
    O/P args: tv_release_years: A list of release years of tv shows.
    """
    try:
        # Initialize an empty list to store the release years of tv shows
        tv_release_years = []

        # Iterate through each tv URL in the tv_url_list
        for tv_url in tv_url_list:
            try:
                # Sending an HTTP GET request to the URL
                tv_page = requests.get(tv_url)

                # Check if the request was successful (status code 200)
                if tv_page.status_code == 200:
                    # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                    tv_soup = BeautifulSoup(tv_page.text, 'html.parser')
                    # Find the release year element and extract the text
                    release_year = tv_soup.find('div', attrs={'class': 'title-block'}).find('span').text
                    # Append the release year to the list
                    tv_release_years.append(release_year)
                else:
                    print(f"Failed to retrieve data from {tv_url}: Status code {tv_page.status_code}")
            except Exception as e:
                print(f"An error occurred while processing {tv_url}: {e}")

        return tv_release_years

    except Exception as e:
        print("An error occurred:", e)
        return None

In [None]:
tv_release_years=scrape_movie_release_years(tv_url_list)

In [None]:
print(tv_release_years)
print(len(tv_release_years))

[' (2020) ', ' (2011) ', ' (2014) ', ' (2018) ', ' (2024) ', ' (2024) ', ' (2024) ', ' (2020) ', ' (2006) ', ' (2024) ', ' (2022) ', ' (2024) ', ' (2018) ', ' (2020) ', ' (2022) ', ' (2023) ', ' (2020) ', ' (2017) ', ' (2019) ', ' (2024) ', ' (2022) ', ' (2017) ', ' (2019) ', ' (2024) ', ' (2021) ', ' (2021) ', ' (2020) ', ' (2023) ', ' (2020) ', ' (2021) ', ' (2023) ', ' (2024) ', ' (2024) ', ' (2021) ', ' (2022) ', ' (2014) ', ' (2018) ', ' (2020) ', ' (2008) ', ' (2010) ', ' (2005) ', ' (2023) ', ' (2023) ', ' (2016) ', ' (2018) ', ' (2023) ', ' (2019) ', ' (2019) ', ' (2019) ', ' (2023) ', ' (2024) ', ' (2023) ', ' (2004) ', ' (2007) ', ' (2022) ', ' (2024) ', ' (2009) ', ' (2020) ', ' (2021) ', ' (2024) ', ' (2023) ', ' (2022) ', ' (2020) ', ' (2023) ', ' (2024) ', ' (2023) ', ' (2024) ', ' (2013) ', ' (2018) ', ' (2015) ', ' (2023) ', ' (2016) ', ' (2023) ', ' (2019) ', ' (2018) ', ' (2024) ', ' (2019) ', ' (2017) ', ' (2010) ', ' (2024) ', ' (2023) ', ' (2020) ', ' (2024) ', ' (

## **Fetching TV Show Genre Details**

In [None]:
# Write Your Code here
#This function extracts the tv show genres.
def extract_tv_genres(tv_url_list):
    """ This extracts tv show genres from the provided list of tv show URLs.
    I/P args: tv_url_list (list): A list of URLs of tv shows.
    O/Ps: tv_genres: A list of tv show genres.
    """
    try:
        tv_genres = []  # Initialize an empty list to store the tv show genres

        # Iterate through each tv show URL in the tv_url_list
        for tv_url in tv_url_list:
            try:
                # Sending an HTTP GET request to the URL
                tv_page = requests.get(tv_url)

                # Check if the request was successful (status code 200)
                if tv_page.status_code == 200:
                    # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                    tv_soup = BeautifulSoup(tv_page.text, 'html.parser')

                    # Find the genre information
                    for val in tv_soup.select("div.detail-infos"):
                        if val.select('h3.detail-infos__subheading')[0].text == "Genres":
                            tv_genres.append(val.select('div.detail-infos__value')[0].text)
                            break
                else:
                    print(f"Failed to retrieve data from {tv_url}: Status code {tv_page.status_code}")
            except Exception as e:
                print(f"An error occurred while processing {tv_url}: {e}")

        return tv_genres

    except Exception as e:
        print("An error occurred:", e)
        return None

In [None]:
tv_genres=extract_tv_genres(tv_url_list)
print(tv_genres)
print(len(tv_genres))

['Comedy, Drama', 'Science-Fiction, Action & Adventure, Drama, Fantasy', 'Drama, Mystery & Thriller, Crime', 'Action & Adventure, Drama, Crime, Mystery & Thriller', 'Fantasy, Animation, Action & Adventure, Science-Fiction', 'Crime, Action & Adventure', 'Crime, Mystery & Thriller, Drama', 'Crime, Drama, Mystery & Thriller, Action & Adventure', 'Reality TV, Kids & Family', 'Comedy, Drama, Romance', 'Drama, Mystery & Thriller, Action & Adventure, Crime', 'Action & Adventure, Drama, Comedy, Mystery & Thriller, Crime', 'Drama, Western', 'Drama, Comedy, Fantasy', 'Action & Adventure, Science-Fiction, Mystery & Thriller, War & Military', 'Crime, Drama, Mystery & Thriller', 'Fantasy, Mystery & Thriller, Animation, Action & Adventure, Science-Fiction', 'Comedy, Kids & Family', 'Drama, Mystery & Thriller', 'Science-Fiction, Action & Adventure, Drama, Comedy, Kids & Family, Fantasy', 'Reality TV', 'Mystery & Thriller, Action & Adventure, Crime, Drama, Made in Europe', 'Drama, Crime, Mystery & Thr

## **Fetching IMDB Rating Details**

In [None]:
#Function definition for scrapping IMDB rating.
def extract_tv_imdb_ratings(tv_url_list):
    """ It extracts IMDb ratings of tv shows from the provided list of tv show URLs.
    I/P args: tv_url_list (list): A list of URLs of tv shows.
    O/P args: tv_imdb_ratings: A list of IMDb ratings of tv shows.
    """
    tv_imdb_ratings = []  # Initialize an empty list to store the IMDb ratings

    # Iterate through each movie URL in the tv_url_list
    for tv_url in tv_url_list:
        # Sending an HTTP GET request to the URL
        tv_page = requests.get(tv_url)

        # Check if the request was successful (status code 200)
        if tv_page.status_code == 200:
            # Parsing the HTML content using BeautifulSoup with the 'html.parser'
            tv_soup = BeautifulSoup(tv_page.text, 'html.parser')

            # Find the IMDb rating information
            for val in tv_soup.select("div.detail-infos"):
                if val.select('h3.detail-infos__subheading')[0].text == "Rating":
                    # Try-except block to handle cases where IMDb rating is not available
                    try:
                        tv_imdb_ratings.append(val.select('span span')[0].text)
                    except IndexError:
                        tv_imdb_ratings.append("None")  # If IMDb rating is not available, append "None"
                    break
        else:
            print(f"Failed to retrieve data from {tv_url}: Status code {tv_page.status_code}")

    return tv_imdb_ratings

In [None]:
tv_imdb_ratings=extract_tv_imdb_ratings(tv_url_list)
print(tv_imdb_ratings)
print(len(tv_imdb_ratings))

[' 8.9  (83k) ', ' 9.2  (2m) ', ' 8.9  (643k) ', ' 8.5  (81k) ', ' 8.5  (9k) ', ' 4.9  (57k) ', ' 6.9  (4k) ', ' 7.9  (13k) ', ' 3.7  (3k) ', ' 8.2  (19k) ', ' 8.1  (201k) ', ' 6.9  (25k) ', ' 8.7  (212k) ', ' 7.1  (2k) ', ' 7.2  (80k) ', ' 8.4  (44k) ', ' 8.6  (106k) ', ' 7.6  (96k) ', ' 7.4  (6k) ', ' 7.5  (22k) ', 'None', ' 8.2  (526k) ', ' 8.1  (19k) ', ' 7.9  (15k) ', ' 8.2  (406k) ', ' 7.8  (39k) ', ' 7.4  (32k) ', ' 8.8  (501k) ', ' 9.3  (150k) ', ' 8.7  (4k) ', ' 8.1  (6k) ', ' 7.2  (29k) ', ' 7.9  (3k) ', ' 9.2  (12k) ', ' 8.6  (199k) ', ' 8.9  (416k) ', ' 3.4  (2k) ', ' 8.6  (6k) ', ' 9.5  (2m) ', ' 8.5  (255k) ', ' 9.3  (358k) ', ' 7.5  ', ' 7.6  (5k) ', ' 8.1  (352k) ', ' 8.8  (264k) ', ' 7.0  (25k) ', ' 9.1  (20k) ', ' 8.3  (234k) ', ' 8.7  (97k) ', ' 7.0  (34k) ', ' 8.0  (66k) ', ' 8.5  (20k) ', ' 8.7  (504k) ', ' 8.7  (163k) ', ' 8.0  (37k) ', ' 5.2  (10k) ', ' 8.5  (476k) ', ' 6.5  (9k) ', ' 7.9  (8k) ', ' 6.3  (3k) ', ' 7.1  (31k) ', ' 8.4  (363k) ', 'None', 'None', ' 

## **Fetching Age Rating Details**

In [None]:
#This function extracts the tv show genres.
def extract_tv_age_rating(tv_url_list):
    """ This extracts tv show age rating from the provided list of tv show URLs.
    I/P args: tv_url_list (list): A list of URLs of tv shows.
    O/Ps: tv_genres: A list of tv show genres.
    """
    try:
        tv_age_rating = []  # Initialize an empty list to store the tv show genres
        # Iterate through each tv show URL in the tv_url_list
        for tv_url in tv_url_list:
            try:
                # Sending an HTTP GET request to the URL
                tv_page = requests.get(tv_url)

                # Check if the request was successful (status code 200)
                if tv_page.status_code == 200:
                    # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                    tv_soup = BeautifulSoup(tv_page.text, 'html.parser')

                    # Find the age rating information
                    age_rating_found = False
                    for val in tv_soup.select("div.detail-infos"):
                        if val.select('h3.detail-infos__subheading')[0].text == "Age rating":
                            tv_age_rating.append(val.select('div.detail-infos__value')[0].text)
                            age_rating_found = True
                            break
                    # If age rating information is not found, append NaN
                    if not age_rating_found:
                      tv_age_rating.append(np.nan)
                else:
                    print(f"Failed to retrieve data from {tv_url}: Status code {tv_page.status_code}")
            except Exception as e:
                print(f"An error occurred while processing {tv_url}: {e}")
        return tv_age_rating
    except Exception as e:
        print("An error occurred:", e)
        return None

In [None]:
tv_age_rating=extract_tv_age_rating(tv_url_list)
print(tv_age_rating)
print(len(tv_age_rating))

[nan, 'U', 'U', nan, nan, 'A', nan, nan, nan, nan, 'A', nan, nan, nan, nan, nan, nan, 'U', nan, nan, nan, nan, nan, 'A', nan, nan, nan, 'A', nan, 'U', nan, nan, nan, nan, nan, nan, 'A', nan, 'U', nan, 'U', nan, nan, 'U', 'U', nan, nan, 'A', nan, nan, nan, nan, 'UA', 'U', 'A', nan, nan, nan, 'UA', nan, nan, 'A', nan, nan, nan, nan, nan, 'A', nan, 'A', nan, nan, nan, nan, nan, nan, 'A', nan, 'A', nan, nan, nan, nan, nan, 'A', nan, nan, 'A', 'A', 'U', nan, 'U', nan, 'A', nan, nan, nan, nan, nan, 'A']
100


## **Fetching Production Country details**

In [None]:
#Function definition for scraping runtime/duration
def extract_tv_Production_country(tv_url_list):
    """ This function extracts tv show runtimes from the provided list of tv show URLs.
    I/P args: tv_url_list (list): A list of URLs of tv shows.
    O/P args: tv_Production_country: A list of tv show runtimes.
    """
    tv_Production_country = []  # Initialize an empty list to store the tv show runtimes
    # Iterate through each tv show URL in the tv_url_list
    for tv_url in tv_url_list:
        try:
            # Sending an HTTP GET request to the URL
            tv_page = requests.get(tv_url)
            # Check if the request was successful (status code 200)
            if tv_page.status_code == 200:
                # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                tv_soup = BeautifulSoup(tv_page.text, 'html.parser')
                # Find the runtime information
                for val in tv_soup.select("div.detail-infos"):
                    if val.select('h3.detail-infos__subheading')[0].text.strip() == "Production country":
                        tv_Production_country.append(val.select('div.detail-infos__value')[0].text)
                        break
            else:
                print(f"Failed to retrieve data from {tv_url}: Status code {tv_page.status_code}")
        except Exception as e:
            print(f"An error occurred while processing {tv_url}: {e}")
    return tv_Production_country

tv_Production_country=extract_tv_Production_country(tv_url_list) #Function calling
print(tv_Production_country)
print(len(tv_Production_country))

['India', 'United States', 'United States', 'India', 'South Korea, Japan', 'India', 'United States', 'India', 'India', 'United Kingdom', 'United States', 'United States', 'United States', 'India', 'United States', 'India', 'Japan, United States', 'United States', 'Belgium', 'United States', 'Poland', 'Spain', 'India', 'United States', 'United States', 'United States', 'India', 'United States', 'India', 'India', 'India', 'United States', 'South Korea', 'India', 'United States', 'United States', 'India', 'India', 'United States', 'United States', 'United States', 'United States, India', 'Japan', 'United States', 'United States', 'Spain', 'India', 'United States', 'India', 'United States', 'United States', 'India', 'United States', 'Japan', 'United States', 'United Kingdom', 'United States', 'Mexico', 'India', 'India', 'United States', 'United States', 'South Korea', 'India', 'United States', 'South Korea', 'United States', 'United Kingdom', 'India', 'United States', 'India', 'United Stat

## **Fetching Streaming Service details**

In [None]:
def extract_tv_streaming_service(tv_url_list):
    """
    This function extracts tv show streaming services from the provided list of tv show URLs.
    I/P args: tv_url_list (list): A list of URLs of tv shows.
    O/P args: tv_streaming_service: A list of tv show streaming services.
    """
    tv_streaming_service = []  # Initialize an empty list to store the streaming services

    # Iterate through each tv show URL in the tv_url_list
    for tv_url in tv_url_list:
        try:
            # Sending an HTTP GET request to the URL
            tv_page = requests.get(tv_url)

            # Check if the request was successful (status code 200)
            if tv_page.status_code == 200:
                # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                tv_soup = BeautifulSoup(tv_page.text, 'html.parser')

                # Find the streaming service information
                streaming_services = []  # Initialize an empty list to store streaming services

                for val in tv_soup.select("article[data-v-3f103c69]"):
                    text = val.find('p', {'data-v-3f103c69': True})
                    if text is not None:
                        streaming_text = text.text.split("streaming on", 1)[-1].strip().rstrip('.')
                        streaming_services.extend(streaming_text.split(','))  # Extend with multiple services
                tv_streaming_service.append(streaming_services)
            else:
                print(f"Failed to retrieve data from {tv_url}: Status code {tv_page.status_code}")
        except Exception as e:
            print(f"An error occurred while processing {tv_url}: {e}")
            tv_streaming_service.append([])  # Append an empty list if an error occurs

    return tv_streaming_service

tv_streaming_service=extract_tv_streaming_service(tv_url_list)

In [None]:
print(tv_streaming_service)
print(len(tv_streaming_service))

[['Amazon Prime Video'], ['Jio Cinema'], ['Jio Cinema'], ['Amazon Prime Video'], ['Crunchyroll or for free with ads on Crunchyroll'], ['Amazon Prime Video'], ['Hotstar'], ['Hotstar or for free with ads on Hotstar'], ['Voot for free with ads'], ['Netflix'], ['Amazon Prime Video'], ['Amazon Prime Video'], ['Netflix'], ['We try to add new providers constantly but we couldn\'t find an offer for "Mastram" online. Please come back again soon to check if there\'s something new'], ['Voot', ' Jio Cinema'], ['Amazon Prime Video'], ['Netflix', ' Crunchyroll or for free with ads on Crunchyroll'], ['Amazon Prime Video', ' Netflix'], ['Netflix'], ['Netflix'], ['Netflix'], ['Netflix'], ['Hotstar or for free with ads on Hotstar'], ['Apple TV Plus'], ['Hotstar'], ['Amazon Prime Video'], ['MX Player for free with ads'], ['Jio Cinema'], ['Sony Liv'], ['Sony Liv'], ['Hotstar or for free with ads on Hotstar'], ['Netflix'], ['Amazon Prime Video'], ['Hotstar or for free with ads on Hotstar'], ['Hotstar'], ['

## **Fetching Duration Details**

In [None]:
#Function definition for scraping duration details
def extract_tv_show_runtime(tv_url_list):
    """ This function extracts tv show runtimes from the provided list of tv show URLs.
    I/P args: tv_url_list (list): A list of URLs of tv shows.
    O/P args: tv_show_runtime: A list of tv show runtimes.
    """
    tv_show_runtime = []  # Initialize an empty list to store the tv show runtimes
    # Iterate through each tv show URL in the tv_url_list
    for tv_url in tv_url_list:
        try:
            # Sending an HTTP GET request to the URL
            tv_page = requests.get(tv_url)
            # Check if the request was successful (status code 200)
            if tv_page.status_code == 200:
                # Parsing the HTML content using BeautifulSoup with the 'html.parser'
                tv_soup = BeautifulSoup(tv_page.text, 'html.parser')
                # Find the runtime information
                for val in tv_soup.select("div.detail-infos"):
                    if val.select('h3.detail-infos__subheading')[0].text == "Runtime":
                        tv_show_runtime.append(val.select('div.detail-infos__value')[0].text)
                        break
            else:
                print(f"Failed to retrieve data from {tv_url}: Status code {tv_page.status_code}")
        except Exception as e:
            print(f"An error occurred while processing {tv_url}: {e}")
    return tv_show_runtime

tv_show_runtime=extract_tv_show_runtime(tv_url_list) #Function calling
print(tv_show_runtime)
print(len(tv_show_runtime))

['33min', '58min', '1h 1min', '50min', '24min', '38min', '47min', '46min', '1h 15min', '29min', '48min', '49min', '50min', '28min', '51min', '56min', '23min', '19min', '54min', '54min', '48min', '50min', '50min', '53min', '49min', '42min', '43min', '58min', '52min', '56min', '50min', '55min', '1h 3min', '21min', '33min', '52min', '44min', '30min', '47min', '54min', '24min', '47min', '23min', '47min', '1h 4min', '48min', '29min', '58min', '45min', '46min', '43min', '59min', '44min', '23min', '1h 0min', '39min', '21min', '34min', '44min', '51min', '39min', '1h 2min', '1h 13min', '44min', '1h 4min', '1h 6min', '36min', '58min', '31min', '50min', '29min', '1h 1min', '56min', '26min', '24min', '53min', '1h 1min', '52min', '46min', '25min', '51min', '47min', '1h 6min', '23min', '31min', '42min', '53min', '58min', '57min', '33min', '24min', '42min', '48min', '51min', '54min', '22min', '34min', '53min', '49min', '48min']
100


## **Creating TV Show DataFrame**

In [None]:
# Create a dictionary of tv show data
data_dict = {
    'TV show Names': tv_titles,
    'Release Year': tv_release_years,
    'TV show Genres': tv_genres,
    'TV show IMDb Ratings': tv_imdb_ratings,
    'TV show Url': tv_url_list,
    'TV show Streaming Service': tv_streaming_service,
    'TV show duration': tv_show_runtime
}
#Create a datafame from scrapped tv show data
tv_df=pd.DataFrame(data_dict)
tv_df

Unnamed: 0,TV show Names,Release Year,TV show Genres,TV show IMDb Ratings,TV show Url,TV show Streaming Service,TV show duration
0,Panchayat,(2020),"Comedy, Drama",8.9 (83k),https://www.justwatch.com/in/tv-show/panchayat,[Amazon Prime Video],33min
1,Game of Thrones,(2011),"Science-Fiction, Action & Adventure, Drama, Fa...",9.2 (2m),https://www.justwatch.com/in/tv-show/game-of-t...,[Jio Cinema],58min
2,True Detective,(2014),"Drama, Mystery & Thriller, Crime",8.9 (643k),https://www.justwatch.com/in/tv-show/true-dete...,[Jio Cinema],1h 1min
3,Mirzapur,(2018),"Action & Adventure, Drama, Crime, Mystery & Th...",8.5 (81k),https://www.justwatch.com/in/tv-show/mirzapur,[Amazon Prime Video],50min
4,Solo Leveling,(2024),"Fantasy, Animation, Action & Adventure, Scienc...",8.5 (9k),https://www.justwatch.com/in/tv-show/solo-leve...,[Crunchyroll or for free with ads on Crunchyroll],24min
...,...,...,...,...,...,...,...
95,Parks and Recreation,(2009),Comedy,8.6 (285k),https://www.justwatch.com/in/tv-show/parks-and...,[Jio Cinema],22min
96,Undekhi,(2020),"Crime, Drama",,https://www.justwatch.com/in/tv-show/undekhi,[Sony Liv],34min
97,The Curse,(2023),"Comedy, Drama, Mystery & Thriller",7.2 (9k),https://www.justwatch.com/in/tv-show/the-curse,"[Lionsgate Play, Lionsgate Play Apple TV Chan...",53min
98,"Good Morning, Verônica",(2020),"Crime, Drama, Mystery & Thriller",7.5 (6k),https://www.justwatch.com/in/tv-show/good-morn...,[Netflix],49min


## **Task 2 :- Data Filtering & Analysis**

## **Task 2 (a):- Data Filtering & Analysis on Movies Dataframe**

In [None]:
# Remove parentheses from Release Year column
movie_df['Release Year'] = movie_df['Release Year'].str.replace(r'\(|\)', '')

# Remove square brackets from Movie Streaming Service column and join the list elements into a single string
movie_df['Movie Streaming Service'] = movie_df['Movie Streaming Service'].apply(lambda x: ', '.join(x))


  movie_df['Release Year'] = movie_df['Release Year'].str.replace(r'\(|\)', '')


In [None]:
movie_df.head()

Unnamed: 0,Movie Names,Release Year,Movie Genres,Movie IMDb Ratings,Movie Url,Movie Streaming Service
0,Animal,2023,"Action & Adventure, Drama, Crime, Mystery & Th...",6.3 (82k),https://www.justwatch.com/in/movie/animal-2022,Netflix
1,Salaar,2023,"Action & Adventure, Crime, Drama, Mystery & Th...",6.5 (60k),https://www.justwatch.com/in/movie/salaar,"Netflix, Hotstar"
2,Dunki,2023,"Drama, Comedy",6.8 (68k),https://www.justwatch.com/in/movie/dunki,Netflix
3,12th Fail,2023,Drama,9.1 (99k),https://www.justwatch.com/in/movie/12th-fail,Hotstar
4,Sam Bahadur,2023,"War & Military, Drama, History",7.8 (13k),https://www.justwatch.com/in/movie/sam-bahadur,Zee5


In [None]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Movie Names              100 non-null    object
 1   Release Year             100 non-null    object
 2   Movie Genres             100 non-null    object
 3   Movie IMDb Ratings       100 non-null    object
 4   Movie Url                100 non-null    object
 5   Movie Streaming Service  100 non-null    object
dtypes: object(6)
memory usage: 4.8+ KB


In [None]:
# Convert 'Release Year' column to numeric
movie_df['Release Year'] = pd.to_numeric(movie_df['Release Year'], errors='coerce')

# Filter movies released in the last 2 years
current_year = pd.Timestamp.now().year
filtered_movie_df = movie_df[(current_year - movie_df['Release Year'] <= 2)]

# Filter movies and TV shows with IMDb rating of 7 or higher
filtered_movie_df = filtered_movie_df[pd.to_numeric(movie_df['Movie IMDb Ratings'], errors='coerce') >= 7]


  filtered_movie_df = filtered_movie_df[pd.to_numeric(movie_df['Movie IMDb Ratings'], errors='coerce') >= 7]


In [None]:
filtered_movie_df

Unnamed: 0,Movie Names,Release Year,Movie Genres,Movie IMDb Ratings,Movie Url,Movie Streaming Service
40,Warning 2,2024,"Action & Adventure, Drama, Mystery & Thriller",7.5,https://www.justwatch.com/in/movie/warning-2,We try to add new providers constantly but we ...
57,Abraham Ozler,2024,"Crime, Drama, Mystery & Thriller",7.0,https://www.justwatch.com/in/movie/abraham-ozler,We try to add new providers constantly but we ...
67,Aatmapamphlet,2023,"Comedy, Drama, Romance",7.9,https://www.justwatch.com/in/movie/aatmapamphlet,Amazon Prime Video. It is also possible to buy...
97,Naadu,2023,Drama,8.9,https://www.justwatch.com/in/movie/naadu,Amazon Prime Video


In [None]:
# Convert 'Movie IMDb Ratings' column to float datatype
movie_df['Movie IMDb Ratings'] = pd.to_numeric(movie_df['Movie IMDb Ratings'], errors='coerce')

## **Calculating Mean IMDB Ratings of Movies**

In [None]:
# Calculate the average IMDb rating for the scraped movies
average_imdb_rating = movie_df['Movie IMDb Ratings'].astype(float).mean()
print(f"The average IMDb rating for the scraped movies is {average_imdb_rating:.2f}.")

The average IMDb rating for the scraped movies and TV shows is 7.02.


## **Analyzing Top Genres**

In [None]:
# Identify the top 5 genres with the highest number of available movies and TV shows
top_genres = movie_df['Movie Genres'].str.split(', ').explode().value_counts().head(5)
print("Top 5 genres:\n", top_genres)

Top 5 genres:
 Drama                 72
Action & Adventure    47
Mystery & Thriller    45
Comedy                26
Crime                 19
Name: Movie Genres, dtype: int64


## **Finding Predominant Streaming Service**

In [None]:
# Determine the streaming service with the most significant number of offerings
streaming_service_counts = movie_df['Movie Streaming Service'].str.split(', ').explode().value_counts()
most_common_streaming_service = streaming_service_counts.idxmax()

In [None]:
print(most_common_streaming_service)

 Google Play Movies


## **Task 2 (b):- Data Filtering & Analysis on TV show Dataframe**

In [None]:
# Remove parentheses from Release Year column
tv_df['Release Year'] = tv_df['Release Year'].str.replace(r'\(|\)', '')

# Remove square brackets from TV show Streaming Service column and join the list elements into a single string
tv_df['Movie Streaming Service'] = tv_df['TV show Streaming Service'].apply(lambda x: ', '.join(x))


  tv_df['Release Year'] = tv_df['Release Year'].str.replace(r'\(|\)', '')


In [None]:
tv_df.head()

Unnamed: 0,TV show Names,Release Year,TV show Genres,TV show IMDb Ratings,TV show Url,TV show Streaming Service,TV show duration,Movie Streaming Service
0,Panchayat,2020,"Comedy, Drama",8.9 (83k),https://www.justwatch.com/in/tv-show/panchayat,[Amazon Prime Video],33min,Amazon Prime Video
1,Game of Thrones,2011,"Science-Fiction, Action & Adventure, Drama, Fa...",9.2 (2m),https://www.justwatch.com/in/tv-show/game-of-t...,[Jio Cinema],58min,Jio Cinema
2,True Detective,2014,"Drama, Mystery & Thriller, Crime",8.9 (643k),https://www.justwatch.com/in/tv-show/true-dete...,[Jio Cinema],1h 1min,Jio Cinema
3,Mirzapur,2018,"Action & Adventure, Drama, Crime, Mystery & Th...",8.5 (81k),https://www.justwatch.com/in/tv-show/mirzapur,[Amazon Prime Video],50min,Amazon Prime Video
4,Solo Leveling,2024,"Fantasy, Animation, Action & Adventure, Scienc...",8.5 (9k),https://www.justwatch.com/in/tv-show/solo-leve...,[Crunchyroll or for free with ads on Crunchyroll],24min,Crunchyroll or for free with ads on Crunchyroll


## **Calculating mean IMDB ratings of TV shows**

In [None]:
tv_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   TV show Names              100 non-null    object
 1   Release Year               100 non-null    object
 2   TV show Genres             100 non-null    object
 3   TV show IMDb Ratings       100 non-null    object
 4   TV show Url                100 non-null    object
 5   TV show Streaming Service  100 non-null    object
 6   TV show duration           100 non-null    object
 7   Movie Streaming Service    100 non-null    object
dtypes: object(8)
memory usage: 6.4+ KB


In [None]:
# Convert 'Release Year' column to numeric
tv_df['Release Year'] = pd.to_numeric(tv_df['Release Year'], errors='coerce')

# Filter TV shows released in the last 2 years
current_year = pd.Timestamp.now().year
filtered_tv_df = tv_df[(current_year - tv_df['Release Year'] <= 2)]

# Filter TV shows with IMDb rating of 7 or higher
filtered_tv_df = filtered_tv_df[pd.to_numeric(movie_df['Movie IMDb Ratings'], errors='coerce') >= 7]


  filtered_tv_df = filtered_tv_df[pd.to_numeric(movie_df['Movie IMDb Ratings'], errors='coerce') >= 7]


In [None]:
filtered_tv_df

Unnamed: 0,TV show Names,Release Year,TV show Genres,TV show IMDb Ratings,TV show Url,TV show Streaming Service,TV show duration,Movie Streaming Service
97,The Curse,2023,"Comedy, Drama, Mystery & Thriller",7.2 (9k),https://www.justwatch.com/in/tv-show/the-curse,"[Lionsgate Play, Lionsgate Play Apple TV Chan...",53min,"Lionsgate Play, Lionsgate Play Apple TV Chann..."


In [None]:
# Convert 'Movie IMDb Ratings' column to float datatype
tv_df['TV show IMDb Ratings'] = pd.to_numeric(tv_df['TV show IMDb Ratings'], errors='coerce')

In [None]:
# Calculate the average IMDb rating for the scraped movies
average_imdb_rating = tv_df['TV show IMDb Ratings'].astype(float).mean()
print(f"The average IMDb rating for the scraped TV shows is {average_imdb_rating:.2f}.")

The average IMDb rating for the scraped TV shows is 7.37.


## **Analyzing Top Genres**

In [None]:
# Identify the top 5 genres with the highest number of available movies and TV shows
top_genres = tv_df['TV show Genres'].str.split(', ').explode().value_counts().head(5)
print("Top 5 genres:\n", top_genres)

Top 5 genres:
 Drama                 83
Mystery & Thriller    40
Action & Adventure    39
Crime                 32
Comedy                30
Name: TV show Genres, dtype: int64


## **Finding Predominant Streaming Service**

In [None]:
# Determine the streaming service with the most significant number of offerings
streaming_service_counts = tv_df['Movie Streaming Service'].str.split(', ').explode().value_counts()
most_common_streaming_service = streaming_service_counts.idxmax()

In [None]:
print(most_common_streaming_service)

 Google Play Movies


## **Task 3 :- Data Export**

In [None]:
#saving final dataframe as Final Data in csv format

movie_df.to_csv('movie_data.csv', index=False)
tv_df.to_csv('tv_data.csv', index=False)


In [None]:
#saving filter data as Filter Data in csv format

filtered_tv_df.to_csv('filtered_tv_data.csv', index=False)
filtered_movie_df.to_csv('filtered_movie_data.csv', index=False)

In [None]:
from google.colab import files

# Download the CSV file
files.download('movie_data.csv') #to download file

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**References:**


[1.]   https://www.w3schools.com/python/module_requests.asp

[2.]   https://www.clickminded.com/status-code-200/#:~:text=An%20HTTP%20response%20status%20code,browser%20as%20a%20normal%20user.


