# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

In [10]:
#Installing all necessary labraries
!pip install bs4
!pip install requests



## **Task 1:- Web Scrapping**

In [11]:
#import all necessary labraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
from urllib.parse import urljoin

## **Scrapping Movies Data**

In [12]:
# Specifying the URL from which movies related data will be fetched
url='https://www.justwatch.com/in/movies?release_year_from=2000'

# Sending an HTTP GET request to the URL
page=requests.get(url)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(page.text,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

<!DOCTYPE html>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>
 403
</title>
403 Forbidden



### Scrapping Movie Title

In [None]:
# Movie title
movie_title_list=[]# List to store all movie title

# Extracting all movie titles from  tag and storing them in movie_titles
movie_titles = soup.find_all('a',class_='title-list-grid__item--link',attrs={'href':True})

# Extracting each movie title from movie_titles and storing in movie_title_list
for movie_title in movie_titles:

    # Extract the 'href' attribute value, which contains the movie title
    data_id_value = movie_title['href']

    # Removing the '/in/movie/' prefix to get the clean movie title
    data_id_value = data_id_value.replace("/in/movie/","")

    # Converting the movie title to uppercase and appending to the list
    movie_title_list.append(data_id_value.upper())

## **Fetching Movie URL's**

In [None]:
# Write Your Code here
Movie_url  =[]
movies_link = soup.find_all("a",class_="title-list-grid__item--link")
for link in movies_link:
    movie_url = "https://www.justwatch.com"+link["href"]
    Movie_url.append(movie_url)

In [None]:
Movie_url

[]

## **Scrapping release Year**

In [None]:
# Movie release year
movie_release_year_list = []# List to store all movie release year

# For every movie title present in movies_title_list , Finding their release year
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie

    # Sending an HTTP GET request to the
    response_ry = requests.get(absolute_url)

    # Parsing HTML content with Beautiful Soup
    soup_ry = BeautifulSoup(response_ry.text,'html.parser')
    movie_release_year =soup_ry.find('span',class_='text-muted').text.strip()
    movie_release_year=movie_release_year.replace("(","")
    movie_release_year=movie_release_year.replace(")","")
    movie_release_year_list.append(movie_release_year)

## **Scrapping Genres**

In [None]:
# Movie genre

movie_genre_list = []# List to store all movie genre

# For every movie title present in movies_title_list , Finding their genre
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response_g = requests.get(absolute_url)
    soup = BeautifulSoup(response_g.text,'html.parser')

    # Selecting only those h3 whose heading is genres
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Genres')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            movie_genre_list.append(div_element.text.strip())
        else:
            movie_genre_list.append("Genre Not Listed")
    else:
         movie_genre_list.append("Genre Not Listed")

## **Scrapping IMBD Rating**

In [None]:
movie_imdb_list = []# List to store all movie imdb rating

# For every movie title present in movies_title_list , Finding their Imdb Rating
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response_g = requests.get(absolute_url)
    soup = BeautifulSoup(response_g.text, 'html.parser')

    # Selecting only those h3 whose heading is Rating
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Rating')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            inside_div = div_element.find_all('div', class_='jw-scoring-listing__rating')

            # Check if inside_div is non-empty
            if inside_div:
                inside_div_last = inside_div[-1] # extracting last div of inside div as last div elemnt contains span (in which last span contains rating)

                # Check if inside_div_last is non-empty
                if inside_div_last:
                    span_all = inside_div_last.find_all('span')

                    # Check if span_all is non-empty
                    if span_all:
                        span_last = span_all[-1] # Here we are extracting rating from the last span(span_last) inside last div(inside_div_last) of main div_element(div_element)
                        movie_imdb_list.append(span_last.text.strip())
                    else:
                        movie_imdb_list.append("Imdb Rating Not Listed.")
                else:
                    movie_imdb_list.append("Imdb Rating Not Listed.")
            else:
                movie_imdb_list.append("Imdb Rating Not Listed.")
        else:
            movie_imdb_list.append("Imdb Rating Not Listed.")
    else:
        movie_imdb_list.append("Imdb Rating Not Listed.")

## **Scrapping Runtime/Duration**

In [None]:
movie_runtime_list=[]# List to store all movie runtime/duration

# For every movie title present in movies_title_list , Finding their Runtime/Duration
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response_g = requests.get(absolute_url)
    soup = BeautifulSoup(response_g.text,'html.parser')

    # Selecting only those h3 whose heading is runtine
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Runtime')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            movie_runtime_list.append(div_element.text.strip())
        else:
            movie_runtime_list.append("No Runtime/Duration mentioned")
    else:
      movie_runtime_list.append("No Runtime/Duration mentioned")

## **Scrapping Age Rating**

In [None]:
movie_age_rating_list = []# List to store all movie age rating

# For every movie title present in movies_title_list , Finding their Age Rating
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response_g = requests.get(absolute_url)
    soup = BeautifulSoup(response_g.text,'html.parser')

    # Selecting only those h3 whose heading is Age rating
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Age rating')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            movie_age_rating_list.append(div_element.text.strip())
        else:
            movie_age_rating_list.append("Age Rating Not Listed.")
    else:
         movie_age_rating_list.append("Age Rating Not Listed.")


## **Fetching Production Countries Details**

In [None]:
movie_production_country_list=[]# List to store all movie production country

# For every movie title present in movies_title_list , Finding their Production country
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Selecting only those h3 whose sub-heading inside details- infos is 'Production Country'
    h3_element = soup.find('h3', class_='detail-infos__subheading', string=' Production country ')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            movie_production_country_list.append(div_element.text.strip())
        else:
            movie_production_country_list.append("Production Country Not Listed")
    else:
       movie_production_country_list.append("Production Country Not Listed")

## **Fetching Streaming Service Details**

In [None]:
movie_streaming_list=[]# List to store all movie streaming platform

# For every movie title present in movies_title_list , Finding their Streaming Platform
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Finding the outer div element with the class "buybox-row stream"
    outer_div = soup.find('div', class_='buybox-row stream')

    if outer_div:
        # Finding the nested div with class "buybox-row__offers" inside the outer div
        inner_div = outer_div.find('div', class_='buybox-row__offers')

        if inner_div:
            # Find the picture element within the nested div
            picture_element = inner_div.find('picture')

            if picture_element:
                # Extract the alt attribute from the img element inside the picture which contains streaming platform name
                img_element = picture_element.find('img')
                if img_element:
                    alt_text = img_element['alt']
                    movie_streaming_list.append(alt_text)
                else:
                    movie_streaming_list.append("Not Available for Streaming.")
            else:
                movie_streaming_list.append("Not Available for Streaming.")
        else:
            movie_streaming_list.append("Not Available for Streaming.")
    else:
        movie_streaming_list.append("Not Available for Streaming.")

## **Now Creating Movies DataFrame**

In [None]:
data_movies = {
    'Movie Title':movie_title_list,
    'IMDB Rating':movie_imdb_list,
    'Release Year':movie_release_year_list,
    'Genre':movie_genre_list,
    'Runtime/Duration':movie_runtime_list,
    'Age Rating':movie_age_rating_list,
    'Production Country':movie_production_country_list,
    'Streaming Platform':movie_streaming_list,
    'Url':movie_url_list
}

df_movies = pd.DataFrame(data_movies)

## **Scraping TV  Show Data**

In [None]:
# URL from which tv shows data is fetched
tv_url='https://www.justwatch.com/in/tv-shows?release_year_from=2000'

# Sending an HTTP GET request to the URL
response = requests.get(tv_url)

# Parsing HTML content with Beautiful Soup
soup_tv = BeautifulSoup(response.text,'html.parser')

## **Fetching Tv Show Title details**

In [None]:
tv_show_title_list=[] # List to store all tv show title

# Extracting all tv show titles and storing them in tv_show_titles
tv_show_titles = soup_tv.find_all('a',class_='title-list-grid__item--link',attrs={'href':True})

# Extracting each tv show title from tv_show_titles and storing in tv_show_title_list
for tv_show_title in tv_show_titles:

    # Extract the 'href' attribute value, which contains the tv_show title
    data_id_value = tv_show_title['href']

    # Removing the '/in/tv-show/' prefix to get the clean tv_show title
    data_id_value = data_id_value.replace("/in/tv-show/","")

    # Converting the tv_show title to uppercase and appending to the list
    tv_show_title_list.append(data_id_value.upper())

### Fetching Tv shows Url details


In [None]:
# Write Your Code here

# Tv Shows url
tv_show_url_list=[] # List to store all tv show urls

# For every tv show title present in tv_show_title_list , Finding their url
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show

    tv_show_url_list.append(absolute_url)

## **Fetching Release Year**

In [None]:
# Write Your Code here

# Movie release year
tv_show_release_year_list = [] # List to store all tv show Release Year

# For every tv show title present in tv_show_title_list , Finding their release year
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show

    # Sending an HTTP GET request to the url
    response = requests.get(absolute_url)

    # Parsing HTML content with Beautiful Soup
    soup = BeautifulSoup(response.text,'html.parser')
    tv_show_release_year =soup.find('span',class_='text-muted').text.strip()
    tv_show_release_year=tv_show_release_year.replace("(","")
    tv_show_release_year=tv_show_release_year.replace(")","")
    tv_show_release_year_list.append(tv_show_release_year)


## **Fetching TV Show Genre Details**

In [None]:
# Write Your Code here
# Tv Shows Genre

tv_show_genre_list = [] # List to store all tv show Genres

# For every tv show title present in tv_show_title_list , Finding their Genre
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Selecting only those h3 whose heading is genres
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Genres')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            tv_show_genre_list.append(div_element.text.strip())
        else:
            tv_show_genre_list.append("Genre Not Listed")
    else:
         tv_show_genre_list.append("Genre Not Listed")

## **Fetching IMDB Rating Details**

In [None]:

tv_show_imdb_list = [] # List to store all tv show Imdb Rating

# For every tv show title present in tv_show_title_list , Finding their Imdb Rating
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Selecting only those h3 whose heading is Rating
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Rating')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            inside_div = div_element.find_all('div', class_='jw-scoring-listing__rating')

            # Check if inside_div is non-empty
            if inside_div:
                inside_div_last = inside_div[-1] # extracting last div of inside div as last div elemnt contains span (in which last span contains rating)

                # Check if inside_div_last is non-empty
                if inside_div_last:
                    span_all = inside_div_last.find_all('span')

                    # Check if span_all is non-empty
                    if span_all:
                        span_last = span_all[-1] # Here we are extracting rating from the last span(span_last) inside last div(inside_div_last) of main div_element(div_element)
                        tv_show_imdb_list.append(span_last.text.strip())
                    else:
                        tv_show_imdb_list.append("Imdb Rating Not Listed.")
                else:
                    tv_show_imdb_list.append("Imdb Rating Not Listed.")
            else:
                tv_show_imdb_list.append("Imdb Rating Not Listed.")
        else:
            tv_show_imdb_list.append("Imdb Rating Not Listed.")
    else:
        tv_show_imdb_list.append("Imdb Rating Not Listed.")

## **Fetching Age Rating Details**

In [None]:
tv_show_age_rating_list = [] # List to store all tv show Age Ratings

# For every tv show title present in tv_show_title_list , Finding their Age Rating
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Selecting only those h3 whose heading is Age rating
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Age rating')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            tv_show_age_rating_list.append(div_element.text.strip())
        else:
            tv_show_age_rating_list.append("Age Rating Not Listed.")
    else:
         tv_show_age_rating_list.append("Age Rating Not Listed.")

## **Fetching Production Country details**

In [None]:
# Write Your Code here
tv_show_production_country_list=[] # List to store all tv show Production Countries

# For every tv show title present in tv_show_title_list , Finding their Production country
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Selecting only those h3 whose sub-heading inside details- infos is 'Production Country'
    h3_element = soup.find('h3', class_='detail-infos__subheading', string=' Production country ')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            tv_show_production_country_list.append(div_element.text.strip())
        else:
            tv_show_production_country_list.append("Production Country Not Listed")
    else:
          tv_show_production_country_list.append("Production Country Not Listed")


## **Fetching Streaming Service details**

In [None]:
# Write Your Code here
tv_show_streaming_list=[] # List to store all tv show Streaming Platorms

# For every tv show title present in tv_show_title_list , Finding their Streaming Platform
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Finding the outer div element with the class "buybox-row stream"
    outer_div = soup.find('div', class_='buybox-row stream')

    if outer_div:
        # Finding the nested div with class "buybox-row__offers" inside the outer div
        inner_div = outer_div.find('div', class_='buybox-row__offers')

        if inner_div:
            # Find the picture element within the nested div
            picture_element = inner_div.find('picture')

            if picture_element:
                # Extract the alt attribute from the img element inside the picture which contains streaming platform name
                img_element = picture_element.find('img')
                if img_element:
                    alt_text = img_element['alt']
                    tv_show_streaming_list.append(alt_text)
                else:
                    tv_show_streaming_list.append("Not Available for Streaming.")
            else:
                tv_show_streaming_list.append("Not Available for Streaming.")
        else:
            tv_show_streaming_list.append("Not Available for Streaming.")
    else:
        tv_show_streaming_list.append("Not Available for Streaming.")

## **Fetching Duration Details**

In [None]:
# Write Your Code here

tv_show_runtime_list=[] # List to store all tv show Runtimes

# For every tv show title present in tv_show_title_list , Finding their Runtime/Duration
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Selecting only those h3 whose heading is runtine
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Runtime')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            tv_show_runtime_list.append(div_element.text.strip())
        else:
            tv_show_runtime_list.append("No Runtime/Duration mentioned")
    else:
      tv_show_runtime_list.append("No Runtime/Duration mentioned")

## **Creating TV Show DataFrame**

In [None]:
# Creating Tv Shows Dataframe
# Creating Tv Shows Dataframe

# Initialize the missing list
tv_show_production_country_list = []  # Collect production country data here (presumably in a previous step)

data_tv_shows = {
    'Tv_Show Title':tv_show_title_list,
    'IMDB Rating':tv_show_imdb_list,
    'Release Year':tv_show_release_year_list,
    'Genre':tv_show_genre_list,
    'Runtime/Duration':tv_show_runtime_list,
    'Age Rating':tv_show_age_rating_list,
    'Production Country':tv_show_production_country_list,  # Now you can use the list
    'Streaming Platform':tv_show_streaming_list,
    'Url':tv_show_url_list
}

df_tv_shows = pd.DataFrame(data_tv_shows)

## **Task 2 :- Data Filtering & Analysis**

In [None]:
# Write Your Code here
from datetime import datetime, timedelta

# Get the current date
current_date = datetime.now()

# Calculate the date 2 years ago from the current date
two_years_ago = current_date - timedelta(days=365 * 2)

def filter_df(df, release_year_col, imdb_rating_col, years_ago, current_date):
    # Convert 'Release Year' to datetime format
    df[release_year_col] = pd.to_datetime(df[release_year_col], errors='coerce')

    # Filter the DataFrame to include only entries released in the last `years_ago` years
    filtered_df = df[(df[release_year_col] >= two_years_ago) & (df[release_year_col] <= current_date)].copy()

    # Converting 'IMDB Rating' column to a string so that, in the next step, we can convert it to numeric values
    filtered_df.loc[:, imdb_rating_col] = filtered_df[imdb_rating_col].astype(str)

    # Extract numeric part and convert to numeric
    filtered_df[imdb_rating_col] = pd.to_numeric(filtered_df[imdb_rating_col].str.extract(r'([\d.]+)', expand=False), errors='coerce')

    # Filter the DataFrame to include only entries whose IMDb Rating >= 7
    filtered_df = filtered_df[filtered_df[imdb_rating_col] >= 7]

    return filtered_df

# Filtering Movies
filtered_df_movies = filter_df(df_movies, 'Release Year', 'IMDB Rating', two_years_ago, current_date)

# Filtering TV Shows
filtered_df_tv_shows = filter_df(df_tv_shows, 'Release Year', 'IMDB Rating', two_years_ago, current_date)

## **Calculating Mean IMDB Ratings for both Movies and Tv Shows**

In [None]:
# Calculating Movies mean IMDb rating
movie_mean_imdb = filtered_df_movies['IMDB Rating'].mean()
movie_mean_imdb_rounded = round(movie_mean_imdb, 2)
print("Mean IMDb Rating for Movies is:", movie_mean_imdb_rounded)

# Calculating Tv Shows mean IMDb rating
tv_mean_imdb = filtered_df_tv_shows['IMDB Rating'].mean()
tv_mean_imdb_rounded = round(tv_mean_imdb, 2)
print("Mean IMDb Rating for Tv Shows is:", tv_mean_imdb_rounded)

Mean IMDb Rating for Movies is: nan
Mean IMDb Rating for Tv Shows is: nan


In [None]:
# Write Your Code here
def get_top_5_imdb(df):

  # Convert 'IMDB Rating' column to string
  df['IMDB Rating'] = df['IMDB Rating'].astype(str)

  # Extract only the IMDb rating value
  df['IMDB Rating'] = df['IMDB Rating'].str.extract('(\d+\.\d+)')

  # Convert the 'IMDB Rating' column to numeric
  df['IMDB Rating'] = pd.to_numeric(df['IMDB Rating'], errors='coerce')

  # Select the top 5 movies/Tv Shows based on IMDb rating
  top_5 = df.nlargest(5, 'IMDB Rating')

  return top_5

In [None]:
#Let's Visvalize it using word cloud
# Top 5 Highest IMDB Rating Movies

top_5_movies = get_top_5_imdb(filtered_df_movies)
print(top_5_movies.loc[:, ['Movie Title', 'IMDB Rating']])

Empty DataFrame
Columns: [Movie Title, IMDB Rating]
Index: []


In [None]:
# Top 5 Highest IMDB Rating Tv Shows

top_5_tv_shows = get_top_5_imdb(filtered_df_tv_shows)
print(top_5_tv_shows.loc[:, ['Tv_Show Title', 'IMDB Rating']])

Empty DataFrame
Columns: [Tv_Show Title, IMDB Rating]
Index: []


### Analyzing Top Genres

In [None]:
print(filtered_df_movies)


Empty DataFrame
Columns: [Movie Title, IMDB Rating, Release Year, Genre, Runtime/Duration, Age Rating, Production Country, Streaming Platform, Url]
Index: []


In [None]:
print(filtered_df_movies['Genre'])

Series([], Name: Genre, dtype: float64)


In [None]:
all_genres = ' '.join(filtered_df_movies['Genre'])
print(all_genres)




In [None]:
# Check the filtered DataFrame
if filtered_df_movies.empty:
    print("The filtered DataFrame is empty.")
else:
    # Check the Genre column
    if 'Genre' not in filtered_df_movies.columns:
        print("The 'Genre' column is missing from the DataFrame.")
    else:
        all_genres = ' '.join(filtered_df_movies['Genre'])
        if not all_genres:
            print("The 'Genre' column contains no data.")
        else:
            wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_genres)
            plt.figure(figsize=(10, 5))
            plt.imshow(wordcloud, interpolation='bilinear')
            plt.axis('off')
            plt.show()


The filtered DataFrame is empty.


In [None]:
print(df_movies.head())


Empty DataFrame
Columns: [Movie Title, IMDB Rating, Release Year, Genre, Runtime/Duration, Age Rating, Production Country, Streaming Platform, Url]
Index: []


In [None]:
df_movies['Genre'] = df_movies['Genre'].astype(str)


In [None]:
# Example filtering criteria (modify according to your use case)
# filtered_df_movies = df_movies[df_movies['Genre'].str.contains('Action')]
print(df_movies[df_movies['Genre'].str.contains('Action')])


Empty DataFrame
Columns: [Movie Title, IMDB Rating, Release Year, Genre, Runtime/Duration, Age Rating, Production Country, Streaming Platform, Url]
Index: []


In [None]:
# Top Movies Genres
if 'Genre' in filtered_df_movies.columns:
    all_genres = ' '.join(filtered_df_movies['Genre'].dropna()) # Handle potential missing values in 'Genre' column
    if all_genres: # Check if all_genres is not empty
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_genres)
        plt.figure(figsize=(10, 5))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis('off')
        plt.show()
    else:
        print("The 'Genre' column contains no data.")
else:
    print("The 'Genre' column is missing from the DataFrame.")

The 'Genre' column contains no data.


## **Finding Predominant Streaming Service**

In [None]:
# Write Your Code here

def visualize_streaming_distribution_wordcloud(df):
    # Filter streaming information available
    streaming_platforms = df[df['Streaming Platform'] != 'Not Available for Streaming.']['Streaming Platform']

    # Create a string of streaming platforms
    streaming_text = ' '.join(streaming_platforms)

    # Generate the word cloud
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(streaming_text)

    # Display the word cloud
    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Streaming Service Distribution - Word Cloud')
    plt.show()

    # Identify the predominant streaming service
    predominant_service = streaming_platforms.mode().iloc[0]
    print(f"The predominant streaming service is: {predominant_service}")

In [None]:
# Before calling visualize_streaming_distribution_wordcloud,
# check if filtered_df_movies is empty and handle it appropriately.

if filtered_df_movies.empty:
    print("The filtered DataFrame is empty. Cannot generate word cloud.")
else:
    visualize_streaming_distribution_wordcloud(filtered_df_movies)

The filtered DataFrame is empty. Cannot generate word cloud.


## **Task 3 :- Data Export**

In [None]:
#saving final dataframe as Final Data in csv format
df_movies.to_csv('Final_Movies_Data.csv', index=False)
df_tv_shows.to_csv('Final_Tv_Shows_Data.csv', index=False)

In [None]:
#saving filter data as Filter Data in csv format
filtered_df_movies.to_csv('Filtered_Movies_Data.csv', index=False)
filtered_df_tv_shows.to_csv('Filtered_Tv_Shows_Data.csv', index=False)

# **Dataset Drive Link (View Access with Anyone) -**

# ***Congratulations!!! You have completed your Assignment.***