<a href="https://colab.research.google.com/github/msaqib015/Web_Scrapping_and_Data_Handling/blob/main/Numerical_Programming_in_Python_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [None]:
# Installing all necessary labraries

!pip install bs4
!pip install requests

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [None]:
# Import all necessary labraries

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

## **Scrapping Movies Data**

In [None]:
# Specifying the URL from which movies related data will be fetched

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
url='https://www.justwatch.com/in/movies?release_year_from=2000'

# Sending an HTTP GET request to the URL
page = requests.get(url, headers = headers)

# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup = BeautifulSoup(page.text,'html.parser')

# Printing the prettified HTML content
print(soup.prettify())

�7T��8�j/�Cc  |��+"�I��PUU�%w�����/���O��?�����+�c  ����ML��-,��ml����]\��=&lt;��}|���ן߫j��oS�r��Y�!D�r�V�}sCbV %�:z�{�L��b(�����,����Y��v��=A�*�anW8$�h@�re(�W�^/&gt;T?��j���J^o�QUܟqi@2x�3|��}�D��|�U8{�Q��+��_�����#t�n�IQ����F����������[64�e���҄Λ��c[�Dd�#�	��\���麟)�'�DR��Qq�����,��2�B��x��?�U��k���$s�X#���Ȍdu�o�}��{�	��ʌ�Z]�q�"!	N��	(Ur���G"�$@`�rN���{�m��Kϼ��+���m�ח����6VԈ,P[��, Y�a����(�g|��������h��y�I��ڞ�il���?��O�ڪNO�U�		�)	7&amp;��}N�d��V�%_I��Ͷ�Ͷk�����G�|�+5N6�B�qC�U�T�4���6��q/�0�k�g~�R]5~!�4�����1x �$��1$�Cc������ѽcs&lt; 6�p��#�V�QJ���	IA���H
ZI�;�G+�6�֭�Zuk����� 9(4�n��K]xPah �`^ ������ΞO8I,� gY7�A^��Hd$+��LF���{�~��m1տ�s(P8~P�h�hGGr�͙$�8˶GY�����J&gt;��Άd���._�CH�,z�CU��F�Fj!u��P�����tZ*u��A��R��ű��!M�/�5�-��2��������5���f�����BS���}˴rC&amp;!�����A���]S��?�0#`#Ѵ�DX�����5h,\�����]��dd�b��X)	��[� �����v�f�F.H�#�G΅���_iy�����:"�QV���?���y؄�u��޻�=ɮ�

## **Fetching Movie URL's**

In [None]:
# Write Your Code here

movie_url = []
for x in soup.find_all('a',attrs = {'class':'title-list-grid__item--link'}):
  movie_url.append('https://www.justwatch.com' + x ['href'])
print(movie_url)
print(len(movie_url))

[]
0


## **Scrapping Movie Title**

In [None]:
# Write Your Code here

import time

# Initialize an empty list to store movie titles.
movie_title = []

# Iterate over each URL in the movie_url list.
for url in movie_url:
  try:
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
    # Send a GET request to the URL with the specified headers
    data = requests.get(url, headers = headers)

    # Parse the response text using BeautifulSoup
    soup = BeautifulSoup(data.text,'html.parser')

    # Extract the full title from the parsed HTML
    full_title = soup.find_all('div', attrs = {'class':'title-detail-hero__details'})[0].find_all('h1')[0].text

    # Clean up the title by removing any year information and whitespace
    title = full_title.split('(')[0].strip()

# Set title to "NA"
  except:
    title = "NA"

  # Append the title (or "NA") to the movie_title list
  movie_title.append(title)

  # Pause execution for 1 second to avoid overwhelming the server
  time.sleep(1)

# Print the list of movie titles
print(movie_title)

# Print the total number of movie titles collected
print(len(movie_title))

[]
0


## **Scrapping release Year**

In [None]:
# Write Your Code here

# Import the time module to enable sleep functionality
import time

# Initialize an empty list to store release years
release_year = []

# Iterate over each URL in the movie_url list
for url in movie_url:
  try:
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}

    # Send a GET request to the URL with the specified headers
    data = requests.get(url,headers = headers)

    # Parse the response text using BeautifulSoup
    soup = BeautifulSoup(data.text,'html.parser')

    # Extract the release year from the parsed HTML
    year = eval(soup.find_all('div', attrs = {'class':'title-detail-hero__details'})[0].find_all('span')[0].text.strip())
  except:
    year = 'NA'
  release_year.append(year)

  # Pause execution for 1 second to avoid overwhelming the server
  time.sleep(1)

# Print the list of release years
print(release_year)

# Print the total number of release years collected
print(len(release_year))

[]
0


## **Scrapping Genres**

In [None]:
# Write Your Code here

# Import the time module to enable sleep functionality
import time

# Initialize an empty list to store movie genres
genre_list = []

# Iterate over each URL in the movie_url list
for url in movie_url:
  try:
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}

    # Send a GET request to the URL with the specified headers
    data = requests.get(url, headers = headers)

    # Parse the response text using BeautifulSoup
    soup = BeautifulSoup(data.text,'html.parser')

    # Iterate through all divs with class 'detail-infos'
    for x in soup.find_all('div', attrs = {'class':'detail-infos'}):

      # Check if the subheading is 'Genres'
      if x.find_all('h3', attrs = {'class':'detail-infos__subheading'})[0].text == 'Genres':

        # Extract the genres from the corresponding div
        genres = x.find_all('div', attrs = {'class':'detail-infos__value'})[0].text
  except:
    genres = 'NA'
  genre_list.append(genres)
  time.sleep(1)

# Print the list of genres
print(genre_list)

# Print the total number of genres collected
print(len(genre_list))

[]
0


## **Scrapping IMBD Rating**

In [None]:
# Write Your Code here

import time
movie_rating = []

for url in movie_url:
  try:
    response = requests.get(url, headers = headers)
    soup = BeautifulSoup(response.text,'html.parser')
    for x in soup.find_all('div',class_='title-detail-hero-details__item'):
      imdbratings = soup.find_all('span', class_='imdb-score')
      if imdbratings:
        imdbrating = imdbratings[0].text.strip().split()[0]   # If we want all values of IMDb then we can use code imdbrating = imdbratings[0].text.strip()
      else:
        imdbrating='NA'
  except:
    imdbrating='NA'

  movie_rating.append(imdbrating)
  time.sleep(1)

print(movie_rating)
print(len(movie_rating))

[]
0


## **Scrapping Runtime/Duration**

In [None]:
# Write Your Code here

import time
movie_runtime = []
for url in movie_url:
  try:
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
    data = requests.get(url, headers = headers)
    soup = BeautifulSoup(data.text, 'html.parser')
    for x in soup.find_all('div', attrs={'class': 'detail-infos'}):
      if x.find_all('h3', attrs={'class': 'detail-infos__subheading'})[0].text == 'Runtime':
        Runtime = x.find_all('div', attrs={'class': 'detail-infos__value'})[0].text
  except:
    Runtime = 'NA'
  movie_runtime.append(Runtime)
  time.sleep(1)
print(movie_runtime)

[]


## **Scrapping Age Rating**

In [None]:
# Write Your Code here

# Initialize an empty list to store age rating
Age_rating = []


# Iterate over each URL in the movie_url list
for url in movie_url:

    # Initialize rating_age
    rating_age = "NA"
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        }

        # Send a GET request to the URL with the specified headers
        data = requests.get(url, headers=headers)

        # Parse the response text using BeautifulSoup
        soup = BeautifulSoup(data.text, 'html.parser')

        # Iterate through all divs with class 'detail-infos'
        for x in soup.find_all('div', attrs = {'class': 'detail-infos'}):

            # Check if the subheading is 'Age rating'
            if x.find_all('h3', attrs = {'class': 'detail-infos__subheading'})[0].text == 'Age rating':

                # Extract the runtime value
                rating_age = x.find_all('div', attrs = {'class': 'detail-infos__value'})[0].text

                # Stop searching once we've found the rating
                break
    except Exception as e:
        print(f"An error occurred: {e}")

    Age_rating.append(rating_age)
    time.sleep(1)

print(Age_rating)
print(len(Age_rating))


[]
0


## **Fetching Production Countries Details**

In [None]:
# Write Your Code here
import time
movie_production_country = []
for url in movie_url:
  try:
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
    data = requests.get(url, headers = headers)
    soup = BeautifulSoup(data.text,'html.parser')

    for x in soup.find_all('div',attrs = {'class':'detail-infos'}):
      if x.find_all('h3',attrs={'class':'detail-infos__subheading'})[0].text ==' Production country ':
         country = x.find_all('div',attrs = {'class':'detail-infos__value'})[0].text

  except:
    country = "NA"
  movie_production_country.append(country)
  time.sleep(1)
print(movie_production_country)
print(len(movie_production_country))

[]
0


## **Fetching Streaming Service Details**

In [None]:
# Write Your Code here
Movie_Streaming_Provider = []
for url in movie_url:
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        names = [
            x['src'].split('/')[-1].split('.')[0]  # Extract the last part of the URL and remove file extension
            for x in soup.find_all('img', class_="provider-icon wide icon") if 'src' in x.attrs]
    except Exception as e:
        print(f"Error for URL {url}: {e}")
        names = ['NA']

    Movie_Streaming_Provider.append(", ".join(names))
print(Movie_Streaming_Provider)
print(len(Movie_Streaming_Provider))

[]
0


## **Now Creating Movies DataFrame**

In [None]:
# Write Your Code here
info = {
    'movies_title': movie_title,
    'release_year' : release_year,
    'imdb_rating': movie_rating,
    'age_rating': Age_rating,
    'stream_provider':Movie_Streaming_Provider,
    'runtime':movie_runtime,
    'genre_list':genre_list,
    'movie_prod_country': movie_production_country,
    'movie_url': movie_url

}



movies_df = pd.DataFrame(info)



In [None]:
movies_df

Unnamed: 0,movies_title,release_year,imdb_rating,age_rating,stream_provider,runtime,genre_list,movie_prod_country,movie_url


In [None]:
# Make a csv file
movies_df.to_csv('movie_file.csv')

## **Scraping TV  Show Data**

In [None]:
# Specifying the URL from which movies related data will be fetched

# WE ARE GETTING 403 ERROR, so took help online and found this piece of code so that it wont throw 403 error
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
url='https://www.justwatch.com/in/tv-shows?release_year_from=2000'

# Sending an HTTP GET request to the URL
page=requests.get(url, headers= headers)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(page.text,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

�м�(� �W���q  �}n��'��@UU��ܽj��O����?��o�����o �	   ��F�&amp;�f��V�6�v��N�.�n��^�&gt;�~����{U}�V�y�$�6�ӿ�z��Y ����)����]����,���ެTv8��W`@CJd��0��;�� �F��������/�$*w��%�LUmf٣U�j�q�?_l-�~	i�x�p��4�,w�Fd�̻��j�[�,�w}��?%�D�$x ��R4I:���Q8���ڢ��~��IW'��_��v|-�vҝr�G",�@�r���Ԭf����������`�b�ss��M��?�
�����B˒����_Qz�\��]��yV����	��T�O��Ej,Y�S����l�f��&amp;��O�8��������;fwk
��0���B3�P��|w�p#&lt;܂nFs��m�P���+W�?3�����U��ʨ��/��D:���R%GE0��'� ������,���.��i���~��Դ���=.p���i�H�Ɣ�7~~�jM�mࠛ�e��J7�tk��&gt;�yI�Wo������׉�w{6���ڙ���"cl�f�� �T��$���?��/Jo�ئβ�Lw��b}�U�@��"�ƓN���-Sy�K�����s�w��
2J�t��;!+H��.H2ZI����=��}UJw묠��� 9(4حv�|�:l�A`�`����U�!����ij��y�2K,�{_%�Yɪ$W"+���{�9{_���!��?b-
V@@�IP�!��N����Ҵ3Lsս2��ެ�Z�B�Y���143K���PU!�Q��$�nT����ޯ�K:�*u�:(��0AqN�4���4/�5�-�9"���'ɨŬF�f6k� F��ͼДf�{�R"�D
	�����9W��?�A�(FE݈����sρk�C@C�u�����&gt;Sm��f�t�uƥ�~�[�24��a�n�u#?rq��N�p�x���e
�QB��7Ѥ�h`g����F�

## **Fetching Tv shows Url details**

In [None]:
# Write Your Code here

# Fetch all the link of 100 movies
tv_show_url = []
for x in soup.find_all('a',attrs={'class':'title-list-grid__item--link'}):
  tv_show_url.append('https://www.justwatch.com'+x['href'])
print(tv_show_url)

[]


## **Fetching Tv Show Title details**

In [None]:
# Write Your Code here

# Import the time module to enable sleep functionality
import time

# Initialize an empty list to store tv shows title
tv_shows_title = []

# Iterate over each URL in the tv_show_url list
for url in tv_show_url:
  try:
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}

    # Send a GET request to the URL with the specified headers
    data = requests.get(url, headers = headers)

    # Parse the response text using BeautifulSoup
    soup = BeautifulSoup(data.text,'html.parser')

    # Extract the full title from the specified div and h1 tags
    full_title = soup.find_all('div', attrs = {'class':'title-detail-hero__details'})[0].find_all('h1')[0].text

    # Clean up the title by removing any text in parentheses and stripping whitespace
    title = full_title.split('(')[0].strip()
  except:
    title = "NA"
  tv_shows_title.append(title)
  time.sleep(1)
print(tv_shows_title)

[]


## **Fetching Release Year**

In [None]:
# Write Your Code here

# Import the time module to enable sleep functionality
import time

# Initialize an empty list to store tv shows year
tv_shows_year = []

# Iterate over each URL in the tv_show_url list
for url in tv_show_url:
  try:
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}

    # Send a GET request to the URL with the specified headers
    data = requests.get(url,headers = headers)

    # Parse the response text using BeautifulSoup
    soup = BeautifulSoup(data.text,'html.parser')

    # Extract the year from the specified div and span tags
    year = eval(soup.find_all('div', attrs = {'class':'title-detail-hero__details'})[0].find_all('span')[0].text.strip())
  except:
    year = 'NA'
  tv_shows_year.append(year)

  # Pause for 1 second to avoid overwhelming the server
  time.sleep(1)
print(tv_shows_year)

[]


## **Fetching TV Show Genre Details**

In [None]:
# Write Your Code here

# Initialize an empty list to store tv show genres
genre_list = []

# Iterate over each URL in the tv_show_url list
for url in tv_show_url:
  try:
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}

    # Send a GET request to the URL with the specified headers
    data = requests.get(url, headers = headers)

    # Parse the response text using BeautifulSoup
    soup = BeautifulSoup(data.text,'html.parser')

    # Iterate through all divs with class 'detail-infos'
    for x in soup.find_all('div', attrs = {'class':'detail-infos'}):

      # Check if the subheading is 'Genres'
      if x.find_all('h3', attrs = {'class':'detail-infos__subheading'})[0].text == 'Genres':

        # Extract the genres from the corresponding div
        genres = x.find_all('div', attrs = {'class':'detail-infos__value'})[0].text
  except:
    genres = 'NA'
  genre_list.append(genres)

  # Pause for 1 second to avoid overwhelming the server
  time.sleep(1)
print(genre_list)

[]


## **Fetching IMDB Rating Details**

In [None]:
# Write Your Code here

import time
tv_show_rating = []

for url in tv_show_url:
  try:
    response = requests.get(url, headers = headers)
    soup = BeautifulSoup(response.text,'html.parser')
    for x in soup.find_all('div',class_='title-detail-hero-details__item'):
      imdbratings = soup.find_all('span', class_='imdb-score')
      if imdbratings:
        imdbrating = imdbratings[0].text.strip().split()[0]   # If we want all values of IMDb then we can use code imdbrating = imdbratings[0].text.strip()
      else:
        imdbrating='NA'
  except:
    imdbrating='NA'

  tv_show_rating.append(imdbrating)
  time.sleep(1)

print(tv_show_rating)
print(len(tv_show_rating))

[]
0


## **Fetching Age Rating Details**

In [None]:
# Write Your Code here


Age_rating = []

for url in tv_show_url:

    # Initialize rating_age here
    rating_age = "NA"
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        }

        # Send a GET request to the URL with the specified headers
        data = requests.get(url, headers=headers)

        # Parse the response text using BeautifulSoup
        soup = BeautifulSoup(data.text, 'html.parser')

        # Iterate through all divs with class 'detail-infos'
        for x in soup.find_all('div', attrs = {'class': 'detail-infos'}):

            # Check if the subheading is 'Age rating'
            if x.find_all('h3', attrs = {'class': 'detail-infos__subheading'})[0].text == 'Age rating':
                rating_age = x.find_all('div', attrs={'class': 'detail-infos__value'})[0].text

                # Stop searching once we've found the rating
                break
    except Exception as e:
        print(f"An error occurred: {e}")

    Age_rating.append(rating_age)
    time.sleep(1)

print(Age_rating)
print(len(Age_rating))


[]
0


## **Fetching Production Country details**

In [None]:
# Write Your Code here
import time
tv_production_country = []
for url in tv_show_url:
  try:
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
    data = requests.get(url, headers = headers)
    soup = BeautifulSoup(data.text,'html.parser')

    for x in soup.find_all('div',attrs = {'class':'detail-infos'}):
      if x.find_all('h3',attrs={'class':'detail-infos__subheading'})[0].text ==' Production country ':
         country = x.find_all('div',attrs = {'class':'detail-infos__value'})[0].text

  except:
    country = "NA"
  tv_production_country.append(country)
  time.sleep(1)
print(tv_production_country)
print(len(tv_production_country))

[]
0


## **Fetching Streaming Service details**

In [None]:
# Write Your Code here
tv_show_streaming_services = []
for url in tv_show_url:
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        names = [
            x['src'].split('/')[-1].split('.')[0]  # Extract the last part of the URL and remove file extension
            for x in soup.find_all('img', class_="provider-icon wide icon") if 'src' in x.attrs]
    except Exception as e:
        print(f"Error for URL {url}: {e}")
        names = ['NA']

    tv_show_streaming_services.append(", ".join(names))

print(tv_show_streaming_services)
print(len(tv_show_streaming_services))

[]
0


## **Fetching Duration Details**

In [None]:
# Write Your Code here
import time
tv_runtime = []
for url in tv_show_url:
  try:
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
    data = requests.get(url, headers = headers)
    soup = BeautifulSoup(data.text,'html.parser')
    for x in soup.find_all('div',attrs={'class':"detail-infos"}):
       if x.find_all('h3',attrs = {'class':'detail-infos__subheading'})[0].text == 'Runtime':
          Runtime = x.find_all('div',attrs={'class':'detail-infos__value'})[0].text

  except:
    Runtime = "NA"
  tv_runtime.append(Runtime)
  time.sleep(1)
print(tv_runtime)
print(len(tv_runtime))

[]
0


## **Creating TV Show DataFrame**

In [None]:
# Write Your Code here
info = {
    'tv_title' : tv_shows_title,
    'release_year': tv_shows_year,
    'Imdb_rating' : tv_show_rating,
    'age_rating':Age_rating,
    'stream_provider':tv_show_streaming_services,
    'runtime': tv_runtime,
    'genre_list': genre_list,
    'tv_prod_country':tv_production_country,
    'tv_show_url':tv_show_url
}
tv_df = pd.DataFrame(info)

In [None]:
tv_df

Unnamed: 0,tv_title,release_year,Imdb_rating,age_rating,stream_provider,runtime,genre_list,tv_prod_country,tv_show_url


In [None]:
# Make a csv file
tv_df.to_csv('tv_file.csv')

## **Task 2 :- Data Filtering & Analysis**

In [None]:
# Write Your Code here

# first we copy the code where we do manipulation

# for Movies
data1=movies_df.copy()
data1.head(5)


Unnamed: 0,movies_title,release_year,imdb_rating,age_rating,stream_provider,runtime,genre_list,movie_prod_country,movie_url


In [None]:
 # For TV shows
data2 = tv_df.copy()
data2

Unnamed: 0,tv_title,release_year,Imdb_rating,age_rating,stream_provider,runtime,genre_list,tv_prod_country,tv_show_url


In [None]:
# basic check row and column for movies
data1.shape

(0, 9)

In [None]:
# basic check row and column for tv_shows
data2.shape

(0, 9)

In [None]:
# checking null value for movies
data1.isnull().sum()

Unnamed: 0,0
movies_title,0
release_year,0
imdb_rating,0
age_rating,0
stream_provider,0
runtime,0
genre_list,0
movie_prod_country,0
movie_url,0


In [None]:
# checking null value for tv shows
data2.isnull().sum()

Unnamed: 0,0
tv_title,0
release_year,0
Imdb_rating,0
age_rating,0
stream_provider,0
runtime,0
genre_list,0
tv_prod_country,0
tv_show_url,0


In [None]:
# Info for movies
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movies_title        0 non-null      float64
 1   release_year        0 non-null      float64
 2   imdb_rating         0 non-null      float64
 3   age_rating          0 non-null      float64
 4   stream_provider     0 non-null      float64
 5   runtime             0 non-null      float64
 6   genre_list          0 non-null      float64
 7   movie_prod_country  0 non-null      float64
 8   movie_url           0 non-null      float64
dtypes: float64(9)
memory usage: 132.0 bytes


In [None]:
# Info for tv_shows
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   tv_title         0 non-null      float64
 1   release_year     0 non-null      float64
 2   Imdb_rating      0 non-null      float64
 3   age_rating       0 non-null      float64
 4   stream_provider  0 non-null      float64
 5   runtime          0 non-null      float64
 6   genre_list       0 non-null      float64
 7   tv_prod_country  0 non-null      float64
 8   tv_show_url      0 non-null      float64
dtypes: float64(9)
memory usage: 132.0 bytes


In [None]:
# statistical report for movies
data1.describe(include='object')

ValueError: No objects to concatenate

In [None]:
# statistical report for tv_shows
data2.describe(include='object')

In [None]:
# Finding duplicates for movies
data1.duplicated().sum()

In [None]:
# Finding duplicates for tv_shows
data2.duplicated().sum()

In [None]:
# Replace 'NA' strings and actual NaN values in 'age_rating' and 'stream_provider' columns for Movies

movies_df['age_rating'].replace('NA', pd.NA, inplace=True)
movies_df['age_rating'].fillna('Age rating not found', inplace=True)

movies_df['stream_provider'].replace('NA', pd.NA, inplace=True)
tv_df['stream_provider'].fillna('Streaming Service Not found', inplace=True)


In [None]:
# Filtered Data for movies
movies_df.head()

In [None]:
# Replace 'NA' strings and actual NaN values in 'age_rating' and 'stream_provider' columns for Tv Shows

tv_df['age_rating'].replace('NA', pd.NA, inplace=True)
tv_df['age_rating'].fillna('Age rating not found', inplace=True)

tv_df['stream_provider'].replace('NA', pd.NA, inplace=True)
tv_df['stream_provider'].fillna('Streaming Service Not found', inplace=True)


In [None]:
# filtered data for TV shows
tv_df.head()

## **Calculating Mean IMDB Ratings for both Movies and Tv Shows**

In [None]:
# Convert 'imdb_rating' to numeric, handling errors
data1['imdb_rating'] = pd.to_numeric(data1['imdb_rating'], errors='coerce')

# Calculate mean rating, excluding missing values
mean_rating = data1['imdb_rating'].mean()

In [None]:
# Convert 'Imdb_rating' column to numeric, handling non-numeric values
data2['Imdb_rating'] = pd.to_numeric(data2['Imdb_rating'], errors='coerce')

# Calculate the mean IMDb rating, ignoring missing values
mean_imdb_rating = data2['Imdb_rating'].mean()

In [None]:
print(f"Mean IMDB rating for movies:{mean_rating:.2f}")
print(f"Mean IMDb rating for TV shows: {mean_imdb_rating:.2f}")

## **Analyzing Top Genres**

In [None]:
# Write Your Code here

# TOP GENRE FOR MOVIES
# Split genres and explode into separate rows
genre_counts_movies = movies_df['genre_list'].str.split(', ').explode().value_counts()

# Display top genres
print(genre_counts_movies)

In [None]:
# MEAN RATING BY GENRE FOR MOVIES

# Create a copy to avoid modifying the original DataFrame
movies_df_copy = movies_df.copy()

# Convert 'imdb_rating' column to numeric, coercing errors to NaN
movies_df_copy['imdb_rating'] = pd.to_numeric(movies_df_copy['imdb_rating'], errors='coerce')

# Explode the 'genre_list' column
movies_df_exploded = movies_df_copy.assign(genre=movies_df_copy['genre_list'].str.split(', ')).explode('genre')


# Group by genre and calculate the mean IMDb rating
mean_rating_by_genre = movies_df_exploded.groupby('genre')['imdb_rating'].mean()

# Display the result
mean_rating_by_genre

In [None]:
# TOP GENRE COUNT FOR TV SHOWS
# Split the 'genre_list' column and count genre occurrences
genre_counts_tv = tv_df['genre_list'].str.split(', ').explode().value_counts()

# Display the top genres
print(genre_counts_tv)


In [None]:
# MEAN RATING BY GENRE FOR TV SHOWS
# Create a copy to avoid modifying the original DataFrame
tv_df_copy = tv_df.copy()

# Convert 'Imdb_rating' column to numeric, coercing errors to NaN
tv_df_copy['Imdb_rating'] = pd.to_numeric(tv_df_copy['Imdb_rating'], errors='coerce')

# Explode the 'genre_list' column
tv_df_exploded = tv_df_copy.assign(genre=tv_df_copy['genre_list'].str.split(', ')).explode('genre')

# Group by genre and calculate the mean IMDb rating
mean_rating_by_genre_tv = tv_df_exploded.groupby('genre')['Imdb_rating'].mean()

# Display the result
mean_rating_by_genre_tv

In [None]:
#Let's Visvalize it using word cloud

from wordcloud import WordCloud
import matplotlib.pyplot as plt

genre_freq = dict(genre_counts_movies)

# Create word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(genre_freq)

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Genre Counts Word Cloud for movies')
plt.show()

In [None]:
# TOP GENRE COUNT FOR TV SHOWS
# Split the 'genre_list' column and count genre occurrences
genre_counts_tv = tv_df['genre_list'].str.split(', ').explode().value_counts()

# Display the top genres
print(genre_counts_tv)

from wordcloud import WordCloud
import matplotlib.pyplot as plt
genre_freq = dict(genre_counts_tv)

# Create word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(genre_freq)

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Genre Counts Word Cloud for Tv shows')
plt.show()

## **Finding Predominant Streaming Service**

TO CLARIFY WHY I USED INDEX[1] BECAUSE THE MOST OF THE FEILDS ARE NA SO I TOOK THE SECOND INDEX TO GET THE STREAM PROVIDER

In [None]:
# Write Your Code here

stream_provider_counts = movies_df['stream_provider'].value_counts()

# Find the most common stream provider
predominant_stream_provider = stream_provider_counts.index[1]

print(f'The  predominant stream provider is: {predominant_stream_provider}')

In [None]:
stream_provider_counts = tv_df['stream_provider'].value_counts()

# Find the most common stream provider
predominant_stream_provider = stream_provider_counts.index[1]

print(f'The  predominant stream provider is: {predominant_stream_provider}')

In [None]:
# Let's Visvalize it using word cloud
from wordcloud import WordCloud
if not stream_provider_counts.empty:
    # Create a string of all stream providers for the word cloud
    all_providers_text = " ".join(movies_df['stream_provider'].dropna())

    # Create the word cloud
    wordcloud = WordCloud(
        width=800,
        height=400,
        background_color="white",
        colormap="viridis"
    ).generate(all_providers_text)

    # Plot the word cloud
    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.title("Stream Provider Distribution")
    plt.show()
else:
    print("No stream provider data available for visualization.")

In [None]:
# Let's Visvalize it using word cloud
from wordcloud import WordCloud
if not stream_provider_counts.empty:
    # Create a string of all stream providers for the word cloud
    all_providers_text = " ".join(tv_df['stream_provider'].dropna())

    # Create the word cloud
    wordcloud = WordCloud(
        width=800,
        height=400,
        background_color="white",
        colormap="plasma"
    ).generate(all_providers_text)

    # Plot the word cloud
    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.title("TV Stream Provider Distribution")
    plt.show()
else:
    print("No stream provider data available for visualization.")

## **Task 3 :- Data Export**

In [None]:
#saving final dataframe as Final Data in csv format
# Export movies_df to CSV
movies_df.to_csv('filtered_movies_data.csv', index=False)

# Export tv_df to CSV
tv_df.to_csv('filtered_tv_shows_data.csv', index=False)


# **Dataset Drive Link (View Access with Anyone) -**

https://drive.google.com/drive/folders/1i7JBGPQcAK7JRRiu-1De3Cvwm1Hs7StG?usp=drive_link

# ***Congratulations!!! You have completed your Assignment.***