# Day5_Pandas_Grouping_Aggregations.ipynb
# Content: Pandas Grouping & Aggregations with IMDB dataset

---

## Step 1: Import Libraries

In [19]:
import pandas as pd
import numpy as np

## Step 2: Load IMDB Dataset

In [20]:
imdb_df = pd.read_csv("/content/imdb_top_1000.csv")

# Inspect dataset
print("First 5 rows:\n", imdb_df.head())
print("\nShape:", imdb_df.shape)
print("\nColumns:", imdb_df.columns)
print("\nInfo:\n", imdb_df.info())
print("\nSummary Statistics:\n", imdb_df.describe())

First 5 rows:
                                          Poster_Link  \
0  https://m.media-amazon.com/images/M/MV5BMDFkYT...   
1  https://m.media-amazon.com/images/M/MV5BM2MyNj...   
2  https://m.media-amazon.com/images/M/MV5BMTMxNT...   
3  https://m.media-amazon.com/images/M/MV5BMWMwMG...   
4  https://m.media-amazon.com/images/M/MV5BMWU4N2...   

               Series_Title Released_Year Certificate  Runtime  \
0  The Shawshank Redemption          1994           A  142 min   
1             The Godfather          1972           A  175 min   
2           The Dark Knight          2008          UA  152 min   
3    The Godfather: Part II          1974           A  202 min   
4              12 Angry Men          1957           U   96 min   

                  Genre  IMDB_Rating  \
0                 Drama          9.3   
1          Crime, Drama          9.2   
2  Action, Crime, Drama          9.0   
3          Crime, Drama          9.0   
4          Crime, Drama          9.0   

          

## Step 3: Handle Missing Data

In [21]:
# Fill missing Meta_score with mean
imdb_df['Meta_score'].fillna(imdb_df['Meta_score'].mean(), inplace=True)

# Fill missing Certificate with 'Not Rated'
imdb_df['Certificate'].fillna('Not Rated', inplace=True)

# Fill missing Gross with 0
imdb_df['Gross'].fillna('0', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  imdb_df['Meta_score'].fillna(imdb_df['Meta_score'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  imdb_df['Certificate'].fillna('Not Rated', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate o

## Step 4: Grouping & Aggregation

In [22]:
# Average IMDB rating per Genre
avg_rating_genre = imdb_df.groupby('Genre')['IMDB_Rating'].mean().sort_values(ascending=False)
print("\nAverage IMDB Rating per Genre:\n", avg_rating_genre)

# Count of movies per Genre
count_genre = imdb_df['Genre'].value_counts()
print("\nNumber of movies per Genre:\n", count_genre)

# Average Meta_score per Director
avg_meta_director = imdb_df.groupby('Director')['Meta_score'].mean().sort_values(ascending=False).head(10)
print("\nTop 10 Directors by average Meta_score:\n", avg_meta_director)

# Total No_of_Votes per Director
votes_per_director = imdb_df.groupby('Director')['No_of_Votes'].sum().sort_values(ascending=False).head(10)
print("\nTop 10 Directors by total votes:\n", votes_per_director)


Average IMDB Rating per Genre:
 Genre
Animation, Drama, War         8.50
Action, Sci-Fi                8.40
Drama, Musical                8.40
Drama, Mystery, War           8.35
Western                       8.35
                              ... 
Action, Adventure, Mystery    7.60
Action, Adventure, Family     7.60
Action, Adventure, Crime      7.60
Animation, Drama, Romance     7.60
Drama, War, Western           7.60
Name: IMDB_Rating, Length: 202, dtype: float64

Number of movies per Genre:
 Genre
Drama                        85
Drama, Romance               37
Comedy, Drama                35
Comedy, Drama, Romance       31
Action, Crime, Drama         30
                             ..
Action, Adventure, Family     1
Action, Crime, Mystery        1
Animation, Drama, Romance     1
Drama, War, Western           1
Adventure, Comedy, War        1
Name: count, Length: 202, dtype: int64

Top 10 Directors by average Meta_score:
 Director
Orson Welles            99.5
Charles Laughton      

## Step 5: Top-Rated Movies

In [23]:
top_10_movies = imdb_df.sort_values('IMDB_Rating', ascending=False).head(10)
print("\nTop 10 highest rated movies:\n", top_10_movies[['Series_Title', 'Genre', 'IMDB_Rating', 'Director']])


Top 10 highest rated movies:
                                      Series_Title                      Genre  \
0                        The Shawshank Redemption                      Drama   
1                                   The Godfather               Crime, Drama   
4                                    12 Angry Men               Crime, Drama   
2                                 The Dark Knight       Action, Crime, Drama   
3                          The Godfather: Part II               Crime, Drama   
5   The Lord of the Rings: The Return of the King   Action, Adventure, Drama   
7                                Schindler's List  Biography, Drama, History   
6                                    Pulp Fiction               Crime, Drama   
8                                       Inception  Action, Adventure, Sci-Fi   
12                Il buono, il brutto, il cattivo                    Western   

    IMDB_Rating              Director  
0           9.3        Frank Darabont  
1       

## Step 6: Practice Tasks

In [24]:
# 1. Find average IMDB rating for movies released after 2010
avg_rating_2010 = imdb_df[imdb_df['Released_Year'] > '2010'].groupby('Genre')['IMDB_Rating'].mean()
print("\nAverage rating per genre (after 2010):\n", avg_rating_2010)

# 2. Count movies per Certificate type
cert_counts = imdb_df['Certificate'].value_counts()
print("\nNumber of movies per Certificate type:\n", cert_counts)

# 3. Director with highest average rating (min 3 movies)
director_counts = imdb_df['Director'].value_counts()
directors_3plus = director_counts[director_counts >= 3].index
highest_avg_director = imdb_df[imdb_df['Director'].isin(directors_3plus)].groupby('Director')['IMDB_Rating'].mean().sort_values(ascending=False)
print("\nDirector with highest average rating (min 3 movies):\n", highest_avg_director.head(5))


Average rating per genre (after 2010):
 Genre
Action, Adventure              8.400000
Action, Adventure, Comedy      7.842857
Action, Adventure, Drama       8.000000
Action, Adventure, Sci-Fi      7.900000
Action, Adventure, Thriller    7.700000
                                 ...   
Drama, Thriller, War           8.300000
Drama, War                     7.950000
Drama, Western                 8.400000
Horror, Mystery, Thriller      7.700000
Mystery, Thriller              7.850000
Name: IMDB_Rating, Length: 84, dtype: float64

Number of movies per Certificate type:
 Certificate
U            234
A            197
UA           175
R            146
Not Rated    101
PG-13         43
PG            37
Passed        34
G             12
Approved      11
TV-PG          3
GP             2
TV-14          1
Unrated        1
TV-MA          1
16             1
U/A            1
Name: count, dtype: int64

Director with highest average rating (min 3 movies):
 Director
Christopher Nolan       8.462500
Pe