<a href="https://colab.research.google.com/github/maneakansha36/my_first_repository/blob/main/akanksha_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset Link: https://raw.githubusercontent.com/laxmimerit/AllCSVMLDataFilesDownload/master/IMDBMovie

Data.csv

Tasks: 1) Import pandas, numpy

In [None]:
import pandas as pd
import numpy as np

2) Load dataset using pd.read_csv()

In [None]:
url = "https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/IMDB-Movie-Data.csv"
df = pd.read_csv(url)

3) Display: shape, head, column names, datatypes, missing value

In [None]:
# Display the shape of the dataset
print("Shape:", df.shape)

# Display the first few rows of the dataset
print("Head:\n", df.head())

# Display column names
print("Columns:", df.columns)

# Display data types of each column
print("Data Types:\n", df.dtypes)

# Check for missing values
print("Missing Values:\n", df.isnull().sum())

Shape: (1000, 12)
Head:
    Rank                    Title                     Genre  \
0     1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1     2               Prometheus  Adventure,Mystery,Sci-Fi   
2     3                    Split           Horror,Thriller   
3     4                     Sing   Animation,Comedy,Family   
4     5            Suicide Squad  Action,Adventure,Fantasy   

                                         Description              Director  \
0  A group of intergalactic criminals are forced ...            James Gunn   
1  Following clues to the origin of mankind, a te...          Ridley Scott   
2  Three girls are kidnapped by a man with a diag...    M. Night Shyamalan   
3  In a city of humanoid animals, a hustling thea...  Christophe Lourdelet   
4  A secret government agency recruits some of th...            David Ayer   

                                              Actors  Year  Runtime (Minutes)  \
0  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...

Check and remove duplicates

In [None]:
# Check for duplicate rows
print("Duplicate Rows:", df.duplicated().sum())

# Remove duplicate rows
df = df.drop_duplicates()


Duplicate Rows: 0


5) Rename columns ("Title" > "Movie_Title")

In [None]:
df = df.rename(columns={"Title": "Movie_Title"})

6. Getting Unique Genres

In [None]:
unique_genres = df['Genre'].unique()
print("Unique Genres:", unique_genres)

Unique Genres: ['Action,Adventure,Sci-Fi' 'Adventure,Mystery,Sci-Fi' 'Horror,Thriller'
 'Animation,Comedy,Family' 'Action,Adventure,Fantasy' 'Comedy,Drama,Music'
 'Comedy' 'Action,Adventure,Biography' 'Adventure,Drama,Romance'
 'Adventure,Family,Fantasy' 'Biography,Drama,History'
 'Animation,Adventure,Comedy' 'Action,Comedy,Drama' 'Action,Thriller'
 'Biography,Drama' 'Drama,Mystery,Sci-Fi' 'Adventure,Drama,Thriller'
 'Drama' 'Crime,Drama,Horror' 'Action,Adventure,Drama' 'Drama,Thriller'
 'Action,Adventure,Comedy' 'Action,Horror,Sci-Fi' 'Adventure,Drama,Sci-Fi'
 'Action,Adventure,Western' 'Comedy,Drama' 'Horror'
 'Adventure,Drama,Fantasy' 'Action,Crime,Thriller' 'Action,Crime,Drama'
 'Adventure,Drama,History' 'Crime,Horror,Thriller' 'Drama,Romance'
 'Comedy,Drama,Romance' 'Horror,Mystery,Thriller' 'Crime,Drama,Mystery'
 'Drama,Romance,Thriller' 'Drama,History,Thriller' 'Action,Drama,Thriller'
 'Drama,History' 'Action,Drama,Romance' 'Drama,Fantasy' 'Action,Sci-Fi'
 'Adventure,Drama,War' 

7. Extracting Number of Genres per Movie

In [None]:
df['Num_Genres'] = df['Genre'].apply(lambda x: len(x.split(',')))
print("Movies with Number of Genres:\n", df[['Movie_Title', 'Num_Genres']].head())

Movies with Number of Genres:
                Movie_Title  Num_Genres
0  Guardians of the Galaxy           3
1               Prometheus           3
2                    Split           2
3                     Sing           3
4            Suicide Squad           3


8. Handling Missing Values in 'Revenue (Millions)' and 'Metascore

In [None]:
# Fill missing values in 'Revenue (Millions)' with 0
df['Revenue (Millions)'].fillna(0, inplace=True)

# Fill missing values in 'Metascore' with the median
df['Metascore'].fillna(df['Metascore'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Revenue (Millions)'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Metascore'].fillna(df['Metascore'].median(), inplace=True)


9. Top 5 Movies by Rating

In [None]:
top_5_movies = df[['Movie_Title', 'Rating']].sort_values(by='Rating', ascending=False).head(5)
print("Top 5 Movies by Rating:\n", top_5_movies)


Top 5 Movies by Rating:
           Movie_Title  Rating
54    The Dark Knight     9.0
80          Inception     8.8
117            Dangal     8.8
36       Interstellar     8.6
249  The Intouchables     8.6


10. Movies with Highest/Lowest Revenue

In [None]:
# Movie with highest revenue
highest_revenue = df[['Movie_Title', 'Revenue (Millions)']].sort_values(by='Revenue (Millions)', ascending=False).head(1)
print("Movie with Highest Revenue:\n", highest_revenue)

# Movie with lowest revenue
lowest_revenue = df[['Movie_Title', 'Revenue (Millions)']].sort_values(by='Revenue (Millions)').head(1)
print("Movie with Lowest Revenue:\n", lowest_revenue)


Movie with Highest Revenue:
                                    Movie_Title  Revenue (Millions)
50  Star Wars: Episode VII - The Force Awakens              936.63
Movie with Lowest Revenue:
       Movie_Title  Revenue (Millions)
998  Search Party                 0.0


11. Average Rating by Director

In [None]:
avg_rating_by_director = df.groupby('Director')['Rating'].mean().reset_index()
print("Average Rating by Director:\n", avg_rating_by_director.head())


Average Rating by Director:
               Director  Rating
0           Aamir Khan     8.5
1  Abdellatif Kechiche     7.8
2            Adam Leon     6.5
3           Adam McKay     7.0
4        Adam Shankman     6.3


12. Movies per Year

In [None]:
movies_per_year = df.groupby('Year')['Movie_Title'].count().reset_index(name='Movie_Count')
print("Movies per Year:\n", movies_per_year.head())


Movies per Year:
    Year  Movie_Count
0  2006           44
1  2007           53
2  2008           52
3  2009           51
4  2010           60


13. Movies with Rating > 8 & Revenue > 100M

In [None]:
filtered_movies = df[(df['Rating'] > 8) & (df['Revenue (Millions)'] > 100)]
print("Movies with Rating > 8 & Revenue > 100M:\n", filtered_movies[['Movie_Title', 'Rating', 'Revenue (Millions)']].head())


Movies with Rating > 8 & Revenue > 100M:
                                    Movie_Title  Rating  Revenue (Millions)
0                      Guardians of the Galaxy     8.1              333.13
6                                   La La Land     8.3              151.06
36                                Interstellar     8.6              187.99
50  Star Wars: Episode VII - The Force Awakens     8.1              936.63
54                             The Dark Knight     9.0              533.32


14. Group by Director: Average Rating and Revenue

In [None]:
director_stats = df.groupby('Director').agg(
    avg_rating=('Rating', 'mean'),
    avg_revenue=('Revenue (Millions)', 'mean')
).reset_index()
print("Director Statistics:\n", director_stats.head())


Director Statistics:
               Director  avg_rating  avg_revenue
0           Aamir Khan         8.5        1.200
1  Abdellatif Kechiche         7.8        2.200
2            Adam Leon         6.5        0.000
3           Adam McKay         7.0      109.535
4        Adam Shankman         6.3       78.665


15. Group by Year: Count and Average Rating

In [None]:
year_stats = df.groupby('Year').agg(
    movie_count=('Movie_Title', 'count'),
    avg_rating=('Rating', 'mean')
).reset_index()
print("Yearly Statistics:\n", year_stats.head())


Yearly Statistics:
    Year  movie_count  avg_rating
0  2006           44    7.125000
1  2007           53    7.133962
2  2008           52    6.784615
3  2009           51    6.960784
4  2010           60    6.826667


16. Creating a 'Profitability' Column

In [None]:
# Assuming 'Duration' is in minutes
df['Profitability'] = df['Revenue (Millions)'] / df['Runtime (Minutes)']

# Top 10 most profitable movies
top_profitability = df[['Movie_Title', 'Profitability']].sort_values(by='Profitability', ascending=False).head(10)
print("Top 10 Most Profitable Movies:\n", top_profitability)

Top 10 Most Profitable Movies:
                                     Movie_Title  Profitability
50   Star Wars: Episode VII - The Force Awakens       6.886985
85                               Jurassic World       5.259516
119                                Finding Dory       5.013299
87                                       Avatar       4.694506
76                                 The Avengers       4.358601
15                      The Secret Life of Pets       4.233448
688                                 Toy Story 3       4.028932
12                                    Rogue One       4.001278
174                                      Frozen       3.928824
797                             Despicable Me 2       3.755612


17. Categorizing Movies by Rating

In [None]:
def categorize_rating(rating):
    if rating < 6:
        return 'Poor'
    elif 6 <= rating < 7.5:
        return 'Average'
    else:
        return 'Excellent'

df['Rating_Category'] = df['Rating'].apply(categorize_rating)
print("Movies with Rating Categories:\n", df[['Movie_Title', 'Rating', 'Rating_Category']].head())


Movies with Rating Categories:
                Movie_Title  Rating Rating_Category
0  Guardians of the Galaxy     8.1       Excellent
1               Prometheus     7.0         Average
2                    Split     7.3         Average
3                     Sing     7.2         Average
4            Suicide Squad     6.2         Average


18. Custom Scoring Function

In [None]:
df['Custom_Score'] = df['Rating'] * 10 + df['Revenue (Millions)']
print("Movies with Custom Scores:\n", df[['Movie_Title', 'Custom_Score']].head())


Movies with Custom Scores:
                Movie_Title  Custom_Score
0  Guardians of the Galaxy        414.13
1               Prometheus        196.46
2                    Split        211.12
3                     Sing        342.32
4            Suicide Squad        387.02
