# Case Study: Movie Data Analysis
This case study introduces data analysis using pandas with a real-world dataset from the 
MovieLens website. You will learn how to explore, clean, and analyze large datasets effectivel 
using Python
Using the movieLens 20 million dataset.


## Step 1 : Inspect the dataset

In [11]:
import pandas as pd

# Load the files
movies_df = pd.read_csv("CaseStudy-Movie/movies.csv")
ratings_df = pd.read_csv("CaseStudy-Movie/ratings.csv")
tags_df = pd.read_csv("CaseStudy-Movie/tags.csv")

print("Movies Data:")
display(movies_df.head())  

print("\nRatings Data:")
display(ratings_df.head())

print("\nTags Data:")
display(tags_df.head())

Movies Data:


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy



Ratings Data:


Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580



Tags Data:


Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,1240597180
1,65,208,dark hero,1368150078
2,65,353,dark hero,1368150079
3,65,521,noir thriller,1368149983
4,65,592,dark hero,1368150078


In [14]:
#Count number of rows in movie.csv
print(f" Number of rows in movie {len(movies_df)}")
#Count number of rows in ratings.csv
print(f" Number of rows in ratings {len(ratings_df)}")

 Number of rows in movie 27278
 Number of rows in ratings 3981986


In [15]:
# Data preprocessing

#1. emove timestamp columns
del ratings_df['timestamp']
del tags_df['timestamp']

print("\nRatings Data:")
display(ratings_df.head())

print("\nTags Data:")
display(tags_df.head())

 ## Pandas Structure - Series
A Pandas Series is a one-dimensional labeled array capable of holding data of any type
(integers, strings, floating point numbers, etc.).
• The axis labels are collectively referred to as the index.
This knowledge is foundational for performing row-wise operations in Pandas. For example:
• Extracting a row as a Series to inspect or manipulate its data.
• Renaming rows for better interpretability in logs or reports.

In [17]:
#extracting a row from a DataFrame, it is returned as a Series object.
row_0 = tags_df.iloc[0]
print(type(row_0))
print(row_0)


<class 'pandas.core.series.Series'>
userId              18
movieId           4141
tag        Mark Waters
Name: 0, dtype: object


## Descriptive Statistics

In [23]:
# using describe to providea summary of statistics for numerical cols
#output includes count, mean, std, min/max, percentiles

print(ratings_df.describe())

print("\n")
#finding mode to identify frquently occuring values for each col
ratings_df.mode()

             userId       movieId        rating
count  3.981986e+06  3.981986e+06  3.981986e+06
mean   1.352100e+04  9.052952e+03  3.518330e+00
std    7.842245e+03  1.973555e+04  1.053910e+00
min    1.000000e+00  1.000000e+00  5.000000e-01
25%    6.719000e+03  9.080000e+02  3.000000e+00
50%    1.349900e+04  2.174000e+03  3.500000e+00
75%    2.026200e+04  4.792000e+03  4.000000e+00
max    2.710200e+04  1.310150e+05  5.000000e+00




Unnamed: 0,userId,movieId,rating
0,8405,296,4.0


In [24]:
## Correlation analysis using .corr.
## to measure the correlation between numerical cols
# Correlation values range from -1 (strong negative) to +1 (strong positive).

print(ratings_df.corr())

           userId   movieId    rating
userId   1.000000  0.014892 -0.001913
movieId  0.014892  1.000000  0.003052
rating  -0.001913  0.003052  1.000000


In [26]:
# filter for ratings greater than 0

filter_movies= ratings_df.loc[ratings_df['rating']>0]
print(filter_movies)

         userId  movieId  rating
0             1        2     3.5
1             1       29     3.5
2             1       32     3.5
3             1       47     3.5
4             1       50     3.5
...         ...      ...     ...
3981981   27102     2268     3.5
3981982   27102     2273     2.5
3981983   27102     2279     1.0
3981984   27102     2294     3.0
3981985   27102     2318     3.5

[3981986 rows x 3 columns]


In [27]:
# filter for ratings less than 1

filter_movies= ratings_df.loc[ratings_df['rating']<1]
print(filter_movies)

         userId  movieId  rating
982          11      286     0.5
1020         11      671     0.5
1038         11     1077     0.5
1082         11     1977     0.5
1092         11     2107     0.5
...         ...      ...     ...
3981658   27100     6013     0.5
3981667   27100     6464     0.5
3981672   27100     6827     0.5
3981868   27102      788     0.5
3981949   27102     1862     0.5

[48340 rows x 3 columns]


In [29]:
# Grouping and Aggregation

#group by movieid and calculate mean
#purpose: find the average raitng for each movie

movie_avg=ratings_df.groupby("movieId").mean()
print(movie_avg)

               userId    rating
movieId                        
1        13509.197104  3.913517
2        13471.614422  3.200409
3        13496.324670  3.149340
4        13927.104982  2.876335
5        13244.477801  3.072727
...               ...       ...
130976   26497.000000  2.500000
130978   26497.000000  5.000000
130980   24036.000000  4.000000
130982   24036.000000  3.000000
131015   25978.000000  2.500000

[19365 rows x 2 columns]


## Key Takeaways
1. Descriptive Statistics: Helps summarize the central tendency, spread, and shape of the
data distribution.
2. Mode Analysis: Useful for identifying the most common rating in the dataset.
3. Correlation: Gives insight into relationships between numerical columns.
4. Filtering and Grouping: Enables specific data analysis, such as focusing on movies
with positive ratings or calculating average ratings.

## Data Cleaning

In [31]:
##check sataset shape

print(movies_df.shape) #output: 27,278 rows and 3 col

(27278, 3)


In [32]:
## check for null valuues

print(movies_df.isnull().any()) #none

print(ratings_df.isnull().any()) #none

print(tags_df.isnull().any()) # some nulls

movieId    False
title      False
genres     False
dtype: bool
userId     False
movieId    False
rating     False
dtype: bool
userId     False
movieId    False
tag         True
dtype: bool


In [33]:
#handle the nulls by dropping them

tags_df=tags_df.dropna()

print(tags_df.isnull().any())

print(tags_df.shape)

userId     False
movieId    False
tag        False
dtype: bool
(465548, 3)


# Advanced Data Analysis with Pandas

In [35]:
# filtering

is_highly_rated= ratings_df['rating']>=4.0
print(ratings_df[is_highly_rated].head())

print("\n")
is_animation=movies_df['genres'].str.contains('Animation')
print(movies_df[is_animation].head(10))

print("\n")


    userId  movieId  rating
6        1      151     4.0
7        1      223     4.0
8        1      253     4.0
9        1      260     4.0
10       1      293     4.0


     movieId                                       title  \
0          1                            Toy Story (1995)   
12        13                                Balto (1995)   
47        48                           Pocahontas (1995)   
236      239                       Goofy Movie, A (1995)   
241      244                     Gumby: The Movie (1995)   
310      313                   Swan Princess, The (1994)   
360      364                       Lion King, The (1994)   
388      392  Secret Adventures of Tom Thumb, The (1993)   
547      551      Nightmare Before Christmas, The (1993)   
553      558                      Pagemaster, The (1994)   

                                              genres  
0        Adventure|Animation|Children|Comedy|Fantasy  
12                      Adventure|Animation|Children  
47  

In [39]:
#groupby and aggregate

#count the  number of occurences of each rrting
ratings_count=ratings_df[['movieId','rating']].groupby('rating').count()
print(ratings_count)
print("\n")

#compute average ating for each movie
avg_rating=ratings_df[['movieId','rating']].groupby('movieId').mean()
avg_rating.head()
print(avg_rating)
print("\n")

#find movies with an average rating of 5.0
print(avg_rating.loc[avg_rating.rating == 5.0].head())

        movieId
rating         
0.5       48340
1.0      137160
1.5       55523
2.0      289204
2.5      178387
3.0      857403
3.5      437655
4.0     1103468
4.5      302309
5.0      572537


           rating
movieId          
1        3.913517
2        3.200409
3        3.149340
4        2.876335
5        3.072727
...           ...
130976   2.500000
130978   5.000000
130980   4.000000
130982   3.000000
131015   2.500000

[19365 rows x 1 columns]


         rating
movieId        
7145        5.0
7356        5.0
7447        5.0
7950        5.0
8536        5.0


# Vectorized String operrations in pandas


In [40]:
# The genres column containes multiple genrers separated by a |
# split this column up into multiple

movie_genres=movies_df['genres'].str.split('|',expand=True)
movie_genres.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Adventure,Animation,Children,Comedy,Fantasy,,,,,
1,Adventure,Children,Fantasy,,,,,,,
2,Comedy,Romance,,,,,,,,
3,Comedy,Drama,Romance,,,,,,,
4,Comedy,,,,,,,,,
5,Action,Crime,Thriller,,,,,,,
6,Comedy,Romance,,,,,,,,
7,Adventure,Children,,,,,,,,
8,Action,,,,,,,,,
9,Action,Adventure,Thriller,,,,,,,


In [42]:
#adding a flag for a specific genre

movie_genres['IsComedy'] = movies_df['genres'].str.contains('Comedy')
movie_genres.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,IsComedy
0,Adventure,Animation,Children,Comedy,Fantasy,,,,,,True
1,Adventure,Children,Fantasy,,,,,,,,False
2,Comedy,Romance,,,,,,,,,True
3,Comedy,Drama,Romance,,,,,,,,True
4,Comedy,,,,,,,,,,True


In [43]:
#extract the release yeear from the movie titles

movies_df['year'] = movies_df['title'].str.extract(r'\((\d{4})\)', expand=True)
#this adds a new colum called yearr to the dataset

movies_df[['title','year']].head()

Unnamed: 0,title,year
0,Toy Story (1995),1995
1,Jumanji (1995),1995
2,Grumpier Old Men (1995),1995
3,Waiting to Exhale (1995),1995
4,Father of the Bride Part II (1995),1995
