                              **BIG DATA ANALYTSIS**

In [1]:
!pip install pyspark



IMPORTING REQUIRED LIBRARY

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Movie Rating Big Data Analysis") \
    .getOrCreate()

In [3]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


IMPORTING THE DATASET

In [4]:
df = spark.read.csv(
    "/content/drive/MyDrive/Netflix_User_Ratings.csv",
    header=True,
    inferSchema=True
)

Preview data

In [5]:
df.show(5)
df.printSchema()


+-------+------+----------+-------+
| CustId|Rating|      Date|MovieId|
+-------+------+----------+-------+
|1488844|     3|2005-09-06|      1|
| 822109|     5|2005-05-13|      1|
| 885013|     4|2005-10-19|      1|
|  30878|     4|2005-12-26|      1|
| 823519|     3|2004-05-03|      1|
+-------+------+----------+-------+
only showing top 5 rows
root
 |-- CustId: integer (nullable = true)
 |-- Rating: integer (nullable = true)
 |-- Date: date (nullable = true)
 |-- MovieId: integer (nullable = true)



Dataset Understanding

In [6]:
print("Total records:", df.count())
print("Total users:", df.select("CustId").distinct().count())
print("Total movies:", df.select("MovieId").distinct().count())


Total records: 100480507
Total users: 480189
Total movies: 17770


                              Data Cleaning

Remove missing or invalid values:

In [7]:
df_clean = df.dropna()


In [8]:
#Check rating range:
df_clean.groupBy("Rating").count().show()


+------+--------+
|Rating|   count|
+------+--------+
|     1| 4617990|
|     3|28811247|
|     5|23168232|
|     4|33750958|
|     2|10132080|
+------+--------+



Convert Date Column

In [9]:
from pyspark.sql.functions import to_date

df_clean = df_clean.withColumn("Date", to_date("Date"))


             Exploratory Data Analysis (EDA)

In [10]:
# Rating Distribution
df_clean.groupBy("Rating").count().orderBy("Rating").show()


+------+--------+
|Rating|   count|
+------+--------+
|     1| 4617990|
|     2|10132080|
|     3|28811247|
|     4|33750958|
|     5|23168232|
+------+--------+



In [11]:
#Average Rating per Movie
from pyspark.sql.functions import avg

avg_rating = df_clean.groupBy("MovieId") \
    .agg(avg("Rating").alias("Avg_Rating"))

avg_rating.show(7)


+-------+------------------+
|MovieId|        Avg_Rating|
+-------+------------------+
|    148| 3.304947283049473|
|    463| 4.092738407699038|
|    471|4.0713333333333335|
|    496|3.7106382978723405|
|    833| 3.516949152542373|
|    243|3.0252100840336134|
|    392|3.2333333333333334|
+-------+------------------+
only showing top 7 rows


In [12]:
#Top 10 Highest Rated Movies
avg_rating.orderBy("Avg_Rating", ascending=False).show(10)


+-------+------------------+
|MovieId|        Avg_Rating|
+-------+------------------+
|  14961| 4.723269925683507|
|   7230| 4.716610825093296|
|   7057| 4.702611063648014|
|   3456|4.6709891019450955|
|   9864| 4.638809387521466|
|  15538| 4.605021432945499|
|   8964|               4.6|
|  14791|               4.6|
|  10464| 4.595505617977528|
|  14550| 4.593383932407275|
+-------+------------------+
only showing top 10 rows


User Behavior Analysis

In [13]:
#Most Active Users
df_clean.groupBy("CustId").count() \
    .orderBy("count", ascending=False) \
    .show(10)


+-------+-----+
| CustId|count|
+-------+-----+
| 305344|17653|
| 387418|17436|
|2439493|16565|
|1664010|15813|
|2118461|14831|
|1461435| 9822|
|1639792| 9767|
|1314869| 9740|
|2606799| 9064|
|1932594| 8880|
+-------+-----+
only showing top 10 rows


Time-Based Analysis

In [14]:
#Ratings Over Years
from pyspark.sql.functions import year

df_clean.withColumn("Year", year("Date")) \
    .groupBy("Year") \
    .count() \
    .orderBy("Year") \
    .show()

+----+--------+
|Year|   count|
+----+--------+
|1999|    2178|
|2000|  924443|
|2001| 1769031|
|2002| 4342871|
|2003| 9985337|
|2004|30206574|
|2005|53250073|
+----+--------+



Scalability Demonstration

In [15]:

df_clean.rdd.getNumPartitions()

21

‚úîÔ∏è Explain:

Spark processes data in parallel.

Suitable for millions of records.

Faster than pandas for large datasets.

In [16]:
#avg Processed Results
avg_rating.write.csv("avg_movie_ratings", header=True)

In [17]:
#Stop Spark Session
spark.stop()

üìå Insights:



Majority of ratings fall between 3 and 5, indicating positive user sentiment.

Some movies receive consistently high ratings across many users.

A small group of users are highly active, contributing most of the ratings.

Ratings increased significantly after the year 2000.

PySpark efficiently processed millions of records, proving scalability.