# Regression Models on MovieLens

## Dataset

We will use the MovieLens dataset (public ZIP archive).

**Download:** http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

**Contained files:**

- ratings.csv → columns: userId, movieId, rating, timestamp
- movies.csv → columns: movieId, title, genres

In [1]:
import requests
import pandas as pd
from zipfile import ZipFile
from io import BytesIO

## Download & Load the Data

In [2]:
url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
r = requests.get(url)
z = ZipFile(BytesIO(r.content))

ratings = pd.read_csv(z.open('ml-latest-small/ratings.csv'))
movies = pd.read_csv(z.open('ml-latest-small/movies.csv'))

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Data Preparation

1. Merge ratings and movies on movieId.

In [5]:
data = pd.merge(ratings, movies, on='movieId')
data.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


2. Convert timestamp into datetime and extract year and month.

In [6]:
timestamp_col = pd.to_datetime(data['timestamp'], unit='s').head()
print("timestamp=", timestamp_col, "\n")

print("year=", timestamp_col.dt.year.head(), "\n")
print("month=", timestamp_col.dt.month.head(), "\n")

timestamp= 0   2000-07-30 18:45:03
1   2000-07-30 18:20:47
2   2000-07-30 18:37:04
3   2000-07-30 19:03:35
4   2000-07-30 18:48:51
Name: timestamp, dtype: datetime64[ns] 

year= 0    2000
1    2000
2    2000
3    2000
4    2000
Name: timestamp, dtype: int32 

month= 0    7
1    7
2    7
3    7
4    7
Name: timestamp, dtype: int32 



3. Aggregate at the movie level:

In [7]:
movies_df = pd.DataFrame(data)

- Average rating per movie

In [10]:
avg_ratings = data.groupby(['movieId', 'title'])['rating'].mean().reset_index()
avg_ratings.columns = ['movieId', 'title', 'avg_rating']
print(avg_ratings[['title', 'avg_rating']].to_string(index=False))

                                                                                                                                                         title  avg_rating
                                                                                                                                              Toy Story (1995)    3.920930
                                                                                                                                                Jumanji (1995)    3.431818
                                                                                                                                       Grumpier Old Men (1995)    3.259615
                                                                                                                                      Waiting to Exhale (1995)    2.357143
                                                                                                                            Father of the Bride P

- Number of ratings per movie