# Regression Models on MovieLens

## Dataset

We will use the MovieLens dataset (public ZIP archive).

**Download:** http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

**Contained files:**

- ratings.csv → columns: userId, movieId, rating, timestamp
- movies.csv → columns: movieId, title, genres

In [None]:
import requests
import pandas as pd
from zipfile import ZipFile
from io import BytesIO

## Download & Load the Data

In [None]:
url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
r = requests.get(url)
z = ZipFile(BytesIO(r.content))

ratings = pd.read_csv(z.open('ml-latest-small/ratings.csv'))
movies = pd.read_csv(z.open('ml-latest-small/movies.csv'))

In [None]:
ratings.head()

In [None]:
movies.head()

## Data Preparation

1. Merge ratings and movies on movieId.

In [None]:
data = pd.merge(ratings, movies, on='movieId')
data.head()

2. Convert timestamp into datetime and extract year and month.

In [None]:
timestamp_col = pd.to_datetime(data['timestamp'], unit='s').head()
print("timestamp=", timestamp_col, "\n")

print("year=", timestamp_col.dt.year.head(), "\n")
print("month=", timestamp_col.dt.month.head(), "\n")

3. Aggregate at the movie level:

In [None]:
movies_df = pd.DataFrame(data)

- Average rating per movie

In [None]:
avg_ratings = movies_df.groupby('movieId')['rating'].mean().reset_index()
merged = pd.merge(movies_df, avg_ratings, how='outer') 
print(merged.head())

- Number of ratings per movie