# Part 1: Building a Recommendation System from Scratch

We will be using a [MovieLens](https://movielens.org/) dataset to build a simple recommendation system in Python. MovieLens is a platform where users rate movies, and get personalized recommendations based on their preferences. It has several publicly available datasets that are widely used for recommendation system tutorials.

### Imports

In [7]:
import numpy as np
import pandas as pd
import sklearn

import os

Now, let's download a small version of the MovieLens dataset. See [here](https://grouplens.org/datasets/movielens/) for zip file url. We're working with data in `ml-latest-small.zip`.

In [8]:
ratings = pd.read_csv(os.path.join("data", "ratings.csv"))
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


Let's take a look at the data and perform some basic exploratory data analysis.

In [29]:
n_ratings = len(ratings)
n_movies = ratings['movieId'].nunique()
n_users = ratings['userId'].nunique()

print("Number of ratings:", n_ratings)
print("Number of unique movieId's:", n_movies)
print("Number of unique users:", n_users)
print("Average number of ratings per user:", round(n_ratings/n_users, 2))
print("Average number of ratings per movie:", round(n_ratings/n_movies, 2))

Number of ratings: 100004
Number of unique movieId's: 9066
Number of unique users: 671
Average number of ratings per user: 149.04
Average number of ratings per movie: 11.03


In [31]:
movies = pd.read_csv(os.path.join("data", "movies.csv"))
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
