## Install dependencies (only if not already installed)
The following cell installs the necessary Python packages. In Colab this uses `%pip` magic which is supported; in local Jupyter it will also work when running in an IPython kernel.

In [1]:
%pip install -q pandas scikit-learn scipy numpy requests

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## Dataset download and extraction (ml-latest-small)
If `movies.csv` and `ratings.csv` are already present in the working directory this cell will skip downloading. Otherwise it downloads the MovieLens `ml-latest-small` zip, extracts the two CSVs, and moves them to the working directory.

In [2]:
import os
import requests
import zipfile
import io
import shutil

def download_movielens_small(target_dir='.'):
    movie_path = os.path.join(target_dir, 'movies.csv')
    ratings_path = os.path.join(target_dir, 'ratings.csv')
    if os.path.exists(movie_path) and os.path.exists(ratings_path):
        print('movies.csv and ratings.csv already exist in the working directory.')
        return movie_path, ratings_path

    print('Downloading MovieLens ml-latest-small dataset...')
    url = 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
    resp = requests.get(url)
    resp.raise_for_status()
    z = zipfile.ZipFile(io.BytesIO(resp.content))
    # Extract the needed CSVs to a temporary folder inside current working directory
    tmp_dir = os.path.join(target_dir, 'ml-latest-small')
    if not os.path.exists(tmp_dir):
        os.makedirs(tmp_dir)
    for filename in z.namelist():
        if filename.endswith('movies.csv') or filename.endswith('ratings.csv'):
            z.extract(filename, target_dir)
    # Move the files from tmp folder to target_dir root
    movies_in_zip = os.path.join(target_dir, 'ml-latest-small', 'movies.csv')
    ratings_in_zip = os.path.join(target_dir, 'ml-latest-small', 'ratings.csv')
    if os.path.exists(movies_in_zip) and os.path.exists(ratings_in_zip):
        shutil.move(movies_in_zip, movie_path)
        shutil.move(ratings_in_zip, ratings_path)
        # Remove the extracted folder
        shutil.rmtree(os.path.join(target_dir, 'ml-latest-small'))
    else:
        raise FileNotFoundError('Downloaded zip but could not find expected CSVs inside')

    print('Download and extraction complete')
    return movie_path, ratings_path

movie_file, ratings_file = download_movielens_small('.')
print('Files available:')
print(f' - {movie_file}')
print(f' - {ratings_file}')
# Print a couple of lines counts to check
print('movies.csv lines:', sum(1 for _ in open(movie_file)))
print('ratings.csv lines:', sum(1 for _ in open(ratings_file)))

movies.csv and ratings.csv already exist in the working directory.
Files available:
 - .\movies.csv
 - .\ratings.csv


UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 827: character maps to <undefined>

## Load Data and Preview
Read the CSVs into DataFrames and preview the data so we can understand shapes and fields.

In [3]:
import pandas as pd

movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

print('Movies shape:', movies.shape)
print('Ratings shape:', ratings.shape)

display(movies.head())
display(ratings.head())

Movies shape: (9742, 3)
Ratings shape: (100836, 4)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## Preprocessing & Pivot
Create a movie-user matrix (rows: movies, columns: users), fill missing ratings with 0, and convert to a CSR matrix for efficient math operations.

In [4]:
from scipy.sparse import csr_matrix

movie_user_matrix = ratings.pivot(index='movieId', columns='userId', values='rating')
movie_user_matrix.fillna(0, inplace=True)
movie_csr = csr_matrix(movie_user_matrix.values)

print('Movie-user pivot shape:', movie_user_matrix.shape)
print('CSR shape:', movie_csr.shape)

Movie-user pivot shape: (9724, 610)
CSR shape: (9724, 610)


## Train KNN Model
Train a KNN model on this movie-user matrix using cosine similarity. This is the item-based KNN approach.

In [11]:
from sklearn.neighbors import NearestNeighbors

model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20)
model_knn.fit(movie_csr)
print('Model trained successfully!')

Model trained successfully!


## Recommendation Function
This function takes a movie name (or substring), finds a matching movie in the movies dataframe, and uses the trained KNN model to find similar movies by cosine distance. It prints the top matches and a similarity percentage.

In [12]:
def recommend_movie(movie_name, n_neighbors=6):
    # fuzzy search for the movie name in the titles
    movie_list = movies[movies['title'].str.contains(movie_name, case=False, na=False)]
    if len(movie_list) == 0:
        print(f"Movie not found: {movie_name}. Please try a different title fragment.")
        return
    movie_idx = movie_list.iloc[0]['movieId']
    movie_title = movie_list.iloc[0]['title']

    try:
        query_index = movie_user_matrix.index.get_loc(movie_idx)
        distances, indices = model_knn.kneighbors(
            movie_user_matrix.iloc[query_index,:].values.reshape(1, -1),
            n_neighbors=n_neighbors
        )
        print(f'Because you liked "{movie_title}":')
        for i in range(1, len(distances.flatten())):
            result_movie_id = movie_user_matrix.index[indices.flatten()[i]]
            result_title = movies[movies['movieId'] == result_movie_id]['title'].values[0]
            similarity = (1 - distances.flatten()[i]) * 100
            print(f'{i}: {result_title} (Match: {similarity:.2f}%)')
    except KeyError:
        print('Movie ID found in list but not in matrix (likely not enough ratings)')


## Demo
Run a demonstration using `'Iron Man'` as the query and print the top 6 recommendations.

In [15]:
recommend_movie('Toy Story', n_neighbors=16)

Because you liked "Toy Story (1995)":
1: Toy Story 2 (1999) (Match: 57.26%)
2: Jurassic Park (1993) (Match: 56.56%)
3: Independence Day (a.k.a. ID4) (1996) (Match: 56.43%)
4: Star Wars: Episode IV - A New Hope (1977) (Match: 55.74%)
5: Forrest Gump (1994) (Match: 54.71%)
6: Lion King, The (1994) (Match: 54.11%)
7: Star Wars: Episode VI - Return of the Jedi (1983) (Match: 54.11%)
8: Mission: Impossible (1996) (Match: 53.89%)
9: Groundhog Day (1993) (Match: 53.42%)
10: Back to the Future (1985) (Match: 53.04%)
11: Shrek (2001) (Match: 52.80%)
12: Aladdin (1992) (Match: 52.79%)
13: Apollo 13 (1995) (Match: 52.03%)
14: Pulp Fiction (1994) (Match: 51.80%)
15: Star Wars: Episode V - The Empire Strikes Back (1980) (Match: 51.42%)


---
### How to use this notebook in Google Colab
1. Upload this notebook file to Google Drive or open directly in Colab (File > Open Notebook > Upload).
2. Run the cells from top to bottom. The dataset will be downloaded and extracted automatically if not present in the working directory.
3. Modify the `recommend_movie()` demo to try other movies (e.g. `recommend_movie('Toy Story', n_neighbors=5)`) or loop through a list of inputs to generate multiple recommendations.

---
If you want, I can add additional cells: e.g., evaluation (precision@k), save/serialize the model, accept user input, or convert this notebook into a Streamlit demo to share interactively.