# ITCS 6162: Data Mining - Programming Assignment

**In this assignment, you will explore data analysis, recommendation algorithms, and graph-based techniques using the MovieLens dataset. Your tasks will range from basic data exploration to advanced recommendation models, including:**
- Data manipulation with pandas
- User-item collaborative filtering
- Similarity-based recommendation models
- A Pixie-inspired Graph-based recommendation using adjacency lists with weighted random walks (without using NetworkX)


#### **Dataset Files:**
- **`u.data`**: User-movie ratings (`user_id  movie_id  rating  timestamp`)
- **`u.item`**: Movie metadata (`movie_id | title | release date | IMDB_website`)
- **`u.user`**: User demographics (`user_id | age | gender | occupation | zip_code`)

## **Part 1: Exploring and Cleaning Data**

### Inspecting the Dataset Format

The dataset is not in a traditional CSV format. To examine its structure, use the following shell command to display the first 10 lines of the file:

```sh
!head <file_name>


**In the cells given below. Write the code to read the files.**

In [115]:
# Cell to have all the imports for the project
import pandas as pd
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity # For computing user similarity
import random  # For random walks

In [116]:
# Displaying first 10 rows of u.data dataset
!head u.data

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013


In [117]:
# Displaying first 10 rows of u.item dataset
!head u.item

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0
8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Babe%20(1995)|0|0|0|0|1

In [118]:
# Displaying first 10 rows of u.user dataset
!head u.user

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703


#### Loading the Dataset with Pandas

Use **pandas** to load the dataset into a DataFrame for analysis. Follow these steps:  

1. Import the necessary library: `pandas`.  
2. Use `pd.read_csv()` (or an appropriate function) to read the dataset file.  
3. Ensure the dataset is loaded with the correct delimiter (e.g., `','`, `'\t'`,`'|'` , or another separator if needed).  
4. Select and display the first few rows using `.head()`.

Ensure that:  

- The `ratings` dataset is read from `"u.data"` using tab (`'\t'`) as a separator and column names (`"user_id"`, `"movie_id"`, `"rating"` and `"timestamp"`).  
- The `movies` dataset is read from `"u.item"` using `'|'` as a separator, use columns (`0`, `1`, `2`), encoding (`"latin-1"`) and name the columns (`movie_id`, `title`, and `release_date`).  
- The `users` dataset is read from `"u.user"` using `'|'` as a separator, use columns (`0`, `1`, `2`, `3`) and name the columns (`user_id`, `age`, `gender`, and `occupation`).

In [119]:
# ratings
# Converting u.data into dataframe named ratings
ratings_column = ['user_id', 'movie_id', 'rating', 'timestamp'] # Defining columns for the data frame
ratings = pd.read_csv("u.data", sep="\t", names=ratings_column, header=None) # As column names are explictly provided, no need to treat first row as column name and header=None
ratings.head() #Displaying first 5 rows

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [120]:
# movies
movies_column = ['movie_id', 'title', 'release_date'] # Defining columns for the data frame
# As column names are explictly provided, no need to treat first row as column name and header=None. As provided in instruction notes only columns 0, 1 and 2 are considered and encoding is done
movies = pd.read_csv("u.item", sep="|", encoding='latin-1', usecols=[0, 1, 2], names=movies_column, header=None)
movies.head() #Displaying first 5 rows

Unnamed: 0,movie_id,title,release_date
0,1,Toy Story (1995),01-Jan-1995
1,2,GoldenEye (1995),01-Jan-1995
2,3,Four Rooms (1995),01-Jan-1995
3,4,Get Shorty (1995),01-Jan-1995
4,5,Copycat (1995),01-Jan-1995


In [121]:
# users
users_column = ['user_id', 'age', 'gender', 'occupation']  # Defining columns for the data frame
# As column names are explictly provided, no need to treat first row as column name and header=None. As provided in instruction notes only columns 0, 1 and 2 are considered.
users = pd.read_csv("u.user", sep="|", usecols=[0, 1, 2, 3], names=users_column, header=None)
users.head() #Displaying first 5 rows

Unnamed: 0,user_id,age,gender,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other


**Note:** As a **Bonus** task save the `ratings`, `movies` and `users` dataframe created into a `.csv` file format. <br>
**Hint:** Use the `to_csv()` function in pandas to save these DataFrames as CSV files.

In [122]:
# ratings
#Saving ratings dataframe as .csv and gets saved to the same location where .ipynb is located
ratings.to_csv("ratings.csv", index=False)

In [123]:
# movies
#Saving movies dataframe as .csv and gets saved to the same location where .ipynb is located
movies.to_csv("movies.csv", index=False)

In [124]:
# users
#Saving users dataframe as .csv and gets saved to the same location where .ipynb is located
users.to_csv("users.csv", index=False)

**Display the first 10 rows of each file.**

In [125]:
# ratings
# Load the CSV file
ratings_data = pd.read_csv("ratings.csv")

# Display the first 10 rows
print(ratings_data.head(10))

   user_id  movie_id  rating  timestamp
0      196       242       3  881250949
1      186       302       3  891717742
2       22       377       1  878887116
3      244        51       2  880606923
4      166       346       1  886397596
5      298       474       4  884182806
6      115       265       2  881171488
7      253       465       5  891628467
8      305       451       3  886324817
9        6        86       3  883603013


In [126]:
# movies
# Load the CSV file
movies_data = pd.read_csv("movies.csv")

# Display the first 10 rows
print(movies_data.head(10))

   movie_id                                              title release_date
0         1                                   Toy Story (1995)  01-Jan-1995
1         2                                   GoldenEye (1995)  01-Jan-1995
2         3                                  Four Rooms (1995)  01-Jan-1995
3         4                                  Get Shorty (1995)  01-Jan-1995
4         5                                     Copycat (1995)  01-Jan-1995
5         6  Shanghai Triad (Yao a yao yao dao waipo qiao) ...  01-Jan-1995
6         7                              Twelve Monkeys (1995)  01-Jan-1995
7         8                                        Babe (1995)  01-Jan-1995
8         9                            Dead Man Walking (1995)  01-Jan-1995
9        10                                 Richard III (1995)  22-Jan-1996


In [127]:
# users
# Load the CSV file
users_data = pd.read_csv("users.csv")

# Display the first 10 rows
print(users_data.head(10))

   user_id  age gender     occupation
0        1   24      M     technician
1        2   53      F          other
2        3   23      M         writer
3        4   24      M     technician
4        5   33      F          other
5        6   42      M      executive
6        7   57      M  administrator
7        8   36      M  administrator
8        9   29      M        student
9       10   53      M         lawyer


### Data Cleaning and Exploration with Pandas  

After loading the dataset, it’s important to clean and explore the data to ensure consistency and accuracy. Below are key **pandas** functions for cleaning and understanding the dataset.

#### 1. Handle Missing Values  
- `df.dropna()` – Removes rows with missing values.  
- `df.fillna(value)` – Fills missing values with a specified value.  

#### 2. Remove Duplicates  
- `df.drop_duplicates()` – Drops duplicate rows from the dataset.  

#### 3. Handle Incorrect Data Types  
- `df.astype(dtype)` – Converts columns to the appropriate data type.  

#### 4. Filter Outliers (if applicable)  
- `df[df['column_name'] > threshold]` – Filters rows based on a condition.  

#### 5. Rename Columns (if needed)  
- `df.rename(columns={'old_name': 'new_name'})` – Renames columns for clarity.  

#### 6. Reset Index  
- `df.reset_index(drop=True, inplace=True)` – Resets the index after cleaning.  

### Data Exploration Functions  

To better understand the dataset, use these **pandas** functions:  

- `df.shape` – Returns the number of rows and columns in the dataset.  
- `df.nunique()` – Displays the number of unique values in each column.  
- `df['column_name'].unique()` – Returns unique values in a specific column.  

**Example Usage in Pandas:**  
```python
import pandas as pd

# Load dataset
df = pd.read_csv("your_file.csv")

# Drop missing values
df_cleaned = df.dropna()

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

# Convert 'timestamp' column to datetime format
df_cleaned['timestamp'] = pd.to_datetime(df_cleaned['timestamp'])

# Display dataset shape
print("Dataset shape:", df_cleaned.shape)

# Display number of unique values in each column
print("Unique values per column:\n", df_cleaned.nunique())

# Display unique movie IDs
print("Unique movie IDs:", df_cleaned['movie_id'].unique()[:10])  # Show first 10 unique movie IDs


**Note:** The functions mentioned above are some of the widely used **pandas** functions for data cleaning and exploration. However, it is not necessary that all of these functions will be required in the exercises below. Use them as needed based on the dataset and the specific tasks.

**Convert Timestamps into Readable dates.**

In [128]:
# ratings
ratings_data['timestamp'] = pd.to_datetime(ratings_data['timestamp'], unit='s') #As the timestamp is in seconds unit is given as 's'
print(ratings_data.head(10))

   user_id  movie_id  rating           timestamp
0      196       242       3 1997-12-04 15:55:49
1      186       302       3 1998-04-04 19:22:22
2       22       377       1 1997-11-07 07:18:36
3      244        51       2 1997-11-27 05:02:03
4      166       346       1 1998-02-02 05:33:16
5      298       474       4 1998-01-07 14:20:06
6      115       265       2 1997-12-03 17:51:28
7      253       465       5 1998-04-03 18:34:27
8      305       451       3 1998-02-01 09:20:17
9        6        86       3 1997-12-31 21:16:53


**Check for Missing Values**

In [129]:
# ratings
print(ratings_data.isnull().values.any())
rating_missing_rows = ratings_data[ratings_data.isnull().any(axis=1)]
print(rating_missing_rows)
# There is no missing value or cell

False
Empty DataFrame
Columns: [user_id, movie_id, rating, timestamp]
Index: []


In [130]:
# movies
print(movies_data.isnull().values.any())
movie_missing_rows = movies_data[movies_data.isnull().any(axis=1)]
print(movie_missing_rows)
# Only for movie id 267, release_date is not available but it is trivial

True
     movie_id    title release_date
266       267  unknown          NaN


In [131]:
# users
print(users.isnull().values.any())
user_missing_rows = users_data[users_data.isnull().any(axis=1)]
print(user_missing_rows)
# There is no missing value or cell

False
Empty DataFrame
Columns: [user_id, age, gender, occupation]
Index: []


In [132]:
#Check for duplicate values
ratings_duplicates = ratings_data[ratings_data.duplicated(subset=['user_id', 'movie_id'])]
movies_duplicates = movies_data[movies_data['movie_id'].duplicated()]
users_duplicates = users_data[users_data['user_id'].duplicated()]
print("ratings:\n", ratings_duplicates)
print("movies:\n", movies_duplicates)
print("users:\n", users_duplicates)
#No user or movie is duplicated and no user-movie is rated duplicately

ratings:
 Empty DataFrame
Columns: [user_id, movie_id, rating, timestamp]
Index: []
movies:
 Empty DataFrame
Columns: [movie_id, title, release_date]
Index: []
users:
 Empty DataFrame
Columns: [user_id, age, gender, occupation]
Index: []


**Print the total number of users, movies, and ratings.**

In [133]:
print(f"Total Users: { users_data['user_id'].nunique() }") # no.of unique users
print(f"Total Movies: { movies_data['movie_id'].nunique() }") # no.of unique movies
print(f"Total Ratings: { len(ratings_data) }") # As no user_movie pair is duplicated count the number of total rows

Total Users: 943
Total Movies: 1682
Total Ratings: 100000


## **Part 2: Collaborative Filtering-Based Recommendation**

### **Create a User-Item Matrix**

#### Instructions for Creating a User-Movie Rating Matrix

In this exercise, you will create a user-movie rating matrix using **pandas**. This matrix will represent the ratings that users have given to different movies.

1. **Dataset Overview**:  
   The dataset has already been loaded. It includes the following key columns:
   - `user_id`: The ID of the user.
   - `movie_id`: The ID of the movie.
   - `ratings`: The rating the user gave to the movie.

2. **Create the User-Movie Rating Matrix**:  
   Use the **`pivot()`** function in **pandas** to reshape the data. Your goal is to create a matrix where:
   - Each **row** represents a **user**.
   - Each **column** represents a **movie**.
   - Each **cell** contains the **rating** that the user has given to the movie.

   Specify the following parameters for the `pivot()` function:
   - **`index`**: The `user_id` column (this will define the rows).
   - **`columns`**: The `movie_id` column (this will define the columns).
   - **`values`**: The `rating` column (this will fill the matrix with ratings).

3. **Inspect the Matrix**:  
   After creating the matrix, examine the first few rows of the resulting matrix to ensure it has been constructed correctly.

4. **Handle Missing Values**:  
   It's likely that some users have not rated every movie, resulting in `NaN` values in the matrix. You will need to handle these missing values. Consider the following options:
   - **Fill with 0**: If you wish to represent missing ratings as zeros (indicating no rating).
   - **Fill with the average rating**: Alternatively, replace missing values with the average rating for each movie.

**Create the user-movie rating matrix using the `pivot()` function.**

In [110]:
user_movie_matrix = ratings_data.pivot(index='user_id', columns='movie_id', values='rating') #Formation of user-movie matrix

**Display the matrix to verify the transformation.**

In [105]:
user_movie_matrix_1 = user_movie_matrix.fillna(0) # Replace Nan with 0
user_similarity = cosine_similarity(user_movie_matrix_1) # Find similarity between users using cosine similarity
user_user_sim = pd.DataFrame(user_similarity, index=user_movie_matrix_1.index, columns=user_movie_matrix_1.index) # Convert into 2d similarity dataframe with both rows and columns as users

### **User-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement a **user-based collaborative filtering** movie recommendation system using the **Movie dataset**. The goal is to recommend movies to a user based on the preferences of similar users.

##### **Step 1: Import Required Libraries**
Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing user similarity
```

##### **Step 2: Compute User-User Similarity**
- We will use **cosine similarity** to measure how similar each pair of users is based on their movie ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.

##### **Instructions:**
1. Fill missing values with `0` using `.fillna(0)`.
2. Compute similarity using `cosine_similarity()`.
3. Convert the result into a **Pandas DataFrame**, with users as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
user_similarity = cosine_similarity(user_movie_matrix.fillna(0))
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)
```

##### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies_for_user(user_id, num=5)` to recommend movies for a given user.

##### **Function Inputs:**
- `user_id`: The target user for whom we need recommendations.
- `num`: The number of movies to recommend (default is 5).

##### **Function Steps:**
1. Find **similar users**:
   - Retrieve the similarity scores for the given `user_id`.
   - Sort them in **descending** order (highest similarity first).
   - Exclude the user themselves.
   
2. Get the **movie ratings** from these similar users.

3. Compute the **average rating** for each movie based on these users' preferences.

4. Sort the movies in **descending order** based on the computed average ratings.

5. Retrieve the **top `num` recommended movies**.

6. Map **movie IDs** to their **titles** using the `movies` DataFrame.

7. Return the results as a **Pandas DataFrame** with rankings.

##### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'Ranking': range(1, num+1),
    'Movie Name': movie_names     
})
result_df.set_index('Ranking', inplace=True)
```

#### **Example: User-Based Collaborative Filtering**
```python
recommend_movies_for_user(10, num = 5)
```
**Output:**
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | In the Company of Men (1997)   |
| 2       | Misérables, Les (1995)         |
| 3       | Thin Blue Line, The (1988)     |
| 4       | Braindead (1992)               |
| 5       | Boys, Les (1997)               |


In [138]:
# Code the function here
user_id = int(input("Enter the user id : ")) # Get the user_id from user
num = 5 # No.of recommendations (As default is 5)
if user_id < 1 or user_id > len(user_user_sim): # Check if user_id is between 1 and 943 inclusive
    print("Invalid Input \nAccepted values are: \nuser_id=[1,943]") #Throw error
else: # If user_id is valid
    non_zero_mask = user_movie_matrix != 0 # Create a mask where ratings are non-zero. True when rating is present and False when movie is not rated
    similarity_row = user_user_sim.loc[user_id] # Get the row of the inputted user_id
    similarity_row[user_id] = 0 # Exclude the user's own similarity
    masked_similarities = non_zero_mask.mul(similarity_row, axis=0) # Keeps similarity scores only for users who rated a given movie. Zeros out the similarity if a user didn’t rate that movie.
    # Multiply the above non_zero_mask with the user_movie matrix A matrix where each rating position is replaced by the similarity of that user, only if the user rated that movie. Otherwise, it remains 0.
    weighted_sum = user_movie_matrix.mul(masked_similarities, axis=0).sum(axis=0)
    sim_sum = masked_similarities.sum(axis=0) # Compute sum of similarities only for users who rated
    predicted_ratings = (weighted_sum / sim_sum).fillna(0) # Avoid division by zero (fill NaN with 0)
    top_movie_ids = predicted_ratings.sort_values(ascending=False).head(num).index # Get top 5 movie IDs from predicted ratings
    movie_id_to_title = movies.set_index('movie_id')['title'] # Map movie_id to title from the movies DataFrame
    top_movie_titles = top_movie_ids.map(movie_id_to_title)
    # Do the below to print in the required format
    # Convert to Series (if needed) for apply
    top_movie_titles_series = pd.Series(top_movie_titles)
    # Calculate max width for alignment
    title_width = max(top_movie_titles_series.apply(len).max(), len("Movie Name"))
    # Print table header
    print(f"| Ranking | {'Movie Name'.ljust(title_width)} |")
    print(f"|{'-' * 9}|{'-' * (title_width + 2)}|")
    # Print each movie with ranking
    for rank, title in enumerate(top_movie_titles, start=1):
        print(f"| {str(rank).ljust(7)} | {title.ljust(title_width)} |")

Enter the user id :  1


| Ranking | Movie Name                       |
|---------|----------------------------------|
| 1       | Star Wars (1977)                 |
| 2       | Return of the Jedi (1983)        |
| 3       | Fargo (1996)                     |
| 4       | Raiders of the Lost Ark (1981)   |
| 5       | Silence of the Lambs, The (1991) |


### **Item-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement an **item-based collaborative filtering** recommendation system using the **Movie dataset**. The goal is to recommend movies similar to a given movie based on user rating patterns.

#### **Step 1: Import Required Libraries**
Although we have done this part already in the previous task but just to emphasize the importance reiterrating this part.

Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing item similarity
```

#### **Step 2: Compute Item-Item Similarity**
- We will use **cosine similarity** to measure how similar each pair of movies is based on their user ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.
- Unlike user-based filtering, we need to **transpose** (`.T`) the `user_movie_matrix` because we want similarity between movies (columns) instead of users (rows).

##### **Instructions:**
1. Transpose the user-movie matrix using `.T` to make movies the rows.
2. Fill missing values with `0` using `.fillna(0)`.
3. Compute similarity using `cosine_similarity()`.
4. Convert the result into a **Pandas DataFrame**, with movies as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
item_similarity = cosine_similarity(user_movie_matrix.T.fillna(0))
item_sim_df = pd.DataFrame(item_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)
```

#### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies(movie_name, num=5)` to recommend movies similar to a given movie.

##### **Function Inputs:**
- `movie_name`: The target movie for which we need recommendations.
- `num`: The number of similar movies to recommend (default is 5).

##### **Function Steps:**
1. Find the **movie_id** corresponding to the given `movie_name` in the `movies` DataFrame.
2. If the movie is not found, return an appropriate message.
3. Extract the **similarity scores** for this movie from `item_sim_df`.
4. Sort the movies in **descending order** based on similarity (excluding the movie itself).
5. Retrieve the **top `num` similar movies**.
6. Map **movie IDs** to their **titles** using the `movies` DataFrame.
7. Return the results as a **Pandas DataFrame** with rankings.

#### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'ranking': range(1, num+1),
    'movie_name': movie_names
})
result_df.set_index('ranking', inplace=True)
```

#### **Example: Item-Based Collaborative Filtering**
```python
recommend_movies("Jurassic Park (1993)", num=5)
```
**Output:**
```
| Ranking | Movie Name                               |
|---------|------------------------------------------|
| 1       | Top Gun (1986)                           |
| 2       | Empire Strikes Back, The (1980)          |
| 3       | Raiders of the Lost Ark (1981)           |
| 4       | Indiana Jones and the Last Crusade (1989)|
| 5       | Speed (1994)                             |


In [106]:
item_similarity = cosine_similarity(user_movie_matrix.T.fillna(0)) # Find the movie-movie similarity, by filling Nan as 0
item_item_sim = pd.DataFrame(item_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns) # Convert into data frame
item_item_sim.to_csv('item_item_similarity.csv', index=True) # Save movie-movie similarity as .csv 

In [107]:
# Code the function here
movie_name = input("Enter the movie name: ") # Receive the movie name from the user
num = 5
matched_movie = movies[movies['title'].str.lower() == movie_name.lower()] # Fetch the movie_id for the inputted movie name

if matched_movie.empty: # If there is not matched movie_id, it means movie_id is invalid and throw error
    print("Please enter a valid movie name")
else: # If movie_id is valid, proceed with below
    movie_id = matched_movie['movie_id'].values[0]  # Extract the movie ID as scalar value
    movie_sim_score = item_item_sim[movie_id] # Get similarity scores for the selected movie
    # Convert it into data frame with movie_id as index and values as movie_sim score
    movie_sim_score = pd.DataFrame({
        'movie_id': movie_sim_score.index,
        'sim_score': movie_sim_score.values
    })
    #movie_sim_score.to_csv('movie_sim_score.csv', index=True) # Save sim_score of movie as .csv 
    movie_sim_score = movie_sim_score[movie_sim_score['movie_id'] != movie_id] # Remove the selected movie itself from the list
    movie_sim_score = movie_sim_score.merge(movies[['movie_id', 'title']], on='movie_id') # Merge movie titles with movie_id for sorting
    movie_sim_score = movie_sim_score.sort_values(by=['sim_score', 'title'], ascending=[False, True]) # Sort the similarity scores in descending order
    top_5_titles = movie_sim_score['title'].head(num) # Get top num movies
    # Do the below for formatting the printing
    title_width = max(top_5_titles.apply(len).max(), len("Movie Name"))
    print(f"| Ranking | {'Movie Name'.ljust(title_width)} |") # Print table header
    print(f"|{'-' * 9}|{'-' * (title_width + 2)}|")
    # Print each movie with ranking
    for rank, title in enumerate(top_5_titles, start=1):
        print(f"| {str(rank).ljust(7)} | {title.ljust(title_width)} |")

Enter the movie name:  Jurassic Park (1993)


| Ranking | Movie Name                                |
|---------|-------------------------------------------|
| 1       | Top Gun (1986)                            |
| 2       | Speed (1994)                              |
| 3       | Raiders of the Lost Ark (1981)            |
| 4       | Empire Strikes Back, The (1980)           |
| 5       | Indiana Jones and the Last Crusade (1989) |


## **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm)**

### **Adjacency List**

#### **Objective**
In this task, you will preprocess the Movie dataset and construct a **graph representation** where:
- **Users** are connected to the movies they have rated.
- **Movies** are connected to users who have rated them.
  
This graph structure will help in exploring **user-movie relationships** for recommendations.

#### **Step 1: Merge Ratings with Movie Titles**
Since we have **movie IDs** in the ratings dataset but need human-readable movie titles, we will:
1. Merge the `ratings` DataFrame with the `movies` DataFrame using the `'movie_id'` column.
2. This allows each rating to be associated with a **movie title**.

#### **Hint:**
Use the following Pandas operation to merge:
```python
ratings = ratings.merge(movies, on='movie_id')
```


#### **Step 2: Aggregate Ratings**
Since multiple users may rate the same movie multiple times, we:
1. Group the dataset by `['user_id', 'movie_id', 'title']`.
2. Compute the **mean rating** for each movie by each user.
3. Reset the index to ensure we maintain a clean DataFrame structure.

#### **Hint:**  
Use `groupby()` and `mean()` as follows:
```python
ratings = ratings.groupby(['user_id', 'movie_id', 'title'])['rating'].mean().reset_index()
```

#### **Step 3: Normalize Ratings**
Since different users have different rating biases, we normalize ratings by:
1. **Computing each user's mean rating**.
2. **Subtracting the mean rating** from each individual rating.

#### **Instructions:**
- Use `groupby('user_id')` to group ratings by users.
- Apply `transform(lambda x: x - x.mean())` to adjust ratings.

#### **Hint:**  
Normalize ratings using:
```python
ratings['rating'] = ratings.groupby('user_id')['rating'].transform(lambda x: x - x.mean())
```
This ensures each user’s ratings are centered around zero, making similarity calculations fairer.

#### **Step 4: Construct the Graph Representation**
We represent the user-movie interactions as an **undirected graph** using an **adjacency list**:
- Each **user** is a node connected to movies they rated.
- Each **movie** is a node connected to users who rated it.

#### **Graph Construction Steps:**
1. Initialize an empty dictionary `graph = {}`.
2. Iterate through the **ratings dataset**.
3. For each `user_id` and `movie_id` pair:
   - Add the movie to the user’s set of connections.
   - Add the user to the movie’s set of connections.

#### **Hint:**  
The following code builds the graph:

```python
graph = {}
for _, row in ratings.iterrows():
    user, movie = row['user_id'], row['movie_id']
    if user not in graph:
        graph[user] = set()
    if movie not in graph:
        graph[movie] = set()
    graph[user].add(movie)
    graph[movie].add(user)
```

This results in a **bipartite graph**, where:
- **Users** are connected to multiple movies.
- **Movies** are connected to multiple users.

#### **Step 5: Understanding the Graph**
- **Nodes** in the graph represent **users and movies**.
- **Edges** exist between a user and a movie **if the user has rated the movie**.
- This structure allows us to find **users with similar movie tastes** and **movies frequently watched together**.

#### **Exploring the Graph**
- **Find a user’s rated movies:**  
  ```python
  user_id = 1
  print(graph[user_id])  # Movies rated by user 1
  ```

- **Find users who rated a movie:**  
  ```python
  movie_id = 50
  print(graph[movie_id])  # Users who rated movie 50
  ```

In [156]:
#### Code the function here
ratings = ratings.drop(columns=['title'], errors='ignore') # To avoid errors during execution of merge multiple times or duplication of columns.
ratings = ratings.merge(movies[['movie_id', 'title']], on='movie_id') # Merge only 'title' from movies
# Normalizing ratings by calculating mean rating for a movie and subtracting mean from each rating
ratings = ratings.groupby(['user_id', 'movie_id', 'title'])['rating'].mean().reset_index()
ratings['rating'] = ratings.groupby('user_id')['rating'].transform(lambda x: x - x.mean())
graph = {} # Initialize adjacency list as a dictionary
# building bi-partite graph with users as one group of nodes and movies as another group
for _, row in ratings.iterrows(): # Loop through the ratings data frame
    user, movie, weight = row['user_id'], row['title'], row['rating']
    if pd.isna(weight) or weight == 0: # Skip connections with zero or missing weight
        continue
    if user not in graph: # If user is not already a node in the graph create node with user_id as key and initialize with an empty dictionary
        graph[user] = {}
    if movie not in graph: # # If movie is not already a node in the graph create node with movie_id as key and initialize with an empty dictionary
        graph[movie] = {}
    #Establish a bidirectional(undirected) link between movies and users if a user has rated that movie
    # Add the movie to the user's list of neighbors, with the rating as the edge weight
    graph[user][movie] = weight
    # Add the user to the movie's list of neighbors, with the rating as the edge weight
    graph[movie][user] = weight
user_id = int(input("Enter the user id to find the movies rated by them : ")) # Receive the user id from tester
print("Movies rated by user", user_id, ":", list(graph[user_id].keys())) # Print the respective movies rated by user
movie_id = int(input("Enter the movie id to find the users who rated : ")) # Receive the movie_id from tester
if movie_id < 1 or movie_id > 1682: # Check if movie_id is valid
    print("Invalid movie id")
else:
    movie_title_arr = movies.loc[movies['movie_id'] == movie_id, 'title'].values
    movie_title = movie_title_arr[0]  # Extract the string from array
    print("Users who rated movie", movie_title, ":", list(graph[movie_title].keys())) # print the users who rated the movie

Enter the user id to find the movies rated by them :  1


Movies rated by user 1 : ['Toy Story (1995)', 'GoldenEye (1995)', 'Four Rooms (1995)', 'Get Shorty (1995)', 'Copycat (1995)', 'Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)', 'Twelve Monkeys (1995)', 'Babe (1995)', 'Dead Man Walking (1995)', 'Richard III (1995)', 'Seven (Se7en) (1995)', 'Usual Suspects, The (1995)', 'Mighty Aphrodite (1995)', 'Postino, Il (1994)', "Mr. Holland's Opus (1995)", 'French Twist (Gazon maudit) (1995)', 'From Dusk Till Dawn (1996)', 'White Balloon, The (1995)', "Antonia's Line (1995)", 'Angels and Insects (1995)', 'Muppet Treasure Island (1996)', 'Braveheart (1995)', 'Taxi Driver (1976)', 'Rumble in the Bronx (1995)', 'Birdcage, The (1996)', 'Brothers McMullen, The (1995)', 'Bad Boys (1995)', 'Apollo 13 (1995)', 'Batman Forever (1995)', 'Belle de jour (1967)', 'Crimson Tide (1995)', 'Crumb (1994)', 'Desperado (1995)', 'Doom Generation, The (1995)', 'Free Willy 2: The Adventure Home (1995)', 'Mad Love (1995)', 'Nadja (1994)', 'Net, The (1995)', 'Strange

Enter the movie id to find the users who rated :  2


Users who rated movie 2 : [1, 5, 13, 22, 30, 42, 49, 64, 72, 83, 87, 92, 95, 102, 110, 130, 178, 193, 197, 200, 201, 207, 213, 217, 222, 234, 249, 250, 256, 267, 268, 271, 276, 279, 280, 292, 293, 301, 303, 305, 320, 325, 327, 346, 363, 373, 374, 378, 379, 385, 387, 393, 398, 399, 405, 407, 416, 425, 429, 435, 442, 450, 455, 466, 472, 484, 487, 495, 497, 506, 521, 532, 536, 543, 551, 561, 566, 600, 618, 621, 622, 627, 632, 640, 642, 643, 648, 650, 653, 655, 660, 671, 682, 686, 705, 709, 715, 727, 738, 746, 749, 751, 757, 764, 773, 774, 790, 795, 796, 798, 804, 806, 807, 815, 826, 830, 844, 846, 864, 868, 870, 880, 886, 889, 896, 899, 916, 924, 934, 943]


### **Implement Weighted Random Walks**

#### **Random Walk-Based Movie Recommendation System (Weighted Pixie)**

#### **Objective**
In this task, you will implement a **random-walk-based recommendation algorithm** using the **Weighted Pixie** method. This technique uses a **user-movie bipartite graph** to recommend movies by simulating a random walk from a given user or movie.

#### **Step 1: Import Required Libraries**
Make sure you have the necessary libraries:

```python
import random  # For random walks
import pandas as pd  # For handling data
```

#### **Step 2: Implement the Random Walk Algorithm**
Your task is to **simulate a random walk** from a given starting point in the **bipartite user-movie graph**.

##### **Hints for Implementation**
- Start from **either a user or a movie**.
- At each step, **randomly move** to a connected node.
- Keep track of **how many times each movie is visited**.
- After completing the walk, **rank movies by visit count**.

#### **Step 3: Implement User-Based Recommendation**
**Hints:**
- Check if the `user_id` exists in the `graph`.
- Start a loop that runs for `walk_length` steps.
- Randomly pick a **connected node** (user or movie).
- Track how many times each **movie** is visited.
- Sort movies by visit frequency and return the **top N**.

#### **Step 4: Implement Movie-Based Recommendation**
**Hints:**
- Find the `movie_id` corresponding to the given `movie_name`.
- Ensure the movie exists in the `graph`.
- Start a random walk from that movie.
- Follow the same **tracking and ranking** process as the user-based version.

**Note:**  
**Your task:** Implement a function `weighted_pixie_recommend(user_id, walk_length=15, num=5)` or `weighted_pixie_recommend(movie_name, walk_length=15, num=5)`.  
**Implement either Step 3 or Step 4.**

#### **Step 5: Running Your Recommendation System**
Once your function is implemented, test it by calling:

##### **Example: User-Based Recommendation**
```python
weighted_pixie_recommend(1, walk_length=15, num=5)
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | My Own Private Idaho (1991)   |
| 2       | Aladdin (1992)                |
| 3       | 12 Angry Men (1957)           |
| 4       | Happy Gilmore (1996)          |
| 5       | Copycat (1995)                |


##### **Example: Movie-Based Recommendation**
```python
weighted_pixie_recommend("Jurassic Park (1993)", walk_length=10, num=5)
```
| Ranking | Movie Name                           |
|---------|-------------------------------------|
| 1       | Rear Window (1954)                 |
| 2       | Great Dictator, The (1940)         |
| 3       | Field of Dreams (1989)             |
| 4       | Casablanca (1942)                  |
| 5       | Nightmare Before Christmas, The (1993) |


#### **Step 6: Understanding the Results**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

Each movie is ranked based on **how frequently it was visited** during the walk.

#### **Experiment with Different Parameters**
- Try different **`walk_length`** values and observe how it changes recommendations.
- Adjust the number of recommended movies (`num`).

In [158]:
# Code the function here
def weighted_choice_with_tie_break(nodes, weights):
    """
    Select the node with the highest weight.
    If there is a tie (multiple nodes with the same max weight), choose randomly among them.
    """
    max_weight = max(weights) # Get the maximum weight
    max_weight_nodes = []

    for i in range(len(nodes)): # Collect all nodes with the highest weight
        if weights[i] == max_weight:
            max_weight_nodes.append(nodes[i])
    return random.choice(max_weight_nodes)     # Randomly pick one among the tied nodes if there is more than one node with same maximum weights
    
def weighted_pixie_recommend(user_id, walk_length, graph):
    movie_visit_count = {} # Initialize the movie_visit_count dictionary
    current_node = user_id # initialize current_node as the user_id inputted

    # Perform the random walk for the specified number of steps
    for i in range(walk_length):
        neighbors = list(graph[current_node].keys())  # Get neighbors (movies or users) as list
        if not neighbors: # If no neighbors (dead end), break out of the loop
            break
        neighbor_weights = [graph[current_node][neighbor] for neighbor in neighbors] # Get the corresponding weights for the neighbors
        next_node = weighted_choice_with_tie_break(neighbors, neighbor_weights) # Select the next node using weighted choice with tie-breaking
        if isinstance(next_node, str):  # Movie nodes are strings. If the next node is a movie, increment its visit count
            if next_node not in movie_visit_count: # Add movie to dictionary and increment
                movie_visit_count[next_node] = 0
            movie_visit_count[next_node] += 1    
        current_node = next_node # Move to the next node
    
    return movie_visit_count

user_id = int(input("Enter user id : ")) # Get the user_id from the user
walk_length = int(input("Enter the walk length : ")) # Get the walk length
num = int(input("Enter the number of top movies to be listed : ")) # Get the number of recommendations to be given
recommended_movies = weighted_pixie_recommend(user_id, walk_length, graph) # Call the function to get the movies
sorted_movies = sorted(recommended_movies, key=lambda k: recommended_movies[k], reverse=True) #Sort the movies in reverse order based on visit count
# Form data frame
top_movies = sorted_movies[:num]  # Select only top num movies
movies_df = pd.DataFrame({
    'Ranking': range(1, num + 1),
    'Movie Name': top_movies
})
print(movies_df.to_string(index=False))  # Print without index

Enter user id :  1
Enter the walk length :  12
Enter the number of top movies to be listed :  5


 Ranking                       Movie Name
       1           Terminator, The (1984)
       2                  Net, The (1995)
       3 Shawshank Redemption, The (1994)
       4               Stand by Me (1986)
       5                    Gandhi (1982)


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [181]:
train_ratings, test_ratings = train_test_split(ratings_data, test_size=0.2, random_state=42)
user_movie_matrix = train_ratings.pivot(index='user_id', columns='movie_id', values='rating').fillna(0) #Formation of user-movie matrix
user_similarity = cosine_similarity(user_movie_matrix) # Find similarity between users using cosine similarity
user_user_sim = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index) # Convert into 2d similarity dataframe with both rows and columns as users
actual_ratings = []
predicted_list = []
for _, row in test_ratings.iterrows():
    user_id = row["user_id"]
    movie_id = row["movie_id"]
    actual = row["rating"]

    if user_id in user_user_sim.index and movie_id in user_movie_matrix.columns:
        non_zero_mask = user_movie_matrix != 0 # Create a mask where ratings are non-zero. True when rating is present and False when movie is not rated
        similarity_row = user_user_sim.loc[user_id] # Get the row of the inputted user_id
        similarity_row[user_id] = 0 # Exclude the user's own similarity
        masked_similarities = non_zero_mask.mul(similarity_row, axis=0) # Keeps similarity scores only for users who rated a given movie. Zeros out the similarity if a user didn’t rate that movie.
    # Multiply the above non_zero_mask with the user_movie matrix A matrix where each rating position is replaced by the similarity of that user, only if the user rated that movie. Otherwise, it remains 0.
        weighted_sum = user_movie_matrix.mul(masked_similarities, axis=0).sum(axis=0)
        sim_sum = masked_similarities.sum(axis=0) # Compute sum of similarities only for users who rated
        predicted_ratings = (weighted_sum / sim_sum).fillna(0) # Avoid division by zero (fill NaN with 0
        predicted_rating = predicted_ratings[movie_id]
        actual_ratings.append(actual)
        predicted_list.append(predicted_rating)
# Final accuracy scores
rmse = np.sqrt(mean_squared_error(actual_ratings, predicted_list))
mae = mean_absolute_error(actual_ratings, predicted_list)

print(f"\nUser-User Collaborative Filtering Accuracy:")
print(f"  RMSE: {rmse:.4f}")
print(f"  MAE : {mae:.4f}")


User-User Collaborative Filtering Accuracy:
  RMSE: 1.0146
  MAE : 0.8060


In [183]:
# Split the data
train_ratings, test_ratings = train_test_split(ratings_data, test_size=0.2, random_state=42)

# Create user-movie matrix
user_movie_matrix = train_ratings.pivot(index='user_id', columns='movie_id', values='rating').fillna(0)

# Transpose for item-item CF
movie_user_matrix = user_movie_matrix.T  # Now rows are movies, columns are users

# Compute movie similarity
item_similarity = cosine_similarity(movie_user_matrix)
item_item_sim = pd.DataFrame(item_similarity, index=movie_user_matrix.index, columns=movie_user_matrix.index)

actual_ratings = []
predicted_ratings = []

# Loop over test data
for _, row in test_ratings.iterrows():
    user_id = row["user_id"]
    movie_id = row["movie_id"]
    actual = row["rating"]
    
    if movie_id in item_item_sim.index and user_id in user_movie_matrix.index:
        similarity_scores = item_item_sim.loc[movie_id]  # Similarities between this movie and all others
        similarity_scores[movie_id] = 0  # Exclude self-similarity

        user_ratings = user_movie_matrix.loc[user_id]  # Ratings by this user
        rated_movies = user_ratings[user_ratings > 0].index  # Movies the user has rated
        
        relevant_similarities = similarity_scores[rated_movies]
        relevant_ratings = user_ratings[rated_movies]

        if relevant_similarities.sum() > 0:
            weighted_sum = np.dot(relevant_similarities, relevant_ratings)
            sim_sum = relevant_similarities.sum()
            predicted_rating = weighted_sum / sim_sum
        else:
            predicted_rating = user_ratings.mean()  # Fallback: use user’s average rating
        
        actual_ratings.append(actual)
        predicted_ratings.append(predicted_rating)

# Final accuracy scores
rmse = np.sqrt(mean_squared_error(actual_ratings, predicted_ratings))
mae = mean_absolute_error(actual_ratings, predicted_ratings)

print(f"\nItem-Item Collaborative Filtering Accuracy:")
print(f"  RMSE: {rmse:.4f}")
print(f"  MAE : {mae:.4f}")


Item-Item Collaborative Filtering Accuracy:
  RMSE: 1.0128
  MAE : 0.8067


In [None]:
from tqdm import tqdm

# Step 1: Train-test split (if not already done)
train_ratings, test_ratings = train_test_split(ratings_data, test_size=0.2, random_state=42)

# Step 2: Create ground truth mapping: user_id -> set of movie_ids in test set
test_user_movies = defaultdict(set)
for _, row in test_ratings.iterrows():
    test_user_movies[row['user_id']].add(row['movie_id'])

# Step 3: Set parameters
K = 10
walk_length = 500  # Reduced for speed
num_sample_users = 100  # Evaluate only on 100 users

# Step 4: Sampling users
test_user_list = list(test_user_movies.keys())
random.shuffle(test_user_list)
sample_users = test_user_list[:num_sample_users]

# Step 5: Initialize metrics
precision_total = 0
recall_total = 0
hit_total = 0
evaluated_users = 0

# Step 6: Evaluate Pixie algorithm
for user_id in tqdm(sample_users, desc="Evaluating Pixie"):
    if user_id not in graph:
        continue  # skip users not present in the graph

    visit_counts = weighted_pixie_recommend(user_id, walk_length, graph)
    recommended_movies = sorted(visit_counts, key=visit_counts.get, reverse=True)
    top_k_recommendations = recommended_movies[:K]

    ground_truth = test_user_movies[user_id]
    if not ground_truth:
        continue

    hits = len(set(top_k_recommendations) & ground_truth)
    precision = hits / K
    recall = hits / len(ground_truth)
    hit_rate = 1 if hits > 0 else 0

    precision_total += precision
    recall_total += recall
    hit_total += hit_rate
    evaluated_users += 1

# Step 7: Final averaged scores
if evaluated_users > 0:
    precision_at_k = precision_total / evaluated_users
    recall_at_k = recall_total / evaluated_users
    hit_rate_at_k = hit_total / evaluated_users

    print(f"\n📊 Pixie Algorithm Evaluation @K={K} (on {evaluated_users} users)")
    print(f"  Precision@K: {precision_at_k:.4f}")
    print(f"  Recall@K   : {recall_at_k:.4f}")
    print(f"  Hit Rate@K : {hit_rate_at_k:.4f}")
else:
    print("No valid users were evaluated.")

---

## **Submission Requirements:**

To successfully complete this assignment, ensure that you submit the following:


### **1. Jupyter Notebook Submission**
- Submit a **fully completed Jupyter Notebook** that includes:
  - **All implemented recommendation functions** (user-based, item-based, and random walk-based recommendations).
  - **Code explanations** in markdown cells to describe each step.
  - **Results and insights** from running your recommendation models.


### **2. Explanation of Pixie-Inspired Algorithms (3-5 Paragraphs)**
- Write a **detailed explanation** of **Pixie-inspired random walk algorithms** used for recommendations.
- Your explanation should cover:
  - What **Pixie-inspired recommendation systems** are.
  - How **random walks** help in identifying relevant recommendations.
  - Any real-world applications of such algorithms in industry.


### **3. Report for the Submitted Notebook**
Your report should be structured as follows:

#### **Title: Movie Recommendation System Report**

#### **1. Introduction**
- Briefly introduce **movie recommendation systems** and why they are important.
- Explain the **different approaches used** (user-based, item-based, random-walk).

#### **2. Dataset Description**
- Describe the **MovieLens 100K dataset**:
  - Number of users, movies, and ratings.
  - What features were used.
  - Any preprocessing performed.

#### **3. Methodology**
- Explain the three recommendation techniques implemented:
  - **User-based collaborative filtering** (how user similarity was calculated).
  - **Item-based collaborative filtering** (how item similarity was determined).
  - **Random-walk-based Pixie algorithm** (why graph-based approaches are effective).
  
#### **4. Implementation Details**
- Discuss the steps taken to build the functions.
- Describe how the **adjacency list graph** was created.
- Explain how **random walks** were performed and how visited movies were ranked.

#### **5. Results and Evaluation**
- Present **example outputs** from each recommendation approach.
- Compare the different methods in terms of accuracy and usefulness.
- Discuss any **limitations** in the implementation.

#### **6. Conclusion**
- Summarize the key takeaways from the project.
- Discuss potential improvements (e.g., **hybrid models, additional features**).
- Suggest real-world applications of the methods used.

### **Submission Instructions**

- Submit `.zip` file consisting of Jupyter Notebook and all the datafiles (provided) and the ones saved [i.e. `users.csv`, `movies.csv` and `ratings.csv`]. Also, include the Report and Pixie Algorithm explanation document.
- [`Bonus 10 Points`] **Upload your Jupyter Notebook, Explanation Document, and Report** to your GitHub repository.
- Ensure the repository is public and contains:
  - `users.csv`, `movies.csv` and `ratings.csv` [These are the Dataframes which were created in part 1. Save and export them as a `.csv` file]
  - `Movie_Recommendation.ipynb`
  - `Pixie_Algorithm_Explanation.pdf` or `.md`
  - `Recommendation_Report.pdf` or `.md`
- **Submit the GitHub repository link in the cell below.**


#### **Example Submission Format**
```text
GitHub Repository: https://github.com/username/Movie-Recommendation
```

In [None]:
# Submit the Github Link here:
https://github.com/gowshiksaravanan19/KDD-6162

### **Grading Rubric: ITCS 6162 - Data Mining Assignment**


| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **Part 1: Exploring and Cleaning Data (15 pts)**  | Properly loads `u.user`, `u.movies`, and `u.item` datasets into DataFrames | 5 |
|                                           | Handles missing values, duplicates, and inconsistencies appropriately | 5 |
|                                           | Saves the cleaned datasets into CSV files: `users.csv`, `movies.csv`, `ratings.csv` | 5 |
| **Part 2: Collaborative Filtering-Based Recommendation (30 pts)** | Implements user-based collaborative filtering correctly | 10 |
|                                           | Implements item-based collaborative filtering correctly | 10 |
|                                           | Computes similarity measures accurately and provides valid recommendations | 10 |
| **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm) (35 pts)** | Constructs adjacency lists properly from user-movie interactions | 10 |
|                                           | Implements weighted random walk-based recommendation correctly | 15 |
|                                           | Explains and justifies the algorithm design choices (Pixie-inspired) | 10 |
| **Code Quality & Documentation (10 pts)** | Code is well-structured, efficient, and follows best practices | 5 |
|                                           | Markdown explanations and comments are clear and enhance understanding | 5 |
| **Results & Interpretation (5 pts)**      | Provides meaningful insights from the recommendation system's output | 5 |
| **Submission & Report (5 pts)**          | Submits all required files in the correct format (ZIP file with Jupyter notebook, processed CSV files, and project report) | 5 |
| **Total**                                 |                              | 100 |

#### **Bonus (10 pts)**
| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **GitHub Submission**                     | Provides a well-documented GitHub repository with CSV files, a structured README, and a properly formatted Jupyter Notebook | 10 |