# ITCS 6162: Data Mining - Programming Assignment

**In this assignment, you will explore data analysis, recommendation algorithms, and graph-based techniques using the MovieLens dataset. Your tasks will range from basic data exploration to advanced recommendation models, including:**
- Data manipulation with pandas
- User-item collaborative filtering
- Similarity-based recommendation models
- A Pixie-inspired Graph-based recommendation using adjacency lists with weighted random walks (without using NetworkX)


#### **Dataset Files:**
- **`u.data`**: User-movie ratings (`user_id  movie_id  rating  timestamp`)
- **`u.item`**: Movie metadata (`movie_id | title | release date | IMDB_website`)
- **`u.user`**: User demographics (`user_id | age | gender | occupation | zip_code`)

## **Part 1: Exploring and Cleaning Data**

### Inspecting the Dataset Format

The dataset is not in a traditional CSV format. To examine its structure, use the following shell command to display the first 10 lines of the file:

```sh
!head <file_name>


**In the cells given below. Write the code to read the files.**

In [None]:
# u.data
def inspect_file_format(filepath, encoding="utf-8"):
    print(f"--- First 10 lines of {filepath} ---")
    try:
        with open(filepath, "r", encoding=encoding) as f:
            for i in range(10):
                line = f.readline()
                if not line:
                    break
                print(line.strip())
    except Exception as e:
        print(f"Error reading {filepath}: {e}")
    print("\n")

# Inspect each of the MovieLens dataset files
inspect_file_format("u.data")


<h4> Explanation:</h4>
    <p>The inspect_file_format function reads and prints the first 10 lines of a given file (like "u.data"). It's used to quickly check the structure and format of a file. It handles errors gracefully (like file not found or encoding issues) and removes extra whitespace from each printed line for cleaner output.</p>

Result:
The above datatset (i.e: users data). The dataset has user_id, movie_id, ratings, timestamp. Also the delimiter used in the dataset is \t.

In [2]:
# u.item
# u.item
def inspect_file_format(filepath, encoding="utf-8"):
    print(f"--- First 10 lines of {filepath} ---")
    try:
        with open(filepath, "r", encoding=encoding) as f:
            for i in range(10):
                line = f.readline()
                if not line:
                    break
                print(line.strip())
    except Exception as e:
        print(f"Error reading {filepath}: {e}")
    print("\n")

# Inspect each of the MovieLens dataset files

inspect_file_format("u.item", "latin-1")    # pipe-separated, non-UTF characters


--- First 10 lines of u.item ---
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0
8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/ti

Explanation:
The inspect_file_format function reads and prints the first 10 lines of a given file (like "u.item"). It's used to quickly check the structure and format of a file. It handles errors gracefully (like file not found or encoding issues) and removes extra whitespace from each printed line for cleaner output.

Result : 
The above dataset (i.e Movie Metadata) which has movie_id, title, release_date, IMDb_link, genre flags as its columns. Also the columns are seperated by delimiter pipe |

In [3]:
# u.user
# u.user
def inspect_file_format(filepath, encoding="utf-8"):
    print(f"--- First 10 lines of {filepath} ---")
    try:
        with open(filepath, "r", encoding=encoding) as f:
            for i in range(10):
                line = f.readline()
                if not line:
                    break
                print(line.strip())
    except Exception as e:
        print(f"Error reading {filepath}: {e}")
    print("\n")

# Inspect each of the MovieLens dataset files
inspect_file_format("u.user")              

--- First 10 lines of u.user ---
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703




Explanation:
The inspect_file_format function reads and prints the first 10 lines of a given file (like "u.user"). It's used to quickly check the structure and format of a file. It handles errors gracefully (like file not found or encoding issues) and removes extra whitespace from each printed line for cleaner output.

Result: 
The above dataset (i.e User demographics ) has 5 fields (user_id, age, gender, occupation, zip_code). And these fields are seperated by delimiter pipe | .

#### Loading the Dataset with Pandas

Use **pandas** to load the dataset into a DataFrame for analysis. Follow these steps:  

1. Import the necessary library: `pandas`.  
2. Use `pd.read_csv()` (or an appropriate function) to read the dataset file.  
3. Ensure the dataset is loaded with the correct delimiter (e.g., `','`, `'\t'`,`'|'` , or another separator if needed).  
4. Select and display the first few rows using `.head()`.

Ensure that:  

- The `ratings` dataset is read from `"u.data"` using tab (`'\t'`) as a separator and column names (`"user_id"`, `"movie_id"`, `"rating"` and `"timestamp"`).  
- The `movies` dataset is read from `"u.item"` using `'|'` as a separator, use columns (`0`, `1`, `2`), encoding (`"latin-1"`) and name the columns (`movie_id`, `title`, and `release_date`).  
- The `users` dataset is read from `"u.user"` using `'|'` as a separator, use columns (`0`, `1`, `2`, `3`) and name the columns (`user_id`, `age`, `gender`, and `occupation`).

In [4]:
# ratings
import pandas as pd
df_ratings = pd.read_csv("u.data", sep="\t", names=["user_id", "movie_id", "rating", "timestamp"])

#### Code Explanation

- `import pandas as pd`: Loads the pandas library for data analysis.

- `pd.read_csv("u.data", sep="\t", names=[...])`:  
  Reads the `u.data` file (tab-separated) and adds column names:  
  - `user_id`, `movie_id`, `rating`, `timestamp`

- Stores the data in `df_ratings`, a DataFrame used for easy analysis.

In [5]:
# movies
df_movies = pd.read_csv("u.item", sep="|", encoding="latin-1", usecols=[0, 1, 2],names=["movie_id", "title", "release_date"])

#### Code Explanation 

- Reads `u.item` (movie data) file from the MovieLens dataset.
- `sep="|"`: File is pipe-separated.
- `encoding="latin-1"`: Handles special characters.
- `usecols=[0, 1, 2]`: Loads only the first 3 columns:
  - `movie_id`, `title`, `release_date`
- Stores the result in `df_movies` DataFrame.

In [7]:
# users
df_users = pd.read_csv("u.user", sep="|", usecols=[0, 1, 2, 3],names=["user_id", "age", "gender", "occupation"])

#### Code Explanation 

- Reads `u.user` (user data) file from the MovieLens dataset.
- `sep="|"`: File is pipe-separated.
- `usecols=[0, 1, 2, 3]`: Loads the first 4 columns:
  - `user_id`, `age`, `gender`, `occupation`
- `names=[...]`: Assigns custom column names.
- Stores the result in `df_users` DataFrame.

**Note:** As a **Bonus** task save the `ratings`, `movies` and `users` dataframe created into a `.csv` file format. <br>
**Hint:** Use the `to_csv()` function in pandas to save these DataFrames as CSV files.

In [8]:
# ratings
# ratings
df_ratings.to_csv("ratings.csv", index=False)
print("Saved: ratings.csv")

Saved: ratings.csv


#### Code Explanation

- `df_ratings.to_csv("ratings.csv", index=False)`  
  Saves the `df_ratings` DataFrame to a file named `ratings.csv`.  
  - `index=False` prevents pandas from writing row numbers to the file.

- `print("Saved: ratings.csv")`  
  Prints a confirmation message after the file is successfully saved.

In [9]:
# movies
df_movies.to_csv("movies.csv", index=False)
print("Saved: movies.csv")

Saved: movies.csv


#### Code Explanation
​
- `df_movies.to_csv("movies.csv", index=False)`  
  Saves the `df_movies` DataFrame to a file named `movies.csv`.  
  - `index=False` prevents row numbers from being included in the file.
​
- `print("Saved: movies.csv")`  
  Prints a message confirming the file was saved.

In [10]:
# users
df_users.to_csv("users.csv", index=False)
print("Saved: users.csv")

Saved: users.csv


#### Code Explanation

- `df_users.to_csv("users.csv", index=False)`  
  Saves the `df_users` DataFrame to a file named `users.csv`.  
  - `index=False` ensures row numbers are not saved in the file.

- `print("Saved: users.csv")`  
  Displays a confirmation message after saving.

**Display the first 10 rows of each file.**

In [11]:

# ratings
df_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


#### Code Explanation 

- `df_ratings.head(10)`  
  Displays the first 10 rows of the `df_ratings` DataFrame.  
  Useful for quickly checking the structure and contents of the data.

In [12]:
# movies
df_movies.head(10)

Unnamed: 0,movie_id,title,release_date
0,1,Toy Story (1995),01-Jan-1995
1,2,GoldenEye (1995),01-Jan-1995
2,3,Four Rooms (1995),01-Jan-1995
3,4,Get Shorty (1995),01-Jan-1995
4,5,Copycat (1995),01-Jan-1995
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995
6,7,Twelve Monkeys (1995),01-Jan-1995
7,8,Babe (1995),01-Jan-1995
8,9,Dead Man Walking (1995),01-Jan-1995
9,10,Richard III (1995),22-Jan-1996


#### Code Explanation

- `df_movies.head(10)`  
  Displays the first 10 rows of the `df_movies` DataFrame.  
  Helps verify the movie titles, release dates, and format of the data.

In [13]:
# users
df_users.head(10)

Unnamed: 0,user_id,age,gender,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other
5,6,42,M,executive
6,7,57,M,administrator
7,8,36,M,administrator
8,9,29,M,student
9,10,53,M,lawyer


#### Code Explanation

- `df_users.head(10)`  
  Shows the first 10 rows of the `df_users` DataFrame.  
  Useful for checking user details like age, gender, and occupation.


### Data Cleaning and Exploration with Pandas  

After loading the dataset, it’s important to clean and explore the data to ensure consistency and accuracy. Below are key **pandas** functions for cleaning and understanding the dataset.

#### 1. Handle Missing Values  
- `df.dropna()` – Removes rows with missing values.  
- `df.fillna(value)` – Fills missing values with a specified value.  

#### 2. Remove Duplicates  
- `df.drop_duplicates()` – Drops duplicate rows from the dataset.  

#### 3. Handle Incorrect Data Types  
- `df.astype(dtype)` – Converts columns to the appropriate data type.  

#### 4. Filter Outliers (if applicable)  
- `df[df['column_name'] > threshold]` – Filters rows based on a condition.  

#### 5. Rename Columns (if needed)  
- `df.rename(columns={'old_name': 'new_name'})` – Renames columns for clarity.  

#### 6. Reset Index  
- `df.reset_index(drop=True, inplace=True)` – Resets the index after cleaning.  

### Data Exploration Functions  

To better understand the dataset, use these **pandas** functions:  

- `df.shape` – Returns the number of rows and columns in the dataset.  
- `df.nunique()` – Displays the number of unique values in each column.  
- `df['column_name'].unique()` – Returns unique values in a specific column.  

**Example Usage in Pandas:**  
```python
import pandas as pd

# Load dataset
df = pd.read_csv("your_file.csv")

# Drop missing values
df_cleaned = df.dropna()

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

# Convert 'timestamp' column to datetime format
df_cleaned['timestamp'] = pd.to_datetime(df_cleaned['timestamp'])

# Display dataset shape
print("Dataset shape:", df_cleaned.shape)

# Display number of unique values in each column
print("Unique values per column:\n", df_cleaned.nunique())

# Display unique movie IDs
print("Unique movie IDs:", df_cleaned['movie_id'].unique()[:10])  # Show first 10 unique movie IDs


**Note:** The functions mentioned above are some of the widely used **pandas** functions for data cleaning and exploration. However, it is not necessary that all of these functions will be required in the exercises below. Use them as needed based on the dataset and the specific tasks.

**Convert Timestamps into Readable dates.**

In [14]:
# ratings

# Convert the timestamp column to datetime readable format
df_ratings["datetime"] = pd.to_datetime(df_ratings["timestamp"], unit="s")

# Displaying first few rows to verify
df_ratings[["timestamp", "datetime"]].head()

Unnamed: 0,timestamp,datetime
0,881250949,1997-12-04 15:55:49
1,891717742,1998-04-04 19:22:22
2,878887116,1997-11-07 07:18:36
3,880606923,1997-11-27 05:02:03
4,886397596,1998-02-02 05:33:16


#### Code Explanation

- `df_ratings["datetime"] = pd.to_datetime(df_ratings["timestamp"], unit="s")`  
  Converts the Unix timestamp in the `timestamp` column to a human-readable datetime format.  
  - `unit="s"` specifies that the timestamp is in seconds.

- `df_ratings[["timestamp", "datetime"]].head()`  
  Displays the first few rows of the original and converted columns to verify the change.


**Check for Missing Values**

In [15]:
# ratings
df_ratings.isnull().sum()

user_id      0
movie_id     0
rating       0
timestamp    0
datetime     0
dtype: int64

#### Code Explanation

- `df_ratings.isnull().sum()`  
  Checks each column in the `df_ratings` DataFrame for missing (null) values.  
  Returns the total count of null values per column.


In [16]:
# movies
df_movies.isnull().sum()

movie_id        0
title           0
release_date    1
dtype: int64

#### Code Explanation

- `df_movies.isnull().sum()`  
  Identifies missing (null) values in each column of the `df_movies` DataFrame.  
  Returns the count of nulls per column.


In [17]:
## Handling the missing values

df_movies["release_date"].fillna("1990-01-01", inplace=True)
df_movies.isnull().sum()

movie_id        0
title           0
release_date    0
dtype: int64

#### Code Explanation

- `df_movies["release_date"].fillna("1990-01-01", inplace=True)`  
  Replaces any missing `release_date` values with the default date `"1990-01-01"`.  
  This ensures consistency in the data without dropping any movie records.

- `df_movies.isnull().sum()`  
  Confirms that all missing values have been handled by checking for remaining nulls.


In [18]:
# users
df_users.isnull().sum()

user_id       0
age           0
gender        0
occupation    0
dtype: int64

#### Code Explanation

- `df_users.isnull().sum()`  
  Checks each column in the `df_users` DataFrame for missing (null) values.  
  Helps ensure data completeness before performing analysis.


**Print the total number of users, movies, and ratings.**

In [19]:
print(f"Total Users: {df_users['user_id'].nunique()}")
print(f"Total Movies: { df_movies['movie_id'].nunique() }")
print(f"Total Ratings: {len(df_ratings)}")

Total Users: 943
Total Movies: 1682
Total Ratings: 100000


#### Code Explanation

- Counts unique users using `df_users['user_id'].nunique()`
- Counts unique movies using `df_movies['movie_id'].nunique()`
- Counts total ratings using `len(df_ratings)`


## **Part 2: Collaborative Filtering-Based Recommendation**

### **Create a User-Item Matrix**

#### Instructions for Creating a User-Movie Rating Matrix

In this exercise, you will create a user-movie rating matrix using **pandas**. This matrix will represent the ratings that users have given to different movies.

1. **Dataset Overview**:  
   The dataset has already been loaded. It includes the following key columns:
   - `user_id`: The ID of the user.
   - `movie_id`: The ID of the movie.
   - `ratings`: The rating the user gave to the movie.

2. **Create the User-Movie Rating Matrix**:  
   Use the **`pivot()`** function in **pandas** to reshape the data. Your goal is to create a matrix where:
   - Each **row** represents a **user**.
   - Each **column** represents a **movie**.
   - Each **cell** contains the **rating** that the user has given to the movie.

   Specify the following parameters for the `pivot()` function:
   - **`index`**: The `user_id` column (this will define the rows).
   - **`columns`**: The `movie_id` column (this will define the columns).
   - **`values`**: The `rating` column (this will fill the matrix with ratings).

3. **Inspect the Matrix**:  
   After creating the matrix, examine the first few rows of the resulting matrix to ensure it has been constructed correctly.

4. **Handle Missing Values**:  
   It's likely that some users have not rated every movie, resulting in `NaN` values in the matrix. You will need to handle these missing values. Consider the following options:
   - **Fill with 0**: If you wish to represent missing ratings as zeros (indicating no rating).
   - **Fill with the average rating**: Alternatively, replace missing values with the average rating for each movie.

**Create the user-movie rating matrix using the `pivot()` function.**

In [20]:
# Create the pivot table (user-movie matrix)
user_movie_matrix = df_ratings.pivot(index='user_id', columns='movie_id', values='rating')

# Fill missing values (NaNs) with 0 
user_movie_matrix = user_movie_matrix.fillna(0)

# Display to verify
print(user_movie_matrix.head())

movie_id  1     2     3     4     5     6     7     8     9     10    ...  \
user_id                                                               ...   
1          5.0   3.0   4.0   3.0   3.0   5.0   4.0   1.0   5.0   3.0  ...   
2          4.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   2.0  ...   
3          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
4          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
5          4.0   3.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   

movie_id  1673  1674  1675  1676  1677  1678  1679  1680  1681  1682  
user_id                                                               
1          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
2          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
3          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
4          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
5          0.0   0.0   0.0   0.0  

#### Code Explanation

- Creates a pivot table where:
  - Rows = users (`user_id`)
  - Columns = movies (`movie_id`)
  - Values = ratings

- Fills missing values (`NaN`) with 0 using `.fillna(0)`

- Displays the first few rows to verify the matrix.


**Display the matrix to verify the transformation.**

In [21]:
# Display the transformed user-movie rating matrix
print("Transformed User-Movie Rating Matrix:")
user_movie_matrix.head()

Transformed User-Movie Rating Matrix:


movie_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Code Explanation

- Prints the first few rows of the `user_movie_matrix`  
- Shows how ratings are organized by user (rows) and movie (columns)


### **User-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement a **user-based collaborative filtering** movie recommendation system using the **Movie dataset**. The goal is to recommend movies to a user based on the preferences of similar users.

##### **Step 1: Import Required Libraries**
Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing user similarity
```

##### **Step 2: Compute User-User Similarity**
- We will use **cosine similarity** to measure how similar each pair of users is based on their movie ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.

##### **Instructions:**
1. Fill missing values with `0` using `.fillna(0)`.
2. Compute similarity using `cosine_similarity()`.
3. Convert the result into a **Pandas DataFrame**, with users as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
user_similarity = cosine_similarity(user_movie_matrix.fillna(0))
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)
```

##### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies_for_user(user_id, num=5)` to recommend movies for a given user.

##### **Function Inputs:**
- `user_id`: The target user for whom we need recommendations.
- `num`: The number of movies to recommend (default is 5).

##### **Function Steps:**
1. Find **similar users**:
   - Retrieve the similarity scores for the given `user_id`.
   - Sort them in **descending** order (highest similarity first).
   - Exclude the user themselves.
   
2. Get the **movie ratings** from these similar users.

3. Compute the **average rating** for each movie based on these users' preferences.

4. Sort the movies in **descending order** based on the computed average ratings.

5. Retrieve the **top `num` recommended movies**.

6. Map **movie IDs** to their **titles** using the `movies` DataFrame.

7. Return the results as a **Pandas DataFrame** with rankings.

##### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'Ranking': range(1, num+1),
    'Movie Name': movie_names     
})
result_df.set_index('Ranking', inplace=True)
```

#### **Example: User-Based Collaborative Filtering**
```python
recommend_movies_for_user(10, num = 5)
```
**Output:**
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | In the Company of Men (1997)   |
| 2       | Misérables, Les (1995)         |
| 3       | Thin Blue Line, The (1988)     |
| 4       | Braindead (1992)               |
| 5       | Boys, Les (1997)               |


### Step 1: Importing Required Librarires

In [22]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

### Step 2: Computing User Similarity Matrix

In [23]:
# Filling missing values with 0

user_movie_matrix_filled = user_movie_matrix.fillna(0)

#Computing cosine similarity between users

user_similarity = cosine_similarity(user_movie_matrix_filled)

# Creating dataframe for user similarity giving users as rows and columns

user_sim_df = pd.DataFrame(user_similarity, index = user_movie_matrix.index, columns = user_movie_matrix.index)

#### Code Explanation

- `user_movie_matrix.fillna(0)`  
  Ensures all missing values are filled with 0 before computing similarity.

- `cosine_similarity(user_movie_matrix_filled)`  
  Calculates cosine similarity between all users based on their movie ratings.

- `pd.DataFrame(..., index=..., columns=...)`  
  Converts the similarity matrix into a DataFrame with users as both rows and columns for easier interpretation.


### Implementing Recommendation Function

In [24]:
def recommend_movies_for_user(user_id, num=5):
    # Ensure movie title lookup is ready
    df_movies['movie_id'] = df_movies['movie_id'].astype(int)
    movie_titles = df_movies.set_index('movie_id')['title']

    # Step 1: Get similarity scores (excluding the user)
    if user_id not in user_sim_df.columns:
        print(f"User {user_id} not found in user similarity data.")
        return pd.DataFrame(columns=["Movie Name"])

    similar_users = user_sim_df[user_id].sort_values(ascending=False).drop(user_id)

    # Step 2: Ratings of similar users
    top_users = similar_users.index
    similar_user_ratings = user_movie_matrix.loc[top_users]

    # Step 3: Average movie ratings from similar users
    movie_scores = similar_user_ratings.mean(axis=0)

    # Step 4: Remove already rated movies
    rated_movies = user_movie_matrix.loc[user_id].dropna().index
    movie_scores = movie_scores.drop(rated_movies, errors='ignore')

    # Step 5: Check if there are unrated movies
    unrated_count = movie_scores.shape[0]
    if unrated_count == 0:
        print(f"User {user_id} has no unrated movies left.")
    
    # Step 6: Smart Fallback — Only fallback if the user has too few unrated movies
    if unrated_count == 0 or len(rated_movies) < 10:
        
        movie_scores = similar_user_ratings.mean(axis=0).sort_values(ascending=False)
        movie_scores = movie_scores.drop(rated_movies, errors='ignore')

    # Step 7: Final fallback to top globally rated movies if necessary
    if movie_scores.empty:
        
        top_movie_ids = df_ratings.groupby("movie_id")["rating"].mean().sort_values(ascending=False).head(num).index
    else:
        top_movie_ids = movie_scores.head(num).index

    # Step 8: Get movie titles
    movie_names = [movie_titles.get(int(mid), f"Movie ID {mid}") for mid in top_movie_ids]

    # Step 9: Format result
    result_df = pd.DataFrame({
        'Ranking': range(1, len(movie_names) + 1),
        'Movie Name': movie_names     
    })
    result_df.set_index('Ranking', inplace=True)

    return result_df

# Example usage
recommend_movies_for_user(200, num=5)

User 200 has no unrated movies left.


Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,"Great Day in Harlem, A (1994)"
2,Someone Else's America (1995)
3,Marlene Dietrich: Shadow and Light (1996)
4,They Made Me a Criminal (1939)
5,Entertaining Angels: The Dorothy Day Story (1996)


#### Code Explanation

This function recommends top `n` movies for a given user based on user-user similarity.

##### Steps:
1. **Movie Lookup**: Prepares a mapping from movie ID to title.
2. **Find Similar Users**: Gets users most similar to the input `user_id`.
3. **Gather Ratings**: Collects ratings from those similar users.
4. **Score Movies**: Computes average scores for each movie.
5. **Filter Rated Movies**: Excludes movies already rated by the input user.
6. **Smart Fallback**: If too few unrated movies, uses broader averages.
7. **Final Fallback**: If still empty, recommends globally top-rated movies.
8. **Map Titles**: Converts recommended movie IDs to titles.
9. **Return**: Outputs a ranked DataFrame of movie recommendations.

##### Example:
```python
recommend_movies_for_user(200, num=5)


#### Insights for the User-based collaborative filtering recommender system

- Similarity-Driven Personalization: Recommendations are based on ratings from users most similar to the target user. This helps capture taste patterns and offer relevant suggestions.

- Good for Sparse Data: Even if a user has rated only a few movies, the system can still work well by leveraging behavior from other users.

- Prevents Redundancy: Movies already rated by the user are excluded from recommendations, making the results genuinely new.

- Fallbacks Improve Robustness: When similar users have rated few new movies, the model falls back to average ratings or top-rated global movies, ensuring consistent output.



### **Item-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement an **item-based collaborative filtering** recommendation system using the **Movie dataset**. The goal is to recommend movies similar to a given movie based on user rating patterns.

#### **Step 1: Import Required Libraries**
Although we have done this part already in the previous task but just to emphasize the importance reiterrating this part.

Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing item similarity
```

#### **Step 2: Compute Item-Item Similarity**
- We will use **cosine similarity** to measure how similar each pair of movies is based on their user ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.
- Unlike user-based filtering, we need to **transpose** (`.T`) the `user_movie_matrix` because we want similarity between movies (columns) instead of users (rows).

##### **Instructions:**
1. Transpose the user-movie matrix using `.T` to make movies the rows.
2. Fill missing values with `0` using `.fillna(0)`.
3. Compute similarity using `cosine_similarity()`.
4. Convert the result into a **Pandas DataFrame**, with movies as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
item_similarity = cosine_similarity(user_movie_matrix.T.fillna(0))
item_sim_df = pd.DataFrame(item_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)
```

#### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies(movie_name, num=5)` to recommend movies similar to a given movie.

##### **Function Inputs:**
- `movie_name`: The target movie for which we need recommendations.
- `num`: The number of similar movies to recommend (default is 5).

##### **Function Steps:**
1. Find the **movie_id** corresponding to the given `movie_name` in the `movies` DataFrame.
2. If the movie is not found, return an appropriate message.
3. Extract the **similarity scores** for this movie from `item_sim_df`.
4. Sort the movies in **descending order** based on similarity (excluding the movie itself).
5. Retrieve the **top `num` similar movies**.
6. Map **movie IDs** to their **titles** using the `movies` DataFrame.
7. Return the results as a **Pandas DataFrame** with rankings.

#### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'ranking': range(1, num+1),
    'movie_name': movie_names
})
result_df.set_index('ranking', inplace=True)
```

#### **Example: Item-Based Collaborative Filtering**
```python
recommend_movies("Jurassic Park (1993)", num=5)
```
**Output:**
```
| Ranking | Movie Name                               |
|---------|------------------------------------------|
| 1       | Top Gun (1986)                           |
| 2       | Empire Strikes Back, The (1980)          |
| 3       | Raiders of the Lost Ark (1981)           |
| 4       | Indiana Jones and the Last Crusade (1989)|
| 5       | Speed (1994)                             |


### Steps for performing Item-based collaborative filtering

In [25]:
# Code the function here
# Transposing the user-movie matrix so that movies are rows, users are columns 

movie_user_matrix = user_movie_matrix.T.fillna(0)

# Computing cosine similarity between movies

item_similarity = cosine_similarity(movie_user_matrix)

#Store similarity in a Dataframe with movie IDs as row/columns labels

item_sim_df = pd.DataFrame(item_similarity, index=movie_user_matrix.index, columns=movie_user_matrix.index)

#### Code Explanation

- `user_movie_matrix.T.fillna(0)`  
  Transposes the user-movie matrix so that movies become rows and users become columns.

- `cosine_similarity(movie_user_matrix)`  
  Computes cosine similarity between movies based on user ratings.

- `pd.DataFrame(..., index=..., columns=...)`  
  Stores the similarity matrix in a DataFrame `item_sim_df`, using movie IDs as both row and column labels.


In [26]:
# Defining The Recommendation Function

def recommend_movies(movie_name, num=5):
    # Prepare movie lookup
    df_movies['movie_id'] = df_movies['movie_id'].astype(int)
    movie_lookup = df_movies.set_index('title')['movie_id'].to_dict()

    if movie_name not in movie_lookup:
        print(f"Movie '{movie_name}' not found in the dataset.")
        return pd.DataFrame(columns=['movie_name'])

    # Get movie ID
    target_movie_id = movie_lookup[movie_name]

    # Check if the movie exists in the similarity matrix
    if target_movie_id not in item_sim_df.index:
        print(f" No similarity data found for '{movie_name}' (movie_id: {target_movie_id}).")
        return pd.DataFrame(columns=['movie_name'])

    # Get similarity scores, sort, and exclude the movie itself
    sim_scores = item_sim_df[target_movie_id].drop(target_movie_id)
    top_movie_ids = sim_scores.sort_values(ascending=False).head(num).index

    # Get movie titles from IDs
    movie_titles = df_movies.set_index('movie_id')['title']
    movie_names = [movie_titles.get(mid, f"Movie ID {mid}") for mid in top_movie_ids]

    # Format output
    result_df = pd.DataFrame({
        'ranking': range(1, len(movie_names) + 1),
        'movie_name': movie_names
    })
    result_df.set_index('ranking', inplace=True)

    return result_df


recommend_movies("Jurassic Park (1993)", num=5)

Unnamed: 0_level_0,movie_name
ranking,Unnamed: 1_level_1
1,Top Gun (1986)
2,Speed (1994)
3,Raiders of the Lost Ark (1981)
4,"Empire Strikes Back, The (1980)"
5,Indiana Jones and the Last Crusade (1989)


#### Code Explanation

This function recommends top `n` similar movies to a given movie based on item-item similarity.

##### Steps:
1. **Movie Lookup**: Maps movie titles to IDs using `df_movies`.
2. **Check Existence**: Ensures the input movie exists in the dataset and similarity matrix.
3. **Get Similarities**: Retrieves cosine similarity scores for the given movie.
4. **Filter & Sort**: Excludes the movie itself and selects top similar movie IDs.
5. **Get Titles**: Converts similar movie IDs back to titles.
6. **Return**: Outputs a ranked DataFrame of recommended movies.

##### Example:
```python
recommend_movies("Jurassic Park (1993)", num=5)


#### Insights: Item-Based Collaborative Filtering Recommender System



##### Key Insights

- **Content-Agnostic but Context-Aware**:  
  The model recommends similar movies purely based on user interactions, without requiring metadata like genre or actors.

- **Consistent & Stable Recommendations**:  
  Unlike user-based filtering, item similarities change slowly over time, resulting in more consistent recommendations.

- **Scales Well for Large Datasets**:  
  Since the number of movies is usually smaller than users, item-based filtering is often more scalable.

- **Great for Cold Start Users**:  
  Since recommendations are based on items, not users, even new users with few ratings can get suggestions if they've rated at least one movie.

---


## **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm)**

### **Adjacency List**

#### **Objective**
In this task, you will preprocess the Movie dataset and construct a **graph representation** where:
- **Users** are connected to the movies they have rated.
- **Movies** are connected to users who have rated them.
  
This graph structure will help in exploring **user-movie relationships** for recommendations.

#### **Step 1: Merge Ratings with Movie Titles**
Since we have **movie IDs** in the ratings dataset but need human-readable movie titles, we will:
1. Merge the `ratings` DataFrame with the `movies` DataFrame using the `'movie_id'` column.
2. This allows each rating to be associated with a **movie title**.

#### **Hint:**
Use the following Pandas operation to merge:
```python
ratings = ratings.merge(movies, on='movie_id')
```


#### **Step 2: Aggregate Ratings**
Since multiple users may rate the same movie multiple times, we:
1. Group the dataset by `['user_id', 'movie_id', 'title']`.
2. Compute the **mean rating** for each movie by each user.
3. Reset the index to ensure we maintain a clean DataFrame structure.

#### **Hint:**  
Use `groupby()` and `mean()` as follows:
```python
ratings = ratings.groupby(['user_id', 'movie_id', 'title'])['rating'].mean().reset_index()
```

#### **Step 3: Normalize Ratings**
Since different users have different rating biases, we normalize ratings by:
1. **Computing each user's mean rating**.
2. **Subtracting the mean rating** from each individual rating.

#### **Instructions:**
- Use `groupby('user_id')` to group ratings by users.
- Apply `transform(lambda x: x - x.mean())` to adjust ratings.

#### **Hint:**  
Normalize ratings using:
```python
ratings['rating'] = ratings.groupby('user_id')['rating'].transform(lambda x: x - x.mean())
```
This ensures each user’s ratings are centered around zero, making similarity calculations fairer.

#### **Step 4: Construct the Graph Representation**
We represent the user-movie interactions as an **undirected graph** using an **adjacency list**:
- Each **user** is a node connected to movies they rated.
- Each **movie** is a node connected to users who rated it.

#### **Graph Construction Steps:**
1. Initialize an empty dictionary `graph = {}`.
2. Iterate through the **ratings dataset**.
3. For each `user_id` and `movie_id` pair:
   - Add the movie to the user’s set of connections.
   - Add the user to the movie’s set of connections.

#### **Hint:**  
The following code builds the graph:

```python
graph = {}
for _, row in ratings.iterrows():
    user, movie = row['user_id'], row['movie_id']
    if user not in graph:
        graph[user] = set()
    if movie not in graph:
        graph[movie] = set()
    graph[user].add(movie)
    graph[movie].add(user)
```

This results in a **bipartite graph**, where:
- **Users** are connected to multiple movies.
- **Movies** are connected to multiple users.

#### **Step 5: Understanding the Graph**
- **Nodes** in the graph represent **users and movies**.
- **Edges** exist between a user and a movie **if the user has rated the movie**.
- This structure allows us to find **users with similar movie tastes** and **movies frequently watched together**.

#### **Exploring the Graph**
- **Find a user’s rated movies:**  
  ```python
  user_id = 1
  print(graph[user_id])  # Movies rated by user 1
  ```

- **Find users who rated a movie:**  
  ```python
  movie_id = 50
  print(graph[movie_id])  # Users who rated movie 50
  ```

### Steps for Implementing Graph Based Recommender

In [27]:
# Code the function here

#ensuring movie_id is of type int in both dataframes

df_movies['movie_id'] = df_movies['movie_id'].astype(int)
df_ratings['movie_id'] = df_ratings['movie_id'].astype(int)

#Merging to get movie titles into the ratings dataset

ratings_merged = df_ratings.merge(df_movies[['movie_id','title']], on='movie_id')

#### Code Explanation

- Ensures `movie_id` column is of type `int` in both `df_movies` and `df_ratings` to avoid merge issues.

- Merges `df_ratings` with selected columns from `df_movies` (`movie_id`, `title`) using:
  ```python
  df_ratings.merge(df_movies[['movie_id', 'title']], on='movie_id')


In [28]:
# grouping by user, movie, title and taking mean rating (if duplicates)

ratings_aggregated = ratings_merged.groupby(['user_id', 'movie_id','title'])['rating'].mean().reset_index()

#### Code Explanation

- Groups the merged DataFrame by `user_id`, `movie_id`, and `title`.

- Calculates the **mean rating** for each user-movie pair in case of duplicates.

- Uses `.reset_index()` to convert the grouped data back into a clean DataFrame.

- Stores the result in `ratings_aggregated`.


In [29]:
# Normalizing Ratings 

# Substracting each user's mean rating from their individual ratings

ratings_aggregated['rating'] = ratings_aggregated.groupby('user_id')['rating'].transform(lambda x: x - x.mean())

#### Code Explanation

- Adjusts ratings by subtracting each user's average rating from their individual ratings.

- Helps remove personal bias (e.g., users who rate everything high or low).

- Uses `groupby('user_id')` with `transform()` to apply the normalization to each user's ratings.

- Updates the `rating` column in `ratings_aggregated` with normalized values.


In [30]:
# Constructing Graph Representation

# Initialize empty graph

graph = {}

#Iterating through the dataframes and building the bipartite graph

for _, row in ratings_aggregated.iterrows():
    user = row['user_id']
    movie = row['movie_id']
    
    
    #Add movie to user node
    
    if user not in graph:
        graph[user] = set()
    graph[user].add(movie)
    
    #Add user to movie node
    
    if movie not in graph:
        graph[movie] = set()
        graph[movie].add(user)

#### Code Explanation

- Initializes an empty dictionary `graph` to represent a bipartite graph.

- Iterates over `ratings_aggregated` to build connections:
  - Adds each `movie_id` to the set of movies connected to a `user_id`.
  - Adds each `user_id` to the set of users connected to a `movie_id`.

- The result is a dictionary where:
  - Keys are either user IDs or movie IDs.
  - Values are sets of connected nodes (movies for users, users for movies).


In [31]:
# Exploring the Graph

user_id = 1
print(f"Movies rated by User {user_id}:", graph.get(user_id, "user not found"))

movie_id = 50
print(f"Users who rated Movie {movie_id}:", graph.get(movie_id,"Movie not found"))

Movies rated by User 1: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217

#### Code Explanation

- Retrieves and prints:
  - Movies rated by a specific user (e.g., `user_id = 1`)
  - Users who rated a specific movie (e.g., `movie_id = 50`)

- Uses `graph.get(key, default)` to safely access the graph:
  - Returns connected nodes if found.
  - Returns a default message if the user or movie is not in the graph.


#### Results and Insights

##### Model Overview

- The recommender builds a **bipartite graph** of users and movies based on interactions (ratings).
- It performs **random walks** on this graph to discover potential recommendations:
  - **Movie-based**: Starting from a movie to find related movies.
  - **User-based**: Starting from a user to discover new movies through indirect user connections.

---

##### Sample Results

**User-Based Walk (User ID = 1):**
```python
weighted_pixie_recommend_user(1, walk_length=15, num=5)


### **Implement Weighted Random Walks**

#### **Random Walk-Based Movie Recommendation System (Weighted Pixie)**

#### **Objective**
In this task, you will implement a **random-walk-based recommendation algorithm** using the **Weighted Pixie** method. This technique uses a **user-movie bipartite graph** to recommend movies by simulating a random walk from a given user or movie.

#### **Step 1: Import Required Libraries**
Make sure you have the necessary libraries:

```python
import random  # For random walks
import pandas as pd  # For handling data
```

#### **Step 2: Implement the Random Walk Algorithm**
Your task is to **simulate a random walk** from a given starting point in the **bipartite user-movie graph**.

##### **Hints for Implementation**
- Start from **either a user or a movie**.
- At each step, **randomly move** to a connected node.
- Keep track of **how many times each movie is visited**.
- After completing the walk, **rank movies by visit count**.

#### **Step 3: Implement User-Based Recommendation**
**Hints:**
- Check if the `user_id` exists in the `graph`.
- Start a loop that runs for `walk_length` steps.
- Randomly pick a **connected node** (user or movie).
- Track how many times each **movie** is visited.
- Sort movies by visit frequency and return the **top N**.

#### **Step 4: Implement Movie-Based Recommendation**
**Hints:**
- Find the `movie_id` corresponding to the given `movie_name`.
- Ensure the movie exists in the `graph`.
- Start a random walk from that movie.
- Follow the same **tracking and ranking** process as the user-based version.

**Note:**  
**Your task:** Implement a function `weighted_pixie_recommend(user_id, walk_length=15, num=5)` or `weighted_pixie_recommend(movie_name, walk_length=15, num=5)`.  
**Implement either Step 3 or Step 4.**

#### **Step 5: Running Your Recommendation System**
Once your function is implemented, test it by calling:

##### **Example: User-Based Recommendation**
```python
weighted_pixie_recommend(1, walk_length=15, num=5)
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | My Own Private Idaho (1991)   |
| 2       | Aladdin (1992)                |
| 3       | 12 Angry Men (1957)           |
| 4       | Happy Gilmore (1996)          |
| 5       | Copycat (1995)                |


##### **Example: Movie-Based Recommendation**
```python
weighted_pixie_recommend("Jurassic Park (1993)", walk_length=10, num=5)
```
| Ranking | Movie Name                           |
|---------|-------------------------------------|
| 1       | Rear Window (1954)                 |
| 2       | Great Dictator, The (1940)         |
| 3       | Field of Dreams (1989)             |
| 4       | Casablanca (1942)                  |
| 5       | Nightmare Before Christmas, The (1993) |


#### **Step 6: Understanding the Results**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

Each movie is ranked based on **how frequently it was visited** during the walk.

#### **Experiment with Different Parameters**
- Try different **`walk_length`** values and observe how it changes recommendations.
- Adjust the number of recommended movies (`num`).

### Steps for Implementing Weighted Pixie Random Walk Recommender

In [32]:
import random 
import pandas as pd

### Movie Based Random walk recommendation

In [33]:
# implementing random walk for movie-based recommendation

def weighted_pixie_recommend_movie(movie_name, walk_length=15, num=5):
    movie_row = df_movies[df_movies['title'] == movie_name]
    if movie_row.empty:
        print(f"Movie '{movie_name}' not found.")
        return pd.DataFrame(columns=['movie_name'])

    movie_id = int(movie_row['movie_id'].values[0])

    if movie_id not in graph:
        print(f"Movie ID {movie_id} not found in the graph.")
        return pd.DataFrame(columns=['movie_name'])

    current_node = movie_id
    visit_counts = {}

    for _ in range(walk_length):
        neighbors = list(graph.get(current_node, []))
        if not neighbors:
            break

        next_node = random.choice(neighbors)

        # Count movie visits
        if isinstance(next_node, int) and next_node in df_movies['movie_id'].values:
            if next_node != movie_id:  # avoid counting the starting movie
                visit_counts[next_node] = visit_counts.get(next_node, 0) + 1

        current_node = next_node

    # Sort and get top N
    sorted_movies = sorted(visit_counts.items(), key=lambda x: x[1], reverse=True)[:num]
    movie_titles = df_movies.set_index('movie_id')['title']
    movie_names = [movie_titles.get(mid, f"Movie ID {mid}") for mid, _ in sorted_movies]

    result_df = pd.DataFrame({
        'Ranking': range(1, len(movie_names) + 1),
        'Movie Name': movie_names
    })
    result_df.set_index('Ranking', inplace=True)

    return result_df

weighted_pixie_recommend_movie("Jurassic Park (1993)", walk_length=10, num=5)


Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Scream 2 (1997)
2,Ransom (1996)
3,Private Parts (1997)
4,"Brothers McMullen, The (1995)"
5,Jerry Maguire (1996)


#### Code Explanation

This function recommends movies by simulating a random walk on the bipartite graph.

##### Steps:
1. **Lookup Movie ID**: Finds the `movie_id` based on the input `movie_name`.
2. **Initialize Walk**: Starts a random walk from the given movie node.
3. **Random Walk**:
   - At each step, randomly chooses a connected node (user or movie).
   - Tracks visit counts for other movies encountered (excluding the starting movie).
4. **Ranking**: Sorts movies by how often they were visited during the walk.
5. **Map Titles**: Converts movie IDs to names.
6. **Return**: A ranked DataFrame of recommended movies.

##### Example:
```python
weighted_pixie_recommend_movie("Jurassic Park (1993)", walk_length=10, num=5)


#### Random Walk Movie-Based Recommendation Results

##### Starting Movie: *Jurassic Park (1993)*

##### Recommended Movies:
1. **Independence Day (ID4) (1996)**
2. **Terminator 2: Judgment Day**

---

#### Insights

- **Genre & Popularity Alignment**:  
  Both recommended movies belong to the same action/sci-fi genre as *Jurassic Park*, showing that the model effectively captures thematic connections.

- **Graph Connectivity**:  
  These movies are linked through shared users in the bipartite graph, meaning users who rated *Jurassic Park* also rated these titles — a key signal in recommendation.

- **Random Walk Dynamics**:  
  The use of random walk introduces variation, so future runs may surface different but still relevant movies. This promotes content discovery and prevents repetitive suggestions.

- **Lightweight and Effective**:  
  Despite its simplicity, this graph-based model produces contextually appropriate recommendations without heavy computations.

---




### User based random walk recommendation

In [34]:
#implementing user based random walk recommendation.

def weighted_pixie_recommend_user(user_id, walk_length=15, num=5):
    if user_id not in graph:
        print(f"User {user_id} not found in the graph.")
        return pd.DataFrame(columns=['movie_name'])

    current_node = user_id
    visit_counts = {}

    for _ in range(walk_length):
        neighbors = list(graph.get(current_node, []))
        if not neighbors:
            break  # No where to go

        next_node = random.choice(neighbors)
        
        # If we landed on a movie, count it
        if isinstance(next_node, int) and next_node in df_movies['movie_id'].values:
            visit_counts[next_node] = visit_counts.get(next_node, 0) + 1

        current_node = next_node  # Move to next node

    # Sort by visit frequency
    sorted_movies = sorted(visit_counts.items(), key=lambda x: x[1], reverse=True)[:num]

    # Map movie_ids to titles
    df_movies['movie_id'] = df_movies['movie_id'].astype(int)
    movie_titles = df_movies.set_index('movie_id')['title']
    movie_names = [movie_titles.get(mid, f"Movie ID {mid}") for mid, _ in sorted_movies]

    result_df = pd.DataFrame({
        'Ranking': range(1, len(movie_names) + 1),
        'Movie Name': movie_names
    })
    result_df.set_index('Ranking', inplace=True)

    return result_df

weighted_pixie_recommend_user(1, walk_length=15, num=5)


Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,"Terminator, The (1984)"
2,Burnt By the Sun (1994)
3,"Empire Strikes Back, The (1980)"
4,"Philadelphia Story, The (1940)"
5,Fools Rush In (1997)


#### Code Explanation

This function recommends movies to a user by simulating a random walk on the user-movie graph.

##### Steps:
1. **Start From User Node**: Begins the walk from the specified `user_id`.
2. **Random Walk**:
   - Randomly moves between connected nodes (users ↔ movies).
   - Tracks how often each movie is visited during the walk.
3. **Rank Movies**: Sorts visited movies by frequency and picks the top ones.
4. **Map Titles**: Converts movie IDs into readable movie titles.
5. **Return**: A ranked DataFrame of recommended movies for the user.

##### Example:
```python
weighted_pixie_recommend_user(1, walk_length=15, num=5)


#### Results & Insights: Pixie-Inspired Random Walk Recommendation Models

##### Observed Results:

1. **Movie-Based Recommendations (`weighted_pixie_recommend_movie`)**
   - When starting from a popular movie (e.g., *Jurassic Park (1993)*), the recommended movies were often:
     - Popular among users who also liked the starting movie.
     - Released in a similar time period or genre.
   - The results showed a good mix of related content, even with limited walk lengths.

2. **User-Based Recommendations (`weighted_pixie_recommend_user`)**
   - Recommendations for a specific user (e.g., `user_id = 1`) reflected:
     - Movies indirectly connected via other users with similar interests.
     - A broader range of movies that the user hasn’t rated but are highly rated by similar users.

---

##### Key Insights:

- **Random Walks Add Diversity**: Unlike direct similarity scores, random walks explore multi-hop relationships, leading to more diverse recommendations.
  
- **Graph Structure Matters**: Users and movies with many connections (edges) tend to appear more in recommendations due to higher traversal likelihood.

- **Fallback-Free & Adaptive**: These methods don’t rely on strict rating similarity or matrix sparsity. They work even when users have few ratings, as long as there are graph connections.

- **Lightweight Yet Effective**: This model is computationally simpler than matrix factorization or deep learning models, making it suitable for quick prototyping.

---




---

## **Submission Requirements:**

To successfully complete this assignment, ensure that you submit the following:


### **1. Jupyter Notebook Submission**
- Submit a **fully completed Jupyter Notebook** that includes:
  - **All implemented recommendation functions** (user-based, item-based, and random walk-based recommendations).
  - **Code explanations** in markdown cells to describe each step.
  - **Results and insights** from running your recommendation models.


### **2. Explanation of Pixie-Inspired Algorithms (3-5 Paragraphs)**
- Write a **detailed explanation** of **Pixie-inspired random walk algorithms** used for recommendations.
- Your explanation should cover:
  - What **Pixie-inspired recommendation systems** are.
  - How **random walks** help in identifying relevant recommendations.
  - Any real-world applications of such algorithms in industry.


### **3. Report for the Submitted Notebook**
Your report should be structured as follows:

#### **Title: Movie Recommendation System Report**

#### **1. Introduction**
- Briefly introduce **movie recommendation systems** and why they are important.
- Explain the **different approaches used** (user-based, item-based, random-walk).

#### **2. Dataset Description**
- Describe the **MovieLens 100K dataset**:
  - Number of users, movies, and ratings.
  - What features were used.
  - Any preprocessing performed.

#### **3. Methodology**
- Explain the three recommendation techniques implemented:
  - **User-based collaborative filtering** (how user similarity was calculated).
  - **Item-based collaborative filtering** (how item similarity was determined).
  - **Random-walk-based Pixie algorithm** (why graph-based approaches are effective).
  
#### **4. Implementation Details**
- Discuss the steps taken to build the functions.
- Describe how the **adjacency list graph** was created.
- Explain how **random walks** were performed and how visited movies were ranked.

#### **5. Results and Evaluation**
- Present **example outputs** from each recommendation approach.
- Compare the different methods in terms of accuracy and usefulness.
- Discuss any **limitations** in the implementation.

#### **6. Conclusion**
- Summarize the key takeaways from the project.
- Discuss potential improvements (e.g., **hybrid models, additional features**).
- Suggest real-world applications of the methods used.

### **Submission Instructions**

- Submit `.zip` file consisting of Jupyter Notebook and all the datafiles (provided) and the ones saved [i.e. `users.csv`, `movies.csv` and `ratings.csv`]. Also, include the Report and Pixie Algorithm explanation document.
- [`Bonus 10 Points`] **Upload your Jupyter Notebook, Explanation Document, and Report** to your GitHub repository.
- Ensure the repository is public and contains:
  - `users.csv`, `movies.csv` and `ratings.csv` [These are the Dataframes which were created in part 1. Save and export them as a `.csv` file]
  - `Movie_Recommendation.ipynb`
  - `Pixie_Algorithm_Explanation.pdf` or `.md`
  - `Recommendation_Report.pdf` or `.md`
- **Submit the GitHub repository link in the cell below.**


#### **Example Submission Format**
```text
GitHub Repository: https://github.com/username/Movie-Recommendation
```

# Submission of Github Link
GitHub Repository : https://github.com/priyal-11/Movie_recommendation.git

### **Grading Rubric: ITCS 6162 - Data Mining Assignment**


| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **Part 1: Exploring and Cleaning Data (15 pts)**  | Properly loads `u.user`, `u.movies`, and `u.item` datasets into DataFrames | 5 |
|                                           | Handles missing values, duplicates, and inconsistencies appropriately | 5 |
|                                           | Saves the cleaned datasets into CSV files: `users.csv`, `movies.csv`, `ratings.csv` | 5 |
| **Part 2: Collaborative Filtering-Based Recommendation (30 pts)** | Implements user-based collaborative filtering correctly | 10 |
|                                           | Implements item-based collaborative filtering correctly | 10 |
|                                           | Computes similarity measures accurately and provides valid recommendations | 10 |
| **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm) (35 pts)** | Constructs adjacency lists properly from user-movie interactions | 10 |
|                                           | Implements weighted random walk-based recommendation correctly | 15 |
|                                           | Explains and justifies the algorithm design choices (Pixie-inspired) | 10 |
| **Code Quality & Documentation (10 pts)** | Code is well-structured, efficient, and follows best practices | 5 |
|                                           | Markdown explanations and comments are clear and enhance understanding | 5 |
| **Results & Interpretation (5 pts)**      | Provides meaningful insights from the recommendation system's output | 5 |
| **Submission & Report (5 pts)**          | Submits all required files in the correct format (ZIP file with Jupyter notebook, processed CSV files, and project report) | 5 |
| **Total**                                 |                              | 100 |

#### **Bonus (10 pts)**
| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **GitHub Submission**                     | Provides a well-documented GitHub repository with CSV files, a structured README, and a properly formatted Jupyter Notebook | 10 |