# Recommendation System


## Data Preprocessing:

#### Load the dataset into a suitable data structure (e.g., pandas DataFrame).

In [101]:
#import libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [102]:
#Read csv file
df = pd.read_csv("anime.csv")
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


#### Handle missing values, if any.

In [103]:
#Check for null values
df.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [104]:
#Handle missing values
df['rating'].fillna(df['rating'].mean(), inplace=True)
df['genre'].fillna('Unknown', inplace=True)
df['type'].fillna('Unknown', inplace=True)

#Recheck missing values
print("\nMissing values after handling:")
print(df.isnull().sum())
df.head()


Missing values after handling:
anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


#### Explore the dataset to understand its structure and attributes.

In [105]:
#Get shape
df.shape

(12294, 7)

In [106]:
#Get info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [107]:
#Get summary
df.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12294.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.017096,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.9,225.0
50%,10260.5,6.55,1550.0
75%,24794.5,7.17,9437.0
max,34527.0,10.0,1013917.0


In [108]:
#Replace unknown and TBD value
df['episodes'] = df['episodes'].replace({'Unknown': 0, 'TBD': 0})
df['episodes'] = pd.to_numeric(df['episodes'])

## Feature Extraction:

#### Decide on the features that will be used for computing similarity (e.g., genres, user ratings).

In [109]:
#Feature selection
df_features = df[['genre', 'type', 'rating', 'members']]
df_features.head()

Unnamed: 0,genre,type,rating,members
0,"Drama, Romance, School, Supernatural",Movie,9.37,200630
1,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,9.26,793665
2,"Action, Comedy, Historical, Parody, Samurai, S...",TV,9.25,114262
3,"Sci-Fi, Thriller",TV,9.17,673572
4,"Action, Comedy, Historical, Parody, Samurai, S...",TV,9.16,151266


### Reasoning for Feature Selection:

For calculating anime similarity, the following columns from the df DataFrame are chosen as features:

**genre:** This column is crucial because anime genres (e.g., Action, Drama, Sci-Fi) are fundamental in defining its category and target audience. Similar genres often indicate similar content and themes, which is a primary factor in user preferences.

**type:** The type of anime (e.g., TV, Movie, OVA, Special) provides important context about its format and usual length. Users often have preferences for specific types, and similarity in type can suggest similar viewing experiences.

**rating:** The user rating is a direct indicator of an anime's perceived quality and popularity among viewers. Animes with similar ratings are often considered to be of comparable quality, making it a valuable metric for similarity.

**members:** The number of members indicates the size of the community interested in a particular anime. While not directly describing content, a higher member count often correlates with popularity and broader appeal. Animes with similar community sizes might attract similar audiences.

#### Convert categorical features into numerical representations if necessary.


In [110]:
#Convert categorical features into numerical 
df_type_encoded = pd.get_dummies(df_features['type'], prefix='type')
#convert to numeric
df_type_encoded = df_type_encoded.astype(float)
df_type_encoded.head()

Unnamed: 0,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,type_Unknown
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [111]:
#Convert categorical features into numerical with seperated by comma
df_genre_encoded = df_features['genre'].str.get_dummies(sep=', ')
df_genre_encoded.head()

Unnamed: 0,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Yuri
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,1,1,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [112]:
#Drop old 'genre', 'type' columns
df_features_encoded = pd.concat([df_features, df_type_encoded, df_genre_encoded], axis=1)
df_features_encoded = df_features_encoded.drop(columns=['genre', 'type'])
df_features_encoded.head()

Unnamed: 0,rating,members,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,type_Unknown,Action,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Yuri
0,9.37,200630,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,...,0,0,0,0,1,0,0,0,0,0
1,9.26,793665,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1,...,0,0,0,0,0,0,0,0,0,0
2,9.25,114262,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1,...,0,0,0,0,0,0,0,0,0,0
3,9.17,673572,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,...,0,0,0,0,0,1,0,0,0,0
4,9.16,151266,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1,...,0,0,0,0,0,0,0,0,0,0


#### Normalize numerical features if required.

In [113]:
#Normalize numerical features
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_features_encoded[['rating', 'members']] = scaler.fit_transform(df_features_encoded[['rating', 'members']])
df_features_encoded.head()

Unnamed: 0,rating,members,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,type_Unknown,Action,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Yuri
0,0.92437,0.197872,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,...,0,0,0,0,1,0,0,0,0,0
1,0.911164,0.78277,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1,...,0,0,0,0,0,0,0,0,0,0
2,0.909964,0.112689,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1,...,0,0,0,0,0,0,0,0,0,0
3,0.90036,0.664325,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,...,0,0,0,0,0,1,0,0,0,0
4,0.89916,0.149186,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1,...,0,0,0,0,0,0,0,0,0,0


## Recommendation System:

#### Design a function to recommend anime based on cosine similarity.

In [114]:
#import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

#Apply cosine_similarity
cosine_sim_matrix = cosine_similarity(df_features_numeric)

print("Shape of the cosine similarity matrix:", cosine_sim_matrix.shape)
print('-'*60)
print(cosine_sim_matrix[:5, :5])


Shape of the cosine similarity matrix: (12294, 12294)
------------------------------------------------------------
[[1.         0.26770973 0.11961836 0.19251447 0.11930264]
 [0.26770973 1.         0.42873892 0.36935073 0.43103174]
 [0.11961836 0.42873892 1.         0.47204416 0.99991818]
 [0.19251447 0.36935073 0.47204416 1.         0.47468026]
 [0.11930264 0.43103174 0.99991818 0.47468026 1.        ]]


In [115]:
#Define ecommendation function
def get_recommendations(anime_id, num_recommendations=10, similarity_threshold=0.0):
    # Get the index of the anime that matches the anime_id
    if anime_id not in df['anime_id'].values:
        print(f"Anime ID {anime_id} not found in the dataset.")
        return []
    #Get the index of anime id
    idx = df[df['anime_id'] == anime_id].index[0]

    # Get the pairwise similarity scores of all anime with that anime
    sim_scores = list(enumerate(cosine_sim_matrix[idx]))

    # Sort the anime based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Filter recommendations based on similarity_threshold
    filtered_sim_scores = [score for score in sim_scores if score[1] >= similarity_threshold]

    # Get the scores of the `num_recommendations` most similar anime (excluding itself)
    # Ensure we exclude the anime itself and take only up to num_recommendations from the filtered list
    recommendations_list = []
    for i, score in enumerate(filtered_sim_scores):
        if score[0] == idx:  # Skip the anime itself
            continue
        recommendations_list.append(score)
        if len(recommendations_list) == num_recommendations:
            break

    # Get the anime indices
    anime_indices = [i[0] for i in recommendations_list]

    # Return the names of the top recommended anime
    return df['name'].iloc[anime_indices].tolist()

In [116]:
#Select sample anime id to find similarity
sample_anime_id = 32281 

print(f"\nRecommendations for Anime ID {sample_anime_id}  with similarity_threshold = 0.5:")
print('-'*70)
reco_thresh_0_5 = get_recommendations(sample_anime_id, num_recommendations=10, similarity_threshold=0.5)
if reco_thresh_0_5:
    for i, anime_name in enumerate(reco_thresh_0_5):
        print(f"{i+1}. {anime_name}")
else:
    print("No recommendations found above the threshold.")



print(f"\nRecommendations for Anime ID {sample_anime_id}  with similarity_threshold = 0.9:")
print('-'*70)
reco_thresh_0_9 = get_recommendations(sample_anime_id, num_recommendations=10, similarity_threshold=0.9)
if reco_thresh_0_9:
    for i, anime_name in enumerate(reco_thresh_0_9):
        print(f"{i+1}. {anime_name}")
else:
    print("No recommendations found above the threshold.")


Recommendations for Anime ID 32281  with similarity_threshold = 0.5:
----------------------------------------------------------------------
1. Aura: Maryuuin Kouga Saigo no Tatakai
2. Kokoro ga Sakebitagatterunda.
3. Harmonie
4. Air Movie
5. Hotarubi no Mori e
6. &quot;Bungaku Shoujo&quot; Movie
7. Clannad Movie
8. Suki ni Naru Sono Shunkan wo.: Kokuhaku Jikkou Iinkai
9. Taifuu no Noruda
10. Wind: A Breath of Heart OVA

Recommendations for Anime ID 32281  with similarity_threshold = 0.9:
----------------------------------------------------------------------
1. Aura: Maryuuin Kouga Saigo no Tatakai
2. Kokoro ga Sakebitagatterunda.
3. Harmonie
4. Air Movie


In [117]:
# Define liked as rating >= 7 (arbitrary threshold)
df['liked'] = df['rating'].apply(lambda x: 1 if x >= 7 else 0)
df['liked'].head()

0    1
1    1
2    1
3    1
4    1
Name: liked, dtype: int64

## Evaluation:

#### Split the dataset into training and testing sets.

In [118]:
#import libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

# Split dataset
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
print(train_df.shape)
print(test_df.shape)

(9835, 8)
(2459, 8)


#### Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.


In [119]:
# Example evaluation (binary liked/not liked prediction)
y_true = test_df['liked']  
y_pred = []  # predicted based on recommendations

for anime in test_df['name']:
    recs = get_recommendations(sample_anime_id, num_recommendations=5, similarity_threshold=0.6)
    # If any recommended anime is in test set liked list, mark as positive
    y_pred.append(1 if any(r in list(test_df[test_df['liked']==1]['name']) for r in recs) else 0)

# Metrics
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1-score:", f1_score(y_true, y_pred))

Precision: 0.3298088653924359
Recall: 1.0
F1-score: 0.4960244648318043


#### Analyze the performance of the recommendation system and identify areas of improvement.

#### **Analyze the performance of the recommendation system:**

**Precision (≈ 0.33):**

- Out of all the anime the system recommended, only about 33% were actually relevant (liked by the user or matched the ground truth).

- Low precision means the system is recommending a lot of "extra" or irrelevant items.

**Recall (1.0):**

- Recall of 100% means your system successfully found all the relevant anime in the test set.

- In other words, it didn’t miss any liked/relevant items.

**F1-score (≈ 0.50):**

- The F1-score is the balance between precision and recall.

- A score of ~0.5 shows that while recall is perfect, precision is dragging the overall performance down.

#### **Areas of improvement:** 
**Hybrid models:** Combine content-based (genres, ratings) with collaborative filtering (user preferences).

**Dimensionality reduction:** Use PCA or embeddings to reduce feature space.

**Cold-start problem:** Handle new anime with limited data.

## Interview Questions:

#### 1. Can you explain the difference between user-based and item-based collaborative filtering?

#### **User-based Collaborative Filtering (User-User CF):**

**How it works:** This approach finds users who are similar to the active user (the one for whom we want to make recommendations). Similarity is typically determined by common ratings or preferences. Once similar users are identified, the system recommends items that these similar users liked but the active user has not yet interacted with.

**Analogy:** "People who are like you liked these things, so you might like them too."

**Pros:** Can recommend novel items (serendipity) that the active user might not have discovered otherwise, as it's not limited to items directly similar to those the user has already seen.

**Cons:**
- **Scalability:** Can be computationally expensive for systems with many users, as it requires calculating user-to-user similarity dynamically.
- **Sparsity:** Difficulty in finding truly similar users when rating data is sparse (most users only rate a small fraction of available items).
- **Cold Start (Users):** Hard to make recommendations for new users with few ratings.


#### **Item-based Collaborative Filtering (Item-Item CF):**

**How it works:** This approach finds items that are similar to items the active user has already liked. Similarity between items is determined by how consistently users rate them similarly. For example, if users who like Movie A also tend to like Movie B, then Movie A and Movie B are considered similar. The system then recommends items that are similar to the active user's past positive interactions.

**Analogy:** "You liked this item, and people who liked this item also liked these other items, so you might like them too."

**Pros:**
- **Scalability:** Item-to-item similarity is usually pre-calculated offline and remains relatively stable, making it more scalable for systems with many users but a more stable item catalog.
- **Performance:** Generally performs better with sparse datasets compared to user-based approaches.

**Cons:**
- **Less Serendipitous:** Tends to recommend items that are very similar to what the user already knows, potentially limiting exposure to new categories.
- **Cold Start (Items):** Hard to recommend new items that haven't been rated by many users yet.

In summary, user-based CF focuses on finding similar users, while item-based CF focuses on finding similar items. Item-based CF is often preferred in large-scale commercial systems due to its better scalability and performance characteristics.




#### 2. What is collaborative filtering, and how does it work?

#### **Collaborative Filtering (CF):** 
**Definition:** A type of recommender system that relies on the collective behavior of users rather than item attributes.

**Core Assumption:** “Users who agreed in the past will likely agree again in the future.”

**Key Idea:** If two users have rated or interacted with items similarly, they are considered similar, and recommendations can be made based on this similarity


**How It Works**

**User-Item Matrix:**

- Construct a matrix where rows = users, columns = items, and values = ratings or implicit feedback (e.g., clicks, views).

- This matrix is usually sparse because most users interact with only a small fraction of items.

**Similarity Calculation:**

- Measure similarity between users (user-based CF) or between items (item-based CF).

- Common metrics: cosine similarity, Pearson correlation, or Jaccard index.

**Prediction:**

- Fill in missing entries in the matrix by predicting how a user would rate an unseen item.

- Example: If User A and User B both liked Naruto and Bleach, and User B also liked One Piece, then recommend One Piece to User A.

**Recommendation Generation:**

- Rank items by predicted scores.

- Recommend the top-N items to the user.

**Types of Collaborative Filtering:**
- **User-Based CF:** Finds similar users and recommends items they liked.

- **Item-Based CF:** Finds similar items to those the user liked and recommends them.

- **Model-Based CF:** Uses machine learning (e.g., matrix factorization, neural networks) to learn latent features and make predictions.