This notebook show a simple usage that use modern transformer embedding model for feature engineering instead of traditional *TF-IDF* method. Process of exploratory data analysis (EDA) would not been demonstrateed in this notebook.

In [1]:
# import necessary libeies

# for data wrangling
import numpy as np
import pandas as pd

# for creating featrues embeddings
from sentence_transformers import SentenceTransformer

# for cumpute cosine similarity between item
from sklearn.metrics.pairwise import cosine_similarity

In this project, I used "*Anime Recommendations Database*" from **[kaggle](https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database/data)** to demonstrate modeling process. And in this notebook, I would only use "anime.csv" in this notebook.

In [None]:
# set path of data
path_anime = "/content/sample_data/anime.csv" # remember type in right path

# read anime data
df_anime_raw = pd.read_csv(path_anime)

**Basic data wrangling process**.
For just simple demonstration, we drop NaN data directly. Then checking that remained data are enough to represent most dataset. And finally, we sorted data by **"name"** column for checking recommendation results easily.

In [None]:
# drop NaN data directly and
df_anime = df_anime_raw.dropna(axis=0).copy()

# rename ['genre'] column as ['tags']
df_anime = df_anime.rename(columns={"genre":"tags"}).reset_index(drop = True)

# check remained data are enough to represent most dataset.
print( round( len(df_anime) / len(df_anime_raw), 4 ) )
# 0.9775, it's still over than 0.95 of raw dataset!

# sort df_anime by ['name'] for
df_anime = df_anime.sort_values(['name'], axis = 0).reset_index(drop = True)

0.9775


In the most simple application scenario, user may want to search the relevant works by the name of giving work. The traditional methods may include *TF-IDF*, edit-distance measurements, which are all statistics based methods. Instead, this notebook demonstrated how to use modern transformer-based model to get potential feature of text columns.

In [None]:
# set embedding model
model_embedding = SentenceTransformer("all-MiniLM-L6-v2")

# get list of name of anime works
list_name = df_anime['name'].tolist().copy()

# get embedding by model_embedding
array_name_embeddings = model_embedding.encode(list_name, normalize_embeddings = True)

# get cosine similarity matrix of array_name_embeddings
array_cosine_sim_name = cosine_similarity(array_name_embeddings,array_name_embeddings)

# convert array_cosine_sim_name to pd.DataFrame()
df_sim_score_name = pd.merge(df_anime, pd.DataFrame(array_cosine_sim_name),
                left_index = True, right_index = True)

For checking results of recommendation, we built simple function to deal this task, and use this function to test recommended results.

In [None]:
# build a functo to check result of recommendation conveniently.
def get_recommended_results(index, df_input, return_qty = 15):
    """
    Get recommended results in pd.DataFrame format with giving df_sim_score_name.
    # ------------------------------------------------
    Args:
      index: integer
          The index of df_sim_score_name of work.
      df_input: pd.DataFrame
          A pd.DataFrame where we want to find out recommended results
    # ------------------------------------------------
    Returns:
        df_result: pd.DataFrame
            Therecommended results
    """
    list_common_col = ['anime_id', 'name', 'tags', 'type', 'episodes', 'rating', 'members']

    if len(df_input) == 0:
      df_result = pd.DataFrame(columns = list_common_col + ['sim'])
    else:
      if return_qty > len(df_input):
        return_qty = len(df_input)

      df_result = df_input[list_common_col + [index]].sort_values(index, ascending = False).iloc[:return_qty].copy()
      df_result = df_result.rename(columns={index: 'sim'})

    return df_result
# End of function: get_recommended_results()


# test: find top 15 relevant works "Prince of Tennis" (index = 8304)
print(df_anime['name'].iat[8304])
index = 8304

df_result = get_recommended_results(index, df_sim_score_name)
print(df_result)

Pokemon Best Wishes! Season 2
      anime_id                                               name  \
8304     14093                      Pokemon Best Wishes! Season 2   
8308     17115           Pokemon Best Wishes! Season 2: Episode N   
8305     17873  Pokemon Best Wishes! Season 2: Decolora Adventure   
8303      9107                               Pokemon Best Wishes!   
8310     16680  Pokemon Best Wishes! Season 2: Shinsoku no Gen...   
8307     23299  Pokemon Best Wishes! Season 2: Decolora Advent...   
8306     20743  Pokemon Best Wishes! Season 2: Decolora Advent...   
8309     12671  Pokemon Best Wishes! Season 2: Kyurem vs. Seik...   
8312     10740  Pokemon Best Wishes!: Victini to Shiroki Eiyuu...   
8295       527                                            Pokemon   
8313     14123      Pokemon Black and White 2: Introduction Movie   
8325     34514                                Pokemon Generations   
3524     28891                            Haikyuu!! Second Season   
8340

The result was seemed very good! But that because "Prince of Tennis" has lot of relevant works in a same series. If we want to find recommended results of relatively unpopular work? Let's see the following example.

In [None]:
# test: find top 15 relevant works "One Punch Man" (index = 7673)
print(df_anime['name'].iat[7673])
index = 7673

df_result_1 = get_recommended_results(index, df_sim_score_name)
print(df_result_1)

Ohayo! Spank (Movie)
       anime_id                                               name  \
7673      19897                               Ohayo! Spank (Movie)   
7672       2912                                       Ohayo! Spank   
4739       9617                                        K-On! Movie   
8992       4031                         Sakigake!! Otokojuku Movie   
8088      29832       Panpaka Pants Movie: Bananan Oukoku no Hihou   
1759       3745  Crayon Shin-chan Movie 02: Buriburi Oukoku no ...   
9747       1764                                  Slam Dunk (Movie)   
1771       8366  Crayon Shin-chan Movie 14: Densetsu wo Yobu Od...   
3952       5956                     High School! Kimengumi (Movie)   
4084       1358                                Hokuto no Ken Movie   
11083       711                        Uchuu Senkan Yamato (Movie)   
11645      6693  Yatterman the Movie: Shin Yattermecha Osu Gou!...   
4746      31344                    K: Missing Kings - Manner Movie   

To deal this task, we can combine more infomation from other columns. In the following codes, we showed the process that integrate ['name', 'tags',
 'members', 'rating'] columns into a new feature!

In [None]:
df_anime_copy = df_anime[['anime_id']].copy()
df_anime_copy['tags'] = df_anime['tags']

# get list of name of anime works
list_tags = df_anime_copy['tags'].tolist().copy()

# get embedding by model_embedding
array_tags_embeddings = model_embedding.encode(list_tags, normalize_embeddings = True)

# standardize ['rating'] columns
df_anime_copy['rating_norm'] = np.round(df_anime['rating'] / 10, 6)
df_anime_copy['rating_log_norm'] = np.round(np.log(df_anime['members']) / max(np.log(df_anime['members']) ) / 10, 6)

# combine the following features: array_name_tags_embeddings + ['rating_norm'] + ['rating_log_norm']
list_new_features = []
for i in df_anime_copy.index:
    array_temp_1 = array_name_embeddings[i]
    array_temp_2 = array_tags_embeddings[i]
    array_temp_3 = df_anime_copy[['rating_norm','rating_log_norm']
                                 ].values[i]
    array_temp_0 = np.concatenate([array_temp_1, array_temp_2, array_temp_3], axis=0)

    list_new_features.append(array_temp_0)
    del array_temp_1, array_temp_2, array_temp_3, array_temp_0
# End for/loop

Now let's check result of recommendation with new features!

In [None]:
# get array_new_features
array_new_features = np.array(list_new_features)
array_cosine_sim_multi_col = cosine_similarity(array_new_features, array_new_features)

# convert array_cosine_sim_multi_col to pd.DataFrame()
df_sim_score_multi_col = pd.merge(df_anime, pd.DataFrame(array_cosine_sim_multi_col),
                  left_index = True, right_index = True)


# test: find top 15 relevant works "One Punch Man" (index = 7673)
print(df_anime['name'].iat[7673])
index = 7673

df_result_2 = get_recommended_results(index, df_sim_score_multi_col)
print(df_result_2)

Ohayo! Spank (Movie)
       anime_id                                       name  \
7673      19897                       Ohayo! Spank (Movie)   
7672       2912                               Ohayo! Spank   
4739       9617                                K-On! Movie   
3323        852                         Gokinjo Monogatari   
5986        191    Love Hina Christmas Special: Silent Eve   
7734        470                Okusama wa Joshikousei (TV)   
7692      16730                Ojamanga Yamada-kun (Movie)   
8100      17875           Papa no Iukoto wo Kikinasai! OVA   
10324     21647                          Tamako Love Story   
8099      11179               Papa no Iukoto wo Kikinasai!   
6325       1453                              Maison Ikkoku   
3574      19495                              Hakusai Anime   
5608       6571                         Koume-chan ga Iku!   
3933       6372  Higashi no Eden Movie I: The King of Eden   
8140      27399        Peeping Life Movie: We Are

# **Summary**
This notebook showed a simple usage that use modern transformer embedding model for feature engineering instead of traditional *TF-IDF* method to build a content-based recommendation system.
# Some may ask: "It's just that?"
# Yes!
Because a production-ready level data science project in the real world should consider more and more factors and domain knowledges.
Take this "anime.csv" this dataset as an example, like a movie, an anime works must with it:
*   Author of original work
*   Released year
*   Director
*   Screenwriter
*   Character voice
*   Music of OP/ED and its singer
*   Music of OST and its composer
*   The company which produced this anime works
*   etc.

The above information did not exist in this demo dataset, or we could make the recommendation much more miscellaneous.
A production-ready level data science project must be implemented cross-function departments in an organization with many details of project management. This also a precious experience of each data scientist. We may have an interview to talk about that. Thanks your reading in the end of this notebook! 🙂