# Description-based and Content-based Recommendations

We will make basic recommendation engines in two ways: using the text descriptions of each anime, and using the genres, relevant tags, animation studio and key staff members behind the media.

In [1]:
import pandas as pd
import numpy as np

from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import display

In [None]:
# from google.colab import drive 
# drive.mount ('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Read-in and cleaning

In [98]:
df = pd.read_csv('/content/drive/MyDrive/Anime project/anime-1991-2021_v4.csv',lineterminator='\n')
df = df.set_index('id',drop=True)

In [99]:
def convert_to_list(x):
  s = x.replace("[","").replace("]","").replace("'","")
  return s.split(", ")

df['genres'] = df['genres'].apply(convert_to_list)
df['tags_cleaned'] = df['tags_cleaned'].apply(convert_to_list)
df['staff'] = df['staff'].apply(convert_to_list)

In [100]:
df.drop_duplicates(subset='title',inplace=True)

In [101]:
df.head()

Unnamed: 0_level_0,popularity,averageScore,genres,episodes,format,description,season,seasonYear,favourites,source,duration,siteUrl,title,studio,tags_cleaned,staff
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1029,25577,72.0,"[Drama, Romance, Slice of Life]",1.0,MOVIE,"Taeko Okajima is a typical ""office lady"" in a ...",SUMMER,1991.0,699,MANGA,118.0,https://anilist.co/anime/1029,Only Yesterday,Studio Ghibli,"[Female Protagonist, Iyashikei, Rural, Coming ...","[Hotaru Okamoto, Yuko Tone, Isao Takahata]"
898,21162,66.0,"[Action, Adventure, Comedy, Fantasy, Sci-Fi]",1.0,MOVIE,"After defeating Freeza, Goku returns to Earth ...",SUMMER,1991.0,151,MANGA,48.0,https://anilist.co/anime/898,Dragon Ball Z: Cooler's Revenge,Toei Animation,"[Martial Arts, Shounen, Super Power, Aliens]","[Akira Toriyama, Mitsuo Hashimoto, Yasuyuki Fuse]"
897,17656,60.0,"[Action, Adventure, Comedy, Fantasy, Sci-Fi]",1.0,MOVIE,A Super Namekian named Slug comes to invade Ea...,SPRING,1991.0,87,MANGA,52.0,https://anilist.co/anime/897,Dragon Ball Z: Lord Slug,Toei Animation,"[Shounen, Super Power]","[Akira Toriyama, Mitsuo Hashimoto, Minoru Maeda]"
795,11209,77.0,"[Drama, Psychological]",39.0,TV,"Before leaving her cram school, Nanako Misonō ...",SUMMER,1991.0,503,MANGA,25.0,https://anilist.co/anime/795,Dear Brother,Tezuka Productions,"[Tragedy, School, Ojou-sama, Primarily Female ...","[Riyoko Ikeda, Osamu Dezaki, Tomoko Konparu]"
2000,8195,68.0,"[Action, Comedy, Drama, Mecha, Sci-Fi]",1.0,MOVIE,The Z Project was intended to give the new gen...,SUMMER,1991.0,86,ORIGINAL,79.0,https://anilist.co/anime/2000,Roujin Z,APPP,"[Artificial Intelligence, Primarily Adult Cast...","[Katsuhiro Ootomo, Hisashi Eguchi, Hiroyuki Ki..."


## Description-based recommendations

In [21]:
temp = df.loc[21].description
temp

"Gold Roger was known as the Pirate King, the strongest and most infamous being to have sailed the Grand Line. The capture and death of Roger by the World Government brought a change throughout the world. His last words before his death revealed the location of the greatest treasure in the world, One Piece. It was this revelation that brought about the Grand Age of Pirates, men who dreamed of finding One Piece (which promises an unlimited amount of riches and fame), and quite possibly the most coveted of titles for the person who found it, the title of the Pirate King.<br><br>\nEnter Monkey D. Luffy, a 17-year-old boy that defies your standard definition of a pirate. Rather than the popular persona of a wicked, hardened, toothless pirate who ransacks villages for fun, Luffy’s reason for being a pirate is one of pure wonder; the thought of an exciting adventure and meeting new and intriguing people, along with finding One Piece, are his reasons of becoming a pirate. Following in the foo

In [22]:
# example of removing HTML tags from description

df.loc[[21,21827]].description.str.replace(r'<[^<>]*>', '', regex=True).loc[21827]

"A certain point in time, in the continent of Telesis. The great war which divided the continent into North and South has ended after four years, and the people are welcoming a new generation. Violet Evergarden, a young girl formerly known as “the weapon”, has left the battlefield to start a new life at CH Postal Service. There, she is deeply moved by the work of “Auto Memories Dolls”, who carry people's thoughts and convert them into words. Violet begins her journey as an Auto Memories Doll, and comes face to face with various people's emotions and differing shapes of love. There are words Violet heard on the battlefield, which she cannot forget. These words were given to her by someone she holds dear, more than anyone else. She does not yet know their meaning but she searches to find it.\n\n(Source: Anime News Network)"

Notice how many descriptions mention its source (i.e. "Source: Anime News Network", "Source: Crunchyroll", etc.), and that can be tricky to take out. For now we remove the "/n," "source" and "written by" phrases to temporarily deal with the problem. 

In [23]:
df['description_2'] = (
  df.description.
  str.lower().
  str.replace(r'<[^<>]*>', ' ', regex=True).
  str.replace(r'(?:[^\w\s]|_)+', ' ').
  str.replace('.', ' ').
  str.replace('\n', ' ').
  str.replace("source", ' ').
  str.replace("written by", ' ')
 )

  """
  


We use a metric called tf-idf to extract the most relevant words in the description. So common words and articles such as "the," and "a" and pronouns such as "he" and "her" are removed. 

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(norm=None,stop_words="english") # Do not normalize.
vec.fit(df.description_2) # This determines the vocabulary.
tf_idf_sparse = vec.transform(df.description_2)
tf_idf_sparse

<3173x21968 sparse matrix of type '<class 'numpy.float64'>'
	with 137070 stored elements in Compressed Sparse Row format>

In [25]:
# calculating description similarites within the tf_idf sparse matrix via linear kernel
# and returning top 10 titles most similar to given anime

def most_similar(name):
  indices = pd.Series(df.index, index=df['title'])
  try:
    id = indices[name]
  except KeyError:
    print("This anime does not exist in our database. Please make sure the spelling is correct."\
          "The title may also be in all caps.")
    return None

  from sklearn.metrics.pairwise import linear_kernel
  linear_sims = linear_kernel(tf_idf_sparse)
  np.fill_diagonal(linear_sims,0)
  tfidf = pd.DataFrame(linear_sims,index=df.index,columns=df.index)
  anime = tfidf[id].sort_values(ascending=False)
  return df.loc[anime.index[:10]]["title"]

In [26]:
def f(x):
    display(x)
    return x

anime = sorted(list(df['title']))
selection = interactive(f, x=widgets.Dropdown(options=anime,description='Anime name:',disabled=False))
print("Select an anime: ")
display(selection)

Select an anime: 


interactive(children=(Dropdown(description='Anime name:', options=('.hack//Legend Of The Twilight', '.hack//Li…

In [27]:
most_similar(selection.result)

id
4896                         Umineko: When They Cry
2563                         ARIA The OVA ~ARIETTA~
472                                        To Heart
169                          Lunar Legend Tsukihime
21366                    March comes in like a lion
1691                                 Kaze no Stigma
109125                  Omoi, Omoware, Furi, Furare
477                              ARIA The ANIMATION
101371    Ms. vampire who lives in my neighborhood.
5681                                    Summer Wars
Name: title, dtype: object

In [28]:
df[df.title=="5 Centimeters per Second"].description.iloc[0]

"Tohno Takaki and Shinohara Akari, two very close friends and classmates, are torn apart when Akari's family is transferred to another region of Japan due to her family's job. Despite separation, they continue to keep in touch through mail. When Takaki finds out that his family is also moving, he decides to meet with Akari one last time.<br><br>\nAs years pass by, they continue down their own paths, their distance slowly growing wider and their contact with one another fades. Yet, they keep remembering one another and the times they have shared together, wondering if they will have the chance to meet once again.\n"

In [29]:
df.loc[4896].description

'Considered as the third installment in the highly popular "When They Cry" series by 07th Expansion, Umineko no Naku Koro ni takes place on the island of Rokkenjima (&#20845;&#36562;&#23798;), owned by the immensely wealthy Ushiromiya family. As customary per year, the entire family is gathering on the island for a conference that discusses the current financial situations of each respective person. Because of the family head\'s poor health, this year involves the topic of the head of the family&rsquo;s inheritance and how it will be distributed.<br><br>\nHowever, the family is unaware that the distribution of his wealth is the least of Ushiromiya Kinzo\'s (family head) concerns for this year\'s family conference. After being told that his end was approaching by his longtime friend and physician, Kinzo is desperate to meet his life\'s true love one last time: the Golden Witch, Beatrice. Having immersed himself in black magic for many of the later years in his life, Kinzo instigates a c

Both anime are similar in terms of the number of similar relevant words they share, but plot-wise, they have very little to do with each other. 

## Content-based Recommendations

Viewers may be in the mood for certain genres of anime, while others want to consume works by certain studios, authors and/or directors. We now like to investigate if only taking into consideration the main genres, animation studio, relevant tags and key staff members will make for a better recommendation engine. We do this by concatenating those four pieces of text into one long string "soup" and measuring their similarities with each other.

In [102]:
df[["genres","studio","tags_cleaned",'staff']]

Unnamed: 0_level_0,genres,studio,tags_cleaned,staff
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1029,"[Drama, Romance, Slice of Life]",Studio Ghibli,"[Female Protagonist, Iyashikei, Rural, Coming ...","[Hotaru Okamoto, Yuko Tone, Isao Takahata]"
898,"[Action, Adventure, Comedy, Fantasy, Sci-Fi]",Toei Animation,"[Martial Arts, Shounen, Super Power, Aliens]","[Akira Toriyama, Mitsuo Hashimoto, Yasuyuki Fuse]"
897,"[Action, Adventure, Comedy, Fantasy, Sci-Fi]",Toei Animation,"[Shounen, Super Power]","[Akira Toriyama, Mitsuo Hashimoto, Minoru Maeda]"
795,"[Drama, Psychological]",Tezuka Productions,"[Tragedy, School, Ojou-sama, Primarily Female ...","[Riyoko Ikeda, Osamu Dezaki, Tomoko Konparu]"
2000,"[Action, Comedy, Drama, Mecha, Sci-Fi]",APPP,"[Artificial Intelligence, Primarily Adult Cast...","[Katsuhiro Ootomo, Hisashi Eguchi, Hiroyuki Ki..."
...,...,...,...,...
117002,"[Drama, Mahou Shoujo, Mystery, Psychological, ...",Shaft,"[Female Protagonist, Magic, Primarily Female C...","[Magica Quartet, Ume Aoki, Gekidan Inu Curry]"
130713,[Fantasy],Studio Kafka,[],"[Kore Yamazaki, Kazuaki Terasawa, Hirotaka Katou]"
131584,"[Music, Supernatural]",A-1 Pictures,"[Vampire, Idol, Primarily Male Cast, Band, Mal...","[Noriyasu Agematsu, Ikumi Katagiri, Takeshi Fu..."
123494,"[Drama, Sports]",LIDENFILMS,"[Football, Shounen, Female Protagonist, Primar...","[Naoshi Arakawa, Seiki Takuno, Natsuko Takahashi]"


In [103]:
# combining two-word terms into one word so two people with same first names do not get mixed up

df['staff'] = df['staff'].apply(lambda x: [s.replace(" ","") for s in x])
df['genres'] = df['genres'].apply(lambda x: [s.replace(" ","") for s in x])
df['tags_cleaned'] = df['tags_cleaned'].apply(lambda x: [s.replace(" ","") for s in x])

def create_soup(x):
  return  ' ' + ' '.join(x['genres']) + ' ' + ''.join(x['studio']).replace(" ","") + ' ' \
          + ' '.join(x['tags_cleaned']) + ' ' + ' '.join(x['staff'])

df['soup'] = df.apply(create_soup, axis=1)

In [104]:
df['soup'].iloc[295]

' Action Ecchi Horror Mecha Sci-Fi AshiProductions Demons Post-Apocalyptic MartialArts Shounen Gore Tokusatsu Dystopian TakayukiYamaguchi ToshikiHirano IkuoKomiya'

Since all the terms in this soup are considered important, we can simply use a count vectorizer that creates word frequencies for each string. This is typically referred to as the "bag-of-words" model. 

In [105]:
def get_content_recs(name):

  soup = df.apply(create_soup, axis=1)

  from sklearn.feature_extraction.text import CountVectorizer

  count = CountVectorizer(stop_words='english')
  count_matrix = count.fit_transform(soup)

  from sklearn.metrics.pairwise import cosine_similarity

  cosine_sims = cosine_similarity(count_matrix)
  np.fill_diagonal(cosine_sims,0)

  indices = pd.Series(df.index, index=df['title'])
  try:
    id = indices[name]
  except KeyError:
    print("This anime does not exist in our database. Please make sure the spelling is correct."\
          "The title may also be in all caps.")
    return None

  sim_scores = pd.DataFrame(cosine_sims,
             index=df.index,
             columns=df.index).loc[id].sort_values(ascending=False)[:10]

  return df['title'][sim_scores.index]

In [106]:
def f(x):
    display(x)
    return x

anime = sorted(list(df['title']))
selection = interactive(f, x=widgets.Dropdown(options=anime,description='Anime name:',disabled=False))
print("Select an anime: ")
display(selection)

Select an anime: 


interactive(children=(Dropdown(description='Anime name:', options=('.hack//Legend Of The Twilight', '.hack//Li…

In [152]:
get_content_recs(selection.result)

id
433       The Place Promised in Our Early Days
106286                     Weathering With You
21519                               Your Name.
16782                      The Garden of Words
17121                      Dareka no Manazashi
256                   Voices of a Distant Star
114065                        Remake Our Life!
147                            Rumbling Hearts
16067              Nagi-Asu: A Lull in the Sea
98820                            Just Because!
Name: title, dtype: object

In [96]:
df[df.title=="5 Centimeters per Second"].soup.iloc[0]

' Drama Romance SliceofLife CoMixWave ComingofAge Heterosexual TimeSkip Tragedy MakotoShinkai MakotoShinkai TakayoNishimura MakotoShinkai MakotoShinkai'

In [147]:
df.soup.loc[433]

' Drama Romance Sci-Fi CoMixWave ComingofAge AlternateUniverse Military MakotoShinkai MakotoShinkai UshioTazawa'

The results look much better this time. *5 Centimeters per Second* shares the same genres with *The Place Promised in Our Early Days* such as Drama and Romance, and is directed by the same person, Makoto Shinkai, who also did *Weathering with You*, *Your Name.*, and *The Garden of Words.* Results for certain anime may be better than others, however.