<a href="https://colab.research.google.com/github/Imanisima/anime-match/blob/master/anime_match.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anime Match
----
A classification model that uses transcipts from Anime to filter out specific themes. 

## The Problem
---
What do we consider when choosing anime? Genre, art style, content length, and popularity are typically what we think about.

Current recommendation systems that only filter by broad genre tags allow for specific themes to slip through and also close of entire categories for general thematic elements that are assumed. Broad genres, including horror and fantasy, may include several anime that deal with death and supernatural elements. Any individual that wished to not see either of these categories may eliminate the broad genre, where several shows in that genre don’t deal with either theme. 

Alternatively, there are several instances where an anime may be tagged by a traditionally light- hearted genre but do include darker themes. For example, the anime “Your Lie in April” is included in the romance genre overall but regularly includes themes of death from the main character’s relative passing to one of the main characters passing by the finale. By focusing on a thematic filter, we can add to existing recommendation systems to better the experience of anime enthusiasts, both by reducing the amount of anime with minor mentions of exclusionary themes to slip through due to their overarching genre tags, and by broadening the available recommendations with previously excluded larger genres.

## The Big Picture
----
Although the classification model we are building is Anime, it could also be applied to Manga, TV shows, books, newspapers, and other content. This model could also be used for parental control for children when they are searching the internet and watching Netflix.

## Dataset
---
The datasets uses in this project are raw transcripts from [Kistunekko](https://kitsunekko.net). It contains transcripts from over 2000 anime in 4 languages: English, Japanese, Chinese, and Korean. For the purposes of this project, we will be sticking with English.

Transcipts can be found in the /content/transcripts path.


### I. Web Scraper
First, we need to build a webscrapper for the kisunekko.net site! Here are our steps:

(1) Build webscrapper using BeautifulSoup

(2) We will retreive all zip files from each anime listed.

(3) Extract all transcripts from each compressed files and lastly,

(4) Remove leftover compressed files to save some space

In [None]:
# run this to import from google drive
from google.colab import drive
drive.mount('/content/drive/') 

Mounted at /content/drive/


In [None]:
import os
import requests

In [None]:
'''
Web-scraper for kistunekko.net
'''

domain = "https://kitsunekko.net"
sub_query = "/dirlist.php?dir=subtitles"
url = domain + sub_query
res = requests.get(url)

res

<Response [200]>

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(res.content, 'html.parser')

table_res = soup.find(id='flisttable') # id that points to the transcripts
trans_elem = table_res.find_all('a', class_='') # Using the table results, retrieve the rows with links to transcripts

In [None]:
import re

''' Strip html tags from text '''
def clean_html(raw_html):
  strip_tags = re.compile('<.*?>')
  clean_text = re.sub(strip_tags, '', raw_html)
  return clean_text
  

In [None]:
'''
Each Anime has a title and a link for download
'''
anime_list = {}
for a_tag in trans_elem:
    title_elem = a_tag.find('strong', class_='')
    title = clean_html(str(title_elem))
    anime_list[title] = a_tag["href"]

In [None]:
# write list of anime to txt file for later use.
print("Writing to file...")
anime_list_path = "/content/drive/My Drive/Colab Notebooks/anime_list.txt"
with open(anime_list_path, "w") as f:
  for anime in anime_list.keys():
    f.write(anime)
    f.write("\n")

f.close()

print("Writing complete.")


Writing to file...
Writing complete.


In [None]:
# !pip install pyunpack # uncomment in case of error
from pyunpack import Archive
import sys
import subprocess

'''extract files from zip, rar, and .7zip files'''
def decompress_files(trans_path):
  trans_folder = os.listdir(trans_path)

  for file in trans_folder:
      if (".rar" in file or ".zip" in file or ".7z" in file):

        with open(trans_path + file, "rb") as f:
          try:
            Archive(trans_path + file).extractall(trans_path + file)
          except: # in case of a bad zip file or magic number error
            pass



In [None]:
# !pip install wget # uncomment to install wget in case of error
import wget

'''
Download file from kisunekko

Dir: where to store the download
URL: link to download the transcripts
'''

def download_files(url, dir, title):
  dir = os.path.expanduser(dir)
  folder = os.path.splitext(dir + title)[0]
  download_to = folder + "/"

  if not os.path.exists(folder):
      os.makedirs(folder)

  if os.path.exists(os.path.splitext(dir + title)[0]):
    wget.download(url=url, out=download_to)

  decompress_files(download_to)


In [None]:
''' Uncomment this cell to delete everything in the folder folder and its contents '''

# import shutil

# shutil.rmtree("/content/drive/My Drive/Colab Notebooks/transcripts/")
# print("folder removed")

folder removed


In [None]:
'''Get path to zip files for downloads.'''

trans_path = "/content/drive/My Drive/Colab Notebooks/transcripts/" # where to store the transcripts

print("Downloading from Kistunekko.net...")
for zip_link in anime_list:
  zip_url = domain + anime_list[zip_link]
  zip_res = requests.get(zip_url)

  soup = BeautifulSoup(zip_res.content, 'html.parser')

  table_res = soup.find(id='flisttable')
  trans_elem = table_res.find_all('a', class_='')

  for a_tag in trans_elem:
    title_elem = a_tag.find('strong', class_='')
    trans_title = clean_html(str(title_elem))

    download_url = domain + "/" + a_tag["href"]

    download_files(download_url, trans_path, trans_title)

print("Download complete.")

Downloading from Kistunekko.net...


## II. Random Generator
This will be used to randomly select the anime we will train the model on!



In [121]:
import random


''' Read txt file into a list. Select 'N' random anime to be used for training.'''
def list_random(n):
    anime_list = []

    with open(anime_list_path, "r") as read_file:
      anime_list = [line.strip() for line in read_file]

    read_file.close()
  
    random.shuffle(anime_list)
    return anime_list[0:n]

In [122]:
train_anime = list_random(10)
train_anime

['Rokudenashi Majutsu Koushi to Akashic Records',
 'Hime Chen Otogi Chikku Idol Lilpri',
 'Shoujo Shuumatsu Ryokou',
 'Oh Edo Rocket',
 'Kino no Tabi - the Beautiful World',
 'Osamu Tezuka Buddha Movie 2',
 'Kodomo no Omocha',
 'Kaze no Stigma',
 'Toaru Majutsu no Index III',
 'Hunter X Hunter - OVA']

## III. Clean Transcripts
After randomly selecting the anime, we will select 10 episodes from each anime and run it through the parser.

In [None]:
# work in progress

## IV. Prototype

In [114]:
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.manifold import TSNE
import spacy
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda

In [115]:
def parseQuadLinesFile(file):
    ''' read quartets of lines from a file '''

    transcript = []
    title = []
    episode = []
    label = []


    with open(file, mode="r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            if i % 2:
                title.append(line.strip())
            elif i % 3:
                episode.append(line.strip())
            elif i % 4:
                label.append(line.strip())
            else:
                transcript.append(line.strip())
    print(sequenceA)
    return transcript, title, episode, label

In [116]:
def buildDataframe():
    ''' organize information from the txt file into a dataframe
        easier to visualize and seperate conceptually '''

    ### Note: As of November 04, we are still building the files as well
    ### "filename.txt" is a placeholder for the time being
    
    file_name = "filename.txt"
    transcript, title, episode, label = parseQuadLinesFile(file_name)
    transcript = np.array(transcript)
    data_df = pd.DataFrame({'Transcript': transcript, 
                          'Title of Anime': title,
                          'Title of Episode': episode,
                          'Human Gold Label': label})
    data_df = data_df[['Transcript', 'Title of Anime', 'Title of Episode', 'Human Gold Label']]
    data_df

In [117]:
def normalize_document(doc):
    wpt = nltk.WordPunctTokenizer()
    stop_words = nltk.corpus.stopwords.words('english')
    
    # lower case and remove special characters and whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()

    # tokenize document
    tokens = wpt.tokenize(doc)

    # filter out stopwords from document
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

In [None]:
'''Parse original file'''
buildDataframe()

''' Normalize document '''
normalize_transcript = np.vectorize(normalize_document)
norm_transcript = normalize_transcript(transcript)
norm_transcript

unique_words = list(set([word for sublist in [doc.split() for doc in norm_transcript] for word in sublist]))

In [None]:
''' Use GloVe to assess word embeddings '''
word_glove_vectors = np.array([nlp(word).vector for word in unique_words])
pd.DataFrame(word_glove_vectors, index=unique_words)

tsne = TSNE(n_components=2, random_state=0, n_iter=5000, perplexity=3)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(word_glove_vectors)
labels = unique_words


In [None]:
''' K-means Clustering based on mean of word embedding to determine overall leaning  
    Cluster labels indicates leaning and, on a test file, would output the suggested group to which a given transcript would belong
'''

doc_glove_vectors = np.array([nlp(str(doc)).vector for doc in norm_transcript])

km = KMeans(n_clusters=3, random_state=0)
km.fit_transform(doc_glove_vectors)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([transcript_df, cluster_labels], axis=1)