<a href="https://colab.research.google.com/github/Imanisima/anime-match/blob/clean-data/kitsunekko_scrubber.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anime Match
----
A classification model that uses transcipts from Anime to filter out specific themes. 

## The Problem
---
What do we consider when choosing anime? Genre, art style, content length, and popularity are typically what we think about.

Current recommendation systems that only filter by broad genre tags allow for specific themes to slip through and also close of entire categories for general thematic elements that are assumed. Broad genres, including horror and fantasy, may include several anime that deal with death and supernatural elements. Any individual that wished to not see either of these categories may eliminate the broad genre, where several shows in that genre don’t deal with either theme. 

Alternatively, there are several instances where an anime may be tagged by a traditionally light- hearted genre but do include darker themes. For example, the anime “Your Lie in April” is included in the romance genre overall but regularly includes themes of death from the main character’s relative passing to one of the main characters passing by the finale. By focusing on a thematic filter, we can add to existing recommendation systems to better the experience of anime enthusiasts, both by reducing the amount of anime with minor mentions of exclusionary themes to slip through due to their overarching genre tags, and by broadening the available recommendations with previously excluded larger genres.

## The Big Picture
----
Although the classification model we are building is Anime, it could also be applied to Manga, TV shows, books, newspapers, and other content. This model could also be used for parental control for children when they are searching the internet and watching Netflix.

## Dataset
---
The datasets uses in this project are raw transcripts from [Kistunekko](https://kitsunekko.net). It contains transcripts from over 2000 anime in 4 languages: English, Japanese, Chinese, and Korean. For the purposes of this project, we will be sticking with English.

Transcipts can be found in the /content/transcripts path.


In [157]:
# run this to import from google drive
from google.colab import drive
drive.mount('/content/drive/') 

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
import os
import requests
from pprint import pprint

### I. Web Scraper
First, we need to build a webscrapper for the kisunekko.net site! We will do this using BeautifulSoup

In [None]:
# web-scraper for kisunekko.net
domain = "https://kitsunekko.net"
sub_query = "/dirlist.php?dir=subtitles"
url = domain + sub_query
res = requests.get(url)

res

<Response [200]>

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(res.content, 'html.parser')

table_res = soup.find(id='flisttable') # id that points to the transcripts
trans_elem = table_res.find_all('a', class_='') # Using the table results, retrieve the rows with links to transcripts

In [None]:
import re

''' Strip html tags from text '''
def clean_html(raw_html):
  strip_tags = re.compile('<.*?>')
  clean_text = re.sub(strip_tags, '', raw_html)
  return clean_text
  

In [None]:
# save link and anime title
title_list = []
anime_list = {}
for a_tag in trans_elem:
    title_elem = a_tag.find('strong', class_='')
    title = clean_html(str(title_elem))
    anime_list[title] = a_tag["href"]

    title_list.append(title)

In [None]:
print(f"Total Anime: {len(anime_list.keys())}")

Total Anime: 2000


In [None]:
import wget

'''
Download file from kisunekko

Dir: where to store the download
URL: link to download the transcripts
'''
def download_files(url, dir, title):
  dir = os.path.expanduser(dir)
  if not os.path.exists(dir):
      os.makedirs(dir)

  print(url)
  print("\nDownloading kitsunekko transcripts...")
  print(dir)
  print(title)

  # file = url.split("/")[-1]
  if os.path.exists(os.path.join(dir, title)):
      print(file, "already downloaded")
  else:
      wget.download(url=url, out=dir) 
      print(f"{title} downloaded successful!")

ModuleNotFoundError: ignored

In [None]:
# zip files
trans_title = ""
trans_path = "/content/drive/My Drive/Colab Notebooks/transcripts/" # where to store the file

for zip_link in anime_list:
  zip_url = domain + anime_list[zip_link]
  zip_res = requests.get(zip_url)

  soup = BeautifulSoup(zip_res.content, 'html.parser')

  table_res = soup.find(id='flisttable')
  trans_elem = table_res.find_all('a', class_='')

  for a_tag in trans_elem:
    title_elem = a_tag.find('strong', class_='')
    trans_title = clean_html(str(title_elem))

    download_url = domain + "/" + a_tag["href"]

    download_files(download_url, trans_path, anime_list[zip_link])

In [None]:
# extract files from zip, rar, and .7zip files
from pyunpack import Archive
import sys
import subprocess

trans_folder = os.listdir(trans_path) # get files from the transcript directory

for file in trans_folder:
    if (".rar" in file or ".zip" in file or ".7z" in file):
      print(file, "will be unpacked")
      print(trans_path)

      with open(trans_path + file, "rb") as f:
        try:
          Archive(trans_path + file).extractall(trans_path)
        except: # in case of a bad zip file or magic number error
          pass

In [200]:
# remove compressed files
for item in trans_folder:
    print(item)
    if item.endswith(".zip" or ".7z" or ".rar"):
      try:
        os.remove(trans_path + item)
      except:
        pass

[LuPerry]_dot_hack_GU_-_returner.zip
[LuPerry]_dot_hack_GU_TRILOGY_1920x1080(x264_AC3).zip
[WhyNot] .hack%u2044%u2044Quantum.zip
[WhyNot] .hack⁄⁄Quantum.zip
.hack SIGN (01-28) [ssa] [AHQ].zip
[Commie] 3-gatsu no Lion (1-22).zip
3D_Kanojo_Real_Girl_TV_2018_Eng(1).rar
3D_Kanojo_Real_Girl_TV_2018_Eng.rar
3D_Kanojo_Real_Girl_TV2_2019_Eng.rar
5-tou ni Naritai (DVD 720x480 x264 10bit AC3) [53A56E8C].ass
[AniSubs]_07-Ghost_TV_Eng.rar
[Eng.Softsub]_07-GHOST-Hatsuyuki.7z
8 Man After R1 dialogue subs.zip
009-1 signs and songs track.zip
Neosubs_11_eyes_BD_01_animesave_com.ass
Neosubs_11_eyes_BD_02_animesave_com.ass
Neosubs_11_eyes_BD_03_animesave_com.ass
Neosubs_11_eyes_BD_04_animesave_com.ass
Neosubs_11_eyes_BD_05_animesave_com.ass
Neosubs_11_eyes_BD_06_animesave_com.ass
Neosubs_11_eyes_BD_07_animesave_com.ass
Neosubs_11_eyes_BD_08_animesave_com.ass
Neosubs_11_eyes_BD_09_animesave_com.ass
Neosubs_11_eyes_BD_10_animesave_com.ass
Neosubs_11_eyes_BD_11_animesave_com.ass
Neosubs_11_eyes_BD_12_animes