# Anime Synopsis Classifier

**Goal:** Using synopses from the website MyAnimeList.net, I want to be able to accurately predict the genres of any anime in the dataset.

To do this, this notebook will have multiple sections:

#### I. Reading and Cleaning the Data
#### II. Vectorizing Synopses
#### III. Preparing Training/Testing Sets
#### IV. Run several Multiclass Classification algorithms (Naive Bayes, Random Forests, Nearest Neighbors, ANN, LSTM)
#### V. Conclusions

## I. Reading and Cleaning the Data

All the synopses are stored in text files, thus we must first load them in and do some preprocessing to have cleaner data to work with.

In [50]:
import glob
import pandas as pd
import sys

# given a directory that contains the text files, return a Pandas Dataframe that contains the raw data
def load_txt_files(directory):
    
    DATA_FILES = glob.glob(directory+'/*/*.txt')
    
    i = 0
    output = []
    for file_name in DATA_FILES:
        
        if (i%1000 == 0):
            
            print(i)
        
        name = [file_name.split('\\')[-1][:-4]]
        
        file = open(file_name, encoding='utf-8')
        
        genres = [file.readline().strip()]
        
        synopsis = ''
        lastLine = ''
        for line in file:
            
            if line.strip() == '':
                
                continue
            
            synopsis += line
            lastLine = line
        
        # remove credits for analysis
        if lastLine != '' and lastLine[0] in {'(':0, '[':0} and lastLine[-1] in {')':0,']':0}:
            
            synopsis = synopsis.replace('\n' + lastLine,'')
        
        synopsis = [synopsis]
        
        output += list(zip(name,genres,synopsis))
        
        file.close()
        i += 1
    
    print('Creating DataFrame...')
    sys.stdout.flush()
    
    df = pd.DataFrame(output, columns = ['Anime', 'Genres', 'Synopsis'])
    
    return df
    
    

In [51]:
dir_ = '../animeScrape'
df = load_txt_files(dir_)

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Creating DataFrame...


In [52]:
df

Unnamed: 0,Anime,Genres,Synopsis
0,009-1,Action;Mecha;Sci-Fi;Seinen,"Mylene Hoffman, a beautiful cyborg spy with th..."
1,009-1__R_B,Action;Sci-Fi;Seinen,"Mylene Hoffman, also known by the code name ""0..."
2,009_Re_Cyborg,Action;Adventure;Mecha;Sci-Fi,Nine regular humans from different parts of th...
3,07-Ghost,Action;Demons;Fantasy;Josei;Magic;Military,Barsburg Empire's Military Academy is known fo...
4,11-nin_Iru,Action;Adventure;Drama;Mystery;Romance;Sci-Fi;...,The elite Cosmo Academy attracts applicants fr...
5,11eyes,Action;Ecchi;Super Power;Supernatural,"When the Sky turns Red, the Moon turns Black, ..."
6,3x3_Eyes,Action;Demons;Fantasy;Horror;Romance,3X3 Eyes is the story of a young man named Yak...
7,3x3_Eyes_Seima_Densetsu,Action;Adventure;Demons;Fantasy;Horror;Romance,"Yakumo has trained and searched for 4 years, f..."
8,6_Angels,Action;Sci-Fi,A group of female mercenaries known as the Ros...
9,91_Days,Action;Historical;Drama,"As a child living in the town of Lawless, Ange..."
