# Anime Synopsis Classifier

**Goal:** Using synopses from the website MyAnimeList.net, I want to be able to accurately predict the genres of any anime in the dataset.

To do this, this notebook will have multiple sections:

#### I. Reading and Cleaning the Data
#### II. Vectorize Text
#### III. Dimension Reduction (LSA)
#### IV. Run several Multilabel Classification algorithms (Naive Bayes, Random Forests, Nearest Neighbors, ANN, LSTM)
#### V. Conclusions

## I. Reading and Cleaning the Data

All the synopses are stored in text files, thus we must first load them in and do some preprocessing to have cleaner data to work with.

In [2]:
import glob
import pandas as pd
import sys
import re

# given a directory that contains the text files, return a Pandas Dataframe that contains the raw data
def load_txt_files(directory):
    
    DATA_FILES = glob.glob(directory+'/*/*.txt')
    i = 0
    output = []
    for file_name in DATA_FILES:
        
        if (i%1000 == 0):
            
            print(i)
        
        name = [file_name.split('\\')[-1][:-4]]
        
        file = open(file_name, encoding='utf-8')
        
        genres = [file.readline().strip()]
        
        synopsis = ''
        lastLine = ''
        for line in file:
            
            if line.strip() == '':
                
                continue
            
            synopsis += line
            lastLine = line
        
        # remove credits for analysis
        if lastLine != '' and lastLine[0] in {'(':0, '[':0} and lastLine[-1] in {')':0,']':0}:
            
            synopsis = synopsis.replace('\n' + lastLine,'')
        
        synopsis = [synopsis]
        
        output += list(zip(name,genres,synopsis))
        
        file.close()
        i += 1
    
    print('Creating DataFrame...')
    sys.stdout.flush()
    
    df = pd.DataFrame(output, columns = ['Anime', 'Genres', 'Synopsis'])
    
    return df
    

In [3]:
dir_ = '../animeScrape'
df = load_txt_files(dir_)

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Creating DataFrame...


In [4]:
df

Unnamed: 0,Anime,Genres,Synopsis
0,009-1,Action;Mecha;Sci-Fi;Seinen,"Mylene Hoffman, a beautiful cyborg spy with th..."
1,009-1__R_B,Action;Sci-Fi;Seinen,"Mylene Hoffman, also known by the code name ""0..."
2,009_Re_Cyborg,Action;Adventure;Mecha;Sci-Fi,Nine regular humans from different parts of th...
3,07-Ghost,Action;Demons;Fantasy;Josei;Magic;Military,Barsburg Empire's Military Academy is known fo...
4,11-nin_Iru,Action;Adventure;Drama;Mystery;Romance;Sci-Fi;...,The elite Cosmo Academy attracts applicants fr...
5,11eyes,Action;Ecchi;Super Power;Supernatural,"When the Sky turns Red, the Moon turns Black, ..."
6,3x3_Eyes,Action;Demons;Fantasy;Horror;Romance,3X3 Eyes is the story of a young man named Yak...
7,3x3_Eyes_Seima_Densetsu,Action;Adventure;Demons;Fantasy;Horror;Romance,"Yakumo has trained and searched for 4 years, f..."
8,6_Angels,Action;Sci-Fi,A group of female mercenaries known as the Ros...
9,91_Days,Action;Historical;Drama,"As a child living in the town of Lawless, Ange..."


In [5]:
import os

# get list of genres to make binary columns in dataframe
genres = os.listdir(dir_)

for genre in genres:
    
    df[genre] = df.apply(lambda r: 1 if genre in r['Genres'] else 0,axis = 1)

In [6]:
# don't need genres column anymore
df = df.drop('Genres', axis = 1)
df

Unnamed: 0,Anime,Synopsis,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,...,Shounen_Ai,Slice_of_Life,Space,Sports,Supernatural,Super_Power,Thriller,Vampire,Yaoi,Yuri
0,009-1,"Mylene Hoffman, a beautiful cyborg spy with th...",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,009-1__R_B,"Mylene Hoffman, also known by the code name ""0...",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,009_Re_Cyborg,Nine regular humans from different parts of th...,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,07-Ghost,Barsburg Empire's Military Academy is known fo...,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,11-nin_Iru,The elite Cosmo Academy attracts applicants fr...,1,1,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
5,11eyes,"When the Sky turns Red, the Moon turns Black, ...",1,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
6,3x3_Eyes,3X3 Eyes is the story of a young man named Yak...,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7,3x3_Eyes_Seima_Densetsu,"Yakumo has trained and searched for 4 years, f...",1,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
8,6_Angels,A group of female mercenaries known as the Ros...,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,91_Days,"As a child living in the town of Lawless, Ange...",1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
import pickle

# serialize dataframe for future use
df.to_pickle('anime_df.pkl')

In [8]:
# read in dataframe
df = pd.read_pickle('anime_df.pkl')

In [9]:
df

Unnamed: 0,Anime,Synopsis,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,...,Shounen_Ai,Slice_of_Life,Space,Sports,Supernatural,Super_Power,Thriller,Vampire,Yaoi,Yuri
0,009-1,"Mylene Hoffman, a beautiful cyborg spy with th...",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,009-1__R_B,"Mylene Hoffman, also known by the code name ""0...",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,009_Re_Cyborg,Nine regular humans from different parts of th...,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,07-Ghost,Barsburg Empire's Military Academy is known fo...,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,11-nin_Iru,The elite Cosmo Academy attracts applicants fr...,1,1,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
5,11eyes,"When the Sky turns Red, the Moon turns Black, ...",1,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
6,3x3_Eyes,3X3 Eyes is the story of a young man named Yak...,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7,3x3_Eyes_Seima_Densetsu,"Yakumo has trained and searched for 4 years, f...",1,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
8,6_Angels,A group of female mercenaries known as the Ros...,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,91_Days,"As a child living in the town of Lawless, Ange...",1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## II. Vectorize Text

In [15]:
import numpy as np
from sklearn.model_selection import ShuffleSplit

X = np.asarray(df['Synopsis'])
y = np.asarray(df[df.columns.difference(['Anime','Synopsis'])])

sf = ShuffleSplit(n_splits = 1, train_size = .8, test_size = .2)

# separate into 80% training 20% testing set
for train_idx, test_idx in sf.split(X):
    
    X_train = X[train_idx]
    X_test = X[test_idx]
    
    y_train = y[train_idx]
    y_test = y[test_idx]

In [66]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

count_vect = CountVectorizer(stop_words='english', ngram_range=(1, 3), max_features = 15000)
X_train_count = count_vect.fit_transform(X_train)
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
X_train_tf = tf_transformer.transform(X_train_count)

In [67]:
# use fitted model to vectorize test data
X_test_count = count_vect.transform(X_test)
X_test_tf = tf_transformer.transform(X_test_count)

In [68]:
features = count_vect.get_feature_names()
pd.DataFrame(features)

Unnamed: 0,0
0,00
1,000
2,000 000
3,000 years
4,009
5,01
6,02
7,03
8,08th
9,09


## III. Dimension Reduction

In [11]:
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from scipy import sparse

normalizer = Normalizer(copy=False)
svd = TruncatedSVD(n_components = 5000)
lsa = make_pipeline(svd,normalizer)

lsa.fit(X_train_tf)

Pipeline(steps=[('truncatedsvd', TruncatedSVD(algorithm='randomized', n_components=5000, n_iter=5,
       random_state=None, tol=0.0)), ('normalizer', Normalizer(copy=False, norm='l2'))])

In [11]:
X_train_tf_reduced = sparse.csr_matrix(lsa.transform(X_train_tf))
X_train_tf_reduced.shape

NameError: name 'lsa' is not defined

In [22]:
X_test_tf_reduced = sparse.csr_matrix(lsa.transform(X_test_tf))
X_test_tf_reduced.shape

(2145, 5000)

In [57]:
from sklearn.feature_selection import SelectKBest, chi2

chi2=SelectKBest(chi2,k=5000)
chi2.fit(X_train_tf,y_train[:,0])

X_train_tf_reduced = chi2.transform(X_train_tf)
X_test_tf_reduced = chi2.transform(X_test_tf)

## IV. Test Different Classifiers

### 1) Naive Bayes

In [69]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=1e-3)

In [72]:
clf.fit(X_train_tf,y_train[:,0])

MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True)

In [74]:
clf.score(X_train_tf,y_train[:,0])

0.91057479305118338

In [76]:
y_pred = clf.predict(X_test_tf)

In [77]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_curve

print('Accuracy: ' + str(accuracy_score(y_test[:,0],y_pred)))
print('F1 Score: ' + str(f1_score(y_test[:,0],y_pred)))
print('Precision Score: ' + str(precision_score(y_test[:,0],y_pred)))
print('Recall Score: ' + str(recall_score(y_test[:,0],y_pred)))

Accuracy: 0.793006993007
F1 Score: 0.556
Precision Score: 0.646511627907
Recall Score: 0.487719298246


### 2) Support Vector Machines

In [78]:
from sklearn.svm import SVC

svc = SVC(kernel = 'linear')
svc.fit(X_train_tf,y_train[:,0])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [79]:
y_pred = svc.predict(X_test_tf)

In [80]:
print('Accuracy: ' + str(accuracy_score(y_test[:,0],y_pred)))
print('F1 Score: ' + str(f1_score(y_test[:,0],y_pred)))
print('Precision Score: ' + str(precision_score(y_test[:,0],y_pred)))
print('Recall Score: ' + str(recall_score(y_test[:,0],y_pred)))

Accuracy: 0.803263403263
F1 Score: 0.560416666667
Precision Score: 0.689743589744
Recall Score: 0.471929824561
