# Notebook for Jutsu Classification

---
## Import the necessary Libraries

In [9]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from datasets import Dataset

from bs4 import BeautifulSoup

## Load the Classification Dataset

In [2]:
data_path = "../data/jutsus.jsonl"
df = pd.read_json(data_path, lines=True)
df.head()

Unnamed: 0,jutsu_name,jutsu_type,jutsu_description
0,10 Hit Combo,Taijutsu,Lars punches the opponent before striking them...
1,Air Raid Shot,Ninjutsu,"Kankurō's puppet, Karasu, soars into the air w..."
2,Air Gold Dust Protective Wall,"Kekkei Genkai, Ninjutsu","Making use of his Gold Dust, the Fourth Kazeka..."
3,Akuta,"Ninjutsu, Kinjutsu",Akuta is an Earth Release technique that's cre...
4,Afterimage Clone,"Ninjutsu, Clone Techniques","Shisui uses the Body Flicker Technique, and mo..."


* Here we can see that jutsu_type has so many other types.
* But here we are classifying based on 3 classification types that is "Genjutsu, Ninjutsu or Taijutsu".
* So below we define a function that simplifies the jutsu_type.

### Categorize the jutsu type in either Genjutsu, Ninjutsu or Taijutsu

In [3]:
def simplify_jutsu(jutsu):
    if "Genjutsu" in jutsu:
        return "Genjutsu"
    if "Ninjutsu" in jutsu:
        return "Ninjutsu"
    if "Taijutsu" in jutsu:
        return "Taijutsu"

In [4]:
df['jutsu_type_simplified'] = df['jutsu_type'].apply(simplify_jutsu)

In [5]:
df.head()

Unnamed: 0,jutsu_name,jutsu_type,jutsu_description,jutsu_type_simplified
0,10 Hit Combo,Taijutsu,Lars punches the opponent before striking them...,Taijutsu
1,Air Raid Shot,Ninjutsu,"Kankurō's puppet, Karasu, soars into the air w...",Ninjutsu
2,Air Gold Dust Protective Wall,"Kekkei Genkai, Ninjutsu","Making use of his Gold Dust, the Fourth Kazeka...",Ninjutsu
3,Akuta,"Ninjutsu, Kinjutsu",Akuta is an Earth Release technique that's cre...,Ninjutsu
4,Afterimage Clone,"Ninjutsu, Clone Techniques","Shisui uses the Body Flicker Technique, and mo...",Ninjutsu


In [6]:
df['jutsu_type_simplified'].value_counts()

jutsu_type_simplified
Ninjutsu    2269
Taijutsu     398
Genjutsu     101
Name: count, dtype: int64

* The distribution is highly imbalanced.
* Ninjutsu class is much dominating than the other classes. This leads to 
    * **Bias towards the Ninjutsu class**, causing the model to predict Ninjutsu more often.
    * **Poor recall for Taijutsu and Genjutsu**, meaning the model may struggle to identify these minority classes.
    * **Misleading accuracy**, as high accuracy might simply reflect the model predicting the dominant class.
    * **Overfitting to Ninjutsu**, reducing the model’s ability to generalize across all classes.
    * **Difficulty in learning minority class patterns**, since the model receives fewer examples of Taijutsu and Genjutsu.

## Feature Engineering
* Adding the jutsu_name and jutsu_description in the same column of text.
* And making the jutsu_type_simplified as jutsus.

In [None]:
df['text'] = df['jutsu_name'] + ". " + df['jutsu_description']
df['jutsus'] = df['jutsu_type_simplified']
df = df[['text', 'jutsus']]
df = df.dropna()

In [8]:
df.head()

Unnamed: 0,text,jutsus
0,10 Hit Combo. Lars punches the opponent before...,Taijutsu
1,"Air Raid Shot. Kankurō's puppet, Karasu, soars...",Ninjutsu
2,Air Gold Dust Protective Wall. Making use of h...,Ninjutsu
3,Akuta. Akuta is an Earth Release technique tha...,Ninjutsu
4,Afterimage Clone. Shisui uses the Body Flicker...,Ninjutsu


### Data Cleaning

In [None]:
class Cleaner():
    def __init__(self):
        pass 
    
    def put_line_breaks(self, text):
        return text.replace("</p>", "</p>\n")
    
    def remove_html_tags(self, text):
        clean_text = BeautifulSoup(text, "lxml").text
        return clean_text

    def clean(self, text):
        text = self.put_line_breaks(text)
        text = self.remove_html_tags(text)
        text = text.strip()
        return text

In [12]:
text_column_name = 'text'
label_column_name = "jutsus"

In [13]:
# Clean Text
cleaner = Cleaner()
df['text_cleaned'] = df[text_column_name].apply(cleaner.clean)

In [14]:
df.head(2)

Unnamed: 0,text,jutsus,text_cleaned
0,10 Hit Combo. Lars punches the opponent before...,Taijutsu,10 Hit Combo. Lars punches the opponent before...
1,"Air Raid Shot. Kankurō's puppet, Karasu, soars...",Ninjutsu,"Air Raid Shot. Kankurō's puppet, Karasu, soars..."


### Label Encoding

In [15]:
# Encode Labels 
le = preprocessing.LabelEncoder()
le.fit(df[label_column_name].tolist())

In [16]:
label_dict = {index:label_name 
              for index, label_name in enumerate(le.__dict__['classes_'].tolist())
              }
label_dict

{0: 'Genjutsu', 1: 'Ninjutsu', 2: 'Taijutsu'}

In [17]:
df['label'] = le.transform(df[label_column_name].tolist())

In [18]:
df.head()

Unnamed: 0,text,jutsus,text_cleaned,label
0,10 Hit Combo. Lars punches the opponent before...,Taijutsu,10 Hit Combo. Lars punches the opponent before...,2
1,"Air Raid Shot. Kankurō's puppet, Karasu, soars...",Ninjutsu,"Air Raid Shot. Kankurō's puppet, Karasu, soars...",1
2,Air Gold Dust Protective Wall. Making use of h...,Ninjutsu,Air Gold Dust Protective Wall. Making use of h...,1
3,Akuta. Akuta is an Earth Release technique tha...,Ninjutsu,Akuta. Akuta is an Earth Release technique tha...,1
4,Afterimage Clone. Shisui uses the Body Flicker...,Ninjutsu,Afterimage Clone. Shisui uses the Body Flicker...,1


## train test split

In [19]:
test_size = 0.2
df_train, df_test = train_test_split(df, 
                                     test_size=test_size, 
                                     stratify=df['label'],)

In [20]:
df_train['jutsus'].value_counts()

jutsus
Ninjutsu    1815
Taijutsu     318
Genjutsu      81
Name: count, dtype: int64

## Download Transformers Model and Tokenizer
* The model i will be using is [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) available in Hugging Face : 


In [21]:
model_name = "distilbert/distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)




### Tokenizing the text based on the transformer model's tokenizer

In [22]:
def preprocess_function(tokenizer,examples):
    return tokenizer(examples['text_cleaned'],truncation=True)

In [None]:
# Convert Pandas to a hugging face dataset
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)

# tokenize the dataset
tokenized_train = train_dataset.map(lambda examples: preprocess_function(tokenizer, examples),
                                    batched=True)
tokenized_test = test_dataset.map(lambda examples: preprocess_function(tokenizer, examples),
                                    batched=True)

Map: 100%|██████████| 2214/2214 [00:00<00:00, 2806.09 examples/s]
Map: 100%|██████████| 554/554 [00:00<00:00, 2715.71 examples/s]
