# Japanese Similarity Analysis
Auth: Harrison Loh

## Abstract
asdf

## Introduction
Immersion learning is a method of foreign language learning (also called acquisition) which emphasizes the learning of a foreign language using native content in the language as the primary study material.
For Japanese, one source of content for use in immersion learning is anime.
Different methods and approaches for using anime to learn the Japanese language have been presented on different internet sites and platforms, one example being AJATT (All Japanese All The Time) [[1](https://tatsumoto-ren.github.io/blog/whats-ajatt.html)] and its various adaptations and modifications.

One source of information relating to tips, strategies, and tools for applying an AJATT style approach to Japanese learning using Anime is a YouTube channel called Matt vs Japan [[2](https://www.youtube.com/@mattvsjapan)].
One idea that has been presented by MattvsJapan, as well as on the Refold language learning guide is the idea of language "domains", or genres of content which have a specific subset of language that is commonly used (e.g. fantasy vs. crime drama vs. slice-of-life) [[3](https://refold.la/simplified/stage-2/b/immersion-guide)].
By focusing on a single domain, words unique to a domain can be encountered more frequency, thus increasing the chance of acquiring them for long term retention.
The aquisition of words has been deemed as highly important for learning a language, such as by Steve Kaufmann (one of the founders of LingQ) [[4](https://www.youtube.com/@Thelinguist)][[5](https://www.lingq.com/en/)].
Therefore, focusing on a single domain when immersing is an attractive strategy for quickly aquiring foreign language vocabulary.

One idea to determine the domain of a show/piece of content is by the genre of the media (e.g. slice-of-life).
While this seems to be a sensible categorization of media into language domains, the question remains (at least to me) whether shows within a single genre quantitatively have a higher language similarity than shows across different tagged genres.

The aim of this repo is to provide an analysis of the language content from different anime shows to quantify the degree of similarity in the language used.
The objectives are as follows:
- Develope criteria for comparing the similarity of the language present between any two shows.
- Identify and differentiate between "core language" and "domain language".
- Compare the degree of similarity of the language of shows in a single genre compared to shows across genres.

References:
- [1] https://tatsumoto-ren.github.io/blog/whats-ajatt.html
- [2] https://www.youtube.com/@mattvsjapan
- [3] https://refold.la/simplified/stage-2/b/immersion-guide
- [4] https://www.youtube.com/@Thelinguist
- [5] https://www.lingq.com/en/

## Methods
### Dataset description
The subtitles for 89 shows were obtained as the dataset for analysis.
Subtitle files were downloaded from Kitsunekko.net under the Japanese subtitles page (https://kitsunekko.net/).
For analysis, SRT subtitle files were solely used.
For any shows with subtitles in the ASS format, conversion of these files to SRT was done using the subtitle tool Aegisub (https://aegisub.org/) by exporting as SRT files after choosing the "clean tags" option in the export window.
The genres for the chosen shows were taken from the information present in their respective listings on MyAnimeList (https://myanimelist.net).
A complete list of all the shows used with their genres and additional information is included in the "show_genres.xlsx" spreadsheet.
The distribution of genres for the shows is as follow:

- Action: 33
- Drama: 26
- Fantasy: 21
- Sci-Fi: 18
- Mystery: 17
- Romance: 15
- Adventure: 12
- Comedy: 12
- Sports: 11
- Supernatural: 10
- Slice of Life: 10
- Suspense: 7
- Ecchi: 2
- Avant Garde: 1

Given that the Ecchi and Avant Garde genre only show up a small number of times, these two categories are excluded from the analysis.

### Lemma Extraction
For quantifying the similarity between two or more selections of japanese text, the first step done is breaking the entire text into the component lemmas.
In linguistics, lemmas are the "dictionary form" of a word, and can be thought of as the 'base' form.
For example, in English the words _break_, _broke_, _broken_, and _breaking_ all share the same lemma: **break** (See [Wiki](https://en.wikipedia.org/wiki/Lemma_(morphology))).
For a similarity analysis between two bodies of text, I am more interested in whether unique words are shared between shows, not whether the same forms of a word are shared.
In other words, whether the base word 'to go' (行く) is shared, and not whether specific conjugations (such as 行きます, 行きません) are shared.
Therefore, the lemmas present in a block of Japanese text are chosen as the components for further comparison.

To give an example, consider the following 4 sentences, each with one additional change to the words used compared to the original, first sentence:
- original: "私の友達は親切な人です"
- one change: "彼の友達は親切な人です"
- two changes: "彼の彼女は親切な人です"
- three changes: "彼の彼女は内気な人です"

With each sentence, the content becomes more distinct from the original sentence.

Using the fugashi package with the Tagger class, we can extract the lemmas present in each of the above sentences.

In [1]:
"""
Lemma extraction from text using fugashi
"""
from fugashi import Tagger

def lemma_extract(text):
    """
    Short function for returning a list of words and a list of the lemmas
    """
    words = tagger(text)

    lemma_list = []
    for word in words:
        lemma_list.append(word.feature.lemma)

    return words, lemma_list


tagger = Tagger('-Owakati')

orig_sent = "私の友達は親切な人です"  # base sentence for comparison
sent_1diff = "彼の友達は親切な人です"  # one word difference
sent_2diff = "彼の彼女は親切な人です"  # two words different
sent_3diff = "彼の彼女は内気な人です"  # three words different

text = [orig_sent, sent_1diff, sent_2diff, sent_3diff]

word_list = []
lemma_list = []
for sentence in text:
    words, lemmas = lemma_extract(sentence)

    word_list.append(words)
    lemma_list.append(lemmas)


print(f"Original Sentence: {text[0]}")
print(f"Original lemmas: {lemma_list[0]}\n")
print(f"1 diff Sentence: {text[1]}")
print(f"1 diff lemmas: {lemma_list[1]}\n")
print(f"2 diff Sentence: {text[2]}")
print(f"2 diff lemmas: {lemma_list[2]}\n")
print(f"3 diff Sentence: {text[3]}")
print(f"3 diff lemmas: {lemma_list[3]}\n")

Original Sentence: 私の友達は親切な人です
Original lemmas: ['私-代名詞', 'の', '友達', 'は', '親切', 'だ', '人', 'です']

1 diff Sentence: 彼の友達は親切な人です
1 diff lemmas: ['彼', 'の', '友達', 'は', '親切', 'だ', '人', 'です']

2 diff Sentence: 彼の彼女は親切な人です
2 diff lemmas: ['彼', 'の', '彼女', 'は', '親切', 'だ', '人', 'です']

3 diff Sentence: 彼の彼女は内気な人です
3 diff lemmas: ['彼', 'の', '彼女', 'は', '内気', 'だ', '人', 'です']



### Calculating the Similarity of Lemma sets

For English text, similarity scores between sets of documents or text can be done fairly easily using criteria such as the Term Frequency-Inverse Document Frequency (IF-IDF) and libraries such as scikit with sklearn.
One challenge I faced while trying to set these tools up however was in modifying the workflow from English to Japanese.
While there are a few sites which describe adapting sklearn to Asian languages, specifically using the TfidfVectorizer class with a custom tokenizer (such as [here](https://investigate.ai/text-analysis/how-to-make-scikit-learn-natural-language-processing-work-with-japanese-chinese/)) (which is what I originally wanted to do), I wasn't quite able to figure out how to apply this using the fugashi package which I was more comfortable with using, and so I decided to try a different approach.

In the previous code block, lists of the lemmas present in each of the example sentences were generated.
If the frequency of occurence of each lemma in the sentence is ignored, then each list can be converted into a set, resulting in a collection listing the unique lemmas present in a given text.
From here, methods which quantify the similarity between two sets can be applied to quantify how similar the lemma collection between the sentences are.

The value I am using to evaluate the similarity of sets is the [Jaccard Similarity Coefficient](https://en.wikipedia.org/wiki/Jaccard_index), and is defined as the size of the intersection between two sets divided by the size of the union of the sets.

$$
J(A, B) = \frac{\left| A \cap B \right|}{\left| A \cup B \right|}
$$

where $A$ and $B$ are two sets for comparison.
The calculation is commutative, so order of the sets does not matter.

Setting up calculating the Jaccard Similarity can easily be done in python, as shown [here](https://www.annasguidetopython.com/python3/data%20structures/lists-finding-the-jaccard-similarity-between-two-sets-in-a-list/) and below.

In [2]:
"""
Quantifying the similarity of sentences using Jaccard Similarity on sets of the lemmas present
"""
set1 = set(lemma_list[0])
set2 = set(lemma_list[1])
set3 = set(lemma_list[2])
set4 = set(lemma_list[3])

def jaccard_similar(set1, set2):
    return len(set1.intersection(set2)) / len(set1.union(set2))

print(f"1 against 1: {jaccard_similar(set1, set1)}")
print(f"1 against 2: {jaccard_similar(set1, set2)}")
print(f"1 against 3: {jaccard_similar(set1, set3)}")
print(f"1 against 4: {jaccard_similar(set1, set4)}")

1 against 1: 1.0
1 against 2: 0.7777777777777778
1 against 3: 0.6
1 against 4: 0.45454545454545453


As expected, the similarity value for a lemma set compared against itself is 1, meaning the sets are identical.
As each sentence becomes more and more different than the original, the Jaccard coefficient decreases, with a value of ~0.45 for a sentence with 4 changed words from the original.

To recap, in order to compare the similarity of the language used in between two anime shows, the following steps will be done:
- extract a set of the lemmas present within the subtitle files of each show
- calculate the Jaccard Similarity Coefficient between the shows.

## Analysis
In this section, subtitle files for each show in the 'Data' folder will be parsed to create sets of the unique lemmas present, and the Jaccard Similarity between each show is calculated.

### Creating database of lemmas from shows

In [3]:
"""
Creating a database of lemmas by reading in and parsing the subtitles files in the 'data' folder
"""
from fugashi import Tagger
from subtitleparsing import create_lemma_database

data_folder = 'data'  # folder with subtitles
tagger = Tagger('-Owakati')

lemma_database = create_lemma_database(data_folder, tagger)

-- Beginning parse of shows in subtitle folder --

Show currently parsing: 07-ghost
Show currently parsing: 3-gatsu-no-lion
Show currently parsing: 7seeds
Show currently parsing: 91-days
Show currently parsing: acca-13
Show currently parsing: aico-incarnation
Show currently parsing: akebi-chan-no-sailor-fuku
Show currently parsing: amagi-brilliant-park
Show currently parsing: appare-ranman
Show currently parsing: assassination-classroom
Show currently parsing: baby-steps
Show currently parsing: ballroom-e-youkoso
Show currently parsing: banana-fish
Show currently parsing: barakamon
Show currently parsing: blue-lock
Show currently parsing: blue-period
Show currently parsing: bocchi-the-rock
Show currently parsing: boku-no-hero-academia
Show currently parsing: bungo-stray-dogs
Show currently parsing: burn-the-witch
Show currently parsing: chainsaw-man
Show currently parsing: charlotte
Show currently parsing: chihayafuru
Show currently parsing: cider-no-you-ni-kotoba-ga-wakiagaru
Show cur

In [4]:
# Example of the contents of the lemma_database dictionary
print(lemma_database['charlotte'])

{'シチュー-stew', '只', '処分', '無い', '盛り上がる', 'えー', '男', '可能', '近付く', '追い追い', '彼', '信ずる', '好み', '一塁', '年', '熟す', '襲う', '終わり', 'ピッチャー-pitcher', '知れる', '病み上がり', '布団', '体調', '日本', 'やんちゃ', '小さい', '次', '倒す', '未だ', '馬鹿馬鹿しい', '強いる', '体', '禁止', '見付かる', '忘れ物', '両方', '仲間', '無量', '巻き返す', '絶景', '親元', '女の子', 'パスタ-pasta', '匙', '糞', '委員', '其の', '直接', 'じゃん', '此処', 'のみ', '宜しく', '一寸', '放課', '暴力', '追う', '雨霰', '普通', '何方', 'ワン-one', '彗星', '頭脳', '自身', '静か', '着る', '会', '下される', '寝込む', '勝つ', '御早う', '嘗て', '五', '局', '見舞い', '代走', '合宿', 'どんな', '性格', '他人', '都合', '失敗', '連行', '初めて', '見え透く', '選手', 'コニシ', '台', 'オーライ-all right', '如何に', '業者', '変わり', '幾', '聞く', '勿論', '読者', '安心', '単刀', '騙す', '安全', '小学', '仕方無い', '派', '一員', 'けれど', '地', 'あからさま', '付ける', 'ばき', '我慢', '急行', '的', '利用', 'フライング-flying', '平凡', 'ない', '番', '悪', '料理', '多重', '今日は', '選び取る', '新', '空', '所', '首尾', 'ジャミング-jamming', '死人', '影', '見失う', '無し', '品', '仕方', '未', '訳', 'でかい', '揺らす', 'マンション-mansion', '好', '連れ出す', '親友', '払う', '運動', '唯一', '観念', '得る', '空く', '折る', '収穫', '驕る', '演ず

### Calculate a similarity matrix between each show

In [26]:
from itertools import product  # for helping iterate through the shows
import numpy as np

# Get number of shows
num_shows = len(lemma_database)

similarity_matrix = np.zeros((num_shows, num_shows))  # zero matrix for over-writing with values

# Calculate Jaccard similarity, also create index of show names for future reference
def jaccard_similar(set1, set2):
    return len(set1.intersection(set2)) / len(set1.union(set2))


show_list = []
for key in lemma_database:
    show_list.append(key)  # add key to show list


for i,j in product(range(num_shows), range(num_shows)):
    jaccard_value = jaccard_similar(lemma_database[show_list[i]], lemma_database[show_list[j]])
    similarity_matrix[i, j] = jaccard_value


## Results and Discussion
asdf

## Conclusion
asdf