# Japanese Similarity Analysis
Auth: Harrison Loh

## Abstract
asdf

## Introduction
Immersion learning is a method of foreign language learning (also called acquisition) which emphasizes the learning of a foreign language using native content in the language as the primary study material.
For Japanese, one source of content for use in immersion learning is anime.
Different methods and approaches for using anime to learn the Japanese language have been presented on different internet sites and platforms, one example being AJATT (All Japanese All The Time) [[1](https://tatsumoto-ren.github.io/blog/whats-ajatt.html)] and its various adaptations and modifications.

One source of information relating to tips, strategies, and tools for applying an AJATT style approach to Japanese learning using Anime is a YouTube channel called Matt vs Japan [[2](https://www.youtube.com/@mattvsjapan)].
One idea that has been presented by MattvsJapan, as well as on the Refold language learning guide is the idea of language "domains", or genres of content which have a specific subset of language that is commonly used (e.g. fantasy vs. crime drama vs. slice-of-life) [[3](https://refold.la/simplified/stage-2/b/immersion-guide)].
By focusing on a single domain, words unique to a domain can be encountered more frequency, thus increasing the chance of acquiring them for long term retention.
The aquisition of words has been deemed as highly important for learning a language, such as by Steve Kaufmann (one of the founders of LingQ) [[4](https://www.youtube.com/@Thelinguist)][[5](https://www.lingq.com/en/)].
Therefore, focusing on a single domain when immersing is an attractive strategy for quickly aquiring foreign language vocabulary.

One idea to determine the domain of a show/piece of content is by the genre of the media (e.g. slice-of-life).
While this seems to be a sensible categorization of media into language domains, the question remains (at least to me) whether shows within a single genre quantitatively have a higher language similarity than shows across different tagged genres.

The aim of this repo is to provide an analysis of the language content from different anime shows to quantify the degree of similarity in the language used.
The objectives are as follows:
- Develope criteria for comparing the similarity of the language present between any two shows.
- Identify and differentiate between "core language" and "domain language".
- Compare the degree of similarity of the language of shows in a single genre compared to shows across genres.

References:
- [1] https://tatsumoto-ren.github.io/blog/whats-ajatt.html
- [2] https://www.youtube.com/@mattvsjapan
- [3] https://refold.la/simplified/stage-2/b/immersion-guide
- [4] https://www.youtube.com/@Thelinguist
- [5] https://www.lingq.com/en/

## Methods
### Dataset description
The subtitles for 89 shows were obtained as the dataset for analysis.
Subtitle files were downloaded from Kitsunekko.net under the Japanese subtitles page (https://kitsunekko.net/).
For analysis, SRT subtitle files were solely used.
For any shows with subtitles in the ASS format, conversion of these files to SRT was done using the subtitle tool Aegisub (https://aegisub.org/).
The genres for the chosen shows were taken from the information present in their respective listings on MyAnimeList (https://myanimelist.net).
A complete list of all the shows used with their genres and additional information is included in the "show_genres.xlsx" spreadsheet.
The distribution of genres for the shows is as follow:

- Action: 33
- Drama: 26
- Fantasy: 21
- Sci-Fi: 18
- Mystery: 17
- Romance: 15
- Adventure: 12
- Comedy: 12
- Sports: 11
- Supernatural: 10
- Slice of Life: 10
- Suspense: 7
- Ecchi: 2
- Avant Garde: 1

Given that the Ecchi and Avant Garde genre only show up a small number of times, these two categories are excluded from the analysis.

### Lemma Extraction
For quantifying the similarity between two or more selections of japanese text, the first step done is breaking the entire text into the component lemmas.
In linguistics, lemmas are the "dictionary form" of a word, and can be thought of as the 'base' form.
For example, in English the words _break_, _broke_, _broken_, and _breaking_ all share the same lemma: **break** (See [Wiki](https://en.wikipedia.org/wiki/Lemma_(morphology))).
For a similarity analysis between two bodies of text, I am more interested in whether unique words are shared between shows, not whether the same forms of a word are shared.
In other words, whether the base word 'to go' (行く) is shared, and not whether specific conjugations (such as 行きます, 行きません) are shared.
Therefore, the lemmas present in a block of Japanese text are chosen as the components for further comparison.

To give an example, consider the following 4 sentences, each with one additional change to the words used compared to the original, first sentence:
- original: "私の友達は親切な人です"
- one change: "彼の友達は親切な人です"
- two changes: "彼の彼女は親切な人です"
- three changes: "彼の彼女は内気な人です"

With each sentence, the content becomes more distinct from the original sentence.

Using the fugashi package with the Tagger class, we can extract the lemmas present in each of the above sentences.

In [28]:
from fugashi import Tagger

def lemma_extract(text):
    """
    Short function for returning a list of words and a list of the lemmas
    """
    words = tagger(text)

    lemma_list = []
    for word in words:
        lemma_list.append(word.feature.lemma)

    return words, lemma_list


tagger = Tagger('-Owakati')

orig_sent = "私の友達は親切な人です"  # base sentence for comparison
sent_1diff = "彼の友達は親切な人です"  # one word difference
sent_2diff = "彼の彼女は親切な人です"  # two words different
sent_3diff = "彼の彼女は内気な人です"  # three words different

text = [orig_sent, sent_1diff, sent_2diff, sent_3diff]

for sentence in text:
    words, lemma_list = lemma_extract(sentence)

    print(f"Sentence: {sentence}")
    print(f"Words: {words}")
    print(f"Lemmas: {lemma_list}\n")


Sentence: 私の友達は親切な人です
Words: [私, の, 友達, は, 親切, な, 人, です]
Lemmas: ['私-代名詞', 'の', '友達', 'は', '親切', 'だ', '人', 'です']

Sentence: 彼の友達は親切な人です
Words: [彼, の, 友達, は, 親切, な, 人, です]
Lemmas: ['彼', 'の', '友達', 'は', '親切', 'だ', '人', 'です']

Sentence: 彼の彼女は親切な人です
Words: [彼, の, 彼女, は, 親切, な, 人, です]
Lemmas: ['彼', 'の', '彼女', 'は', '親切', 'だ', '人', 'です']

Sentence: 彼の彼女は内気な人です
Words: [彼, の, 彼女, は, 内気, な, 人, です]
Lemmas: ['彼', 'の', '彼女', 'は', '内気', 'だ', '人', 'です']



### Calculate Similarity of Lemma sets

In [33]:
from fugashi import Tagger

def lemma_extract(text):
    """
    Short function for returning a list of words and a list of the lemmas
    """
    words = tagger(text)

    lemma_list = []
    for word in words:
        lemma_list.append(word.feature.lemma)

    return words, lemma_list


tagger = Tagger('-Owakati')

orig_sent = "私の友達は親切な人です"  # base sentence for comparison
sent_1diff = "彼の友達は親切な人です"  # one word difference
sent_2diff = "彼の彼女は親切な人です"  # two words different
sent_3diff = "彼の彼女は内気な人です"  # three words different

text = [orig_sent, sent_1diff, sent_2diff, sent_3diff]

lemma_total = []
for sentence in text:
    words, lemma_list = lemma_extract(sentence)

    print(f"Sentence: {sentence}")
    print(f"Words: {words}")
    print(f"Lemmas: {lemma_list}\n")

    lemma_total.append(lemma_list)






Sentence: 私の友達は親切な人です
Words: [私, の, 友達, は, 親切, な, 人, です]
Lemmas: ['私-代名詞', 'の', '友達', 'は', '親切', 'だ', '人', 'です']

Sentence: 彼の友達は親切な人です
Words: [彼, の, 友達, は, 親切, な, 人, です]
Lemmas: ['彼', 'の', '友達', 'は', '親切', 'だ', '人', 'です']

Sentence: 彼の彼女は親切な人です
Words: [彼, の, 彼女, は, 親切, な, 人, です]
Lemmas: ['彼', 'の', '彼女', 'は', '親切', 'だ', '人', 'です']

Sentence: 彼の彼女は内気な人です
Words: [彼, の, 彼女, は, 内気, な, 人, です]
Lemmas: ['彼', 'の', '彼女', 'は', '内気', 'だ', '人', 'です']



InvalidParameterError: The 'tokenizer' parameter of TfidfVectorizer must be a callable or None. Got <module 'JapaneseTokenizer' from 'e:\\Users\\Harrison\\Documents\\GitHub\\Japanese-Similarity-Analysis\\.venv\\Lib\\site-packages\\JapaneseTokenizer\\__init__.py'> instead.

## Results and Discussion
asdf

## Conclusion
asdf