# 1. Data Exploration

### Information on the Dataset

Data Fields for SNOW T15 and SNOW T23 ⛄<br>
Resource: https://huggingface.co/datasets/snow_simplified_japanese_corpus <br>
Paper: https://aclanthology.org/L18-1072.pdf

- <strong>ID</strong>: sentence ID.
- <strong>original_ja</strong>: original Japanese sentebolnce.
- <strong>simplified_ja</strong>: simplified Japanese sentence.
- <strong>original_en</strong>: original English sentence.
- <strong>proper_noun</strong>: (included ONLY in SNOW T23) Proper nowus that the workers has extracted as proper nouns. The authors instructed workers not to rewrite proper nouns, leaving the determination of proper nouns to the workers.

# 2. Baseline Model

In the SNOW T15 dataset it states: <br>
<i>Core vocabulary is restricted to 2,000 words where it is selected by accounting for several factors such as meaning preservation, variation, simplicity and the UniDic word segmentation criterion/</i>

#### Step 1: Take a sample size from the SNOW T15 dataset and extracted 2,000 simplified terms.

In [1]:
""" Required installations """

!pip install mecab-python3
#These wheels include a copy of the MeCab library, but not a dictionary. 
#In order to use MeCab you'll need to install a dictionary. unidic-lite is a good one to start with:
!pip install unidic-lite

# normalization tool
!pip install neologdn

!pip install openpyxl

# To be able to see in Japanese!
!pip install japanize_matplotlib



In [2]:
import os
import pandas as pd

# Preprocessing
import MeCab
import neologdn
import collections
from nltk import FreqDist
from nltk.corpus import stopwords

# Visualization
import matplotlib.pyplot as plt
import japanize_matplotlib

In [3]:
def get_data(file):
    """
    Gets csv data under 'simply-japanese/data/'
    Returns as Dataframe where columns=['original','simplified']
    """

    # FIXME:  Make sure to
    # 1. Change these when you transfer to .py file
    # 2. Put these global variables somewhere else
    
    CURRENT_PATH = 'notebooks/Untitled.ipynb'
    DATA_PATH = 'data/2_RawData'
    csv_path = os.path.abspath(__file__)[:-len(CURRENT_PATH)]  + DATA_PATH
    df = pd.read_excel(os.path.join(csv_path, file))
    
    df.drop(columns=['#英語(原文)','#固有名詞'], inplace=True, errors='ignore')
    df.rename(columns={"#日本語(原文)": "original", "#やさしい日本語": "simplified"}, inplace=True)
    
    return df

In [22]:
# FIXME: Set df in __init__ 
def term_frequency(df, col='original'):
    """
    Count number of terms in a corpus
    Ignore independent words  ["助動詞", "助詞", "補助記号"] and words in japanese stopwords
    Returns collection of term and its frequency
    """
    # FIXME : Need to find a way to implement japanese_stopword.txt when this file is used externally
    jp_stopwords = stopwords.words('japanese')
    all_terms = collections.Counter()
    t = MeCab.Tagger("-O wakati")
    for idx, row in df.iterrows():
        text = row[col]
        node = t.parseToNode(text).next
        while node.next:
            part_of_speech = node.feature.split(',')[0]
            # TBD
            if part_of_speech in ["助動詞", "助詞", "補助記号"] or node.surface in jp_stopwords:
                node = node.next
                continue
            all_terms[node.surface] += 1
            node = node.next
    return all_terms

In [8]:
def get_simplified_terms(df, n_most_common):
    """
    Only returns simplified terms that exists in the simplified column
    Return list until the top 'n' elements from most common
    """
    # Filter out corpuses if original and simplified are exactly the same
    diff_corpus_df = df[df['original'] != df['simplified']]
    
    # Create collections of original and simplified terms
    original_terms = term_frequency(diff_corpus_df, 'original')
    simplified_terms = term_frequency(diff_corpus_df, 'simplified')
    
    # Compare two collections using subtract
    diff_terms = simplified_terms
    diff_terms.subtract(original_terms)
    
    diff_terms_df = pd.DataFrame(dict(diff_terms).items(), columns=['word', 'count'])
    return diff_terms_df[diff_terms_df['count'] >= 0].sort_values(by='count', ascending=False)['word'].tolist()[:n_most_common]

In [None]:
df = get_data('SNOW_T15_10000.xlsx')
len(get_simplified_terms(df, 2000))

#### Step 2: Using the 2000 list of simplified terms from Step 1, find the nearest term

# 3. Preprocessing

# 3.1) Data Organization and Clean Up!

In [None]:
# All the imported libraries go here for Section 2