# 🀄 Ideogram-based vs. Phonogram-based Language
#### Jason Heesang Lee

In [None]:
!pip install -q whoosh

*Disclaimer:* Initially, this project was just a quick review to fulfill my curiosity.<br>
But as I was developing this notebook, somehow it became a large-sized project..<br>
I will try to finish this notebook by ***October 2023***.<br>

------------------------------------------------------------------
<br>

***There*** are more than **7.8 billion** people in the world and with more than **7,000 languages**.<br><br>
In a greater perspective, there are two types of languages: Ideogram-Based Language and Phonogram-Based Language.<br>
Phonogram-based Languages are languages that are developed on phonemes (speech sound) or a combination of phonemes.<br>
Latin alphabets and Korean (Hangul) are examples.<br>
<br>
Ideogram Based Languages are the languages that are developed on symbols of writing systems.
Chinese, Egyptian Hieroglyph and Sumero-Akkadian Cuneiform are examples.<br> [Source: Wikipedia](https://en.wikipedia.org/wiki/Ideogram)<br>
<br>
As I am fluent in Korean, English, and Chinese, I was suddenly curious about the possible differences in Natural Language Processing (NLP) techniques dealing with these two types of languages.<br> [Source: Wikipedia](https://en.wikipedia.org/wiki/Phonogram_(linguistics))<br>
<br>
In a brief thoughts, I believe it is easier to process Ideogram Based Languages than Phonogram Based Languages.<br>
<br>
It is due to the characteristics of the Ideogram Based Languages.
<br>
Taking Chinese (which I am familiar with) as an example, each character represents a certain definition. Each character (or an alphabet) in Phonogram Based Languages such as English and Hangul, often needs other characters to contain a definition.<br>
<br>
**`Hypothesis`** : Ideogram-based Languages might not need special Tokenizations or Embeddings for Natural Language Processing.<br>
<br>
I tried to ask and discuss with the lecturers here at Year-Dream School (Data Science Bootcamp) regarding this topic.<br>
I only had a meaningful discussion with [@Yongdam Kim](https://www.kaggle.com/emphymachine) as he and some of his friend has some (not a lot, as per he claims) experience in this field.<br>
He mentioned that Natural Language Processing can be easier for Ideogram-based languages, as each character in this language contains meaning, which already could be similar to embedding.<br>
<br>
As I want to further research into this topic, I had to first ask ChatGPT and Google Bard to fulfill my curiosity.<br>
<br>
***My query was as below.***<br>

> *I was recently wondering that NLP process could be different between Phonogram based languages like English, and Ideogram based language like Chinese.<br>
Like Tokenization, Embedding, Vectorization, Lemmatization, Stemming, etc.<br>
Could you tell me the main differences in process of Natural Language Processing?*
>

Below are the responses from the LLMs (redirected to my Notion page)<br><br>
**`ChatGPT`**<br>
[GPT - NLP Phonogram Ideogram.pdf](https://www.notion.so/jason-heesang-lee/Ideogram-Based-Language-vs-Phonogram-Based-Language-6ba064e320e2413aaba60f6aba5e6e19?pvs=4#b378a20a10ca4580b4442a5a4486b87f)<br>
<br>
**`Google Bard`**<br>
[Bard - NLP Phonogram Ideogram.pdf](https://www.notion.so/jason-heesang-lee/Ideogram-Based-Language-vs-Phonogram-Based-Language-6ba064e320e2413aaba60f6aba5e6e19?pvs=4#11e5c642f4394e1c82311677d4ea1268)<br>
<br>
There were some points that were interesting.<br>
<br>
First, both GPT and Bard told me that the Tokenization process might be harder on Ideogram-based Languages.<br>
Tokenization is the process of decomposing a sentence into words.<br>
As each word is represented in a way as a sequence of characters, it would be easier for the tokenizing process.<br>

My plan is to open up each key modules and figure out how they work.<br>
For Jieba, I will try to understand how this module is able to perform such segmentation.<br>
Also for DeBERTa Tokenizer or AutoTokenizer (I need to find out which module makes the difference), I want to know the inner logic that processes English and Chinese with the same lines of code.<br>

In [None]:
import sys
sys.path.insert(0, "../input/sentencepiece-pb2/")

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter("ignore")
import re
import nltk
import jieba
import numpy as np
import pandas as pd
from tqdm import tqdm
import sentencepiece_pb2
import sentencepiece as spm
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Dataset
I have brought [Chinese Daily News](https://www.kaggle.com/datasets/noxmoon/chinese-official-daily-news-since-2016) by [@noxmoon](https://www.kaggle.com/noxmoon) & True news from [Fake and Real News](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset) by [@clmentbisaillon](https://www.kaggle.com/clmentbisaillon) to compare the process.<br>

In [None]:
cn_df = pd.read_csv('/kaggle/input/chinese-official-daily-news-since-2016/chinese_news.csv', encoding='utf-8')
display(cn_df.head(5))
print()
en_df = pd.read_csv('/kaggle/input/fake-and-real-news-dataset/True.csv')
display(en_df.head(5))

# Checking DataFrame Information

In [None]:
print(f"cn_df.info() :\n{cn_df.info()}")

In [None]:
print(f"en_df.info() :\n{en_df.info()}")

# Dropping unnecessary columns
I will drop date & tag columns from each DataFrame.<br>
And matched the column names.

In [None]:
cn_df = cn_df.drop(columns=['date', 'tag'])
en_df = en_df.drop(columns=['date', 'subject'])

In [None]:
print(f"list(cn_df.columns) :\n{list(cn_df.columns)}")

In [None]:
en_df = en_df.rename(columns={'title':'headline', 'text':'content'})
print(f"list(en_df.columns) :\n{list(en_df.columns)}")

##### Most of NLP Technique retrieved from [@jhoward](https://www.kaggle.com/jhoward)'s notebook
***[Getting started with NLP for absolute beginners](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners)***

In [None]:
cn_example = pd.DataFrame(cn_df.iloc[0]).T
print(f"cn_example :\n{cn_example}")

In [None]:
en_example = pd.DataFrame(en_df.iloc[0]).T
print(f"en_example :\n{en_example}")

# CN Text Preprocessing
##### CN Definition retrieved from [Baidu Wenku](https://wenku.baidu.com/view/039d6d4e551252d380eb6294dd88d0d233d43cc8.html?_wkts_=1693548615550&bdQuery=%E4%B8%AD%E6%96%87+%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80+%E5%89%8D%E5%A4%84%E7%90%86)

In [None]:
# stopwords = [k.strip() for k in open('/kaggle/input/english-and-chinese-stopwords/stopwords.txt', encoding='utf8').readlines() if k.strip() != '']

def find_chinese(text):
    pattern = re.compile(r'[^\u4e00-\u9fa5]')
    chinese_txt = re.sub(pattern,'',text)
    return str(chinese_txt)

def cut_words(text):
    jieba_txt = ' '.join(jieba.cut(find_chinese(text), cut_all=False))
    return jieba_txt

# def seg_sentence(text_list):
#     seg_text = [word for word in text_list if word not in stopwords]
#     return seg_text

In [None]:
%%writefile cn_text.txt
 

In [None]:
cn_text_file = open('/kaggle/working/cn_text.txt', 'w')
cn_text = ''
for column in cn_example.columns:
    temp = []
    for row in tqdm(range(cn_example.shape[0])):
        text = cn_example.iloc[row][column]
        text = cut_words(text)
        
        temp.append(text)
        cn_text = cn_text + '; ' + text
#     text = seg_sentence(temp)
    cn_example[column] = pd.Series(temp)

#     cn_example[column] = pd.Series(seg_sentence(temp))
cn_text_file.write(cn_text)
display(cn_example.head())

# EN Text Preproecessing
I will drop the names of the news companies.

In [None]:
%%writefile en_text.txt
 

In [None]:
temp = []
en_text_file = open('/kaggle/working/en_text.txt', 'w')

en_text = ''
for row in tqdm(range(en_example.shape[0])):
    text_h = en_df.headline[row]
    text = " ".join(en_example.content[row].split(' - ')[1:])
    en_text = en_text + '; ' + text_h
    en_text = en_text + "; " + text
    temp.append(text)

en_text_file.write(en_text)
en_example.content = pd.Series(temp)
display(en_example.head())

# Checking text files

In [None]:
with open('/kaggle/working/cn_text.txt') as cn_text_file:
    print(cn_text_file.read())

In [None]:
with open('/kaggle/working/en_text.txt') as en_text_file:
    print(en_text_file.read())

# Jieba
***I guess this is where I have to examine the [Jieba Github](https://github.com/fxsjy/jieba)...!***

**This is how the repository looks like.**<br>
<img src="https://github.com/jasonheesanglee/Ideogram_Phonogram/blob/main/IDEOPHONO/jieba_main.png?raw=true" height="100" /><br>
There are 3 different directories - extra_dict, jieba, test, and some config files.<br>
Let's first look into README.md to grasp the concept of what this module is in the end.<br>
I brought English version of README.<br>
(The content is identical with the Chinese version.)

## jieba
-----------------------------------

*Jieba (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.*<br>
***This is the explanation of what jieba is***<br>

### Features
-----------------------------------
- Support three types of segmentation mode:
1. Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
2. Full Mode gets all the possible words from the sentence. Fast but not accurate.
3. Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
- Supports Traditional Chinese
- Supports customized dictionaries
- MIT License

***There are 3 different segmentation modes, and the usage of each mode differs from the purpose of the usage.***

### Online demo 
-----------------------------------

[http://jiebademo.ap01.aws.af.cm/](http://jiebademo.ap01.aws.af.cm/)<br>
***This online demo is not working anymore (404 Error)***

### Usage
-----------------------------------

- Fully automatic installation: `easy_install jieba` or `pip install jieba`<br>
- Semi-automatic installation: Download [http://pypi.python.org/pypi/jieba/](https://pypi.org/project/jieba/) , run `python setup.py install` after extracting.<br>
- Manual installation: place the `jieba` directory in the current directory or python `site-packages` directory.<br>
- `import jieba`.<br>

***This section explains how to import the module***

### Algorithm
-----------------------------------

- Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.
- Use dynamic programming to find the most probable combination based on the word frequency.
- For unknown words, a HMM-based model is used with the Viterbi algorithm.

***Wait, there are so many terms I have no clue about.<br>What is Directed Acyclic Graph? <br>What is HMM-based model? <br>What is Viterbi algorithm?***

#### Directed Acyclic Graph (DAG)

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Tred-G.svg/1920px-Tred-G.svg.png" width=200 />

Based on [Wikipedia](https://en.wikipedia.org/wiki/Directed_acyclic_graph), Directed Acyclic Graph is a 
>*Directed graph with no directed cycles. A directed graph is a DAG if and only if it can be [topologically ordered](https://en.wikipedia.org/wiki/Topological_order), by arranging the vertices as a linear ordering that is consistent with all edge directions.* <br>

What? still no clue yet, let's move on to the definitions.<br>

> *A graph is formed by vertices and by edges connecting pairs of vertices, where the vertices can be any kind of object that is connected in pairs by edges. In the case of a directed graph, each edge has an orientation, from one vertex to another vertex. A path in a directed graph is a sequence of edges having the property that the ending vertex of each edge in the sequence is the same as the starting vertex of the next edge in the sequence; a path forms a cycle if the starting vertex of its first edge equals the ending vertex of its last edge. A directed acyclic graph is a directed graph that has no cycles.*<br><br>

**TL;DR** (I shouldn't though)

Instead of learning it from Wikipedia, I searched Google a bit more and completely understood the concept from [StackExchange](https://math.stackexchange.com/questions/3782987/difference-between-oriented-graph-and-directed-acyclic-graphs-dag#:~:text=Basically%20directed%20graphs%20can%20have,two%20vertices%20A%20and%20B.&text=In%20mathematics%2C%20particularly%20graph%20theory,graph%20with%20no%20directed%20cycles.)<br>
Please correct me if I am wrong;<br>
Basically DAG is a graph of number of vertices connected by edges (with direction), and this edge doesn't go back but only go forth.<br> Which makes this graph a graph with direction, but not circulating.<br>
***OH*** That is why its name is **Directed** **A**cyclic Graph!!

#### HMM-based model
Based on this [article](https://medium.com/data-science-in-your-pocket/pos-tagging-using-hidden-markov-models-hmm-viterbi-algorithm-in-nlp-mathematics-explained-d43ca89347c4) by [Mehul Gupta](https://medium.com/@mehulgupta_7991), to understand the conecept of Hidden Markov Model (HMM)-based model, we need to understand what ***Markov Chain*** is.<br>
It gave a simple definition of Markov chain and I completely got it!
> *A Markov chain is a model that tells us something about the probabilities of sequences of random states/variables. A Markov chain makes a very strong assumption that if we want to predict the future in the sequence, all that matters is the current state. All the states before the current state have no impact on the future except via the current state.*

Below is an example the writer gave, and I believe this is just a perfect example.
> *A Markov Chain model based on Weather might have Hot, Cool, and Rainy as its states & to predict tomorrow’s weather you could examine today’s weather but yesterday’s weather isn’t significant in the prediction.*<br>

Below are specified all the components of Markov Chains.
<img src="https://miro.medium.com/v2/resize:fit:1104/format:webp/1*tI5HGo_cFTxgcxiEDHQMOw.png" height=100 />

Moving on to ***HMM-based model***.<br>
> *Sometimes, what we want to predict is a sequence of states that aren’t directly observable in the environment. Though we are given another sequence of states that are observable in the environment, these hidden states have some dependence on the observable states.*

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*EXjrDa28pUnGmI0ehjhR8A.png" height=100 /><br>
> *In the above HMM, we are given Walk, Shop & Clean as observable states. But we are more interested in tracing the sequence of the hidden states that will be followed which are Rainy & Sunny.*<br>

***As per my understanding, simply saying, this is an advanced step after Markov Chain.<br>
According to the provided image, actions are the subjects that we are predicting, and the original factors of Markov Chain example (weather conditions), are the hidden layer (state) that influences the prediction of the action.***<br>

Hidden Markov Model is needed for Part of Speech Tagging (Categories of words; verbs, nouns, actions, expresssions and so on).<br>
> If you notice closely, we can have the words in a sentence as Observable States (given to us in the data) but their POS Tags as Hidden states and hence we use HMM for estimating POS tags. It must be noted that we call Observable states ‘Observation’ & Hidden states ‘States’.

Below are the specified all the components of HMM<br>
<img src="https://miro.medium.com/v2/resize:fit:1240/format:webp/1*ARltONawvqjzKeZOMvD-tg.png" height=100 /><br>
> *$Q$: Set of possible Tags<br><br>
$A$: The A matrix contains the tag transition probabilities<br><br>
$P$($ti$|$ti−1$) which represent the probability of a tag occurring given the previous tag. Example: Calculating A[Verb][Noun]:
$P$ (Noun|Verb): Count(Noun & Verb)/Count(Verb)<br><br>
$O$: Sequence of observation (words in the sentence)<br><br>
$B$: The $B$ emission probabilities, $P(wi|ti)$, represent the probability, given a tag (say Verb), that it will be associated with a given word (say Playing). The emission probability $B$[Verb][Playing] is calculated using:<br><br>
$P$(Playing | Verb): Count (Playing & Verb)/ Count (Verb)<br><br>
It must be noted that we get all these Count() from the corpus itself used for training.<br><br>
A sample HMM with both ‘A’ & ‘B’ matrices will look like this :*

<img src="https://miro.medium.com/v2/resize:fit:1026/format:webp/1*TY_h8WgfRH7iJy1PZKN5UQ.png" height=100 />

> *Here, the black, straight arrows represent values of Transition matrix ‘A’ while the dotted black arrow represents Emission Matrix ‘B’ for a system with Q: {MD, VB, NN}.*<br>

The writer has also explained about Decoding using HMMs.<br>
While skimming through the upcoming content (Viterbi Algorithm) of the same writer, I thought it was necessary to go through this part as well.<br>
> Given an input as HMM (Transition Matrix, Emission Matrix) and a sequence of observations $O = o1, o2, …, oT$ (Words in sentences of a corpus), find the most probable sequence of states $Q = q1q2q3…, qT$ (POS Tags in our case)<br>
The two major assumptions followed while decoding tag sequence using HMMs:
> - The probability of a word appearing depends only on its **own tag** and is independent of neighboring words and tags.
> - The probability of a tag depends only on ***the previous tag(bigram HMM)*** that occured rather than the entire previous tag sequence i.e. shows Markov Property. Though we can be flexible with this.

Which, I believe, means unlikely to Markov Chain, it references on the previous tag, but surely, not the entire previous tags.<br>
Let's move on to Viterbi Algorithm

### Viterbi Algorithm
[Mehul Gupta](https://medium.com/@mehulgupta_7991) has also well explained about Viterbi Algorithm from the same [article](https://medium.com/data-science-in-your-pocket/pos-tagging-using-hidden-markov-models-hmm-viterbi-algorithm-in-nlp-mathematics-explained-d43ca89347c4).<br><br>
Viterbi Algorithm is a decoding algorithm used for HMMs.<br>
The writer mentioned that setting up Lattice, the probability matrix, is necessary.<br>
In prior to proceed further, being familiar with the tags of Part of Speech would be necessary.<br><br>

<img src="https://m-clark.github.io/text-analysis-with-R/img/POS-Tags.png" height=200 /> <br>
[***Source: Text Analysis in R by Michael Clark***](https://m-clark.github.io/text-analysis-with-R/part-of-speech-tagging.html)<br>

    
With a sample sentence ***Janet will back the bill***, it will look like this on Lattice:<br>
<img src="https://miro.medium.com/v2/resize:fit:1080/format:webp/1*8-5KZVj-_jZOWN83gGhD5A.png" height=100 />

As all the words in this sentence are commonly used words, there are no word with an "Unknown" tag.<br><br>

Each cell of the lattice is represented by $V_t(j)$, $V$ for Viterbi, $t$ for column, $j$ for row.<br>
This represents probability that the HMM is in $state j(present POS Tag)$ after seeing the $first t observations (past words for which lattice values has been calculated)$.<br>
This passes through the most **probable state sequence (Previous POS Tag)** $q_1, q_2, ... q_t-1$.<br>
Which means, if we have the word **back**, we most probably will have **Janet** and **will** in previous order.<br>

$V_t(j)$ is calculated as :
$$V_t(j) = max: V_t-1*a(i,j)* b_j(O_t)$$
where we got ‘a’(transition matrix) & ‘b’(emission matrix) from the HMM part calculations discussed above.:<br>
<img src="https://miro.medium.com/v2/resize:fit:1014/format:webp/1*1UylhpDw7suhH9WpnPYFaw.png" height=20 />

### Main Functions
-----------------------------------

### 1. Cut
-----------------------------------
***This section shows how to use jieba.cut method.***<br>
- The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
- `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
- The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
- `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode).
- `jieba.lcut` and `jieba.lcut_for_search` returns a list.
- `jieba.Tokenizer(dictionary=DEFAULT_DICT)` creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. `jieba.dt` is the default Tokenizer, to which almost all global functions are mapped.
<br>

In [None]:
print("Code example: Segmentation\n\nOutput: ")

#encoding=utf-8
import jieba

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式
print()
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 默认模式
print()
seg_list = jieba.cut("他来到了网易杭研大厦")
print(", ".join(seg_list))
print()
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))

### 2. Add a custom dictionary
-----------------------------------
***This section explains about how to load & modifying the dictionary***<br>
#### Load dictionary
- Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but you can add your own new words can ensure a higher accuracy.
- Usage: `jieba.load_userdict(file_name)` # file_name is a file-like object or the path of the custom dictionary
- The dictionary format is the same as that of `dict.txt`: one word per line; each line is divided into three parts separated by a space: word, word frequency, POS tag. If `file_name` is a path or a file opened in binary mode, the dictionary must be UTF-8 encoded.
- The word frequency and POS tag can be omitted respectively. The word frequency will be filled with a suitable value if omitted.

**Example:** <br>
*创新办 3 i*<br>
*云计算 5<br>
凱特琳 nz<br>
台中<br>*

- Change a Tokenizer's `tmp_dir` and `cache_file` to specify the path of the cache file, for using on a restricted file system.

**Example:** <br>
*云计算 5<br>
  李小福 2<br>
  创新办 3<br>
  [Before]： 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 /<br>
  [After]：　李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /*<br>
  
#### Modify dictionary
- Use add_word(word, freq=None, tag=None) and del_word(word) to modify the dictionary dynamically in programs.
- Use suggest_freq(segment, tune=True) to adjust the frequency of a single word so that it can (or cannot) be segmented.
- Note that HMM may affect the final result.


In [None]:
print("Example :\n")
>>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
# 如果/放到/post/中将/出错/。
>>> jieba.suggest_freq(('中', '将'), True)
# 494
>>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
# 如果/放到/post/中/将/出错/。
>>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
# 「/台/中/」/正确/应该/不会/被/切开
>>> jieba.suggest_freq('台中', True)
# 69
>>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
# 「/台中/」/正确/应该/不会/被/切开


### 3. Keyword Extraction
-----------------------------------
***This section explains how to extract keywords***<br>

`import jieba.analyse`

- `jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())`<br>
    - `sentence`: the text to be extracted<br>
    - `topK`: return how many keywords with the highest TF/IDF weights. The default value is 20<br>
`withWeight`: whether return TF/IDF weights with the keywords. The default value is False<br>
    - `allowPOS`: filter words with which POSs are included. Empty for no filtering.<br>
- `jieba.analyse.TFIDF(idf_path=None)` creates a new TF/IDF instance, `idf_path` specifies IDF file path.<br><br>

**Example (keyword extraction)**<br>
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py<br>
Developers can specify their own custom IDF corpus in jieba keyword extraction<br>
- Usage: `jieba.analyse.set_idf_path(file_name)`<br>
`file_name` is the path for the custom corpus<br>
- Custom Corpus Sample: (not working)<br>
https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
- Sample Code:<br>
https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py<br>
Developers can specify their own custom stop words corpus in jieba keyword extraction

- Usage: `jieba.analyse.set_stop_words(file_name)`<br>
`file_name` is the path for the custom corpus
- Custom Corpus Sample:<br>
https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
- Sample Code:<br>
https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py<br>

There's also a [TextRank](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) implementation available.

- Use: `jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))`

Note that it filters POS by default.<br>
`jieba.analyse.TextRank()` creates a new TextRank instance.


### 4. Part of Speech Tagging
-----------------------------------
***This section explains how to tag Part of Words***<br>

- `jieba.posseg.POSTokenizer(tokenizer=None)` creates a new customized Tokenizer.<br>
`tokenizer` specifies the `jieba.Tokenizer` to internally use. jieba.posseg.dt is the default POSTokenizer.
- Tags the POS of each word after segmentation, using labels compatible with ictclas.<br>

**Example:**

In [None]:
>>> import jieba.posseg as pseg
>>> words = pseg.cut("我爱北京国安")
>>> for w in words:
...    print('%s %s' % (w.word, w.flag))

### 5. Parallel Processing
-----------------------------------
***This section explains about parallel processing.***<br><br>
I believe this would be helpful on large dataset, but I will not implement this module on this notebook.<br>

- Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.

- Based on the multiprocessing module of Python.

- Usage:

    - `jieba.enable_parallel(4)`<br>
    Enable parallel processing. The parameter is the number of processes.
    - `jieba.disable_parallel()`<br>
    Disable parallel processing.
    
- **Example:** https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py

- Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.

- Note that parallel processing supports only default tokenizers, `jieba.dt` and `jieba.posseg.dt`.


### 6. Tokenize: return words with position
-----------------------------------
***This section explains about tokenizing words.***
- The input must be unicode


In [None]:
print('Default Mode\n')
result = jieba.tokenize(u'永和服装饰品有限公司')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

In [None]:
print("Search Mode\n")
result = jieba.tokenize(u'永和服装饰品有限公司',mode='search')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

### 7. ChineseAnalyzer for Whoosh
-----------------------------------
***This section explains about tokenizing words.***

`from jieba.analyse import ChineseAnalyzer`<br>
**Example:** (Copy & Pasted below to see the result)<br> https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py<br>

In [None]:
# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
import sys,os
sys.path.append("../")
from whoosh.index import create_in,open_dir
from whoosh.fields import *
from whoosh.qparser import QueryParser

from jieba.analyse.analyzer import ChineseAnalyzer

analyzer = ChineseAnalyzer()

schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True, analyzer=analyzer))
if not os.path.exists("tmp"):
    os.mkdir("tmp")

ix = create_in("tmp", schema) # for create new index
#ix = open_dir("tmp") # for read only
writer = ix.writer()

writer.add_document(
    title="document1",
    path="/a",
    content="This is the first document we’ve added!"
)

writer.add_document(
    title="document2",
    path="/b",
    content="The second one 你 中文测试中文 is even more interesting! 吃水果"
)

writer.add_document(
    title="document3",
    path="/c",
    content="买水果然后来世博园。"
)

writer.add_document(
    title="document4",
    path="/c",
    content="工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作"
)

writer.add_document(
    title="document4",
    path="/c",
    content="咱俩交换一下吧。"
)

writer.commit()
searcher = ix.searcher()
parser = QueryParser("content", schema=ix.schema)

for keyword in ("水果世博园","你","first","中文","交换机","交换"):
    print("result of ",keyword)
    q = parser.parse(keyword)
    results = searcher.search(q)
    for hit in results:
        print(hit.highlights("content"))
    print("="*10)

for t in analyzer("我的好朋友是李明;我爱北京国安;IBM和Microsoft; I have a dream. this is intetesting and interested me a lot"):
    print(t.text)

### 8. Command Line Interface
-----------------------------------
```
$> python -m jieba --help
Jieba command line interface.

positional arguments:
  filename              input file

optional arguments:
  -h, --help            show this help message and exit
  -d [DELIM], --delimiter [DELIM]
                        use DELIM instead of ' / ' for word delimiter; or a
                        space if it is used without DELIM
  -p [DELIM], --pos [DELIM]
                        enable POS tagging; if DELIM is specified, use DELIM
                        instead of '_' for POS delimiter
  -D DICT, --dict DICT  use DICT as dictionary
  -u USER_DICT, --user-dict USER_DICT
                        use USER_DICT together with the default dictionary or
                        DICT (if specified)
  -a, --cut-all         full pattern cutting (ignored with POS tagging)
  -n, --no-hmm          don't use the Hidden Markov Model
  -q, --quiet           don't print loading messages to stderr
  -V, --version         show program's version number and exit

If no filename specified, use STDIN instead.

```

## Initialization
-----------------------------------

By default, Jieba don't build the prefix dictionary unless it's necessary. This takes 1-3 seconds, after which it is not initialized again.<br>
If you want to initialize Jieba manually, you can call:


In [None]:
import jieba
jieba.initialize()  # (optional)

You can also specify the dictionary (not supported before version 0.28) :

In [None]:
jieba.set_dictionary('/kaggle/input/ideogram-phonogram-dataset/dict.txt.big')


## Using Other Dictionaries
-----------------------------------
It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:<br>
1. A smaller dictionary for a smaller memory footprint: <br>
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small<br>

2. There is also a bigger dictionary that has better support for traditional Chinese (繁體):<br>
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big<br>
***You can find both files from [here](https://www.kaggle.com/datasets/jasonheesanglee/ideogram-phonogram-dataset)***<br><br>

By default, an in-between dictionary is used, called `dict.txt` and included in the distribution.<br>

In either case, download the file you want, and then call `jieba.set_dictionary('data/dict.txt.big')` or just replace the existing `dict.txt`.



## jieba/jieba
-----------------------------------
As I have finished going through README.md, I will start on the original plan.<br>
Below is how jieba/jieba directory looks like.<br><br>
Code explanation done with the help of ChatGPT & BARD
<img src="https://github.com/jasonheesanglee/Ideogram_Phonogram/blob/main/IDEOPHONO/jieba_jieba.png?raw=true" height="100" />

### dict.txt
-----------------------------------
Let's take a look at dict.txt<br>
From the code output below, we can see that this txt file is composed in a format of `[word]` | `[word frequency]` | `[POS]` as explained in [2.Add a custom dictionary](https://www.kaggle.com/code/jasonheesanglee/ideogram-based-vs-phonogram-based-language?scriptVersionId=142722802&cellId=36).

In [None]:
with open(r'/kaggle/input/ideogram-phonogram-dataset/dict.txt') as dict_txt:
    display(dict_txt.readlines()[0:10])
#     display(dict_txt.read()[:])

### _compat.py
-----------------------------------
I will start with _compat.py as both `main.py` and `init.py` starts by importing this module.<br>
(Sorry for not including "___" in the file name... Hate markdown syntax..)<br>

I will break the module down per `def`.

### Importing Modules
-----------------------------------

In [None]:
# -*- coding: utf-8 -*-
import logging
import os
import sys

### Logging Configurations
-----------------------------------

In [None]:
log_console = logging.StreamHandler(sys.stderr)
default_logger = logging.getLogger(__name__)
default_logger.setLevel(logging.DEBUG)

- `log_console` is created as a Stream Handler that directs log messages to the standard error ('sys.stderr')<br>
- `default_loger` is a logger object created for the current module.<br>
- \_\_name__ refers to the current module (`_compat.py`)<br>
It is configured to log messages with a minimum level of "DEBUG"

### `setLogLevel`
-----------------------------------

In [None]:
# def setLogLevel(log_level):
#     default_logger.setLevel(log_level)

This function is defined to allow changing the log level of the "default_logger".<br>
By calling this function with a log level. the logger's level will be set.<br>
For example, if the log level is set to `logging.INFO`, the log level will be changed to `INFO`, and the logger will only display messages of INFO level.

In [None]:
# check_paddle_install = {'is_paddle_installed': False}

# try:
#     import pkg_resources

#     get_module_res = lambda *res: pkg_resources.resource_stream(__name__,
#                                                                 os.path.join(*res))
# except ImportError:
#     get_module_res = lambda *res: open(os.path.normpath(os.path.join(
#         os.getcwd(), os.path.dirname(__file__), *res)), 'rb')

- This part is to check whether the PaddlePaddle library is installed.<br>
- If `pkg_resources` can be imported, it sets `is_paddle_installed` to True.<br>
- It uses pkg_resources.resource_stream if available, and if not, it constructs the resource path using os.getcwd() and os.path.dirname(__file__) and opens the resource as a binary file.

### `enable_paddle`
-----------------------------------

In [None]:
# def enable_paddle():
#     try:
#         import paddle
#     except ImportError:
#         default_logger.debug("Installing paddle-tiny, please wait a minute......")
#         os.system("pip install paddlepaddle-tiny")
#         try:
#             import paddle
#         except ImportError:
#             default_logger.debug(
#                 "Import paddle error, please use command to install: pip install paddlepaddle-tiny==1.6.1."
#                 "Now, back to jieba basic cut......")
#     if paddle.__version__ < '1.6.1':
#         default_logger.debug("Find your own paddle version doesn't satisfy the minimum requirement (1.6.1), "
#                              "please install paddle tiny by 'pip install --upgrade paddlepaddle-tiny', "
#                              "or upgrade paddle full version by "
#                              "'pip install --upgrade paddlepaddle (-gpu for GPU version)' ")
#     else:
#         try:
#             import jieba.lac_small.predict as predict
#             default_logger.debug("Paddle enabled successfully......")
#             check_paddle_install['is_paddle_installed'] = True
#         except ImportError:
#             default_logger.debug("Import error, cannot find paddle.fluid and jieba.lac_small.predict module. "
#                                  "Now, back to jieba basic cut......")

- This function begins by importing `paddle`.<br>
- If the nested import raises an ImportError again, it logs another message to the default logger, indicating that PaddlePaddle couldn't be imported even after the installation and suggests a specific command to install a particular version of PaddlePaddle.
- The function does not return any values but effectively determines whether PaddlePaddle is available for use in the Jieba library and logs relevant messages.
- This function handles the installation and availability of the PaddlePaddle library, which may be used by Jieba for certain tasks.<br> If PaddlePaddle is available and of the correct version, it sets the is_paddle_installed flag to True, indicating that PaddlePaddle support is enabled.<br> Otherwise, it falls back to the basic Jieba functionality.

### `PaddlePaddle`
-----------------------------------

In [None]:
# PY2 = sys.version_info[0] == 2

# default_encoding = sys.getfilesystemencoding()

# if PY2:
#     text_type = unicode
#     string_types = (str, unicode)

#     iterkeys = lambda d: d.iterkeys()
#     itervalues = lambda d: d.itervalues()
#     iteritems = lambda d: d.iteritems()

# else:
#     text_type = str
#     string_types = (str,)
#     xrange = range

#     iterkeys = lambda d: iter(d.keys())
#     itervalues = lambda d: iter(d.values())
#     iteritems = lambda d: iter(d.items())

This section deals with defining variables and functions based on Python version compatibility (Python 2 and Python 3).

- `PY2 = sys.version_info[0] == 2`:<br>This line determines whether the Python version being used is Python 2.<br>It checks if the major version number (`sys.version_info[0]`) is equal to 2 and assigns the result to the variable PY2.
- `default_encoding = sys.getfilesystemencoding()`:This line obtains the default encoding used by the file system and assigns it to the variable `default_encoding`.<br>This is often used for encoding and decoding file paths.

***I will pass the first `if` statement as we are using Python 3***

- `text_type = str`: In Python 3, str is used for representing both byte strings and Unicode strings, so it assigns the name text_type to str.
- `string_types = (str,)`: It defines string_types as a tuple containing only str since there's no need for unicode in Python 3.
- `xrange = range`: In Python 2, there was a separate xrange function for creating efficient iterators over a range of numbers.<br>In Python 3, the range function provides the same functionality, so it assigns range to xrange.

### `strdecode`
-----------------------------------

In [None]:
# def strdecode(sentence):
#     if not isinstance(sentence, text_type):
#         try:
#             sentence = sentence.decode('utf-8')
#         except UnicodeDecodeError:
#             sentence = sentence.decode('gbk', 'ignore')
#     return sentence

- `strdecode` function decodes string to ensure they are in utf-8 format.
- If it is not utf-8 format, it decodes the sentence with `gbk` encoding.

### `resolve_filename`
-----------------------------------

In [None]:
# def resolve_filename(f):
#     try:
#         return f.name
#     except AttributeError:
#         return repr(f)

- `resolve_filename` defines a function named resolve_filename that takes one argument called f, which is expected to be a file object.
- If the name attribute is not available, it returns a string representation of the file object f using the repr() function.<br>This representation includes information about the object, which can be helpful for debugging or providing more context.

### __main__.py
-----------------------------------
I have hid this part of analysis as it is mostly configurations.

### Importing modules
-----------------------------------

In [None]:
"""Jieba command line interface."""

# import sys
# import jieba
# from argparse import ArgumentParser
# from ._compat import *

### Argument Parsing
-----------------------------------

In [None]:
# parser = ArgumentParser(usage="%s -m jieba [options] filename" % sys.executable, description="Jieba command line interface.", epilog="If no filename specified, use STDIN instead.")

This section sets up the argument parser for the command-line interface of `jieba`.<br>
It defines various command-line options that can be used when running the script.

### Argument Definitions
-----------------------------------

In [None]:
# parser.add_argument("-d", "--delimiter", metavar="DELIM", default=' / ',
#                     nargs='?', const=' ',
#                     help="use DELIM instead of ' / ' for word delimiter; or a space if it is used without DELIM")
# parser.add_argument("-p", "--pos", metavar="DELIM", nargs='?', const='_',
#                     help="enable POS tagging; if DELIM is specified, use DELIM instead of '_' for POS delimiter")
# parser.add_argument("-D", "--dict", help="use DICT as dictionary")
# parser.add_argument("-u", "--user-dict",
#                     help="use USER_DICT together with the default dictionary or DICT (if specified)")
# parser.add_argument("-a", "--cut-all",
#                     action="store_true", dest="cutall", default=False,
#                     help="full pattern cutting (ignored with POS tagging)")
# parser.add_argument("-n", "--no-hmm", dest="hmm", action="store_false",
#                     default=True, help="don't use the Hidden Markov Model")
# parser.add_argument("-q", "--quiet", action="store_true", default=False,
#                     help="don't print loading messages to stderr")
# parser.add_argument("-V", '--version', action='version',
#                     version="Jieba " + jieba.__version__)
# parser.add_argument("filename", nargs='?', help="input file")

This section add arguments to the parser.
- `-d` is used to specify a delimiter for word.
- `-p` is used to enable part of speech tagging.
- `-D` allows specifying a custom dictionary.
- `-u` is for a user-defined dictionary.
- `-a` enables full pattern cutting.
- `-n` disables the Hidden Markov Model.
- `-q` makes the script run quietly without loading messages.

***I guess these are the similar terms and functions to*** `!pip install -q ...`.<br>

### Parsing Command-Line Arguments
-----------------------------------

In [None]:
# args = parser.parse_args()

- This section parses the command-line arguments using the previously defined argument parser.<br>
- The parsed arguments are stored in the `args` variable, which is an object with attributes corresponding to the defined arguments.

### Configuration based on Command-Line Arguments.
-----------------------------------

In [None]:
# if args.quiet:
#     jieba.setLogLevel(60)

# if args.pos:
#     import jieba.posseg
#     posdelim = args.pos
#     def cutfunc(sentence, _, HMM=True):
#         for w, f in jieba.posseg.cut(sentence, HMM):
#             yield w + posdelim + f
# else:
#     cutfunc = jieba.cut

- If the `-q` flag is provided in the command line, it sets the logging level of the jieba library to 60 (which corresponds to the `CRITICAL` log level.<br>
- This means that loading messages will not be printed to the standard error (stderr)<br>

- If the `-p` flag is provided in the command line, it imports the `jieba.posseg` module and sets up a custom word segmentation function (`cutfunc`) that incorporates POS tags based on the specified delimiter.
- If the `-p` flag is not provided, it sets `cutfunc` to the default word segmentation function (`jieba.cut`)

### Variable Assignments
-----------------------------------

In [None]:
# delim = text_type(args.delimiter)
# cutall = args.cutall
# hmm = args.hmm
# fp = open(args.filename, 'r') if args.filename else sys.stdin

- `delim` converts the specified delimiter into appropriate text type. (Either Unicode or byte string)
- `cutall` stores whether the `-a` flag was provided.
- `hmm` stores whether `-n` flag was provided.
- `fp` opens the input file specified in the command line `arg.filename` or `sys.stdin` if filename is not provided.

### `jieba` Configuration
-----------------------------------

In [None]:
# if args.dict:
#     jieba.initialize(args.dict)
# else:
#     jieba.initialize()
# if args.user_dict:
#     jieba.load_userdict(args.user_dict)

- If the `-D` flag is provided, it initializes jieba with the specified dictionary.<br>Otherwise, it uses the default dictionary.
- If the `-u` flag is provided, it loads the specified user dictionary.

### Processing and Output
-----------------------------------

In [None]:
# ln = fp.readline()
# while ln:
#     l = ln.rstrip('\r\n')
#     result = delim.join(cutfunc(ln.rstrip('\r\n'), cutall, hmm))
#     if PY2:
#         result = result.encode(default_encoding)
#     print(result)
#     ln = fp.readline()

# fp.close()

- This section reads lines from the input file (or stdin if no filename is provided) using fp.readline().
- It applies the word segmentation function (cutfunc) to each line, joining the resulting tokens with the specified delimiter.
- If the Python version is 2.x (PY2 is True), it encodes the result using the default encoding.
- It prints the segmented and possibly encoded text to the standard output.
- This process continues until there are no more lines to read.
- Finally, it closes the input file (if opened).

### __init__.py
-----------------------------------

# $Below\ In\ Progress$


# Combining Columns

In [None]:
cn_new = pd.DataFrame()
en_new = pd.DataFrame()
cn_new['input'] = "Headline: " + cn_example['headline'] + "; Content: " + cn_example['content']
en_new['input'] = "Headline: " + en_example['headline'] + "; Content: " + en_example['content']

In [None]:
cn_new[cn_new['input'].isna()==True]
cn_new = cn_new.dropna()
cn_new[cn_new['input'].isna()==True]

In [None]:
en_new[en_new['input'].isna()==True]
en_new = en_new.dropna()
en_new[en_new['input'].isna()==True]

In [None]:
display(cn_new.head(5))
display(en_new.head(5))

# Tokenization
Tokenization is the process of breaking down a text into smaller units, which are typically words or subwords.<br>These smaller units are called tokens.<br><br>
In English, tokenization usually involves splitting text into words based on spaces, punctuation, or other delimiters.<br>
For example, the sentence "I love ice cream" would be tokenized into the tokens: ["I", "love", "ice", "cream"].<br><br>
In languages like Chinese, tokenization can be more complex since <br>Tokenization might involve segmenting text into characters or meaningful subword units.
<br>

### Trying Out SentencePiece
Got the [Colab Doc](https://colab.research.google.com/github/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb#scrollTo=SUcAbKnRVAv6), but not sure how to use it yet.<br>
As always, [abhishek](https://www.kaggle.com/abhishek)'s [notebook](https://www.kaggle.com/code/abhishek/sentencepiece-tokenizer-with-offsets/notebook) helped me a lot on SentencePiece implementation.

In [None]:
class SentencePieceTokenizer:
    '''
    from Abhishek Thakur's notebook
    https://www.kaggle.com/code/abhishek/sentencepiece-tokenizer-with-offsets
    '''
    def __init__(self, model_path):
        self.sp = spm.SentencePieceProcessor()
        self.sp.load(model_path +'.model')
        
    def encode(self, sentence):
        spt = sentencepiece_pb2.SentencePieceText()
        spt.ParseFromString(self.sp.encode_as_serialized_proto(sentence))
        offsets = []
        tokens = []
        for piece in spt.pieces:
            tokens.append(piece.id)
            offsets.append((piece.begin, piece.end))
        return tokens, offsets

In [None]:
deberta_tok = AutoTokenizer.from_pretrained('/kaggle/input/debertav3base')
deberta_tok.save_pretrained('/kaggle/working/deberta_tok/')

In [None]:
spt = SentencePieceTokenizer('/kaggle/input/debertav3base/spm')

# Encoding
Here I have tried on encoding different words, languages and tokens.<br>
Let's see if it works well!<br>

#### Tokens
It is fun to see how these tokens are treated *similarly*

In [None]:
print(f"[MASK] encoded into \t{spt.encode('[MASK]')}")
print(f"[CLS] encoded into \t{spt.encode('[CLS]')}")
print(f"[EOS] encoded into \t{spt.encode('[EOS]')}")
print(f"[UNK] encoded into \t{spt.encode('[UNK]')}")
print(f"[SEP] encoded into \t{spt.encode('[SEP]')}")
print(f"[SPECIAL] encoded into \t{spt.encode('[SPECIAL]')}")
print()
print(f"MASK encoded into \t{spt.encode('MASK')}")
print(f"CLS encoded into \t{spt.encode('CLS')}")
print(f"EOS encoded into \t{spt.encode('EOS')}")
print(f"UNK encoded into \t{spt.encode('UNK')}")
print(f"SEP encoded into \t{spt.encode('SEP')}")
print(f"SPECIAL encoded into \t{spt.encode('SPECIAL')}")

In [None]:
print(f"[MASK] encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('[MASK]', add_special_tokens=False)}")
print(f"\n[CLS] encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('[CLS]', add_special_tokens=False)}")
print(f"\n[EOS] encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('[EOS]', add_special_tokens=False)}")
print(f"\n[UNK] encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('[UNK]', add_special_tokens=False)}")
print(f"\n[SEP] encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('[SEP]', add_special_tokens=False)}")
print(f"\n[SPECIAL] encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('[SPECIAL]', add_special_tokens=False)}")
print()
print(f"\nMASK encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('MASK', add_special_tokens=False)}")
print(f"\nCLS encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('CLS', add_special_tokens=False)}")
print(f"\nEOS encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('EOS', add_special_tokens=False)}")
print(f"\nUNK encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('UNK', add_special_tokens=False)}")
print(f"\nSEP encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('SEP', add_special_tokens=False)}")
print(f"\nSPECIAL encoded into :\nvvvvvvvvvvvvvvvvvvvvvvvvvv\n{deberta_tok('SPECIAL', add_special_tokens=False)}")

### English words
We can see that these sample English words are encoded properly *(maybe?)*

In [None]:
display(spt.encode(en_new.input[0][:10]))
display(spt.encode(en_new.input[0][:15]))
display(spt.encode(en_new.input[0][:30]))
display(spt.encode(en_new.input[0][:50]))

In [None]:
display(deberta_tok(en_new.input[0][:10], add_special_tokens=False))
display(deberta_tok(en_new.input[0][:15], add_special_tokens=False))
display(deberta_tok(en_new.input[0][:30], add_special_tokens=False))
display(deberta_tok(en_new.input[0][:50], add_special_tokens=False))

### Chinese words
Now the problem begins, the last digit of the encoded tensors different.<br> *(which I think is the bytes taken)*<br>
Which implies that this doesn't work at all for Chinese words.<br><br>
**edit**<br>
Wait what...?<br>
Last time I checked, there were no differences between 你 and 您.<br>
But when I check it now, there is a difference...<br>
It seems like deberta is also working well for Chinese characters.

In [None]:
display(spt.encode(cn_new.input[0][:10]))
display(spt.encode(cn_new.input[0][:15]))
display(spt.encode(cn_new.input[0][:30]))
display(spt.encode(cn_new.input[0][:40]))


In [None]:
display(deberta_tok(cn_new.input[0][:10], add_special_tokens=False))
display(deberta_tok(cn_new.input[0][:15], add_special_tokens=False))
display(deberta_tok(cn_new.input[0][:30], add_special_tokens=False))
display(deberta_tok(cn_new.input[0][:40], add_special_tokens=False))


So I tried... and this works!!<br>
**edit** I will leave below as it is.<br>

In [None]:
cn_spt = SentencePieceTokenizer('/kaggle/input/sentencepiece-chinese-bpe/chinese/chinese')

In [None]:
display(cn_spt.encode(cn_new.input[0][:10]))
display(cn_spt.encode(cn_new.input[0][:15]))
display(cn_spt.encode(cn_new.input[0][:30]))
display(cn_spt.encode(cn_new.input[0][:40]))


### Korean words
Same here, I am pretty... no very sure that this doesn't work on foreign languages.<br><br>
**edit**<br>
Oops.. again, this seems like it works well...!

In [None]:
display(spt.encode('나'))
display(spt.encode('제'))
print()
display(spt.encode('안녕'))
display(spt.encode('안녕하세요'))
print()
display(spt.encode('이름'))
display(spt.encode('이름은'))
print()
display(spt.encode('이희상'))
display(spt.encode('이희상입니다'))
print()
display(spt.encode('안녕하세요 제 이름은'))
display(spt.encode('안녕하세요 제 이름은 이희상입니다'))
display(spt.encode('안녕하세요. 제 이름은 이희상입니다. 홍대에서 공부를 하고 있습니다.'))

In [None]:
display(deberta_tok('나', add_special_tokens=False))
display(deberta_tok('제', add_special_tokens=False))
print()
display(deberta_tok('안녕', add_special_tokens=False))
display(deberta_tok('안녕하세요', add_special_tokens=False))
print()
display(deberta_tok('이름', add_special_tokens=False))
display(deberta_tok('이름은', add_special_tokens=False))
print()
display(deberta_tok('이희상', add_special_tokens=False))
display(deberta_tok('이희상입니다', add_special_tokens=False))
print()
display(deberta_tok('안녕하세요 제 이름은', add_special_tokens=False))
display(deberta_tok('안녕하세요 제 이름은 이희상입니다', add_special_tokens=False))
display(deberta_tok('안녕하세요. 제 이름은 이희상입니다. 홍대에서 공부를 하고 있습니다.', add_special_tokens=False))

Again, I tried as below!<br>
It seems like it is working properly! :)<br><br>

**edit**<br>
I will leave below as it is.<br>

In [None]:
ko_spt = SentencePieceTokenizer('/kaggle/input/airc-keti-ke-t5/vocab/sentencepiece_v2')

In [None]:
display(ko_spt.encode('나'))
display(ko_spt.encode('제'))
print()
display(ko_spt.encode('안녕'))
display(ko_spt.encode('안녕하세요'))
print()
display(ko_spt.encode('이름'))
display(ko_spt.encode('이름은'))
print()
display(ko_spt.encode('이희상'))
display(ko_spt.encode('이희상입니다'))
print()
display(ko_spt.encode('안녕하세요 제 이름은'))
display(ko_spt.encode('안녕하세요 제 이름은 이희상입니다'))
display(ko_spt.encode('안녕하세요. 제 이름은 이희상입니다. 홍대에서 공부를 하고 있습니다.'))