# Preprocessing Notebooks

Beberapa preprocessing untuk membuat skenario sebagai penelitian

Preprocessing sesuai dengan IndoBERTweet
1. Sama seperti preprosesing unutk membuat IndoBERTweet (lowercasing, change user, change emoji, change http url) (scenario 1)
2. Nomor 1 dan melakukan stemming (scenario 2)

Preprocessing untuk only full text tanpa adanya Twitter atribute  
1. Lowecasing, Menghilangkan atribut url dan mention dalam text, change emoji  (scenario 3)
2. Nomor 3 dan melakukan stemming (scenario 4)

Preprocessing untuk only full text tanpa adanya Twitter attribute dan emoji agar hanya full text yang mirip dengan headline di berita-berita
1. Lowecasing, Menghilangkan atribut url dan mention dalam text, menghilangkan emoji. (scenario 5)
2. Nomor 5 dan melakukan stemming (scenario 6)

## Instantiate some processing class

In [5]:
from NDETCStemmer import NDETCStemmer, CustomModelDownloader
from kaelib.processor.NDETCStemmerWraper import NDETCStemmerWraper

downloader = CustomModelDownloader(
    model_1="https://is3.cloudhost.id/s3.kaenova.my.id/NDETCStemmer/Model/w2vec_wiki_id_case",
    model_2="https://is3.cloudhost.id/s3.kaenova.my.id/NDETCStemmer/Model/w2vec_wiki_id_case.trainables.syn1neg.npy",
    model_3="https://is3.cloudhost.id/s3.kaenova.my.id/NDETCStemmer/Model/w2vec_wiki_id_case.wv.vectors.npy"
)

original_stemmer=NDETCStemmer(custom_downloader=downloader)
stemmer = NDETCStemmerWraper(original_stemmer)

In [6]:
from kaelib.processor.TextProcessingPipeline import TextProcessingPipeline
import kaelib.processor.preprocessing_func as pf

scenario_processor = {
    1: TextProcessingPipeline([
        pf.lowercasing,
        pf.change_user,
        pf.change_emoji,
        pf.change_web_url
    ]),
    2: TextProcessingPipeline([
        pf.lowercasing,
        pf.change_user,
        pf.change_emoji,
        pf.change_web_url,
        stemmer.stem
    ]),
    3: TextProcessingPipeline([
        pf.lowercasing,
        pf.remove_username,
        pf.remove_url,
        pf.change_emoji,
    ]),
    4: TextProcessingPipeline([
        pf.lowercasing,
        pf.remove_username,
        pf.remove_url,
        pf.change_emoji,
        stemmer.stem
    ]),
    5: TextProcessingPipeline([
        pf.lowercasing,
        pf.remove_username,
        pf.remove_url,
        pf.remove_emoji,
    ]),
    6: TextProcessingPipeline([
        pf.lowercasing,
        pf.remove_username,
        pf.remove_url,
        pf.remove_emoji,
        stemmer.stem
    ])
}

test_text = """😲😲 Miliarder Rusia Oleg Tinkov pada Senin (31/10/2022), mengaku telah  melepaskan kewarganegaraan Rusianya karena konflik di Ukraina. http://dlvr.it/Sc20gN @kaenova """

print("Example")
for i in scenario_processor:
    print(f"Scenario {i}:", scenario_processor[i].process_text(test_text))

Example
Scenario 1: :astonished_face::astonished_face: miliarder rusia oleg tinkov pada senin (31/10/2022), mengaku telah  melepaskan kewarganegaraan rusianya karena konflik di ukraina. HTTPURL @USER
Scenario 2: :astonished_face::astonished_face: miliarder rusia oleg tinkov pada senin 31 10 2022 aku telah lepas warga negara rusianya karena konflik di ukraina HTTPURL @USER
Scenario 3: :astonished_face::astonished_face: miliarder rusia oleg tinkov pada senin (31/10/2022), mengaku telah  melepaskan kewarganegaraan rusianya karena konflik di ukraina.
Scenario 4: :astonished_face::astonished_face: miliarder rusia oleg tinkov pada senin 31 10 2022 aku telah lepas warga negara rusianya karena konflik di ukraina
Scenario 5: miliarder rusia oleg tinkov pada senin (31/10/2022), mengaku telah  melepaskan kewarganegaraan rusianya karena konflik di ukraina.
Scenario 6: miliarder rusia oleg tinkov pada senin 31 10 2022 aku telah lepas warga negara rusianya karena konflik di ukraina


## Combining Organization and Replies from batch

In [7]:
import pandas as pd

path = "../../data/3.1. Annotated"
save_path = "../../data/3.2. Annotated Combined"
organization_file = [
    "organization_1.xlsx",
    "organization_2.xlsx",
]
replies_file = [
    "replies_100_1.xlsx",
    "replies_100_2.xlsx",
]

organization_df = pd.concat(
    [
        pd.read_excel(f"{path}/{filename}").replace(
            {"\n": " ", "\r": "", "_x000D_": " "}, regex=True
        )
        for filename in organization_file
    ]
)
replies_df = pd.concat(
    [
        pd.read_excel(f"{path}/{filename}").replace(
            {"\n": " ", "\r": "", "_x000D_": " "}, regex=True
        )
        for filename in replies_file
    ]
)

organization_df.to_csv(f"{save_path}/organization.csv", index=False)
replies_df.to_csv(f"{save_path}/replies.csv", index=False)


## How to Preprocess so that it has same data?
1. Combine the csv from `organization.csv` and `replies.csv` in `3.2. Annotated Combined` folder
2. Split it using `scikit-learn` with same random state to Train Test Validation
3. With same source data, we preprocess it to diffrent preprocessor

In [8]:
from sklearn.model_selection import train_test_split
import os

processed_path = "../../data/4. Processed"

X_header_name = 'tweet'
y_header_name = 'labels (Non-Headline 0 / Headline 1)'

# Combine
organization_combined_file = "../../data/3.2. Annotated Combined/organization.csv"
organization_combined_df = pd.read_csv(organization_combined_file)[[X_header_name, y_header_name]]
replies_combined_file = "../../data/3.2. Annotated Combined/replies.csv"
replies_combined_df = pd.read_csv(replies_combined_file)[[X_header_name, y_header_name]]
combined = pd.concat([organization_combined_df, replies_combined_df])

# Split
X = combined[X_header_name]
y = combined[y_header_name]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2023, stratify=y) # 0.1 Test Data
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=2023, stratify=y_train) # 0.25 x 0.9 = 0.225 Validation Data | 0.75 x 0.9 = 0.675 Training Data

# Preprocess each scenario
for scenario_number in scenario_processor:
    scenario_save_path = f"../../data/4. Processed/{scenario_number}"
    
    df_train = pd.DataFrame({
        'tweet': X_train.map(lambda x: scenario_processor[scenario_number].process_text(x)),
        'labels': y_train
    })
    
    df_validation = pd.DataFrame({
        'tweet': X_val.map(lambda x: scenario_processor[scenario_number].process_text(x)),
        'labels': y_val
    })
    
    df_test = pd.DataFrame({
        'tweet': X_test.map(lambda x: scenario_processor[scenario_number].process_text(x)),
        'labels': y_test
    })
    
    if not os.path.exists(scenario_save_path):
        os.mkdir(scenario_save_path)
    
    df_train.to_csv(f"{scenario_save_path}/train.csv", index=False, encoding='utf-8')
    df_validation.to_csv(f"{scenario_save_path}/validation.csv", index=False, encoding='utf-8')
    df_test.to_csv(f"{scenario_save_path}/test.csv", index=False, encoding='utf-8')