# IndoXTC - Extracting Yelp Features [XLM-R] 9
Exploring Indonesian hate speech/abusive & sentiment text classification using multilingual language model.   
   
This kernel is a part of my undergraduate final year project.  
Checkout the full github repository:  
https://github.com/ilhamfp/indonesian-text-classification-multilingual

In [1]:
import numpy as np
import pandas as pd
from load_data import load_dataset_foreign
from extract_feature import FeatureExtractor

START = 60000
END   = 67500

## Load Data

In [2]:
def slice_data(START, END):
    data = load_dataset_foreign(data_name='yelp')
    data_pos = data[data['label'] == 1].reset_index(drop=True)
    data_neg = data[data['label'] == 0].reset_index(drop=True)

    train = pd.concat([data_pos[START:END], 
                       data_neg[START:END]]).reset_index(drop=True)
    return train

train = slice_data(START, END)
print(train.shape)
train.head()

~~~Data~~~
Shape:  (560000, 2)
   label                                               text
0      0  unfortunately the frustration of being dr gold...
1      1  been going to dr goldberg for over 10 years i ...

Label:
1    280000
0    280000
Name: label, dtype: int64
(15000, 2)


Unnamed: 0,label,text
0,1,i know this place typically has a long wait bu...
1,1,great gem of pittsburgh half the reason we wen...
2,1,best breakfast in town if you re not there at ...
3,1,descriptive phrase greasy spoon diner n ndeluc...
4,1,yes deluca s is very good yes they make a grea...


## Extract Feature

In [3]:
FE = FeatureExtractor(model_name='xlm-r')

Downloading: "https://github.com/pytorch/fairseq/archive/master.zip" to /root/.cache/torch/hub/master.zip


running build_ext
cythoning fairseq/data/data_utils_fast.pyx to fairseq/data/data_utils_fast.cpp
cythoning fairseq/data/token_block_utils_fast.pyx to fairseq/data/token_block_utils_fast.cpp
building 'fairseq.libbleu' extension
creating build
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/fairseq
creating build/temp.linux-x86_64-3.6/fairseq/clib
creating build/temp.linux-x86_64-3.6/fairseq/clib/libbleu
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.6m -c fairseq/clib/libbleu/libbleu.cpp -o build/temp.linux-x86_64-3.6/fairseq/clib/libbleu/libbleu.o -std=c++11 -O3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=libbleu -D_GLIBCXX_USE_CXX11_ABI=0
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.6m -c fairseq/clib/libbleu/module.cpp -o bui

100%|██████████| 1028340964/1028340964 [00:22<00:00, 44742013.03B/s]


In [4]:
train['text'] = train['text'].apply(lambda x: FE.extract_features(x))
train.head()

Unnamed: 0,label,text
0,1,"[[0.026627354, -0.07264962, 0.055043723, 0.030..."
1,1,"[[0.018357918, -0.08263896, 0.13109441, -0.004..."
2,1,"[[-0.029584464, -0.02880721, 0.12528573, 0.001..."
3,1,"[[0.0300733, -0.0039952938, 0.13389704, -0.000..."
4,1,"[[0.0137354825, -0.010533312, 0.085911624, -0...."


## Saving Results

In [5]:
np.save("train_text.npy", train['text'].values)

In [6]:
train['label'].to_csv('train_label.csv', index=False, header=['label'])