# IndoXTC - Extracting Yelp Features [XLM-R] 1
Exploring Indonesian hate speech/abusive & sentiment text classification using multilingual language model.   
   
This kernel is a part of my undergraduate final year project.  
Checkout the full github repository:  
https://github.com/ilhamfp/indonesian-text-classification-multilingual

In [1]:
import numpy as np
import pandas as pd
from load_data import load_dataset_foreign
from extract_feature import FeatureExtractor

START = 0
END   = 7500

## Load Data

In [2]:
def slice_data(START, END):
    data = load_dataset_foreign(data_name='yelp')
    data_pos = data[data['label'] == 1].reset_index(drop=True)
    data_neg = data[data['label'] == 0].reset_index(drop=True)

    train = pd.concat([data_pos[START:END], 
                       data_neg[START:END]]).reset_index(drop=True)
    return train

train = slice_data(START, END)
print(train.shape)
train.head()

~~~Data~~~
Shape:  (560000, 2)
   label                                               text
0      0  unfortunately the frustration of being dr gold...
1      1  been going to dr goldberg for over 10 years i ...

Label:
1    280000
0    280000
Name: label, dtype: int64
(15000, 2)


Unnamed: 0,label,text
0,1,been going to dr goldberg for over 10 years i ...
1,1,all the food is great here but the best thing ...
2,1,before i finally made it over to this range i ...
3,1,i drove by yesterday to get a sneak peak it re...
4,1,wonderful reuben map shown on yelp page is inc...


## Extract Feature

In [3]:
FE = FeatureExtractor(model_name='xlm-r')

Downloading: "https://github.com/pytorch/fairseq/archive/master.zip" to /root/.cache/torch/hub/master.zip


running build_ext
cythoning fairseq/data/data_utils_fast.pyx to fairseq/data/data_utils_fast.cpp
cythoning fairseq/data/token_block_utils_fast.pyx to fairseq/data/token_block_utils_fast.cpp
building 'fairseq.libbleu' extension
creating build
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/fairseq
creating build/temp.linux-x86_64-3.6/fairseq/clib
creating build/temp.linux-x86_64-3.6/fairseq/clib/libbleu
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.6m -c fairseq/clib/libbleu/libbleu.cpp -o build/temp.linux-x86_64-3.6/fairseq/clib/libbleu/libbleu.o -std=c++11 -O3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=libbleu -D_GLIBCXX_USE_CXX11_ABI=0
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.6m -c fairseq/clib/libbleu/module.cpp -o bui

100%|██████████| 1028340964/1028340964 [01:21<00:00, 12576124.44B/s]


In [4]:
train['text'] = train['text'].apply(lambda x: FE.extract_features(x))
train.head()

Unnamed: 0,label,text
0,1,"[[0.021567786, -0.049458537, 0.038058378, -0.0..."
1,1,"[[0.026237741, -0.00597712, 0.13274254, 0.0315..."
2,1,"[[0.032446913, 0.024457924, 0.09015657, 0.0121..."
3,1,"[[0.024372717, -0.018759903, 0.09137537, 0.016..."
4,1,"[[0.017360827, -0.04467188, 0.081432335, 0.043..."


## Saving Results

In [5]:
np.save("train_text.npy", train['text'].values)

In [6]:
train['label'].to_csv('train_label.csv', index=False, header=['label'])