# IndoXTC - Extracting Yelp Features [XLM-R] 5
Exploring Indonesian hate speech/abusive & sentiment text classification using multilingual language model.   
   
This kernel is a part of my undergraduate final year project.  
Checkout the full github repository:  
https://github.com/ilhamfp/indonesian-text-classification-multilingual

In [1]:
import numpy as np
import pandas as pd
from load_data import load_dataset_foreign
from extract_feature import FeatureExtractor

START = 30000
END   = 37500

## Load Data

In [2]:
def slice_data(START, END):
    data = load_dataset_foreign(data_name='yelp')
    data_pos = data[data['label'] == 1].reset_index(drop=True)
    data_neg = data[data['label'] == 0].reset_index(drop=True)

    train = pd.concat([data_pos[START:END], 
                       data_neg[START:END]]).reset_index(drop=True)
    return train

train = slice_data(START, END)
print(train.shape)
train.head()

~~~Data~~~
Shape:  (560000, 2)
   label                                               text
0      0  unfortunately the frustration of being dr gold...
1      1  been going to dr goldberg for over 10 years i ...

Label:
1    280000
0    280000
Name: label, dtype: int64
(15000, 2)


Unnamed: 0,label,text
0,1,i first came to the beehive 3 years ago wow th...
1,1,i lived in pittsburgh from 2004 2006 for a bit...
2,1,i spend a lot of time in coffee shops let me c...
3,1,huge coffeeshop with extended vegetarian menu ...
4,1,in the burgh for thanksgiving and had to stop ...


## Extract Feature

In [3]:
FE = FeatureExtractor(model_name='xlm-r')

Downloading: "https://github.com/pytorch/fairseq/archive/master.zip" to /root/.cache/torch/hub/master.zip


running build_ext
cythoning fairseq/data/data_utils_fast.pyx to fairseq/data/data_utils_fast.cpp
cythoning fairseq/data/token_block_utils_fast.pyx to fairseq/data/token_block_utils_fast.cpp
building 'fairseq.libbleu' extension
creating build
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/fairseq
creating build/temp.linux-x86_64-3.6/fairseq/clib
creating build/temp.linux-x86_64-3.6/fairseq/clib/libbleu
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.6m -c fairseq/clib/libbleu/libbleu.cpp -o build/temp.linux-x86_64-3.6/fairseq/clib/libbleu/libbleu.o -std=c++11 -O3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=libbleu -D_GLIBCXX_USE_CXX11_ABI=0
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.6m -c fairseq/clib/libbleu/module.cpp -o bui

100%|██████████| 1028340964/1028340964 [01:15<00:00, 13553607.18B/s]


In [4]:
train['text'] = train['text'].apply(lambda x: FE.extract_features(x))
train.head()

Unnamed: 0,label,text
0,1,"[[0.027621489, -0.013727403, 0.11518512, 0.004..."
1,1,"[[0.011348342, 0.014897246, 0.12003265, 0.0325..."
2,1,"[[0.06116158, -0.07809255, 0.03845828, 0.01635..."
3,1,"[[0.030743556, 0.004215678, 0.14954628, 0.0136..."
4,1,"[[0.010227636, -0.01266838, 0.13766402, 0.0021..."


## Saving Results

In [5]:
np.save("train_text.npy", train['text'].values)

In [6]:
train['label'].to_csv('train_label.csv', index=False, header=['label'])