<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Libraries" data-toc-modified-id="Libraries-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Libraries</a></span></li><li><span><a href="#Data" data-toc-modified-id="Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#EDA" data-toc-modified-id="EDA-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>EDA</a></span><ul class="toc-item"><li><span><a href="#Target-labels" data-toc-modified-id="Target-labels-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Target labels</a></span></li><li><span><a href="#Input-text" data-toc-modified-id="Input-text-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Input text</a></span></li></ul></li><li><span><a href="#Train-and-test-splits" data-toc-modified-id="Train-and-test-splits-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Train and test splits</a></span></li><li><span><a href="#Models" data-toc-modified-id="Models-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Models</a></span><ul class="toc-item"><li><span><a href="#Random-Classifier" data-toc-modified-id="Random-Classifier-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Random Classifier</a></span></li><li><span><a href="#Most-Frequent-Classifier" data-toc-modified-id="Most-Frequent-Classifier-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Most Frequent Classifier</a></span></li><li><span><a href="#Multilingual-BERT" data-toc-modified-id="Multilingual-BERT-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Multilingual BERT</a></span></li></ul></li></ul></div>

# Introduction

This notebook contains my submission for Media Prima Digital's Data Science Assesment.

# Setup

In [1]:
%matplotlib inline

Navigate to the project's root directory:

In [2]:
%cd ..

C:\Users\mshukri\Documents\GitHub\mpd


# Libraries

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import src.metrics as metrics

# Data

Load the data:

In [4]:
df = pd.read_excel('data/MPD data science assessment NLP v1.1.xlsx',
                    sheet_name='nstp_articles_sample',
                    keep_default_na=False)

In [5]:
df.describe()

Unnamed: 0,title,body,pub,sect,subsect
count,1428,1428,1428,1428,1428.0
unique,1361,1405,3,27,46.0
top,Keputusan Liga Perdana Inggeris,© New Straits Times Press (M) Bhd\n\n,bh,berita,
freq,4,3,569,297,536.0


Create input and target columns:

In [6]:
df['title'].str.cat(df['body'], sep=' ').str

<pandas.core.strings.StringMethods at 0x2933f1dd978>

In [7]:
df['text'] = df['title'].str.cat(df['body'], sep=' ').str.replace('\n', ' ')
df.head()

Unnamed: 0,title,body,pub,sect,subsect,text
0,Iran warns protesters will &#039;pay the price...,TEHRAN: Iran warned on Sunday that protesters ...,nst,world,,Iran warns protesters will &#039;pay the price...
1,"Tiga minit, RM1.5 juta",[kalidevi@hmetro.com.my](mailto:kalidevi@hmetr...,hm,mutakhir,,"Tiga minit, RM1.5 juta [kalidevi@hmetro.com.my..."
2,Tiada tol di Lebuhraya Persekutuan selepas 24 ...,[norrasyidah@bh.com.my](mailto:norrasyidah@bh....,bh,berita,nasional,Tiada tol di Lebuhraya Persekutuan selepas 24 ...
3,Pogba urges misfiring Man Utd to &#039;wake up...,LONDON: Paul Pogba has called on Manchester Un...,nst,sports,football,Pogba urges misfiring Man Utd to &#039;wake up...
4,Hundreds of Indonesian couples ring in the new...,JAKARTA: Hundreds of Indonesian couples celebr...,nst,world,,Hundreds of Indonesian couples ring in the new...


In [8]:
def create_target(rows):
    sect, subsect = rows
    return subsect if subsect else sect

df['target'] = df[['sect', 'subsect']].apply(create_target, axis=1)
df.head()

Unnamed: 0,title,body,pub,sect,subsect,text,target
0,Iran warns protesters will &#039;pay the price...,TEHRAN: Iran warned on Sunday that protesters ...,nst,world,,Iran warns protesters will &#039;pay the price...,world
1,"Tiga minit, RM1.5 juta",[kalidevi@hmetro.com.my](mailto:kalidevi@hmetr...,hm,mutakhir,,"Tiga minit, RM1.5 juta [kalidevi@hmetro.com.my...",mutakhir
2,Tiada tol di Lebuhraya Persekutuan selepas 24 ...,[norrasyidah@bh.com.my](mailto:norrasyidah@bh....,bh,berita,nasional,Tiada tol di Lebuhraya Persekutuan selepas 24 ...,nasional
3,Pogba urges misfiring Man Utd to &#039;wake up...,LONDON: Paul Pogba has called on Manchester Un...,nst,sports,football,Pogba urges misfiring Man Utd to &#039;wake up...,football
4,Hundreds of Indonesian couples ring in the new...,JAKARTA: Hundreds of Indonesian couples celebr...,nst,world,,Hundreds of Indonesian couples ring in the new...,world


I assume that the goal is to classify `text` according to `target`.

# EDA

## Target labels

View distribution of target labels:

In [9]:
labels_df = df\
    .groupby('target')\
    .size()\
    .to_frame('count')\
    .sort_values('count', ascending=False)

labels_df['count'].quantile([0, 0.25, 0.5, 0.75, 0.90, 0.95, 0.99, 1])

0.00      1.0
0.25      3.0
0.50     12.0
0.75     30.0
0.90     74.2
0.95     96.1
0.99    134.0
1.00    134.0
Name: count, dtype: float64

Replace targets with count less 10 with 'unknown' class:

In [10]:
minority_labels = labels_df.query('count < 10').reset_index()['target'].tolist()

df['target'] = df['target'].where(~df['target'].isin(minority_labels), 'unknown')

## Input text

View distribution of `text` lengths:

In [11]:
df['text_len'] = df['text'].str.split(' ').apply(lambda r: len(r))

df['text_len'].quantile([0, 0.25, 0.5, 0.75, 0.90, 0.95, 1])

0.00      10.00
0.25     205.00
0.50     278.00
0.75     367.25
0.90     512.30
0.95     673.00
1.00    3160.00
Name: text_len, dtype: float64

How many articles exceed 512 words?

In [12]:
n, _ = df.query('text_len > 512').shape
print(f'There are {n} articles with more than 512 words')

There are 143 articles with more than 512 words


# Train and test splits

In [13]:
input_df = df[['text', 'pub', 'target']]

Recode `target` column:

In [14]:
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(input_df['target'])

print('class\tlabel')
for idx, label in enumerate(label_encoder.classes_):
    print(f'{idx}\t{label}')

class	label
0	amerika
1	arena
2	asia
3	bisnes
4	bola
5	business
6	columnists
7	crime-courts
8	football
9	global
10	groove
11	hati
12	kes
13	korporat
14	lain-lain
15	mutakhir
16	nasional
17	nation
18	nuansa
19	others
20	pasaran
21	pendidikan
22	politics
23	politik
24	raket
25	rap
26	selebriti
27	setempat
28	unknown
29	utama
30	wilayah
31	world


In [15]:
input_df.loc[:, 'target'] = label_encoder.transform(input_df['target'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


In [16]:
X_train, X_dev, Y_train, Y_dev = train_test_split(input_df[['text', 'pub']],
                                                  input_df['target'],
                                                  test_size=0.3,
                                                  random_state=123, 
                                                  shuffle=True,
                                                  stratify=input_df['target'])

In [17]:
train_df = pd.concat([X_train, Y_train], axis=1)
dev_df = pd.concat([X_dev, Y_dev], axis=1)

In [18]:
train_df.to_csv('data/train.tsv', header=True, index=False, sep='\t')
dev_df.to_csv('data/dev.tsv', header=True, index=False, sep='\t')

# Models

## Random Classifier

In [19]:
np.random.seed(123)
y_random = np.random.choice(label_encoder.classes_, Y_dev.size, )
y_random = label_encoder.transform(y_random)

random_df = dev_df.copy()
random_df['pred'] = y_random
metrics_random_classifier = metrics.compute_eval_metrics(random_df)

  'recall', 'true', average, warn_for)


In [20]:
metrics.print_metrics(metrics_random_classifier)

Overall accuracy:     0.0350
Overall f1-macro:     0.0344

BM accuracy:          0.0375
BM f1-macro:          0.0311

EN accuracy:          0.0294
EN f1-macro:          0.0131



## Most Frequent Classifier

In [21]:
most_freq_class = Y_train\
                    .to_frame()\
                    .groupby('target')\
                    .size()\
                    .sort_values(ascending=False)[0]

most_freq_df = dev_df.copy()
most_freq_df['pred'] = most_freq_class
metrics_most_freq_classifier = metrics.compute_eval_metrics(most_freq_df)

  'precision', 'predicted', average, warn_for)


In [22]:
metrics.print_metrics(metrics_most_freq_classifier)

Overall accuracy:     0.0932
Overall f1-macro:     0.0053

BM accuracy:          0.1365
BM f1-macro:          0.0104

EN accuracy:          0.0000
EN f1-macro:          0.0000



## Multilingual BERT

In [23]:
!source activate mpd; python  src/train.py --data_dir ~/mpd_data --output_dir ~/mpd_output --num_classes 32