## Preprocessing

Let us first separate only the neccesary colums, combine the data into one collection, and covert it to binary labels 

In [None]:
import pandas as pd
import numpy as np


am = pd.read_csv('adverse_media_training.csv.zip')
nam = pd.read_csv('non_adverse_media_training.csv.zip')

# Keep only needed columns
am_cropped = am[['article','title','label']]
nam_cropped = nam[['article','title', 'label']]

# Combine source files and re-label to binary
am = pd.concat(
    [ am_cropped.loc[(am_cropped.label == 'am') | (am_cropped.label == 'am ')],
     nam_cropped.loc[(nam_cropped.label == 'am')] ]
)
am['label'] = 1

nam = pd.concat(
    [ am_cropped.loc[(am_cropped.label == 'nam')], 
     nam_cropped.loc[(nam_cropped.label == 'nam')] ]
)
nam['label'] = 0


# Combine data into one table
data = pd.concat([am,nam])
data


Unnamed: 0,article,title,label
8,"Bernie Madoff, who is scheduled to be sentence...",Top 10 Crooked CEOs,1
10,Published\n\nOne of the world's leading fund m...,Top fund manager forced to resign after BBC in...,1
11,Published\n\nThe founder of US futures broker ...,Peregrine Financial Group boss admits $100m fraud,1
12,WASHINGTON (AP) — An American security contrac...,American accuses Congo officials of unlawful a...,1
17,"A senior figure in the Bitcoin Foundation, whi...",Bitcoin Foundation vice chair arrested for mon...,1
...,...,...,...
513,)--As banks around the world continue fighting...,Leading UK Bank Strengthens Fight Against Risi...,0
514,"The shadow chancellor, Anneliese Dodds, is cal...",Shadow chancellor calls on ministers to fulfil...,0
516,"(Washington, DC) – The way Peru ’s Congress re...",Peru: Ousting of President Threatens Rule of Law,0
517,image copyrightGetty Images\n\nSocial media an...,France gives online firms one hour to pull 'te...,0


Now let us combine the article and title columns, remove punctuation, lowercase the text, etc.

In [None]:
data["article"] = data["title"] + " " + data["article"]
data.drop(["title"], axis =1)

Unnamed: 0,article,label
8,"Top 10 Crooked CEOs Bernie Madoff, who is sche...",1
10,Top fund manager forced to resign after BBC in...,1
11,Peregrine Financial Group boss admits $100m fr...,1
12,American accuses Congo officials of unlawful a...,1
17,Bitcoin Foundation vice chair arrested for mon...,1
...,...,...
513,Leading UK Bank Strengthens Fight Against Risi...,0
514,Shadow chancellor calls on ministers to fulfil...,0
516,Peru: Ousting of President Threatens Rule of L...,0
517,France gives online firms one hour to pull 'te...,0


In [None]:
!pip install spacy-langdetect

Collecting spacy-langdetect
  Downloading https://files.pythonhosted.org/packages/29/70/72dad19abe81ca8e85ff951da170915211d42d705a001d7e353af349a704/spacy_langdetect-0.1.2-py3-none-any.whl
Collecting langdetect==1.0.7
[?25l  Downloading https://files.pythonhosted.org/packages/59/59/4bc44158a767a6d66de18c4136c8aa90491d56cc951c10b74dd1e13213c9/langdetect-1.0.7.zip (998kB)
[K     |████████████████████████████████| 1.0MB 4.2MB/s 
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.7-cp36-none-any.whl size=993459 sha256=4a088c2c8e40a60d7ac5fdb6694275ecde7d0062ab3c6802a776e9c1967e849f
  Stored in directory: /root/.cache/pip/wheels/ec/0c/a9/1647275e7ef5014e7b83ff30105180e332867d65e7617ddafe
Successfully built langdetect
Installing collected packages: langdetect, spacy-langdetect
Successfully installed langdetect-1.0.7 spacy-langdetect-0.1.2


In [None]:
import spacy
import re

nlp = spacy.load('en_core_web_sm')


# Should be (almost) the same as Canberk's, but slighlty faster, as not compiling the regex each time
regex1 = re.compile(r'(http\S+)|(#(\w+))|(@(\w+))|[^\w\s]|(\w*\d\w*)')
regex2 = re.compile(r'( +)|(\n)')

def lemmatize(article):
    article = re.sub(regex1, '', article)
    article = re.sub(regex2,' ', article).strip().lower()
    
    doc = nlp(article)
    lemmatized_article = " ".join([token.lemma_ for token in doc if (token.is_stop==False)]) 
    
    return lemmatized_article

In [None]:
example = data.article[8]
lemmatized = lemmatize(example)

print('Before Lemmatization:')
print()
print(example)
print()

print('After Lemmatization:')
print()
print(lemmatized)
print()

Before Lemmatization:

Top 10 Crooked CEOs Bernie Madoff, who is scheduled to be sentenced June 29 for perpetrating history's biggest Ponzi scheme, is just be the latest in a long line of industry titans turned crooks

CRIMINAL EXECUTIVE OFFICER
Sam Waksal

CEO: ImClone
Convicted: October 15, 2002 of securities fraud, bank fraud, obstruction of justice, and perjury
Known for his networking skills as much as for his scientific expertise, immunologist Sam Waksal founded ImClone in 1984. The New York-based biotech firm remained relatively unknown until 1999, when it announced the creation of Erbitux — a cancer-fighting drug so promising it convinced pharmaceutical giant Bristol-Myers to purchase $1 billion of ImClone stock in one of the largest biotechnology partnerships in U.S. history. But when the Food and Drug Administration rejected the drug, Waksal alerted several relatives and friends to dump their stock as soon as possible — before the FDA's decision had been made public. Waksal's

Lemmatizing the whole dataset:

In [None]:
train = data[['article', 'label']].copy()
train["article"] = train["article"].apply(lemmatize)
train = train.reset_index()
train = train.drop(['index'], axis=1)
train

Unnamed: 0,article,label
0,crooked ceos bernie madoff schedule sentence j...,1
1,fund manager force resign bbc investigation pu...,1
2,peregrine financial group boss admit fraud pub...,1
3,american accuse congo official unlawful arrest...,1
4,bitcoin foundation vice chair arrest money lau...,1
...,...,...
709,lead uk bank strengthen fight rise payment fra...,0
710,shadow chancellor call minister fulfil pledge ...,0
711,peru oust president threaten rule law washingt...,0
712,france give online firm hour pull terrorist co...,0


In [None]:
from google.colab import drive
import glob
drive.mount('/content/drive')

import os
path = '/content/drive/My Drive/Colab Notebooks/Machine Learning/project/Karl'
os.chdir(path)

Mounted at /content/drive


# Bert with Keras:

In [None]:
!pip install keras-bert # https://pypi.org/project/keras-bert/#Download-Pretrained-Checkpoints

Collecting keras-bert
  Downloading https://files.pythonhosted.org/packages/e2/7f/95fabd29f4502924fa3f09ff6538c5a7d290dfef2c2fe076d3d1a16e08f0/keras-bert-0.86.0.tar.gz
Collecting keras-transformer>=0.38.0
  Downloading https://files.pythonhosted.org/packages/89/6c/d6f0c164f4cc16fbc0d0fea85f5526e87a7d2df7b077809e422a7e626150/keras-transformer-0.38.0.tar.gz
Collecting keras-pos-embd>=0.11.0
  Downloading https://files.pythonhosted.org/packages/09/70/b63ed8fc660da2bb6ae29b9895401c628da5740c048c190b5d7107cadd02/keras-pos-embd-0.11.0.tar.gz
Collecting keras-multi-head>=0.27.0
  Downloading https://files.pythonhosted.org/packages/e6/32/45adf2549450aca7867deccfa04af80a0ab1ca139af44b16bc669e0e09cd/keras-multi-head-0.27.0.tar.gz
Collecting keras-layer-normalization>=0.14.0
  Downloading https://files.pythonhosted.org/packages/a4/0e/d1078df0494bac9ce1a67954e5380b6e7569668f0f3b50a9531c62c1fc4a/keras-layer-normalization-0.14.0.tar.gz
Collecting keras-position-wise-feed-forward>=0.6.0
  Downloading

We only want the embeddings from a pre-trained BERT model for now, so let us do only that:

In [None]:
import keras_bert

from keras_bert import extract_embeddings, PretrainedList, get_pretrained


model_path = get_pretrained(PretrainedList.multi_cased_base)
example_texts = ['all work and no play', 'makes jack a dull boy~']

embeddings = extract_embeddings(model_path, example_texts)

Downloading data from https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip


Bert embedding creates a vector of length 768 for every word in the input text (+ some additional vectors for stopwords and special tokens)

In [None]:
print(len(embeddings))
embeddings[0].shape

2


(7, 768)

In [None]:
embeddings[0]

array([[ 0.07531585, -0.15103045,  0.16370857, ...,  0.7767107 ,
         0.02733361, -0.02975313],
       [-0.12953976, -0.35776514,  0.02478126, ...,  1.4056004 ,
         0.16759607, -0.29797885],
       [-0.2674758 , -0.26116055,  0.11113371, ...,  1.3588166 ,
         0.10443059, -0.4157839 ],
       ...,
       [-0.34334993, -0.25273603, -0.6840705 , ...,  1.327804  ,
        -0.15623444, -0.47893643],
       [-0.2459211 , -0.12426332, -0.07056609, ...,  1.3984779 ,
        -0.03810974, -0.19883168],
       [-0.02836556, -0.25108787,  0.3347791 , ...,  0.78832954,
         0.0526384 , -0.1193769 ]], dtype=float32)

Let us try to embed our data:

In [None]:
example_article = lemmatized.split()
len(example_article)

165

In [None]:
test_article = " ".join(example_article[:50])

article_embedding = extract_embeddings(model_path, test_article)

In [None]:
print(len(article_embedding))

363


In [None]:
article_embedding[0].shape

(3, 768)

In [None]:
test_article

'crooked ceos bernie madoff schedule sentence june perpetrate historys big ponzi scheme late long line industry titan turn crook criminal executive officer sam waksal ceo imclone convict october securities fraud bank fraud obstruction justice perjury know networking skill scientific expertise immunologist sam waksal found imclone new yorkbase biotech firm remain'

393.64565826330534


Second attempt: https://www.analyticsvidhya.com/blog/2020/10/simple-text-multi-classification-task-using-keras-bert/


In [None]:
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

In [None]:
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 5.9MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.94


In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import logging
logging.basicConfig(level=logging.INFO)

In [None]:
import tensorflow_hub as hub
import tokenization
module_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2'
bert_layer = hub.KerasLayer(module_url, trainable=True)

INFO:absl:resolver HttpCompressedFileResolver does not support the provided handle.
INFO:absl:resolver GcsCompressedFileResolver does not support the provided handle.


# Possible improvement: 
Read TODO comment below

In [None]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            

        # TODO: Should be changed to split the text into chunks, process each chunk separately, and later combine
        text = text[:max_len-2]



        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence) + [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

# Possible improvents: 
Add layers, change params, replace with any model really, but it works for now just keep the "bert_layer" in there as one of the first steps

In [None]:
from keras import backend as K

from keras.layers import LSTM, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.layers import Dropout, Concatenate
from keras.layers import SpatialDropout1D, concatenate

In [None]:
def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [None]:
def build_model(bert_layer, max_len=512):
    input_word_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    x = Bidirectional(LSTM(100, return_sequences=True))(clf_output)
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    conc = concatenate([avg_pool, max_pool])
    outp = Dense(1, activation="sigmoid")(conc)

    model = tf.keras.models.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=outp)
    model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy', f1_m])
    
    return model

In [None]:
## test-train split: 
from sklearn.model_selection import train_test_split
train = pd.read_csv('all_lemmatized.csv', lineterminator='\n').iloc[:, 1:3]

bert_train = train.sample(frac = 1) 

x_train, x_val, y_train, y_val = train_test_split(bert_train['article'], 
                                                    bert_train['label'], 
                                                    test_size=0.1, 
                                                    random_state=42,
                                                    stratify= bert_train['label'])

print(x_train.shape, x_val.shape, y_train.shape, y_val.shape)

(642,) (72,) (642,) (72,)


# Increase the max_len param for better results, but more time taken

In [None]:
import keras
max_len = 500 # Larger takes longer
train_input = bert_encode(x_train, tokenizer, max_len=max_len)
test_input = bert_encode(x_val, tokenizer, max_len=max_len)
train_labels =keras.utils.to_categorical(y_train, num_classes=2)

In [None]:
model = build_model(bert_layer, max_len=max_len)
model.summary()

Model: "functional_19"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 500)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 500)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 500)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        multiple             109482241   input_word_ids[0][0]             
                                                                 input_mask[0][0]     

In [None]:
checkpoint = tf.keras.callbacks.ModelCheckpoint('model.h5', monitor='val_f1_m', save_best_only=True, verbose=1)
earlystopping = tf.keras.callbacks.EarlyStopping(monitor='val_f1_m', patience=5, verbose=1)

train_history = model.fit(
    train_input, train_labels, 
    validation_split=0.2,
    epochs=10,
    callbacks=[checkpoint, earlystopping],
    batch_size=8,
    verbose=1)

Epoch 1/10
Epoch 00001: val_accuracy improved from -inf to 0.89922, saving model to model.h5
Epoch 2/10
Epoch 00002: val_accuracy did not improve from 0.89922
Epoch 3/10
Epoch 00003: val_accuracy did not improve from 0.89922
Epoch 4/10
Epoch 00004: val_accuracy did not improve from 0.89922
Epoch 5/10
Epoch 00005: val_accuracy did not improve from 0.89922
Epoch 6/10
Epoch 00006: val_accuracy did not improve from 0.89922
Epoch 00006: early stopping


In [None]:
model.load_weights('model.h5')
test_pred = model.predict(test_input)





In [None]:
test_pred.shape

(72, 2)

In [None]:
y_val.shape

(72,)

In [None]:
# from probability to binary
pred = [1 if el[1]> 0.5 else 0 for el in test_pred]
pred[:5]

[1, 1, 0, 1, 1]

In [None]:
from sklearn.metrics import f1_score


val_f1_score = f1_score(y_val, pred)

print('F1 score for model on validation data:', round(val_f1_score*100, 3))

F1 score for model on validation data: 87.805


In [None]:
import pandas as pd
import numpy as np


am = pd.read_csv('adverse_media_training.csv.zip')
nam = pd.read_csv('non_adverse_media_training.csv.zip')

# Keep only needed columns
am_cropped = am[['article','title','label']]
nam_cropped = nam[['article','title', 'label']]

# Combine source files and re-label to binary
am = pd.concat(
    [ am_cropped.loc[(am_cropped.label == 'am') | (am_cropped.label == 'am ')],
     nam_cropped.loc[(nam_cropped.label == 'am')] ]
)
am['label'] = 1

nam = pd.concat(
    [ am_cropped.loc[(am_cropped.label == 'nam')], 
     nam_cropped.loc[(nam_cropped.label == 'nam')] ]
)
nam['label'] = 0


# Combine data into one table
data = pd.concat([am,nam])
data


In [None]:
public_test = pd.read_csv('../public_test.csv')
public_test

Unnamed: 0,id,title,article,label
0,931,Caputo concealed Cayman Island offshore firms ...,"By Sandra Crucianelli, Emilia Delfino y From B...",1
1,644,California Man Pleads Guilty in $6 Million Art...,A California man pleaded guilty in federal cou...,1
2,881,Couple jailed for laundering £50m,A couple who ran a diamond trading business ha...,1
3,841,John Gilligan charged with money laundering of...,image copyrightRTÉ\n\nA Dublin man has been ch...,1
4,31,Grace Mugabe faces arrest in Mary Chiwenga Sty...,Zimbabwe News\n\nGrace Mugabe faces arrest in ...,1
...,...,...,...,...
154,348,Kanye West's strange presidential bid unravels...,(CNN) Kanye West is on the ballot in Minnesota...,0
155,295,Anti-money laundering software startup TookiTa...,"TookiTaki, a startup that develops machine lea...",0
156,311,If we really want to know what makes terrorist...,In the last two and half years I’ve studied th...,0
157,545,An effective e-declaration system will be a wa...,"BY MARCUS BRAND,\n\nTwo-and-a-half years ago, ...",0


In [None]:
public_test["article"] = public_test["title"] + " " + public_test["article"]
public_test.drop(["title"], axis =1)

INFO:numexpr.utils:NumExpr defaulting to 4 threads.


Unnamed: 0,id,article,label
0,931,Caputo concealed Cayman Island offshore firms ...,1
1,644,California Man Pleads Guilty in $6 Million Art...,1
2,881,Couple jailed for laundering £50m A couple who...,1
3,841,John Gilligan charged with money laundering of...,1
4,31,Grace Mugabe faces arrest in Mary Chiwenga Sty...,1
...,...,...,...
154,348,Kanye West's strange presidential bid unravels...,0
155,295,Anti-money laundering software startup TookiTa...,0
156,311,If we really want to know what makes terrorist...,0
157,545,An effective e-declaration system will be a wa...,0


In [None]:
import spacy
import re

nlp = spacy.load('en_core_web_sm')


# Should be (almost) the same as Canberk's, but slighlty faster, as not compiling the regex each time
regex1 = re.compile(r'(http\S+)|(#(\w+))|(@(\w+))|[^\w\s]|(\w*\d\w*)')
regex2 = re.compile(r'( +)|(\n)')

def lemmatize(article):
    article = re.sub(regex1, '', article)
    article = re.sub(regex2,' ', article).strip().lower()
    
    doc = nlp(article)
    lemmatized_article = " ".join([token.lemma_ for token in doc if (token.is_stop==False)]) 
    
    return lemmatized_article

In [None]:
public_test_lemmatized = public_test[['article', 'label']].copy()
public_test_lemmatized["article"] = public_test_lemmatized["article"].apply(lemmatize)
public_test_lemmatized = public_test_lemmatized.reset_index()
public_test_lemmatized = public_test_lemmatized.drop(['index'], axis=1)
public_test_lemmatized

Unnamed: 0,article,label
0,caputo conceal cayman island offshore firm arg...,1
1,california man plead guilty million art fraud ...,1
2,couple jail launder couple run diamond trading...,1
3,john gilligan charge money laundering offence ...,1
4,grace mugabe face arrest mary chiwenga style s...,1
...,...,...
154,kanye west strange presidential bid unravel th...,0
155,antimoney laundering software startup tookitak...,0
156,want know make terrorist commit atrocity half ...,0
157,effective edeclaration system watershed countr...,0


In [None]:

public_test_tokenized = bert_encode(public_test_lemmatized['article'], tokenizer, max_len=max_len)
public_test_pred = model.predict(public_test_tokenized)

public_test_pred = [1 if el[1]> 0.5 else 0 for el in public_test_pred]

public_test_f1_score = f1_score(public_test_lemmatized['label'], public_test_pred)

print('F1 score for model on public test data:', round(public_test_f1_score*100, 3))

F1 score for model on public test data: 94.301
