# Building an engagemnt classifier

In this notebook, we will build a simple, fast, and accurate text classification model in 3 simple steps. We will build a binary BERT model that classifies any text problem you would like to solve as either high or low.

You need to use your own dataset, because of privacy issues I can't share the social media data I'm using in this tutorial.

Each entry in the dataset should includes 'text' and a targeted measure that you want to predict. In my case I'm using the Instagram post as the text and my target measue is an engagement metric which is the number of likes for the post.  The target measure will be converted into binary classification by assigning posts with a the 50% top likes a high label of 1 and assigning the bottom 50% posts with likes a low label of 0.


In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

In [None]:
Import the needed libraries

# importing ktrain
import ktrain
from ktrain import text

# importing tensorflow
import tensorflow as tf

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
seed_value=13
import re
import os
import math
from sklearn.utils.class_weight import compute_class_weight
from sklearn.preprocessing import MultiLabelBinarizer
import tensorflow as tf
from tensorflow.keras import activations
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import ktrain
from ktrain import text

seed_value=13
classes_number=2
engagement_metric='likes'#The neam of the targeted measure in my dataset

# For reading the datafile.. make sure to add a valid file path if not located in the same folder as the python code
file_name='yourDataFile.csv'#you need to change this for your data file csv file with text and likes fields



In [None]:
#Function to clean data
# Thanks for this repository where I used the clean function
# https://github.com/Hind-Almerekhi/toxicityChangesReddit/blob/main/Classification/fineTunedBERT.ipynb

def clean(text, newline=True, quote=True, bullet_point=True, 
          link=True, strikethrough=True, spoiler=True,
          code=True, superscript=True, table=True, heading=True):
    """
    Cleans text (string).
      * \n (newlines)
      * &gt; (> quotes)
      * * or &amp;#x200B; (bullet points)
      * []() (links)
      * etc (see below)
    Specific removals can be turned off, but everything is on by default.
    Standard punctuation etc is deliberately not removed, can be done in a
    second round manually, or may be preserved in any case.
    """
    # Newlines (replaced with space to preserve cases like word1\nword2)
    text = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(text))
    
    if newline:
        text = re.sub(r'\n+', ' ', text)

        # Remove resulting ' '
        text = text.strip()
        text = re.sub(r'\s\s+', ' ', text)

    # > Quotes
    if quote:
        text = re.sub(r'\"?\\?&?gt;?', '', text)

    # Bullet points/asterisk (bold/italic)
    if bullet_point:
        text = re.sub(r'\*', '', text)
        text = re.sub('&amp;#x200B;', '', text)

    # []() Link (Also removes the hyperlink)
    if link:
        text = re.sub(r'\[.*?\]\(.*?\)', '', text)

    # Strikethrough
    if strikethrough:
        text = re.sub('~', '', text)

    # Spoiler, which is used with < less-than (Preserves the text)
    if spoiler:
        text = re.sub('&lt;', '', text)
        text = re.sub(r'!(.*?)!', r'\1', text)

    # Code, inline and block
    if code:
        text = re.sub('`', '', text)

    # Superscript (Preserves the text)
    if superscript:
        text = re.sub(r'\^\((.*?)\)', r'\1', text)

    # Table
    if table:
        text = re.sub(r'\|', ' ', text)
        text = re.sub(':-', '', text)

    # Heading
    if heading:
        text = re.sub('#', '', text)
    return text
          

## STEP 1:  Loading and Preprocessing the Dataset
We set `val_pct` as 0.2, which will automatically sample 20% of the data for validation.  We specifiy `preprocess_mode='bert'`, as we will fine-tuning a BERT model in this example.


In [None]:
# Reading the datafile.. 
df = pd.read_csv(file_name)

df = df.sample(frac=1,random_state=seed_value)
df['text'] = df['text'].apply(lambda x: clean(x))


# Remove outliers, the top and bottom 1% of the records
q_low = df[engagement_metric].quantile(0.01)
q_hi  = df[engagement_metric].quantile(0.99)
df = df[(df[engagement_metric] < q_hi) & (df[engagement_metric] > q_low)]


# labeling the data based on the quantile value 
# the top 50% of the records are labeled 1, and the bottom 50% records are labeles 0
df[engagement_metric]=pd.qcut(df[engagement_metric], classes_number, labels=list(range(0,classes_number)))

# only keeping the record we need from the dataframe.
df = df[[engagement_metric, 'text']] 

# labeling the target with high and low. Through converting 1 to 'high' and 0 to 'low'
df[engagement_metric] = df[engagement_metric].apply(lambda x: 'low' if x <1 else 'high')

# Renaming the colums to label and text
df.columns = ['label', 'text'] 
print(df.head())

# Here we need to have separate colums for each label, so we end up having 3 colums for the training dataset 
# (text.low,high) important format for the BERT binary model
df = pd.concat([df, df.label.astype('str').str.get_dummies()], axis=1, sort=False)
df = df[['text', 'low', 'high']]
df.head()


In [None]:
# spliting the data into training and testing (20%)
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_df(df, 
                                                                   'text', # the column containing text
                                                                   label_columns=['low', 'high'],#The labels
                                                                   maxlen=256, 
                                                                   max_features=100000,
                                                                   preprocess_mode='bert',
                                                                   val_pct=0.2)

## STEP 2:  Creating the BERT Model and Wraping in Learner Object

We will employ a neural implementation of the [NBSVM](https://www.aclweb.org/anthology/P12-2018/).

In [None]:
# Creating the BERT model using text_classifier 
model = text.text_classifier('bert', (x_train, y_train) , preproc=preproc)
learner = ktrain.get_learner(model, 
                             train_data=(x_train, y_train), 
                             val_data=(x_test, y_test), 
                             batch_size=16)

## STEP 3: Training the BERT Model
We will use the `fit_onecycle` method that employs a 1cycle learning rate (5e-5) policy and train 4 epochs.

In [None]:
learner.fit_onecycle(5e-5, 4)

In [None]:
learner.validate()#validating the model

### Making Predictions on New Data

In [None]:
p = ktrain.get_predictor(learner.model, preproc)

Predicting label for:
> "*#Austria’s supreme court has dropped its terrorism case against university professor #FaridHafez after an Al Jazeera documentary revealed charges were based on false evidence and fabricated accusations. We spoke to him about the case.".*"

In [None]:
p.predict("#Austria’s supreme court has dropped its terrorism case against university professor #FaridHafez after an Al Jazeera documentary revealed charges were based on false evidence and fabricated accusations. We spoke to him about the case.")


### Save our Predictor for Later Deployment

In [None]:
# save model for later use
p.save(f'Trained_BERT_model_4epochs')

In [1]:
# reload from disk
p_saved = ktrain.load_predictor('Trained_BERT_model_4epochs')

In [None]:
# prediction using the saved model after loading
p_saved.predict("#Austria’s supreme court has dropped its terrorism case against university professor #FaridHafez after an Al Jazeera documentary revealed charges were based on false evidence and fabricated accusations. We spoke to him about the case.")
