## Data Loading

In [1]:
import pandas as pd
import numpy as np

In [2]:
input_df = pd.read_csv('supervised_input.csv')
input_df.head()

Unnamed: 0,Document,Topic_Label
0,"American Airlines Flyer Charged, Banned For Li...",AIRLINE INCIDENTS
1,23 Of The Funniest Tweets About Cats And Dogs ...,FUNNY TWEETS
2,Man Sets Himself On Fire In Apparent Protest O...,HOLIDAYS
3,Russian Cosmonaut Valery Polyakov Who Broke Re...,AIRLINE INCIDENTS
4,4 Russian-Controlled Ukrainian Regions Schedul...,WORLD POLITICS


## Preparing Dataset

In [3]:
docs = input_df['Document'].to_list()
text_labels = input_df['Topic_Label'].to_list()

## Encode Labels - because model understands numbers not text

In [4]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(text_labels)

for label, number in zip(encoder.classes_, encoder.transform(encoder.classes_)):
    print(f'{label}: {number}')

AIRLINE INCIDENTS: 0
ANIMALS: 1
ART & HOME: 2
BUSINESS: 3
CLIMATE: 4
CRIME: 5
DATING & MARRIAGE: 6
EDUCATION: 7
FASHION: 8
FOOD & DRINK: 9
FUNNY TWEETS: 10
HEALTH: 11
HOLIDAYS: 12
MENTAL HEALTH: 13
MOVIES: 14
MUSIC: 15
OTHER: 16
PARENTING: 17
QUEER VOICES: 18
ROYAL FAMILY: 19
SCIENCE & HISTORY: 20
SPORTS: 21
STYLE: 22
TECHNOLOGY: 23
TRAVEL: 24
US POLITICS: 25
WEATHER NEWS: 26
WEIRD NEWS: 27
WELLNESS: 28
WORLD POLITICS: 29


In [5]:
data_texts = docs
data_labels = encoded_labels

## Train-Test-Validation Split

In [8]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(data_texts, data_labels, test_size = 0.2, random_state = 42, stratify=data_labels)
train_texts, test_texts, train_labels, test_labels = train_test_split(train_texts, train_labels, test_size = 0.01, random_state = 42, stratify=train_labels)

print(len(train_texts))
print(len(train_labels))
print(len(test_texts))
print(len(test_labels))
print(len(val_texts))
print(len(val_labels))

58596
58596
592
592
14797
14797


## Use Tokenizer to create encodings

In [9]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts, truncation = True, padding = True  )
val_encodings = tokenizer(val_texts, truncation = True, padding = True )

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))

## Model Training

In [11]:
from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=30)

optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=5e-5, epsilon=1e-08)
model.compile(optimizer=optimizer, loss=model.hf_compute_loss, metrics=['accuracy'])

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [12]:
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

model.fit(train_dataset.shuffle(1000).batch(16),
        epochs=2,
        batch_size=16,
        validation_data=val_dataset.shuffle(1000).batch(16),
        callbacks=[early_stopping])

Epoch 1/2
Epoch 2/2


<tf_keras.src.callbacks.History at 0x17229ff70>

In [13]:
# Display the model's architecture
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  23070     
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66976542 (255.50 MB)
Trainable params: 66976542 (255.50 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Saving the Model

In [14]:
save_directory = "iprinka_news_classifier"

model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

('iprinka_news_classifier/tokenizer_config.json',
 'iprinka_news_classifier/special_tokens_map.json',
 'iprinka_news_classifier/vocab.txt',
 'iprinka_news_classifier/added_tokens.json')

## Loading the Model

In [15]:
loaded_tokenizer = DistilBertTokenizer.from_pretrained(save_directory)
loaded_model = TFDistilBertForSequenceClassification.from_pretrained(save_directory)

Some layers from the model checkpoint at iprinka_news_classifier were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at iprinka_news_classifier and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Test the model

In [16]:
def predict_category(text):

    predict_input = loaded_tokenizer.encode(text,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")

    output = loaded_model(predict_input)

    prediction_value = tf.argmax(output[0], axis=1).numpy()[0]
    prediction_probability = tf.reduce_max(tf.nn.softmax(output.logits, axis=1)).numpy()

    if prediction_probability < 0.5:
        prediction_value = -1

    return prediction_value,prediction_probability


## Reassign textual labels back to the output for readability

In [33]:
label_map = {}
for label, number in zip(encoder.classes_, encoder.transform(encoder.classes_)):
    label_map[number] = label

label_map

{np.int64(0): np.str_('AIRLINE INCIDENTS'),
 np.int64(1): np.str_('ANIMALS'),
 np.int64(2): np.str_('ART & HOME'),
 np.int64(3): np.str_('BUSINESS'),
 np.int64(4): np.str_('CLIMATE'),
 np.int64(5): np.str_('CRIME'),
 np.int64(6): np.str_('DATING & MARRIAGE'),
 np.int64(7): np.str_('EDUCATION'),
 np.int64(8): np.str_('FASHION'),
 np.int64(9): np.str_('FOOD & DRINK'),
 np.int64(10): np.str_('FUNNY TWEETS'),
 np.int64(11): np.str_('HEALTH'),
 np.int64(12): np.str_('HOLIDAYS'),
 np.int64(13): np.str_('MENTAL HEALTH'),
 np.int64(14): np.str_('MOVIES'),
 np.int64(15): np.str_('MUSIC'),
 np.int64(16): np.str_('OTHER'),
 np.int64(17): np.str_('PARENTING'),
 np.int64(18): np.str_('QUEER VOICES'),
 np.int64(19): np.str_('ROYAL FAMILY'),
 np.int64(20): np.str_('SCIENCE & HISTORY'),
 np.int64(21): np.str_('SPORTS'),
 np.int64(22): np.str_('STYLE'),
 np.int64(23): np.str_('TECHNOLOGY'),
 np.int64(24): np.str_('TRAVEL'),
 np.int64(25): np.str_('US POLITICS'),
 np.int64(26): np.str_('WEATHER NEWS'),


In [36]:
test_set = test_texts[0:10]

for test_text in test_set:
    print(test_text)
    val, prob = predict_category(test_text)
    label = label_map[val] if val > -1 else "UNKNOWN"
    print(label, prob)


How Some Evangelicals Rationalize Their Support Of Donald Trump [SEP] "Full Frontal with Samantha Bee" asked these conservative Christians why they're supporting Trump.
QUEER VOICES 0.99202865
Russia Detains Anti-Corruption Protesters In Moscow [SEP] The turnout at the demonstrtion was far smaller than at last week's wave of protests.
QUEER VOICES 0.99713624
Post-feast Quinoa Salad Makes Thrifty Use of Thanksgiving Leftovers [SEP] It's a delicious way to use a few extra cups of roasted vegetables from your Thanksgiving dinner. My recipe uses three key Cook for Good ideas that can help you sail through the holidays.
FOOD & DRINK 0.99853826
Valerie Jarrett On Obama's Secret To Finding Fulfillment (VIDEO) [SEP] Watch the video clip above, and head over to our Third Metric page for more inspiration from the conference. In addition
UNKNOWN 0.44775578
Who Owns Donald Trump? The Fiery Speech Senator Chuck Schumer Must Give To Save Our Country [SEP] Here’s the speech Senate Minority Leader Chu