[Link to Example](https://medium.com/geekculture/hugging-face-distilbert-tensorflow-for-custom-text-classification-1ad4a49e26a7)

In [24]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from sklearn.datasets import fetch_20newsgroups
import tensorflow as tf
import transformers
from transformers import DistilBertTokenizer
from transformers import TFDistilBertForSequenceClassification
pd.set_option('display.max_colwidth', None)

import warnings
warnings.filterwarnings('ignore')

In [7]:
# Variables:
MODEL_NAME = 'distilbert-base-uncased-finetuned-sst-2-english'
BATCH_SIZE = 16
N_EPOCHS = 3

In [8]:
dataset = fetch_20newsgroups(subset ='train', 
                             remove=('headers', 'footers', 'quotes'), 
                             shuffle=True, 
                             random_state=42)

df = pd.DataFrame()
df['text'] = dataset.data
df['source'] = dataset.target

label = []

for i in df['source']:
    label.append(dataset.target_names[i])

df['label']=label

# Remove source feature as label will be encoded for dependant:
df.drop(['source'], axis = 1, inplace = True)

def label_targets(a):
    if a['label'] == 'talk.politics.misc' or a['label'] == 'talk.politics.guns' or a['label'] == 'talk.politics.mideast':
        return 'politics'

    elif a['label'] == 'rec.sport.hockey' or a['label'] == 'rec.sport.baseball':
        return 'sport'

    elif a['label'] == 'soc.religion.christian' or a['label'] == 'talk.religion.misc':
        return 'religion'

    elif a['label'] == 'comp.windows.x' or a['label'] == 'comp.sys.ibm.pc.hardware' or a['label'] == 'comp.os.ms-windows.misc' or a['label'] == 'comp.graphics' or a['label'] == 'comp.sys.mac.hardware':
        return 'computer'

    elif a['label'] == 'misc.forsale':
        return 'sales'

    elif a['label'] == 'rec.autos' or a['label'] == 'rec.motorcycles':
        return 'automobile'

    elif a['label'] == 'sci.crypt' or a['label'] == 'sci.electronics' or a['label'] == 'sci.space':
        return 'science'

    elif a['label'] == 'sci.med':
        return 'medicine'

    else:
        return 'ERROR'

df['label'] = df.apply(label_targets, axis = 1)

# Apply word count on text data:
df['Number_of_words'] = df['text'].apply(lambda x:len(str(x).split()))

# Drop all words that equal 0:
no_text = df[df['Number_of_words']==0]

# drop these rows
df.drop(no_text.index,inplace=True)

In [9]:
df.head()

Unnamed: 0,text,label,Number_of_words
0,"I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.",automobile,91
1,"A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.",computer,90
2,"well folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985. sooo, i'm in the market for a\nnew machine a bit sooner than i intended to be...\n\ni'm looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected? i'd heard the 185c was supposed to make an\nappearence ""this summer"" but haven't heard anymore on it - and since i\ndon't have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo's just went through recently?\n\n* what's the impression of the display on the 180? i could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don't really have\na feel for how much ""better"" the display is (yea, it looks great in the\nstore, but is that all ""wow"" or is it really that good?). could i solicit\nsome opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk size and money hit to get the active display? (i realize\nthis is a real subjective question, but i've only played around with the\nmachines in a computer store breifly and figured the opinions of somebody\nwho actually uses the machine daily might prove helpful).\n\n* how well does hellcats perform? ;)\n\nthanks a bunch in advance for any info - if you could email, i'll post a\nsummary (news reading time is at a premium with finals just around the\ncorner... :( )\n--\nTom Willis \ twillis@ecn.purdue.edu \ Purdue Electrical Engineering",computer,307
3,\nDo you have Weitek's address/phone number? I'd like to get some information\nabout this chip.\n,computer,15
4,"From article <C5owCB.n3p@world.std.com>, by tombaker@world.std.com (Tom A Baker):\n\n\nMy understanding is that the 'expected errors' are basically\nknown bugs in the warning system software - things are checked\nthat don't have the right values in yet because they aren't\nset till after launch, and suchlike. Rather than fix the code\nand possibly introduce new bugs, they just tell the crew\n'ok, if you see a warning no. 213 before liftoff, ignore it'.",science,72


In [14]:
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
df['label'] = label_encoder.fit_transform(df['label'])

from sklearn.model_selection import train_test_split

X= df['text']
y= df['label']
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=4)

In [15]:
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((8811,), (2203,), (8811,), (2203,))

In [17]:
#define a tokenizer object
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

#tokenize the text
train_encodings = tokenizer(list(x_train.values),
                            truncation=True, 
                            padding=True)
test_encodings = tokenizer(list(x_test.values),
                           truncation=True, 
                           padding=True)

Downloading: 100%|██████████| 232k/232k [00:00<00:00, 1.47MB/s]
Downloading: 100%|██████████| 48.0/48.0 [00:00<?, ?B/s]


In [27]:
#print(f'First Paragraph: \'{x_train[:1]}\'')
#print(f'Input ids: {train_encodings["input_ids"][0]}')
#print(f'Attention mask: {train_encodings["attention_mask"][0]}')

In [28]:
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings),
                                    list(y_train.values)))
                                    
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings),
                                    list(y_test.values)))

In [29]:
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)

#chose the optimizer
optimizerr = tf.keras.optimizers.Adam(learning_rate=5e-5)

#define the loss function 
losss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

#build the model
model.compile(optimizer=optimizerr,
              loss=losss,
              metrics=['accuracy'])

# train the model 
model.fit(train_dataset.shuffle(len(x_train)).batch(BATCH_SIZE),
          epochs=N_EPOCHS,
          batch_size=BATCH_SIZE)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_59']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3


InvalidArgumentError:  Received a label value of 8 which is outside the valid range of [0, 2).  Label values: 4 1 2 2 1 8 4 1 3 6 2 1 5 2 2 2
	 [[node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at <ipython-input-29-2e6874b6e573>:15) ]] [Op:__inference_train_function_40512]

Function call stack:
train_function


In [26]:
# model evaluation on the test set
model.evaluate(test_dataset.shuffle(len(x_test)).batch(BATCH_SIZE), 
               return_dict=True, 
               batch_size=BATCH_SIZE)



InvalidArgumentError:  Received a label value of 8 which is outside the valid range of [0, 2).  Label values: 4 7 0 4 8 2 1 2 7 1 1 1 7 2 2 1
	 [[node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at <ipython-input-26-8fb55fccd9df>:2) ]] [Op:__inference_test_function_27589]

Function call stack:
test_function
