<a href="https://colab.research.google.com/github/jcdumlao14/CustomSentimentAnalysis-HuggingFace-/blob/main/%F0%9F%A4%97DistilBert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Custom Sentiment Analysis with Hugging Face 🤗 - DistilBert**

# **Install the necessary libraries:**

This step involves installing the required Python libraries, which may include Transformers, TensorFlow, and NumPy.

In [None]:
! pip install transformers

In [None]:
%%capture
!pip install tensorflow

# **Import the necessary libraries and load the dataset:**

After installing the required libraries, the next step is to import them into the Python script. The necessary libraries include TensorFlow, Hugging Face transformers, NumPy, and Pandas. The dataset is loaded using the Pandas library, which is used to read the CSV file containing the data.

In [None]:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizerFast

df = text = pd.read_csv('/content/sms_spam.csv') 
df.head()

Unnamed: 0,type,text
0,ham,Hope you are having a good week. Just checking in
1,ham,K..give back my thanks.
2,ham,Am also doing in cbe only. But have to pay.
3,spam,"complimentary 4 STAR Ibiza Holiday or £10,000 ..."
4,spam,okmail: Dear Dave this is your final notice to...


In [None]:
df.shape

(5559, 2)

In [None]:
X = list(df['text'])

In [None]:
y = list(df['type'])

In [None]:
y

['ham',
 'ham',
 'ham',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 '

In [None]:
y=list(pd.get_dummies(y,drop_first=True)['spam'])

In [None]:
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state =0)


In [None]:
X_train

['Yup ok...',
 'See? I thought it all through',
 'Your account has been refilled successfully by INR  DECIMAL . Your KeralaCircle prepaid account balance is Rs  DECIMAL . Your Transaction ID is KR # .',
 'FREE camera phones with linerental from 4.49/month with 750 cross ntwk mins. 1/2 price txt bundle deals also avble. Call 08001950382 or call2optout/J MF',
 'How come it takes so little time for a child who is afraid of the dark to become a teenager who wants to stay out all night?',
 'Oh sorry please its over',
 'K. I will sent it again',
 'Missing you too.pray inshah allah',
 'How is your schedule next week? I am out of town this weekend.',
 "When people see my msgs, They think Iam addicted to msging... They are wrong, Bcoz They don\\'t know that Iam addicted to my sweet Friends..!! BSLVYL",
 'Sday only joined.so training we started today:)',
 "I'll be late...",
 'U have a Secret Admirer who is looking 2 make contact with U-find out who they R*reveal who thinks UR so special-call on 

In [None]:
pd.get_dummies(y,drop_first=True)

Unnamed: 0,1
0,0
1,0
2,0
3,1
4,1
...,...
5554,0
5555,0
5556,1
5557,1


# **Load the pre-trained transformer model:**

A pre-trained transformer model is loaded from the Hugging Face transformers library. The transformer model is responsible for learning the relationships between the input text and their corresponding sentiment labels.

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [None]:
train_encodings = tokenizer(X_train, truncation=True, padding=True)
test_encodings = tokenizer(X_test, truncation=True, padding=True)

In [None]:
y_train

In [None]:
train_encodings

In [None]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),y_train
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
))

In [None]:
train_dataset

<TensorSliceDataset element_spec=({'input_ids': TensorSpec(shape=(221,), dtype=tf.int32, name=None), 'attention_mask': TensorSpec(shape=(221,), dtype=tf.int32, name=None)}, TensorSpec(shape=(), dtype=tf.int32, name=None))>

# **Fine-tune the model:**

The pre-trained transformer model is fine-tuned on the training data to adapt it to the specific sentiment analysis task. This involves adjusting the model's weights through multiple epochs of training.

In [None]:
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=2,             # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=16,  # batch size for evaluation
    warmup_steps=500,               # number of warmup steps for learning rate scheduler
    weight_decay=0.01,              # strength of weight decay
    logging_dir='./log',            # directory for storing logs
    logging_steps=10,               
)

In [None]:
from transformers import TFDistilBertForSequenceClassification

with training_args.strategy.scope():
  model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

trainer = TFTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

#trainer.train()

# **Evaluate the model:**

After fine-tuning the model, it is evaluated on the validation data to check its performance. The evaluation metrics can include accuracy, precision, recall, and F1-score.

In [None]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.6783758435930525}

In [None]:
trainer.predict(test_dataset)

PredictionOutput(predictions=array([[-0.04963374, -0.06237536],
       [-0.05643223, -0.06064964],
       [-0.06984001, -0.05349755],
       ...,
       [-0.00449148, -0.1056397 ],
       [-0.08056724, -0.05775277],
       [-0.0536454 , -0.06774441]], dtype=float32), label_ids=array([0, 0, 0, ..., 0, 0, 0], dtype=int32), metrics={'eval_loss': 0.6783446175711495})

In [None]:
trainer.predict(test_dataset)[1].shape

(1112,)

In [None]:
output=trainer.predict(test_dataset)[1]

In [None]:
import plotly.figure_factory as ff
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, output)

fig = ff.create_annotated_heatmap(
    z=cm,
    x=['Predicted Ham', 'Predicted Spam'],
    y=['Actual Ham', 'Actual Spam'],
    colorscale='Viridis',
    showscale=True
)

fig.update_layout(
    title='Confusion Matrix',
    xaxis=dict(title='Predicted Label'),
    yaxis=dict(title='Actual Label')
)

fig.show()


In [None]:
trainer.save_model('senti_model')

![image](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS-eV-vDP0_ZcP9GxCEzJFBzAoffWM8zVlwQw&usqp=CAU)