<a href="https://colab.research.google.com/github/hikmat690/AI-programming/blob/main/bbcnewspaper888.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis (Text Classification)**
*   **Downloading Datset from Kaggle to Google Colab**
*   **Text Cleaning**
*   **BERT Model (Feature Engineering)**
*   **DL Model**

# **Importing Preprocessing Libraries**

In [None]:
!pip install -U "tensorflow-text==2.13.*"



In [None]:
#!pip install --quiet tensorflow_text

import re
import nltk
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,precision_score,accuracy_score,confusion_matrix

import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from nltk.corpus import stopwords
nltk.download('stopwords')



stopwords.words('english')
exclude = string.punctuation

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# **Reading Data**

In [None]:
temp = pd.read_csv('/content/bbc-text.csv')
df=temp.iloc[:1200]

In [None]:
df.shape

(1200, 2)

In [None]:

stopwords = nltk.corpus.stopwords.words('english')
def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stopwords])


def remove_special_characters(text):
    """
    Remove special characters from the input text.

    Args:
        text (str): Input string.

    Returns:
        str: Cleaned string with only alphanumeric characters and spaces.
    """
    return re.sub(r'[^A-Za-z0-9\s]', '', text)


# **Text Cleaning & Preprocessing**

In [None]:
df['text'] = df['text'].str.lower()
df['text'] = df['text'].apply(remove_stopwords)
df['text'] = df['text'].apply(remove_special_characters)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['text'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['text'].apply(remove_stopwords)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['text'].apply(remove_special_characters)


In [None]:
df['text']

Unnamed: 0,text
0,tv future hands viewers home theatre systems p...
1,worldcom boss left books alone former worldcom...
2,tigers wary farrell gamble leicester say rushe...
3,yeading face newcastle fa cup premiership side...
4,ocean twelve raids box office ocean twelve cri...
...,...
1195,duff ruled barcelona clash chelsea damien duff...
1196,original exorcist screened original version ho...
1197,record year chilean copper chile copper indust...
1198,sales fail boost high street january sales fai...


In [None]:
df.isnull().sum()

Unnamed: 0,0
category,0
text,0


# **Feature Engineering**

**Target Column Encoding**

In [None]:

from sklearn.preprocessing import LabelEncoder

X = df['text']
Y = df['category']

encoder = LabelEncoder()
Y = encoder.fit_transform(Y)

print(Y)

X_train,X_test,y_train,y_test = train_test_split(df['text'],Y,test_size=0.2,random_state=42)
print(X_train)

[4 0 3 ... 0 0 1]
331     cebit fever takes hanover thousands products t...
409     eu aiming fuel development aid european union ...
76      yukos sues four firms 20bn russian oil firm yu...
868     moody joins england lewis moody flown dublin j...
138     safin relieved aussie recovery marat safin adm...
                              ...                        
1044    domain system scam fear system make easier cre...
1095    african double edinburgh world 5000m champion ...
1130    price trusted pc security buy trusted computer...
860     new year texting breaks record mobile phone es...
1126    stuart joins norwich addicks norwich signed ch...
Name: text, Length: 960, dtype: object


**Finetuning using Deep Learning**

In [None]:
import tensorflow as tf
import numpy as np
import tensorflow_hub as hub
from transformers import BertTokenizer, TFBertModel
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Load the BERT tokenizer and model (bert-base-uncased)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

# Set the maximum sequence length to 128
max_length = 128

# Function to tokenize and preprocess text data
def preprocess_text(text_data):
    # Tokenize the text data using the BERT tokenizer
    encoding = tokenizer(
        text_data,
        truncation=True,
        padding=True,
        max_length=max_length,
        return_tensors='tf',  # Return as TensorFlow tensors
        add_special_tokens=True  # Add special tokens like [CLS], [SEP]
    )
    return encoding

# Function to extract embeddings from the fine-tuned BERT model
def extract_embeddings(text_data):
    # Preprocess the input text data
    # Convert the input to a list of strings if it's a Pandas Series
    if isinstance(text_data, pd.Series):
        text_data = text_data.tolist()
    encoding = preprocess_text(text_data)

    # Extract embeddings from BERT (use the output of the BERT model)
    outputs = bert_model(encoding['input_ids'], attention_mask=encoding['attention_mask'])

    # We are interested in the 'pooler_output' (the representation of [CLS] token)
    embeddings = outputs.pooler_output  # (batch_size, hidden_size)

    return embeddings

# Example dataset (replace this with your actual data)
X = df['text']
Y = df['category']

# Encode labels
encoder = LabelEncoder()
Y = encoder.fit_transform(Y)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Get the embeddings for both training and testing sets
X_train_embeddings = extract_embeddings(X_train)
X_test_embeddings = extract_embeddings(X_test)

# Convert embeddings to numpy arrays (if needed)
X_train_embeddings = np.array(X_train_embeddings)
X_test_embeddings = np.array(X_test_embeddings)

# Print the shape of embeddings
print(X_train_embeddings.shape)  # Should print (batch_size, 768)
print(X_test_embeddings.shape)   # Should print (batch_size, 768)

# ... (rest of the code) ...  # Should print (batch_size, 768)

# 1. Define an Input layer with the shape of your embeddings
input_layer = tf.keras.Input(shape=(X_train_embeddings.shape[1],), name='input_embeddings')

# 2. Apply Dropout and Dense layers to the input layer
drop_out = tf.keras.layers.Dropout(0.2, name='dropout')(input_layer)
output = tf.keras.layers.Dense(5, activation='softmax', name='output')(drop_out)

# 3. Create the Keras model using the input and output layers
model = tf.keras.Model(inputs=[input_layer], outputs=[output])

# Compile the model with RMSprop optimizer
optimizer = RMSprop(learning_rate=0.1)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

# Convert y_train and y_test to one-hot encoding
y_train = to_categorical(y_train, num_classes=5)
y_test = to_categorical(y_test, num_classes=5)

# Train the model with mini-batch size of 8 and 4 epochsalidation_data=(X_test_embeddings, y_test))
history = model.fit(X_train_embeddings, y_train, epochs=4, batch_size=8, validation_data=(X_test_embeddings, y_test))

# Print model summary
model.summary()


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

(960, 768)
(240, 768)
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_embeddings (InputLay  [(None, 768)]             0         
 er)                                                             
                                                                 
 dropout (Dropout)           (None, 768)               0         
                                                                 
 output (Dense)              (None, 5)                 3845      
                                                                 
Total params: 3845 (15.02 KB)
Trainable params: 3845 (15.02 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:


# Get the embeddings for both training and testing sets
X_train_embeddings = extract_embeddings(X_train)
X_test_embeddings = extract_embeddings(X_test)


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize the Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest model on the BERT embeddings
rf.fit(X_train_embeddings, y_train.argmax(axis=1))  # y_train is one-hot encoded, use argmax to get class labels

# Make predictions on the test set
y_pred = rf.predict(X_test_embeddings)

# Evaluate the Random Forest model
# Convert y_test to class labels using argmax to match y_pred format
accuracy = accuracy_score(y_test.argmax(axis=1), y_pred)
print(f"Random Forest model accuracy: {accuracy:.4f}")

Random Forest model accuracy: 0.8167
