<a href="https://colab.research.google.com/github/hikmat690/AI-programming/blob/main/bbcnewspaper777.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis (Text Classification)**
*   **Downloading Datset from Kaggle to Google Colab**
*   **Text Cleaning**
*   **BERT Model (Feature Engineering)**
*   **DL Model**

# **Importing Preprocessing Libraries**

In [7]:
!pip install -U "tensorflow-text==2.13.*"

Collecting tensorflow-text==2.13.*
  Downloading tensorflow_text-2.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting tensorflow<2.14,>=2.13.0 (from tensorflow-text==2.13.*)
  Downloading tensorflow-2.13.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting gast<=0.4.0,>=0.2.1 (from tensorflow<2.14,>=2.13.0->tensorflow-text==2.13.*)
  Downloading gast-0.4.0-py3-none-any.whl.metadata (1.1 kB)
Collecting keras<2.14,>=2.13.1 (from tensorflow<2.14,>=2.13.0->tensorflow-text==2.13.*)
  Downloading keras-2.13.1-py3-none-any.whl.metadata (2.4 kB)
Collecting numpy<=1.24.3,>=1.22 (from tensorflow<2.14,>=2.13.0->tensorflow-text==2.13.*)
  Downloading numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting tensorboard<2.14,>=2.13 (from tensorflow<2.14,>=2.13.0->tensorflow-text==2.13.*)
  Downloading tensorboard-2.13.0-py3-none-any.whl.metadata (1.8 kB)
Collecting tensorflow-es

In [37]:
#!pip install --quiet tensorflow_text

import re
import nltk
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,precision_score,accuracy_score,confusion_matrix

import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from nltk.corpus import stopwords

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt_tab')


stopwords.words('english')
exclude = string.punctuation

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# **Reading Data**

In [38]:
df = pd.read_csv('/content/bbc-text.csv')


In [39]:
df.shape

(2225, 2)

In [40]:

stopwords = nltk.corpus.stopwords.words('english')
def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stopwords])


def remove_special_characters(text):
    """
    Remove special characters from the input text.

    Args:
        text (str): Input string.

    Returns:
        str: Cleaned string with only alphanumeric characters and spaces.
    """
    return re.sub(r'[^A-Za-z0-9\s]', '', text)


# **Text Cleaning & Preprocessing**

In [41]:
df['text'] = df['text'].str.lower()
df['text'] = df['text'].apply(remove_stopwords)
df['text'] = df['text'].apply(remove_special_characters)

In [42]:
df['text']

Unnamed: 0,text
0,tv future hands viewers home theatre systems p...
1,worldcom boss left books alone former worldcom...
2,tigers wary farrell gamble leicester say rushe...
3,yeading face newcastle fa cup premiership side...
4,ocean twelve raids box office ocean twelve cri...
...,...
2220,cars pull us retail figures us retail sales fe...
2221,kilroy unveils immigration policy exchatshow h...
2222,rem announce new glasgow concert us band rem a...
2223,political squabbles snowball become commonplac...


In [43]:
df.isnull().sum()

Unnamed: 0,0
category,0
text,0


# **Feature Engineering**

**Target Column Encoding**

In [44]:

from sklearn.preprocessing import LabelEncoder

X = df['text']
Y = df['category']

encoder = LabelEncoder()
Y = encoder.fit_transform(Y)

print(Y)

X_train,X_test,y_train,y_test = train_test_split(df['text'],Y,test_size=0.2,random_state=42)
print(X_train)

[4 0 3 ... 1 2 3]
1490    farrell due make us tv debut actor colin farre...
2001    china continues rapid growth china economy exp...
1572    ebbers aware worldcom fraud former worldcom bo...
1840    school tribute tv host carson 1 000 people tur...
610     broadband fuels online expression fast web acc...
                              ...                        
1638    november remember last saturday one newspaper ...
1095    african double edinburgh world 5000m champion ...
1130    price trusted pc security buy trusted computer...
1294    driscollgregan lead aid stars ireland brian dr...
860     new year texting breaks record mobile phone es...
Name: text, Length: 1780, dtype: object


**Finetuning using Deep Learning**

In [None]:
import tensorflow as tf
import numpy as np
import tensorflow_hub as hub
from transformers import BertTokenizer, TFBertModel
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Load the BERT tokenizer and model (bert-base-uncased)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

# Set the maximum sequence length to 128
max_length = 128

# Function to tokenize and preprocess text data
def preprocess_text(text_data):
    # Tokenize the text data using the BERT tokenizer
    encoding = tokenizer(
        text_data,
        truncation=True,
        padding=True,
        max_length=max_length,
        return_tensors='tf',  # Return as TensorFlow tensors
        add_special_tokens=True  # Add special tokens like [CLS], [SEP]
    )
    return encoding

# Function to extract embeddings from the fine-tuned BERT model
def extract_embeddings(text_data):
    # Preprocess the input text data
    # Convert the input to a list of strings if it's a Pandas Series
    if isinstance(text_data, pd.Series):
        text_data = text_data.tolist()
    encoding = preprocess_text(text_data)

    # Extract embeddings from BERT (use the output of the BERT model)
    outputs = bert_model(encoding['input_ids'], attention_mask=encoding['attention_mask'])

    # We are interested in the 'pooler_output' (the representation of [CLS] token)
    embeddings = outputs.pooler_output  # (batch_size, hidden_size)

    return embeddings

# Example dataset (replace this with your actual data)
X = df['text']
Y = df['category']

# Encode labels
encoder = LabelEncoder()
Y = encoder.fit_transform(Y)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Get the embeddings for both training and testing sets
X_train_embeddings = extract_embeddings(X_train)
X_test_embeddings = extract_embeddings(X_test)

# Convert embeddings to numpy arrays (if needed)
X_train_embeddings = np.array(X_train_embeddings)
X_test_embeddings = np.array(X_test_embeddings)

# Print the shape of embeddings
print(X_train_embeddings.shape)  # Should print (batch_size, 768)
print(X_test_embeddings.shape)   # Should print (batch_size, 768)

# ... (rest of the code) ...  # Should print (batch_size, 768)

# Build a simple model to classify using the embeddings
drop_out = tf.keras.layers.Dropout(0.2, name='dropout')(X_train_embeddings)
output = tf.keras.layers.Dense(5, activation='softmax', name='output')(drop_out)

model = tf.keras.Model(inputs=[X_train_embeddings], outputs=[output])

# Compile the model with RMSprop optimizer
optimizer = RMSprop(learning_rate=0.1)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

# Convert y_train and y_test to one-hot encoding
y_train = to_categorical(y_train, num_classes=5)
y_test = to_categorical(y_test, num_classes=5)

# Train the model with mini-batch size of 8 and 4 epochs
history = model.fit(X_train_embeddings, y_train, epochs=4, batch_size=8, validation_data=(X_test_embeddings, y_test))

# Print model summary
model.summary()


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [None]:
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.utils import to_categorical

optimizer = RMSprop(learning_rate=0.1)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Train the model with a mini-batch size of 8 and 4 epochs
history = model.fit(X_train, y_train, epochs=4, batch_size=8, validation_split=0.2)


Epoch 1/2
Epoch 2/2


In [None]:
# Extract embeddings from the fine-tuned BERT model
def extract_embeddings(text_data):
    return model.predict(text_data)

# Get the embeddings for both training and testing sets
X_train_embeddings = extract_embeddings(X_train)
X_test_embeddings = extract_embeddings(X_test)




In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize the Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest model on the BERT embeddings
rf.fit(X_train_embeddings, y_train.argmax(axis=1))  # y_train is one-hot encoded, use argmax to get class labels

# Make predictions on the test set
y_pred = rf.predict(X_test_embeddings)

# Evaluate the Random Forest model
# Remove argmax for y_test as it is already in the correct format
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest model accuracy: {accuracy:.4f}")


Random Forest model accuracy: 0.3000
