<a href="https://colab.research.google.com/github/hikmat690/AI-programming/blob/main/bbcnewspaper22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis (Text Classification)**
*   **Downloading Datset from Kaggle to Google Colab**
*   **Text Cleaning**
*   **BERT Model (Feature Engineering)**
*   **DL Model**

# **Importing Preprocessing Libraries**

In [2]:
!pip install -U "tensorflow-text==2.13.*"



In [3]:
#!pip install --quiet tensorflow_text

import re
import nltk
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,precision_score,accuracy_score,confusion_matrix

import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from nltk.corpus import stopwords

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt_tab')


stopwords.words('english')
exclude = string.punctuation

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# **Reading Data**

In [4]:
temp_df = pd.read_csv('/content/bbc-text.csv')
df = temp_df.iloc[:2000]

In [5]:
df.shape

(2000, 2)

# **Text Cleaning & Preprocessing**

In [6]:

def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

#exclude = "!.,?"
def remove_punc(text):
    return text.translate(str.maketrans('', '', exclude))

stopwords = nltk.corpus.stopwords.words('english')
def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stopwords])

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [7]:
df['text'] = df['text'].str.lower()

df['text'] = df['text'].apply(remove_html_tags)

df['text'] = df['text'].apply(remove_url)

df['text'] = df['text'].apply(remove_punc)
df['text'] = df['text'].apply(remove_stopwords)
df['text'] = df['text'].apply(stem_words)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['text'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['text'].apply(remove_html_tags)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['text'].apply(remove_url)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .l

In [8]:
df.head()

Unnamed: 0,category,text
0,tech,tv futur hand viewer home theatr system plasma...
1,business,worldcom boss left book alon former worldcom b...
2,sport,tiger wari farrel gambl leicest say rush make ...
3,sport,yead face newcastl fa cup premiership side new...
4,entertainment,ocean twelv raid box offic ocean twelv crime c...


In [9]:
df.isnull().sum()

Unnamed: 0,0
category,0
text,0


# **Feature Engineering**

**Target Column Encoding**

In [10]:

from sklearn.preprocessing import LabelEncoder

X = df['text']
Y = df['category']

encoder = LabelEncoder()
Y = encoder.fit_transform(Y)

print(Y)

X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=42)
print(X_train)

[4 0 3 ... 3 0 3]
968     leari agre new villa contract aston villa boss...
240     retail sale show festiv fervour uk retail sale...
819     mobil game take india game move one fastestgro...
692     uk troop ivori coast standbi down street confi...
420     small firm hit rise cost rise fuel materi cost...
                              ...                        
1130    price trust pc secur buy trust comput realli t...
1294    driscollgregan lead aid star ireland brian dri...
860     new year text break record mobil phone essenti...
1459    michael film signal retir singer georg michael...
1126    stuart join norwich addick norwich sign charlt...
Name: text, Length: 1600, dtype: object


**Finetuning using Deep Learning**

In [11]:
preprocessor = hub.KerasLayer("https://kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/en-uncased-preprocess/versions/3")
encoder = hub.KerasLayer("https://www.kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/en-uncased-l-12-h-768-a-12/versions/4",trainable=True)


text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
#print (text_input)
encoder_inputs = preprocessor(text_input)
#print(encoder_inputs)
outputs = encoder(encoder_inputs)
#print(outputs)
pooled_output = outputs["pooled_output"]      # [batch_size, 768].
#print(pooled_output)

drop_out = tf.keras.layers.Dropout(0.2,name='dropout')(pooled_output)
output = tf.keras.layers.Dense(5,activation='softmax',name='output')(drop_out)

model=tf.keras.Model(inputs=[text_input],outputs=[output])



In [None]:
# Compile the model
# Use 'categorical_crossentropy' for multi-class classification
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Convert y_train to one-hot encoding using tf.keras.utils.to_categorical
from tensorflow.keras.utils import to_categorical
y_train = to_categorical(y_train, num_classes=5)

history = model.fit(X_train, y_train, epochs=2, validation_split=0.1)

Epoch 1/2

In [1]:
# Extract embeddings from the fine-tuned BERT model
def extract_embeddings(text_data):
    return model.predict(text_data)

# Get the embeddings for both training and testing sets
X_train_embeddings = extract_embeddings(X_train)
X_test_embeddings = extract_embeddings(X_test)


NameError: name 'X_train' is not defined

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize the Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest model on the BERT embeddings
rf.fit(X_train_embeddings, y_train.argmax(axis=1))  # y_train is one-hot encoded, use argmax to get class labels

# Make predictions on the test set
y_pred = rf.predict(X_test_embeddings)

# Evaluate the Random Forest model
# Remove argmax for y_test as it is already in the correct format
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest model accuracy: {accuracy:.4f}")


Random Forest model accuracy: 0.4500
