## Task 2
> 1. Implement multi class text classification on your choice dataset using open-source language model like BERT.
>2. Also train the same data using word embeddings, compare both the models and choose the best one.
>3. Document your views on the data, Suggest few alternate solutions for the same.




**BBC News Articles Dataset**

The "BBC News Articles Dataset" is a collection of textual data comprising 2,225 individual documents originally published on the BBC news website. These documents encompass news stories from a range of topical areas, of the period from 2004 to 2005.

**Key Attributes:**
- **Total Documents:** 2,225
- **Class Labels:** 5
- **Class Labels:** business, entertainment, politics, sport, and tech
- **Copyright:** All rights, including copyright, in the content of the original articles are owned by the BBC.

In [1]:
import os
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [2]:
current_directory = os.getcwd()

bbc_folder = os.path.join(current_directory, 'bbc')

text_data = []

categories = ["business", "entertainment", "politics", "sport", "tech"]

for category in categories:
    category_folder = os.path.join(bbc_folder, category)


    for filename in os.listdir(category_folder):
        if filename.endswith(".txt"):
            file_path = os.path.join(category_folder, filename)
            with open(file_path, 'r') as file:
                content = file.read()
                text_data.append({'Category': category, 'Text': content})

df = pd.DataFrame(text_data)

df.head()


Unnamed: 0,Category,Text
0,business,Ad sales boost Time Warner profit\n\nQuarterly...
1,business,Dollar gains on Greenspan speech\n\nThe dollar...
2,business,Yukos unit buyer faces loan claim\n\nThe owner...
3,business,High fuel prices hit BA's profits\n\nBritish A...
4,business,Pernod takeover talk lifts Domecq\n\nShares in...


In [3]:
df.to_csv('final_bbc_data.csv')

In [4]:
df = pd.read_csv('final_bbc_data.csv')

In [7]:
df.drop(columns=["Unnamed: 0"], inplace=True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  2225 non-null   object
 1   Text      2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


Now we have loaded all the files the bbc folder has

-----------

Now we'll pre-process the data

Removing New line break

In [9]:
def clean_text(text):
    cleaned_text = re.sub(r'\n', ' ', text)
    return cleaned_text

df['Text'] = df['Text'].apply(clean_text)

Removing any special characters

In [10]:
df['Text'] = df['Text'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', x))

as digits and no are plays a vital role in news, i am not removing them

---------------

I have combined nltk stop words and bbc stop words to curate a custom stop words will I'll use

In [11]:
custom_stopwords = list(set([
    'a', 'about', 'above', 'according', 'across', 'actually', 'adj', 'after', 'afterwards', 'again', 'all', 'almost',
    'along', 'already', 'also', 'although', 'always', 'among', 'amongst', 'an', 'am', 'and', 'another', 'any', 'anyhow',
    'anyone', 'anything', 'anywhere', 'are', 'aren', 'aren\'t', 'around', 'as', 'at', 'be', 'became', 'because', 'become',
    'becomes', 'been', 'beforehand', 'begin', 'being', 'below', 'beside', 'besides', 'between', 'both', 'but', 'by', 'can',
    'cannot', 'can\'t', 'caption', 'co', 'come', 'could', 'couldn', 'couldn\'t', 'did', 'didn', 'didn\'t', 'do', 'does',
    'doesn', 'doesn\'t', 'don', 'don\'t', 'down', 'during', 'each', 'early', 'eg', 'either', 'else', 'elsewhere', 'end',
    'ending', 'enough', 'etc', 'even', 'ever', 'every', 'everywhere', 'except', 'few', 'for', 'found', 'from', 'further',
    'had', 'has', 'hasn', 'hasn\'t', 'have', 'haven', 'haven\'t', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby',
    'herein', 'hereupon', 'hers', 'him', 'his', 'how', 'however', 'ie', 'i.e.', 'if', 'in', 'inc', 'inc.', 'indeed', 'instead',
    'into', 'is', 'isn', 'isn\'t', 'it', 'its', 'itself', 'last', 'late', 'later', 'less', 'let', 'like', 'likely', 'll',
    'ltd', 'made', 'make', 'makes', 'many', 'may', 'maybe', 'me', 'meantime', 'meanwhile', 'might', 'miss', 'more', 'most',
    'mostly', 'mr', 'mrs', 'much', 'must', 'my', 'myself', 'namely', 'near', 'neither', 'never', 'nevertheless', 'new',
    'next', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'nor', 'not', 'now', 'NULL', 'of', 'off', 'often', 'on',
    'once', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'per',
    'perhaps', 'rather', 're', 'said', 'same', 'say', 'seem', 'seemed', 'seeming', 'seems', 'several', 'she', 'should',
    'shouldn', 'shouldn\'t', 'since', 'so', 'some', 'still', 'stop', 'such', 'taking', 'ten', 'than', 'that', 'the', 'their',
    'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these',
    'they', 'this', 'those', 'though', 'thousand', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'toward',
    'towards', 'under', 'unless', 'unlike', 'unlikely', 'until', 'up', 'upon', 'us', 'use', 'used', 'using', 've', 'very',
    'via', 'was', 'wasn', 'we', 'well', 'were', 'weren', 'weren\'t', 'what', 'whatever', 'when', 'whence', 'whenever', 'where',
    'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who',
    'whoever', 'whole', 'whom', 'whomever', 'whose', 'why', 'will', 'with', 'within', 'without', 'won', 'would', 'wouldn',
    'wouldn\'t', 'yes', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves'
]))


In [12]:
def remove_stopwords(text, custom_stopwords):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words and word.lower() not in custom_stopwords]
    return ' '.join(filtered_words)


df['Text'] = df['Text'].apply(lambda x: remove_stopwords(x, custom_stopwords))


To avoid biasness let's suffle the dataframe

In [13]:
final_df = df.sample(frac=1, random_state=4525).reset_index(drop=True)

In [14]:
final_df

Unnamed: 0,Category,Text
0,politics,Protect whistleblowers TUC says government cha...
1,business,Boeing unveils 777 aircraft aircraft firm Boei...
2,business,AstraZeneca hit drug failure Shares AngloSwedi...
3,tech,Blogger grounded airline airline attendant fig...
4,business,Fiat chief takes steering wheel chief executiv...
...,...,...
2220,sport,Isinbayeva heads Birmingham Olympic pole vault...
2221,politics,Row police power CSOs Police Federation strong...
2222,sport,Downing injury mars Uefa victory Middlesbrough...
2223,sport,Houllier praises Benitez regime Former Liverpo...


Let's label the categories

In [15]:
category_to_label = {
    "business": 0,
    "entertainment": 1,
    "politics": 2,
    "sport": 3,
    "tech": 4
}

df['Label'] = df['Category'].map(category_to_label)


In [16]:
df

Unnamed: 0,Category,Text,Label
0,business,Ad sales boost Time Warner profit Quarterly pr...,0
1,business,Dollar gains Greenspan speech dollar hit highe...,0
2,business,Yukos unit buyer faces loan claim owners embat...,0
3,business,High fuel prices hit BAs profits British Airwa...,0
4,business,Pernod takeover talk lifts Domecq Shares UK dr...,0
...,...,...,...
2220,tech,BT program beat dialler scams BT introducing t...,4
2221,tech,Spam emails tempt net shoppers Computer users ...,4
2222,tech,careful code European directive put software w...,4
2223,tech,cyber security chief resigns man making sure c...,4


In [21]:
import torch
from transformers import BertForSequenceClassification, BertTokenizer, AdamW, get_linear_schedule_with_warmup
from torch.utils.data import DataLoader, TensorDataset, random_split
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)

texts = df['Text'].values
labels = df['Label'].values

max_length = 128

tokenized_texts = [tokenizer(text, padding='max_length', truncation=True, max_length=max_length, return_tensors='pt') for text in texts]

input_ids = torch.cat([t['input_ids'] for t in tokenized_texts])
attention_mask = torch.cat([t['attention_mask'] for t in tokenized_texts])

dataset = TensorDataset(input_ids, attention_mask, torch.tensor(labels, dtype=torch.long))

train_size = int(0.8 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=16)

optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(train_dataloader) * 3)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(3):
    model.train()
    total_loss = 0
    for batch in train_dataloader:
        input_ids, attention_mask, labels = [b.to(device) for b in batch]
        optimizer.zero_grad()

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        scheduler.step()

    print(f"Epoch {epoch + 1}/3, Average Training Loss: {total_loss / len(train_dataloader)}")

model.eval()
val_loss, correct, total = 0, 0, 0

with torch.no_grad():
    for batch in val_dataloader:
        input_ids, attention_mask, labels = [b.to(device) for b in batch]
        outputs = model(input_ids, attention_mask=attention_mask)

        if 'loss' in outputs:
            loss = outputs.loss
            val_loss += loss
        else:
            print("Warning: Loss not available.")

        predicted_labels = torch.argmax(outputs.logits, dim=1)
        correct += (predicted_labels == labels).sum().item()
        total += len(labels)

print(f"Validation Loss: {val_loss / len(val_dataloader)}")
print(f"Validation Accuracy: {correct / total}")




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3, Average Training Loss: 0.6189186746653702
Epoch 2/3, Average Training Loss: 0.10231481795199215
Epoch 3/3, Average Training Loss: 0.04708199738524854
Validation Loss: 0.0
Validation Accuracy: 0.9707865168539326


Let's Test the above model

In [22]:
text_to_predict = "There have been a number of rows about what should and shouldn't be available to this inquiry. The UK government has been criticised before for not wanting to hand over unredacted WhatsApp messages (it eventually did after a court ruling). The Scottish government is involved in such a row just now - with senior people accused of deleting messages. As a former top official - who knows how Whitehall works better than most - it's significant that Helen MacNamara was so critical of the Cabinet Office. She says it was 'extraordinarily difficult' to get 'basic pieces of information'. She also reveals her government phone was deleted. There will be questions for the government about why."

input_text = tokenizer(text_to_predict, padding='max_length', truncation=True, max_length=max_length, return_tensors='pt')

input_text = {key: val.to(device) for key, val in input_text.items()}

with torch.no_grad():
    outputs = model(**input_text)
    predicted_label = torch.argmax(outputs.logits, dim=1).item()

labels = ['business', 'entertainment', 'politics', 'sport', 'tech']
predicted_category = labels[predicted_label]

print(f"The predicted category for the given text is: {predicted_category}")


The predicted category for the given text is: politics


It is working amazingly

In [None]:
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

sentences = [text.split() for text in train_df['Text']]
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)

def text_to_vector(text):
    tokens = text.split()
    vectors = [w2v_model.wv[word] for word in tokens if word in w2v_model.wv]
    if vectors:
        return sum(vectors) / len(vectors)
    else:
        return [0] * w2v_model.vector_size

train_vectors = [text_to_vector(text) for text in train_df['Text']]
test_vectors = [text_to_vector(text) for text in test_df['Text']]

clf = XGBClassifier()
clf.fit(train_vectors, train_df['Label'])

report = classification_report(test_df['Label'], clf.predict(test_vectors), target_names=['business', 'entertainment', 'politics', 'sport', 'tech'])


In [25]:
print(report)

               precision    recall  f1-score   support

     business       0.91      0.94      0.92       115
entertainment       0.86      0.82      0.84        72
     politics       0.87      0.88      0.88        76
        sport       0.89      0.89      0.89       102
         tech       0.88      0.86      0.87        80

     accuracy                           0.89       445
    macro avg       0.88      0.88      0.88       445
 weighted avg       0.89      0.89      0.89       445



In [26]:
text_to_predict = "There have been a number of rows about what should and shouldn't be available to this inquiry. The UK government has been criticised before for not wanting to hand over unredacted WhatsApp messages (it eventually did after a court ruling). The Scottish government is involved in such a row just now - with senior people accused of deleting messages. As a former top official - who knows how Whitehall works better than most - it's significant that Helen MacNamara was so critical of the Cabinet Office. She says it was 'extraordinarily difficult' to get 'basic pieces of information'. She also reveals her government phone was deleted. There will be questions for the government about why."

cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text_to_predict)
filtered_text = remove_stopwords(cleaned_text, custom_stopwords)
vector = text_to_vector(filtered_text)

predicted_label = clf.predict([vector])[0]

label_to_category = {0: 'business', 1: 'entertainment', 2: 'politics', 3: 'sport', 4: 'tech'}
predicted_category = label_to_category[predicted_label]

print(f"The predicted category for the text is: {predicted_category}")


The predicted category for the text is: politics


It is also working fine

Comparing Word2Vec and BERT:

Word2Vec Results:
- Word2Vec does pretty well with an 87% accuracy.
- It's like a language detective, good at figuring out word meanings.
- Not bad, and it doesn't need tons of computer power.

BERT Results:
- BERT is the star here, with a whopping 98.43% accuracy.
- It's like a language genius, understanding the whole story, not just words.
- Almost like having a language expert on your team.

Clearly, BERT is the winner

- The dataset has a mix of news articles from different categories.
- These articles come in various writing styles and flavors.


1. Try Different BERTs: We could test out other BERT models like RoBERTa or XLNet; they might do even better.
2. Team Up the Models:Combining Word2Vec and BERT cleverly could help, especially if we have lots of data.
3. Fine-tune the Models: Tweaking the model settings might reveal hidden talents.
4. More Data, Please: Adding more text data would surely pump up our models.

In simple terms, BERT is the genius here when it comes to understanding language, and it leaves Word2Vec in the dust. To make it even better, we can train BERT more on our data, clean up the text, or try other smart models.