TEXT CLASSIFICATION USING NAIVE BAYES AND SENTIMENT ANALYSIS ON BLOG POSTS

1. Data Exploration and Preprocessing

•	Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('/content/blogs.csv')

In [3]:
df

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism
...,...,...
1995,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,talk.religion.misc
1996,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc
1997,Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...,talk.religion.misc
1998,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc


In [4]:
df.shape

(2000, 2)

There are 2000 rows and 2 columns in the dataset.

In [5]:
df.columns

Index(['Data', 'Labels'], dtype='object')

In [6]:
df.isnull().sum()

Unnamed: 0,0
Data,0
Labels,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Data    2000 non-null   object
 1   Labels  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


•	Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.

In [8]:
import string

Removing Punctuation:

In [9]:
df['Data'] = df['Data'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

Converting to Lowercase:

In [10]:
df['Data'] = df['Data'].str.lower()

Tokenizing:

In [11]:
df['Data'] = df['Data'].apply(lambda x: x.split())

Removing Stopwords:

In [12]:
import nltk

In [13]:
from nltk.corpus import stopwords

In [14]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [15]:
stop_words = set(stopwords.words('english'))

In [16]:
df['Data'] = df['Data'].apply(lambda words: [w for w in words if w not in stop_words])


Lemmitization:

In [44]:
from nltk.stem import WordNetLemmatizer

In [45]:
lemmatizer = WordNetLemmatizer()

In [47]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [48]:
df['Data'] = df['Data'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))


In [49]:
df.head()

Unnamed: 0,Data,Labels,Sentiment
0,path cantaloupesrvcscmuedumagnesiumclubcccmued...,alt.atheism,Negative
1,newsgroups altatheism path cantaloupesrvcscmue...,alt.atheism,Positive
2,path cantaloupesrvcscmuedudasnewsharvardedunoc...,alt.atheism,Negative
3,path cantaloupesrvcscmuedumagnesiumclubcccmued...,alt.atheism,Negative
4,xref cantaloupesrvcscmuedu altatheism53485 tal...,alt.atheism,Positive


•	Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [19]:
df['Data'] = df['Data'].apply(lambda x: ' '.join(x))


In [20]:
tfidf = TfidfVectorizer(max_features=5000)

In [21]:
x = tfidf.fit_transform(df['Data']).toarray()

In [22]:
y = df['Labels']

2. Naive Bayes Model for Text Classification

•	Split the data into training and test sets.

In [23]:
from sklearn.model_selection import train_test_split

In [24]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

•	Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this purpose.

In [25]:
from sklearn.naive_bayes import MultinomialNB

In [26]:
nb=MultinomialNB()

•	Train the model on the training set and make predictions on the test set.

In [27]:
nb.fit(x_train,y_train)

In [28]:
y_pred=nb.predict(x_test)

3. Sentiment Analysis

•	Choose a suitable library or method for performing sentiment analysis on the blog post texts.

Using VADER (Valence Aware Dictionary and sEntiment Reasoner) for sentiment analysis because it is specifically designed for text data from social media, reviews, and other informal sources, which makes it well-suited for blog posts.

VADER is lightweight, requires no training, works well on small datasets, and is easily interpretable for reporting purposes.

 Additionally, it is included in the NLTK library, making it convenient to implement without complex dependencies.

•	Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.

In [29]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Download VADER lexicon

In [30]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

Initialize VADER

In [31]:
sia = SentimentIntensityAnalyzer()

Function to classify sentiment

In [32]:
def get_sentiment_label(text):
    score = sia.polarity_scores(text)['compound']
    if score > 0.05:
        return "Positive"
    elif score < -0.05:
        return "Negative"
    else:
        return "Neutral"


Apply sentiment analysis

In [33]:
df['Sentiment'] = df['Data'].apply(get_sentiment_label)

In [34]:
df

Unnamed: 0,Data,Labels,Sentiment
0,path cantaloupesrvcscmuedumagnesiumclubcccmued...,alt.atheism,Negative
1,newsgroups altatheism path cantaloupesrvcscmue...,alt.atheism,Positive
2,path cantaloupesrvcscmuedudasnewsharvardedunoc...,alt.atheism,Negative
3,path cantaloupesrvcscmuedumagnesiumclubcccmued...,alt.atheism,Negative
4,xref cantaloupesrvcscmuedu altatheism53485 tal...,alt.atheism,Positive
...,...,...,...
1995,xref cantaloupesrvcscmuedu talkabortion120945 ...,talk.religion.misc,Positive
1996,xref cantaloupesrvcscmuedu talkreligionmisc837...,talk.religion.misc,Positive
1997,xref cantaloupesrvcscmuedu talkorigins41030 ta...,talk.religion.misc,Positive
1998,xref cantaloupesrvcscmuedu talkreligionmisc836...,talk.religion.misc,Positive


TextBlob:

In [50]:
from textblob import TextBlob

# TextBlob sentiment analysis
df['TextBlob_Polarity'] = df['Data'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['TextBlob_Sentiment'] = df['TextBlob_Polarity'].apply(lambda score: 'Positive' if score > 0
                                                         else ('Negative' if score < 0 else 'Neutral'))


In [52]:
df['TextBlob_Sentiment']

Unnamed: 0,TextBlob_Sentiment
0,Positive
1,Negative
2,Positive
3,Positive
4,Positive
...,...
1995,Positive
1996,Positive
1997,Positive
1998,Positive


•	Examine the distribution of sentiments across different categories and summarize your findings.

Sentiment distribution per category

VANDER:

In [35]:
sentiment_dist = df.groupby(['Labels', 'Sentiment']).size().unstack(fill_value=0)

In [36]:
print("\nSentiment Distribution:\n", sentiment_dist)



Sentiment Distribution:
 Sentiment                 Negative  Neutral  Positive
Labels                                               
alt.atheism                     40        1        59
comp.graphics                   10        3        87
comp.os.ms-windows.misc         22        2        76
comp.sys.ibm.pc.hardware        18        3        79
comp.sys.mac.hardware           17        4        79
comp.windows.x                  21        2        77
misc.forsale                     8       10        82
rec.autos                       26        3        71
rec.motorcycles                 32        1        67
rec.sport.baseball              25        3        72
rec.sport.hockey                28        3        69
sci.crypt                       22        2        76
sci.electronics                 13        6        81
sci.med                         29        2        69
sci.space                       28        5        67
soc.religion.christian          28        1        71
ta

Most blog categories show a higher proportion of Positive sentiments compared to Negative or Neutral.

Strong Poitive Categories are comp.graphics,misc.forsale and sci.electronics.

rec.sport.baseball, rec.sport.hockey, rec.autos have a healthy mix of positive posts.

Strong Negative Categoris are talk.politics.guns,talk.politics.mideast and talk.politics.misc.

Neutral sentiment is minimal across all categories.



TextBlob:

In [53]:
textblob_sentiment_dist = df.groupby(['Labels', 'TextBlob_Sentiment']).size().unstack(fill_value=0)



In [54]:
print(textblob_sentiment_dist)

TextBlob_Sentiment        Negative  Neutral  Positive
Labels                                               
alt.atheism                     36        0        64
comp.graphics                   27        0        73
comp.os.ms-windows.misc         23        0        77
comp.sys.ibm.pc.hardware        18        0        82
comp.sys.mac.hardware           26        0        74
comp.windows.x                  20        2        78
misc.forsale                    20        0        80
rec.autos                       24        0        76
rec.motorcycles                 29        0        71
rec.sport.baseball              38        0        62
rec.sport.hockey                44        0        56
sci.crypt                       23        0        77
sci.electronics                 24        0        76
sci.med                         34        0        66
sci.space                       29        0        71
soc.religion.christian          25        0        75
talk.politics.guns          

The TextBlob sentiment analysis shows very few neutral statements, with most texts classified as either positive or negative.

Positive sentiment dominates in technical and science-related categories, while political and sports-related categories have relatively higher negative sentiment.

Comparing VADER and TextBlob sentiment analysis, TextBlob produces more positive classifications in most categories, while VADER detects a higher proportion of neutral and negative sentiments, especially in political discussions.

4. Evaluation

•	Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.

In [55]:
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,classification_report

In [56]:
print("Accuracy Score:",accuracy_score(y_test,y_pred))

Accuracy Score: 0.8225


In [57]:
print("Precision Score:",precision_score(y_test,y_pred,average='weighted'))

Precision Score: 0.8281299559144919


In [58]:
print("Recall Score:",recall_score(y_test,y_pred,average='weighted'))

Recall Score: 0.8225


In [59]:
print("F1-Score:",f1_score(y_test,y_pred,average='weighted'))

F1-Score: 0.8174927019571058


In [60]:
print("\nClassification Report:")
report=classification_report(y_test,y_pred)
print(report)


Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.50      0.83      0.62        18
           comp.graphics       0.83      0.83      0.83        18
 comp.os.ms-windows.misc       0.86      0.82      0.84        22
comp.sys.ibm.pc.hardware       0.76      0.76      0.76        25
   comp.sys.mac.hardware       0.83      0.90      0.86        21
          comp.windows.x       0.91      0.84      0.88        25
            misc.forsale       0.78      0.78      0.78        18
               rec.autos       0.89      0.94      0.92        18
         rec.motorcycles       0.94      0.94      0.94        16
      rec.sport.baseball       0.77      0.94      0.85        18
        rec.sport.hockey       0.94      1.00      0.97        15
               sci.crypt       0.95      0.95      0.95        19
         sci.electronics       0.59      0.62      0.61        16
                 sci.med       0.88      0.88      

•	Discuss the performance of the model and any challenges encountered during the classification process.

The classification model achieved an overall accuracy of 82%, with high precision and recall across most categories. Categories such as sci.crypt, rec.motorcycles, and talk.politics.mideast exhibited near-perfect score. However, categories like talk.religion.misc and sci.electronics showed lower performance.




 As this was my first time working on such a task, I needed extra time to understand the dataset structure and requirements.

 The dataset was not clean initially, so I had to perform preprocessing steps such as removing noise and formatting text properly before modeling.

 Implementing TF-IDF was slightly challenging, especially in understanding its parameters and how it converts text into numerical features.

 Since this was a supervised classification problem, building the model was relatively straightforward once the data was ready.

 The main challenge was more about learning and understanding new concepts rather than technical difficulty in model implementation.

•	Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.

Most technical and hobby-related categories, such as comp.graphics (87% positive), misc.forsale (82% positive), and sci.electronics (81% positive), displayed a strong positive sentiment. This indicates that discussions in these areas were generally supportive.

Sports and science topics like rec.sport.baseball (72% positive) and sci.crypt (76% positive) maintained an overall positive tone but contained moderate amounts of negative sentiment, which indicates some debates in this area.

Politically inclined categories, such as talk.politics.guns (65% negative) and talk.politics.mideast (68% negative), showed a clear tendency towards negative sentiment indicating disagreements.

