----
# **TEXT CLASSIFICATION USING NAIVE BAYES AND SENTIMENT ANALYSIS ON BLOG POSTS**
-----

### OBJECTIVE :

#####  Build a text classification model using the Naive Bayes algorithm to categorize the blog posts accurately. Furthermore, you will perform sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts. This assignment will enhance your understanding of text classification, sentiment analysis, and the practical application of the Naive Bayes algorithm in Natural Language Processing (NLP)

### TASKS :

##### DATA PREPROCESSING :

In [6]:
import warnings
warnings.filterwarnings('ignore')

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

In [9]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder ,MinMaxScaler
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [10]:
import tensorflow as tf
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [2]:
from tensorflow.keras.optimizers import Adam, RMSprop
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [12]:
from sklearn.metrics import classification_report

In [13]:
df=pd.read_csv('blogs.csv')

In [14]:
df

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism
...,...,...
1995,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,talk.religion.misc
1996,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc
1997,Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...,talk.religion.misc
1998,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Data    2000 non-null   object
 1   Labels  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


In [16]:
df.isnull().sum()

Data      0
Labels    0
dtype: int64

In [17]:
df.describe()

Unnamed: 0,Data,Labels
count,2000,2000
unique,2000,20
top,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
freq,1,100


#### **MODEL IMPLEMENTATION :**

In [25]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize and remove stopwords
    tokens = [word for word in text.split() if word not in ENGLISH_STOP_WORDS]
    return ' '.join(tokens)

# Apply preprocessing to the 'Data' column
df['Processed_Data'] = df['Data'].apply(preprocess_text)
df

Unnamed: 0,Data,Labels,Processed_Data
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,newsgroups altatheism path cantaloupesrvcscmue...
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,path cantaloupesrvcscmuedudasnewsharvardedunoc...
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism,xref cantaloupesrvcscmuedu altatheism53485 tal...
...,...,...,...
1995,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,talk.religion.misc,xref cantaloupesrvcscmuedu talkabortion120945 ...
1996,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc,xref cantaloupesrvcscmuedu talkreligionmisc837...
1997,Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...,talk.religion.misc,xref cantaloupesrvcscmuedu talkorigins41030 ta...
1998,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc,xref cantaloupesrvcscmuedu talkreligionmisc836...


#### **Implement the Naive Bayes Classifier :**

In [27]:
# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Use max_features to limit the number of terms

# Transform the processed text data
X = tfidf_vectorizer.fit_transform(df['Processed_Data'])

# Assign labels to the target variable
y = df['Labels']

In [29]:
# Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [32]:
# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()

In [34]:
# Train the model
nb_classifier.fit(X_train, y_train)

In [36]:
# Make predictions on the test set
y_pred = nb_classifier.predict(X_test)

#### **Sentiment Analysis :**

In [39]:
# Download VADER lexicon if not already available
nltk.download('vader_lexicon')

# Initialize the VADER sentiment analyzer
sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\AUSUS\AppData\Roaming\nltk_data...


In [50]:
def get_sentiment(text):
    # Calculate sentiment scores
    sentiment_scores = sid.polarity_scores(text)
    compound_score = sentiment_scores['compound']
    
    # Determine sentiment based on compound score
    if compound_score >= 0.05:
        return 'Positive'
    elif compound_score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Apply the function to determine sentiment for each blog post
df['Sentiment'] = df['Data'].apply(get_sentiment)

df

Unnamed: 0,Data,Labels,Processed_Data,Sentiment
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...,Negative
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,newsgroups altatheism path cantaloupesrvcscmue...,Positive
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,path cantaloupesrvcscmuedudasnewsharvardedunoc...,Negative
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...,Negative
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism,xref cantaloupesrvcscmuedu altatheism53485 tal...,Positive
...,...,...,...,...
1995,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,talk.religion.misc,xref cantaloupesrvcscmuedu talkabortion120945 ...,Positive
1996,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc,xref cantaloupesrvcscmuedu talkreligionmisc837...,Positive
1997,Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...,talk.religion.misc,xref cantaloupesrvcscmuedu talkorigins41030 ta...,Positive
1998,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc,xref cantaloupesrvcscmuedu talkreligionmisc836...,Positive


In [52]:
# Calculate the distribution of sentiments across categories
sentiment_distribution = df.groupby(['Labels', 'Sentiment']).size().unstack().fillna(0)

# Display the distribution
print(sentiment_distribution)

Sentiment                 Negative  Neutral  Positive
Labels                                               
alt.atheism                   42.0      1.0      57.0
comp.graphics                 13.0      4.0      83.0
comp.os.ms-windows.misc       24.0      2.0      74.0
comp.sys.ibm.pc.hardware      21.0      0.0      79.0
comp.sys.mac.hardware         24.0      3.0      73.0
comp.windows.x                20.0      2.0      78.0
misc.forsale                   7.0      8.0      85.0
rec.autos                     27.0      1.0      72.0
rec.motorcycles               30.0      2.0      68.0
rec.sport.baseball            27.0      1.0      72.0
rec.sport.hockey              28.0      1.0      71.0
sci.crypt                     29.0      0.0      71.0
sci.electronics               18.0      4.0      78.0
sci.med                       38.0      1.0      61.0
sci.space                     32.0      3.0      65.0
soc.religion.christian        29.0      0.0      71.0
talk.politics.guns          

#### **EVALUATION CRITERIA**

In [54]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Get precision, recall, and F1-score
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

Accuracy: 0.825
Classification Report:
                           precision    recall  f1-score   support

             alt.atheism       0.54      0.83      0.65        18
           comp.graphics       0.75      0.83      0.79        18
 comp.os.ms-windows.misc       0.86      0.82      0.84        22
comp.sys.ibm.pc.hardware       0.77      0.80      0.78        25
   comp.sys.mac.hardware       0.79      0.90      0.84        21
          comp.windows.x       0.91      0.84      0.88        25
            misc.forsale       0.88      0.78      0.82        18
               rec.autos       0.89      0.94      0.92        18
         rec.motorcycles       0.94      0.94      0.94        16
      rec.sport.baseball       0.71      0.94      0.81        18
        rec.sport.hockey       0.94      1.00      0.97        15
               sci.crypt       0.95      0.95      0.95        19
         sci.electronics       0.64      0.56      0.60        16
                 sci.med       0.88

### Model Evaluation and Challenges
- **Accuracy**: Naive Bayes achieved **82.5% accuracy**, performing well overall.
- **High-Performing Categories**: Categories like **rec.sport.hockey** and **sci.crypt** showed high precision and recall, likely due to distinctive vocabulary.
- **Vocabulary Overlap**: Topics with similar vocabulary, such as **religion** and **technology**, posed classification challenges due to Naive Bayes’ independence assumption  .
- **Class Imbalance**: Imbalanced data led to overfitting on more frequent categories, affecting performance for less-represented ones .

### Sentiment Analysis Results and Implications
- **Positive Sentiment in Commerce-Related Topics**: Categories like **misc.forsale** and **comp.graphics** were largely positive, likely due to factual or promotional content .
- **Negative Sentiment in Controversial Topics**: Categories like **talk.politics.guns** and **talk.politics.mideast** contained more negative sentiments, reflecting polarized or critical discussions .
- **Engagement and Moderation Insights**: Sentiment analysis helps identify areas needing **content moderation** (e.g., contentious topics) or opportunities for **targeted advertising** (e.g., positive sentiment categories) .

These points summarize the model's strengths and challenges, with sentiment trends providing valuable context for engagement and content strategies.