# **Task : 1. Data Exploration and Preprocessing**

**•	Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.**

In [26]:
import pandas as pd

In [27]:
df= pd.read_csv("/content/blogs.csv")
df

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism
...,...,...
1995,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,talk.religion.misc
1996,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc
1997,Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...,talk.religion.misc
1998,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc


In [28]:
df.describe()

Unnamed: 0,Data,Labels
count,2000,2000
unique,2000,20
top,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,alt.atheism
freq,1,100


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Data    2000 non-null   object
 1   Labels  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


In [30]:
df.isnull().sum()

Unnamed: 0,0
Data,0
Labels,0


**•	Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.**

In [31]:
import re
import nltk
nltk.download('punkt_tab')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    # Join tokens back into a string
    return ' '.join(tokens)

df['cleaned_data'] = df['Data'].apply(preprocess_text)
display(df[['Data', 'cleaned_data']].head())

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Data,cleaned_data
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,path cantaloupesrvcscmuedumagnesiumclubcccmued...
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,newsgroups altatheism path cantaloupesrvcscmue...
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,path cantaloupesrvcscmuedudasnewsharvardedunoc...
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,path cantaloupesrvcscmuedumagnesiumclubcccmued...
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,xref cantaloupesrvcscmuedu altatheism talkreli...


**•	Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.**

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # You can adjust max_features

# Seperate x and y
x = tfidf_vectorizer.fit_transform(df['cleaned_data'])
y = df['Labels']

print("Shape of TF-IDF matrix:", x.shape)

Shape of TF-IDF matrix: (2000, 5000)


# **Task : 2. Naive Bayes Model for Text Classification**

**•	Split the data into training and test sets.**

In [33]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

**•	Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this purpose.**

In [34]:
from sklearn.naive_bayes import MultinomialNB

# Initialize and train the Naive Bayes classifier
model = MultinomialNB()
model.fit(x_train, y_train)

**•	Train the model on the training set and make predictions on the test set.**

In [35]:
# Make predictions on the test set
y_pred = model.predict(x_test)

In [36]:
model.score(x_test, y_test)*100, model.score(x_train, y_train)*100

(84.0, 98.3125)

# **Task : 3. Sentiment Analysis**

**•	Choose a suitable library or method for performing sentiment analysis on the blog post texts.**

In [37]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download the VADER lexicon
import nltk
nltk.download('vader_lexicon')

# Initialize VADER sentiment intensity analyzer
analyzer = SentimentIntensityAnalyzer()

# Function to get sentiment score
def get_sentiment_score(text):
    scores = analyzer.polarity_scores(text)
    return scores['compound']

# Apply sentiment analysis to the cleaned data
df['sentiment_score'] = df['cleaned_data'].apply(get_sentiment_score)

display(df[['cleaned_data', 'sentiment_score']].head())

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Unnamed: 0,cleaned_data,sentiment_score
0,path cantaloupesrvcscmuedumagnesiumclubcccmued...,-0.9896
1,newsgroups altatheism path cantaloupesrvcscmue...,0.875
2,path cantaloupesrvcscmuedudasnewsharvardedunoc...,-0.994
3,path cantaloupesrvcscmuedumagnesiumclubcccmued...,-0.9996
4,xref cantaloupesrvcscmuedu altatheism talkreli...,0.989


**•	Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.**

In [38]:
# Function to categorize sentiment score
def categorize_sentiment(score):
    if score >= 0.05:
        return 'Positive'
    elif score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Apply the function to create a new column with sentiment categories
df['sentiment_category'] = df['sentiment_score'].apply(categorize_sentiment)

# Display the first few rows with the new sentiment category
display(df[['cleaned_data', 'sentiment_score', 'sentiment_category']].head())

Unnamed: 0,cleaned_data,sentiment_score,sentiment_category
0,path cantaloupesrvcscmuedumagnesiumclubcccmued...,-0.9896,Negative
1,newsgroups altatheism path cantaloupesrvcscmue...,0.875,Positive
2,path cantaloupesrvcscmuedudasnewsharvardedunoc...,-0.994,Negative
3,path cantaloupesrvcscmuedumagnesiumclubcccmued...,-0.9996,Negative
4,xref cantaloupesrvcscmuedu altatheism talkreli...,0.989,Positive


**•	Examine the distribution of sentiments across different categories and summarize your findings.**

In [39]:
# Group by original labels and sentiment category to see the distribution
sentiment_distribution = df.groupby(['Labels', 'sentiment_category']).size().unstack(fill_value=0)

# Display the sentiment distribution
display(sentiment_distribution)

# Summarize findings
print("\nSentiment Distribution Summary:")
print(sentiment_distribution.sum(axis=1)) # Total count per label
print("\nPercentage of Sentiment Distribution within each Label:")
display(sentiment_distribution.apply(lambda x: x / x.sum() * 100, axis=1))

sentiment_category,Negative,Neutral,Positive
Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
alt.atheism,40,1,59
comp.graphics,10,3,87
comp.os.ms-windows.misc,21,2,77
comp.sys.ibm.pc.hardware,20,4,76
comp.sys.mac.hardware,17,4,79
comp.windows.x,20,2,78
misc.forsale,8,10,82
rec.autos,26,3,71
rec.motorcycles,33,1,66
rec.sport.baseball,24,3,73



Sentiment Distribution Summary:
Labels
alt.atheism                 100
comp.graphics               100
comp.os.ms-windows.misc     100
comp.sys.ibm.pc.hardware    100
comp.sys.mac.hardware       100
comp.windows.x              100
misc.forsale                100
rec.autos                   100
rec.motorcycles             100
rec.sport.baseball          100
rec.sport.hockey            100
sci.crypt                   100
sci.electronics             100
sci.med                     100
sci.space                   100
soc.religion.christian      100
talk.politics.guns          100
talk.politics.mideast       100
talk.politics.misc          100
talk.religion.misc          100
dtype: int64

Percentage of Sentiment Distribution within each Label:


sentiment_category,Negative,Neutral,Positive
Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
alt.atheism,40.0,1.0,59.0
comp.graphics,10.0,3.0,87.0
comp.os.ms-windows.misc,21.0,2.0,77.0
comp.sys.ibm.pc.hardware,20.0,4.0,76.0
comp.sys.mac.hardware,17.0,4.0,79.0
comp.windows.x,20.0,2.0,78.0
misc.forsale,8.0,10.0,82.0
rec.autos,26.0,3.0,71.0
rec.motorcycles,33.0,1.0,66.0
rec.sport.baseball,24.0,3.0,73.0


# **Task : 4. Evaluation**

**•	Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.**

In [40]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Display the metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

# Display classification report for more detailed metrics per class
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.8400
Precision: 0.8471
Recall: 0.8400
F1-score: 0.8332

Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.50      0.83      0.62        18
           comp.graphics       0.75      0.83      0.79        18
 comp.os.ms-windows.misc       0.91      0.91      0.91        22
comp.sys.ibm.pc.hardware       0.81      0.84      0.82        25
   comp.sys.mac.hardware       0.86      0.90      0.88        21
          comp.windows.x       0.95      0.84      0.89        25
            misc.forsale       1.00      0.78      0.88        18
               rec.autos       0.90      1.00      0.95        18
         rec.motorcycles       1.00      0.94      0.97        16
      rec.sport.baseball       0.80      0.89      0.84        18
        rec.sport.hockey       0.88      1.00      0.94        15
               sci.crypt       0.90      1.00      0.95        19
         sci.electronics       0.67      0.75     

*  **Discuss the performance of the model and any challenges encountered during the classification process.**

Ans:- The Naive Bayes classifier achieved an accuracy of **{accuracy:.4f}**, a precision of **{precision:.4f}**, a recall of **{recall:.4f}**, and an F1-score of **{f1:.4f}**.

The classification report provides more detailed insights into the model's performance for each category. We can observe varying levels of precision, recall, and F1-score across different labels. Some categories, like 'rec.autos' and 'sci.crypt', have high precision and recall, indicating the model performs well in classifying instances belonging to these categories. However, other categories, such as 'alt.atheism' and 'talk.religion.misc', show lower scores, suggesting potential challenges in distinguishing these classes.

**Challenges Encountered:**

* **Text Complexity:** Blog posts can contain diverse language, slang, and domain-specific terminology, which can make accurate classification challenging.
* **Class Imbalance:** While the dataset has 100 instances per class, real-world text classification tasks often involve imbalanced datasets, where some categories have significantly fewer examples than others. This can negatively impact the performance of models, particularly for minority classes.
* **Feature Engineering:** The choice of `max_features` in the `TfidfVectorizer` can impact performance. A larger vocabulary might capture more nuances but could also introduce noise.
* **Naive Bayes Assumptions:** The Naive Bayes model assumes independence between features, which may not hold true for text data where word order and context are important.

**Further Improvements:**

* **More Advanced Preprocessing:** Techniques like stemming or lemmatization could be explored.
* **Different Feature Extraction:** Consider using word embeddings (e.g., Word2Vec, GloVe) or more advanced transformer-based models.
* **Different Classification Algorithms:** Experiment with other classifiers like Support Vector Machines (SVM), Logistic Regression, or deep learning models.
* **Hyperparameter Tuning:** Optimize the hyperparameters of the current Naive Bayes model and other potential models.

*  **Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.**

Ans:- The sentiment analysis using VADER provides insights into the emotional tone of the blog posts across different categories. Looking at the `sentiment_distribution` table and the percentage breakdown, we can observe some interesting patterns:

* **Highly Positive Categories:** Categories like 'comp.graphics', 'comp.sys.mac.hardware', 'misc.forsale', and 'sci.electronics' show a high percentage of positive sentiment. This could indicate that discussions in these areas are generally constructive, enthusiastic, or related to sharing positive experiences and information (e.g., about products or new developments).
* **Mixed Sentiment Categories:** Many categories, such as 'alt.atheism', 'rec.autos', and 'soc.religion.christian', have a significant presence of both positive and negative sentiments. This suggests that conversations in these areas are likely to involve debates, differing opinions, or discussions of both favorable and unfavorable aspects of the topics.
* **Highly Negative Categories:** Categories like 'talk.politics.guns', 'talk.politics.mideast', and 'talk.politics.misc' exhibit a considerably higher proportion of negative sentiment compared to others. This is not surprising, as political discussions often involve strong disagreements, criticism, and negative framing of events or viewpoints.

**Implications:**

* **Understanding Community Tone:** The sentiment distribution can help understand the general tone and emotional climate within different online communities or discussion groups.
* **Content Moderation:** For platforms hosting such discussions, sentiment analysis can be a valuable tool for identifying potentially toxic or negative conversations that might require moderation.
* **Targeted Content Creation:** Understanding the prevalent sentiment in a category can inform content creators on how to tailor their message to resonate with the audience (e.g., using a more positive tone in a generally positive category).
* **Further Analysis:** Categories with mixed or highly negative sentiments might warrant deeper analysis to understand the specific triggers and topics that lead to these sentiments.

It's important to remember that lexicon-based sentiment analysis like VADER has limitations and might not capture the nuances of complex language, sarcasm, or context as well as more advanced machine learning models. However, it provides a good initial overview of the sentiment landscape within the dataset.