#  Data Exploration and Preprocessing

Loading the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.

In [1]:
import pandas as pd 
import numpy as np
df=pd.read_csv("blogs.csv")
df

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism
...,...,...
1995,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,talk.religion.misc
1996,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc
1997,Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...,talk.religion.misc
1998,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc


In [2]:
df.head()

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism


checking no.of rows and columns

In [19]:
df.shape

(2000, 3)

checking the datatype of features and non-null count

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Data       2000 non-null   object
 1   Labels     2000 non-null   object
 2   Sentiment  2000 non-null   object
dtypes: object(3)
memory usage: 47.0+ KB


checking missing values 

In [22]:
df.isna().sum()

Data         0
Labels       0
Sentiment    0
dtype: int64

Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.
Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.
# so i have used tfidvectorizer which can removing punctuation, converting to lowercase, tokenizing, and removing stopwords and also coverting into numerical form


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
x = TfidfVectorizer(stop_words='english').fit_transform(df['Data'])  # 'stop_words' removes common words like 'the', 'and'
x

<2000x51096 sparse matrix of type '<class 'numpy.float64'>'
	with 312644 stored elements in Compressed Sparse Row format>

In [4]:
dense_matrix = x.toarray()  # Convert sparse matrix to dense format
print(dense_matrix)
  

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.01855593 0.         0.         ... 0.         0.         0.        ]]


In [5]:
y=df['Labels']
y

0              alt.atheism
1              alt.atheism
2              alt.atheism
3              alt.atheism
4              alt.atheism
               ...        
1995    talk.religion.misc
1996    talk.religion.misc
1997    talk.religion.misc
1998    talk.religion.misc
1999    talk.religion.misc
Name: Labels, Length: 2000, dtype: object

In [6]:
df['Labels'].unique()

array(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
       'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
       'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
       'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt',
       'sci.electronics', 'sci.med', 'sci.space',
       'soc.religion.christian', 'talk.politics.guns',
       'talk.politics.mideast', 'talk.politics.misc',
       'talk.religion.misc'], dtype=object)

coverting the target variable to labelencoder which can convert into numerical form and also make fitting into model 

In [7]:
from sklearn.preprocessing import LabelEncoder
y_df=LabelEncoder().fit_transform(y)
y_df

array([ 0,  0,  0, ..., 19, 19, 19])

# Naive Bayes Model for Text Classification

Split the data into training and test sets.

In [8]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y_df,test_size=0.2,random_state=42)
print("x_train:",x_train)
print("x_test:",x_test)
print("y_train",y_train)
print("y_test",y_test)

x_train:   (0, 26056)	0.14189030943272113
  (0, 30690)	0.12878173684436658
  (0, 47633)	0.14189030943272113
  (0, 11)	0.14955833283488978
  (0, 32306)	0.14955833283488978
  (0, 19948)	0.25756347368873317
  (0, 14848)	0.14955833283488978
  (0, 47489)	0.29911666566977957
  (0, 47490)	0.29911666566977957
  (0, 33094)	0.14955833283488978
  (0, 44805)	0.11912116985811903
  (0, 31867)	0.13644976024653524
  (0, 16057)	0.13644976024653524
  (0, 32817)	0.13222974244647356
  (0, 44802)	0.22831884553832324
  (0, 39655)	0.14189030943272113
  (0, 44187)	0.06136489918757282
  (0, 13944)	0.07293230218758116
  (0, 51026)	0.21017978731130924
  (0, 16525)	0.09387406667484614
  (0, 18993)	0.12586648964617783
  (0, 40174)	0.11567316425601205
  (0, 19071)	0.2644594848929471
  (0, 364)	0.09834457386759583
  (0, 19724)	0.11023261506982616
  :	:
  (1599, 38214)	0.0452122383114489
  (1599, 46265)	0.10492592154864856
  (1599, 27465)	0.02681952531388639
  (1599, 38431)	0.026081974048705065
  (1599, 35506)	0.0268

Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this purpose.

In [9]:
from sklearn.naive_bayes import MultinomialNB
model= MultinomialNB().fit(x_train,y_train)
model

Train the model on the training set and make predictions on the test set.

In [10]:
y_pred=model.predict(x_test)
y_pred

array([18,  3, 13,  9,  3, 12,  9, 17,  0, 13,  0, 13, 11,  9,  3,  2,  7,
        1, 16, 18,  6, 18, 19, 10,  0, 11, 11,  9,  7,  0,  6, 10,  5, 10,
       10,  4, 13, 10, 10,  2,  3, 19,  2,  1, 15,  8, 11,  8,  0, 18, 15,
       11, 14,  2,  9, 17, 12, 16, 11,  3, 14,  6, 16, 10, 18, 18, 10, 15,
        1, 14, 14,  3, 13, 10,  7,  3, 18,  7, 12, 18,  0,  8, 14, 15, 18,
       10, 17,  4,  2, 11, 15, 17,  2,  2, 18, 18,  4, 12, 15, 18,  9,  9,
        4, 13,  2,  8,  7,  2,  6, 11,  7, 14, 10, 17,  7,  3, 11,  9,  3,
       18, 18, 10, 11,  4,  3, 15,  4,  8, 11,  3, 19, 18,  7,  5,  8, 12,
        0,  0, 16, 15, 13,  3, 18,  8, 15,  5,  6,  4,  2,  1, 16,  9,  6,
       11,  3,  6, 13, 16,  3, 16,  6, 16,  1, 11, 19, 14, 17, 12,  2, 17,
       11,  6,  0, 18,  4, 13, 12, 13,  8, 16, 12,  7, 18, 11,  8,  7,  0,
       16, 11, 12,  7,  2, 13,  2, 19,  4, 11,  8,  6, 15, 10, 12,  1,  7,
       11,  2, 14,  9, 10,  0, 17,  0,  2,  1,  7, 18, 11,  3,  1,  1,  9,
        1,  5, 11, 15, 19

#  Sentiment Analysis

Sentiment analysis is a natural language processing (NLP) task used to determine the sentiment or emotion expressed in a piece of text. It typically classifies text into categories such as positive, negative, or neutral.


•	Choose a suitable library or method for performing sentiment analysis on the blog post texts.

•	Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.


In [16]:
from textblob import TextBlob
# Sentiment Analysis
def get_sentiment(text):
    analysis = TextBlob(text).sentiment.polarity
    if analysis > 0:
        return 'Positive'
    elif analysis < 0:
        return 'Negative'
    else:
        return 'Neutral'

# Apply sentiment analysis to the 'Data' column
df['Sentiment'] = df['Data'].apply(get_sentiment)
print(df)

                                                   Data              Labels  \
0     Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...         alt.atheism   
1     Newsgroups: alt.atheism\nPath: cantaloupe.srv....         alt.atheism   
2     Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...         alt.atheism   
3     Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...         alt.atheism   
4     Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...         alt.atheism   
...                                                 ...                 ...   
1995  Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...  talk.religion.misc   
1996  Xref: cantaloupe.srv.cs.cmu.edu talk.religion....  talk.religion.misc   
1997  Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...  talk.religion.misc   
1998  Xref: cantaloupe.srv.cs.cmu.edu talk.religion....  talk.religion.misc   
1999  Xref: cantaloupe.srv.cs.cmu.edu sci.skeptic:43...  talk.religion.misc   

     Sentiment  
0     Positive  
1     Negative  


Examine the distribution of sentiments across different categories and summarize your findings.

In [17]:
# Distribution across all data
sentiment_distribution = df['Sentiment'].value_counts()
print("Sentiment Distribution:")
print(sentiment_distribution)

# Distribution across categories (e.g., 'Labels')
category_sentiment_distribution = df.groupby('Labels')['Sentiment'].value_counts()
print("\nCategory Sentiment Distribution:")
print(category_sentiment_distribution)


Sentiment Distribution:
Sentiment
Positive    1543
Negative     457
Name: count, dtype: int64

Category Sentiment Distribution:
Labels                    Sentiment
alt.atheism               Positive     77
                          Negative     23
comp.graphics             Positive     76
                          Negative     24
comp.os.ms-windows.misc   Positive     78
                          Negative     22
comp.sys.ibm.pc.hardware  Positive     80
                          Negative     20
comp.sys.mac.hardware     Positive     76
                          Negative     24
comp.windows.x            Positive     73
                          Negative     27
misc.forsale              Positive     84
                          Negative     16
rec.autos                 Positive     83
                          Negative     17
rec.motorcycles           Positive     74
                          Negative     26
rec.sport.baseball        Positive     71
                          Negative    

# Evaluation

Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.

In [25]:
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
score=accuracy_score(y_test,y_pred)
score

0.7825

Target is multiclass choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

In [28]:
precision = precision_score(y_test, y_pred, average='weighted')  # Change average as needed
recall = recall_score(y_test, y_pred, average='weighted')        
f1 = f1_score(y_test, y_pred, average='weighted')   

In [29]:
precision

0.8074292078467984

In [30]:
recall

0.7825

In [31]:
f1

0.769562107458443

Evaluation Criteria Assessment

# Correct Implementation of Data Preprocessing and Feature Extraction

 TfidfVectorizer efficiently handled text cleaning and transformation.
 
Accuracy and Robustness of the Naive Bayes Model

Achieved 0.7825 accuracy and 0.8074 precision, which are good indicators of model performance.

Depth and Insightfulness of Sentiment Analysis

The categorization into Positive, Negative, and Neutral sentiments using TextBlob is appropriate and insightful.

Clarity and Thoroughness of Evaluation

Ensure that your report clearly explains how the results relate to the blog post labels (e.g., are certain labels predominantly positive or negative?).

Overall Quality and Organization

Provide a clear summary of each step, metrics, and results in your report for better readability