### TEXT CLASSIFICATION USING NAIVE BAYES AND SENTIMENT ANALYSIS ON BLOG POSTS


### Tasks -

#### 1. Data Exploration and Preprocessing

•	Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.

•	Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.

•	Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.


#### 2. Naive Bayes Model for Text Classification -

•	Split the data into training and test sets.

•	Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this purpose.

•	Train the model on the training set and make predictions on the test set.

#### 3. Sentiment Analysis -

•	Choose a suitable library or method for performing sentiment analysis on the blog post texts.

•	Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.

•	Examine the distribution of sentiments across different categories and summarize your findings.


#### 4. Evaluation -

•	Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.

•	Discuss the performance of the model and any challenges encountered during the classification process.

•	Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.


In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv('blogs_categories.csv')

In [5]:
data.columns

Index(['Unnamed: 0', 'Data', 'Labels'], dtype='object')

In [7]:
data.head()

Unnamed: 0.1,Unnamed: 0,Data,Labels
0,0,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...,alt.atheism
1,1,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
2,2,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
3,3,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
4,4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism


In [8]:
#Load the dataset 
data=pd.read_csv('blogs_categories.csv')

# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

# Impute missing values with the mean
data['Unnamed: 0'].fillna(data['Unnamed: 0'].mean(), inplace=True)

# Impute missing values with the mode
data['Unnamed: 0'].fillna(data['Unnamed: 0'].mode()[0], inplace=True)

# Delete rows with missing values
data.dropna(inplace=True)

data.to_csv('blogs_categories.csv', index=False)


Unnamed: 0    0
Data          0
Labels        0
dtype: int64


In [9]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('blogs_categories.csv')

# Text preprocessing
# remove punctuation and convert text to lowercase
df['Data'] = df['Data'].str.lower().str.replace('[^\w\s]', '')

# Tokenization and removing stopwords
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(filtered_tokens)

df['Data'] = df['Data'].apply(preprocess_text)

# Feature extraction (TF-IDF)
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(df['Data'])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, df['Labels'], test_size=0.2, random_state=42)




  df['Data'] = df['Data'].str.lower().str.replace('[^\w\s]', '')


In [10]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score


# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Data'], df['Labels'], test_size=0.2, random_state=42)

# Feature extraction (TF-IDF)
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Train the Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_tfidf, y_train)

# Make predictions on the test set
y_pred = nb_classifier.predict(X_test_tfidf)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.89775

Classification Report:
                           precision    recall  f1-score   support

             alt.atheism       0.72      0.79      0.75       173
           comp.graphics       0.88      0.91      0.90       179
 comp.os.ms-windows.misc       0.93      0.88      0.91       226
comp.sys.ibm.pc.hardware       0.84      0.85      0.85       204
   comp.sys.mac.hardware       0.90      0.96      0.93       205
          comp.windows.x       0.97      0.95      0.96       186
            misc.forsale       0.91      0.77      0.84       190
               rec.autos       0.91      0.95      0.93       203
         rec.motorcycles       1.00      0.97      0.98       218
      rec.sport.baseball       0.99      0.98      0.99       192
        rec.sport.hockey       0.97      0.99      0.98       203
               sci.crypt       0.90      0.98      0.94       200
         sci.electronics       0.94      0.90      0.92       227
                 sci.med       1

#### 3. Sentiment Analysis -

•	Choose a suitable library or method for performing sentiment analysis on the blog post texts.

•	Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.

•	Examine the distribution of sentiments across different categories and summarize your findings.


In [8]:
!pip install textblob

Collecting textblob
  Obtaining dependency information for textblob from https://files.pythonhosted.org/packages/02/07/5fd2945356dd839974d3a25de8a142dc37293c21315729a41e775b5f3569/textblob-0.18.0.post0-py3-none-any.whl.metadata
  Downloading textblob-0.18.0.post0-py3-none-any.whl.metadata (4.5 kB)
Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
   ---------------------------------------- 0.0/626.3 kB ? eta -:--:--
   - -------------------------------------- 30.7/626.3 kB 1.3 MB/s eta 0:00:01
   --- ----------------------------------- 61.4/626.3 kB 656.4 kB/s eta 0:00:01
   ----- --------------------------------- 92.2/626.3 kB 744.7 kB/s eta 0:00:01
   ---------- --------------------------- 174.1/626.3 kB 952.6 kB/s eta 0:00:01
   -------------- ------------------------- 225.3/626.3 kB 1.1 MB/s eta 0:00:01
   ------------------- -------------------- 307.2/626.3 kB 1.2 MB/s eta 0:00:01
   ------------------------- -------------- 399.4/626.3 kB 1.2 MB/s eta 0:00:01
   --------

In [11]:
from textblob import TextBlob

# Assuming df is your DataFrame containing the preprocessed data

# Perform sentiment analysis
df['Sentiment'] = df['Data'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Categorize sentiments as positive, negative, or neutral
def categorize_sentiment(sentiment):
    if sentiment > 0:
        return 'Positive'
    elif sentiment < 0:
        return 'Negative'
    else:
        return 'Neutral'

df['Labels'] = df['Sentiment'].apply(categorize_sentiment)

# Examine the distribution of sentiments across different categories
sentiment_distribution = df.groupby('Labels')['Data'].value_counts(normalize=True).unstack().fillna(0)
print(sentiment_distribution)

# Summarize findings
print("\nSentiment Distribution Across Categories:")
print(sentiment_distribution)


Data      alanolsenp17f40n105z1fidonetorg alan olsen sender postmastertherosepdxcom path cantaloupesrvcscmueducrabapplesrvcscmuedubb3andrewcmuedunewsseicmuedufs7ececmuedueuropaenggtefsdcomhowlandrestonansnetusccsutexaseduuunetpsgrainqiclabtherosepostmaster newsgroups altatheism subject 2000 years say christian morality messageid 735424748aa00434therosepdxcom date wed 21 apr 1993 024942 0800 lines 29 mc theory creationism theistic view theory mc creationism many others stated genesis mc 1 beginning god created heavens earth order creation accept story creation one many places bible story contradicts following example gen 125 god made beast earth kind cattle kind every thing creepeth upon earth kind god saw good gen 126 god said let us make man image likeness let dominion fish sea fowl air cattle earth every creeping thing creepeth upon earth gen 218 lord god said good man alone make help meet gen 219 ground lord god formed every beast field every fowl air brought unto adam see would cal

#### 4. Evaluation -

•	Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.

•	Discuss the performance of the model and any challenges encountered during the classification process.

•	Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.


In [11]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Split the data into training and test sets
X = df['Data']
y = df['Labels']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [13]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')

# Confusion Matrix and Classification Report
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

print('Classification Report:')
print(classification_report(y_test, y_pred))


Accuracy: 0.8978
Precision: 0.8934
Recall: 0.8939
F1 Score: 0.8924
Confusion Matrix:
[[137   0   0   0   0   0   0   0   0   0   0   1   0   0   0   5   0   1
    1  28]
 [  1 163   1   4   2   3   0   0   0   0   0   2   0   0   1   0   0   1
    0   1]
 [  1   6 199  13   1   0   0   0   0   0   0   2   1   0   1   0   0   0
    0   2]
 [  0   1   9 174   5   2   6   1   0   0   0   2   3   0   0   0   0   0
    0   1]
 [  0   0   1   3 197   0   0   0   0   0   0   0   1   0   0   0   0   0
    3   0]
 [  0   6   2   0   1 176   0   0   0   0   0   1   0   0   0   0   0   0
    0   0]
 [  1   1   1   9   6   0 147   9   1   0   4   2   4   0   0   0   2   1
    1   1]
 [  0   0   0   0   1   0   5 192   0   0   0   0   2   0   0   0   2   0
    1   0]
 [  0   0   0   1   0   0   1   3 211   0   0   2   0   0   0   0   0   0
    0   0]
 [  1   0   0   0   0   0   0   2   0 188   0   1   0   0   0   0   0   0
    0   0]
 [  0   1   0   0   0   0   0   0   0   0 200   0   1   0   0   1