## Name :Snehal Shyam Jagtap

## Assignment No 19



### Naive Bayes and Text Mining

### TEXT CLASSIFICATION USING NAIVE BAYES AND SENTIMENT ANALYSIS ON BLOG POSTS

In this assignment, you will work on the "blogs_categories.csv" dataset, which contains blog posts categorized into various themes. Your task will be to build a text classification model using the Naive Bayes algorithm to categorize the blog posts accurately. Furthermore, you will perform sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts. This assignment will enhance your understanding of text classification, sentiment analysis, and the practical application of the Naive Bayes algorithm in Natural Language Processing (NLP).

In [None]:
pip install pandas nltk scikit-learn textblob

### 1: Import Libraries and Download Resources

In [1]:
# Import necessary libraries
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from textblob import TextBlob

In [2]:
# Download NLTK stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### 2: Load and Explore the Dataset

In [3]:
# Load the dataset
df = pd.read_csv('blogs.csv')

In [4]:
# Display the first few rows to understand the data structure
df.head()

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism


### 3: Data Preprocessing (Cleaning the Text)

In [5]:
# Assuming 'Data' is the column for blog text and 'Labels' is for categories
df = df[['Data', 'Labels']]  # Adjust column names if necessary

In [6]:
# Function to clean text: lowercase, remove punctuation, remove stopwords
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

In [7]:
# Apply the preprocessing function to the dataset
df['Cleaned_Text'] = df['Data'].apply(preprocess_text)

In [8]:
# Display the cleaned data
df.head()

Unnamed: 0,Data,Labels,Cleaned_Text
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,newsgroups altatheism path cantaloupesrvcscmue...
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,path cantaloupesrvcscmuedudasnewsharvardedunoc...
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism,xref cantaloupesrvcscmuedu altatheism53485 tal...


### 4: Feature Extraction (TF-IDF)

In [9]:
# Initialize TF-IDF vectorizer with a maximum of 5000 features
tfidf = TfidfVectorizer(max_features=5000)

In [10]:
# Convert the cleaned text data into numeric form
X = tfidf.fit_transform(df['Cleaned_Text']).toarray()

In [11]:
# Labels (target variable)
y = df['Labels']

In [12]:
# Display the shape of the resulting feature matrix and labels
X.shape, y.shape

((2000, 5000), (2000,))

### 5: Split the Data into Training and Test Sets

In [13]:
# Split the dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
# Display the shapes of training and testing sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1600, 5000), (400, 5000), (1600,), (400,))

### 6: Train Naive Bayes Classifier

In [15]:
# Initialize the Multinomial Naive Bayes classifier
model = MultinomialNB()

In [16]:
# Train the model on the training data
model.fit(X_train, y_train)

### 7: Make Predictions and Evaluate the Model

In [17]:
# Make predictions on the test data
y_pred = model.predict(X_test)

In [18]:
# Print accuracy score
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.8225


In [19]:
# Print classification report for detailed performance metrics
print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
                           precision    recall  f1-score   support

             alt.atheism       0.50      0.83      0.62        18
           comp.graphics       0.79      0.83      0.81        18
 comp.os.ms-windows.misc       0.86      0.82      0.84        22
comp.sys.ibm.pc.hardware       0.76      0.76      0.76        25
   comp.sys.mac.hardware       0.83      0.90      0.86        21
          comp.windows.x       0.91      0.84      0.87        25
            misc.forsale       0.82      0.78      0.80        18
               rec.autos       0.89      0.94      0.92        18
         rec.motorcycles       0.94      0.94      0.94        16
      rec.sport.baseball       0.77      0.94      0.85        18
        rec.sport.hockey       0.88      1.00      0.94        15
               sci.crypt       0.95      0.95      0.95        19
         sci.electronics       0.62      0.62      0.62        16
                 sci.med       0.88      0.88      

### 8: Sentiment Analysis on Blog Posts

In [20]:
# Function to determine the sentiment of a text
def get_sentiment(text):
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    if polarity > 0:
        return 'Positive'
    elif polarity < 0:
        return 'Negative'
    else:
        return 'Neutral'

In [21]:
# Apply sentiment analysis to the original blog post data (not the cleaned version)
df['Sentiment'] = df['Data'].apply(get_sentiment)

In [22]:
# Display the first few rows with sentiment labels
df.head()

Unnamed: 0,Data,Labels,Cleaned_Text,Sentiment
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...,Positive
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,newsgroups altatheism path cantaloupesrvcscmue...,Negative
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,path cantaloupesrvcscmuedudasnewsharvardedunoc...,Positive
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...,Positive
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism,xref cantaloupesrvcscmuedu altatheism53485 tal...,Positive


### 9: Analyze Sentiment Distribution

In [23]:
# Display the overall sentiment distribution
df['Sentiment'].value_counts()

Sentiment
Positive    1543
Negative     457
Name: count, dtype: int64

In [24]:
# Group by category and analyze sentiment distribution per category
sentiment_by_category = df.groupby('Labels')['Sentiment'].value_counts(normalize=True).unstack()

In [25]:
# Display sentiment by category
sentiment_by_category

Sentiment,Negative,Positive
Labels,Unnamed: 1_level_1,Unnamed: 2_level_1
alt.atheism,0.23,0.77
comp.graphics,0.24,0.76
comp.os.ms-windows.misc,0.22,0.78
comp.sys.ibm.pc.hardware,0.2,0.8
comp.sys.mac.hardware,0.24,0.76
comp.windows.x,0.27,0.73
misc.forsale,0.16,0.84
rec.autos,0.17,0.83
rec.motorcycles,0.26,0.74
rec.sport.baseball,0.29,0.71


### 10: Save Results to a CSV File

In [26]:
# Save the dataset with sentiments to a new CSV file
df.to_csv('blog_sentiments.csv', index=False)

In [27]:
# Confirm that the file was saved
print("Results saved to 'blog_sentiments.csv'")

Results saved to 'blog_sentiments.csv'
