<a href="https://colab.research.google.com/github/mishba-ai/Learning-ML/blob/main/Naivebayes_classifier_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# sentiment analysis on Reddit posts


- Fetch data from the Reddit API.

- Preprocess the text data.

- Build a vocabulary.

- Extract features using Bag of Words (BoW).

- Train a Naive Bayes classifier.

- Evaluate the model.

In [3]:
!pip install asyncpraw
import math
from collections import defaultdict,Counter
import asyncpraw
import asyncio
from IPython import get_ipython
import pandas as pd



In [4]:
# access secret keys
from google.colab import userdata
client_id = userdata.get('CLIENT_ID')
client_secret = userdata.get('CLIENT_SECRET')
user_agent = userdata.get('USER_AGENT')

In [5]:
# Authenticate with Reddit API
reddit = asyncpraw.Reddit(
    client_id=client_id,          # Your Client ID
    client_secret=client_secret,  # Your Client Secret
    user_agent=user_agent         # A unique identifier for your app
)

In [6]:
# fetch data

async def fetch_reddit_data(subreddit_name, limit=100):
    data = []
    subreddit = await reddit.subreddit(subreddit_name)

    async for submission in subreddit.hot(limit=limit):
        # Combine title and selftext for analysis
        full_text = f"{submission.title} {submission.selftext}"

        # Basic sentiment assignment based on score
        # You might want to adjust these thresholds
        sentiment = 'positive' if submission.score > 5 else 'negative'

        data.append({
            'text': full_text,
            'score': submission.score,
            'url': submission.url,
            'sentiment': sentiment
        })

    return data

In [8]:
import nest_asyncio
nest_asyncio.apply()

async def main():
    reddit_data = await fetch_reddit_data("SAAS", limit=50)
    return reddit_data

# Run the async function
reddit_data = asyncio.run(main())

# Convert to DataFrame for easier viewing
df = pd.DataFrame(reddit_data)
print(f"Fetched {len(df)} posts")
df.head()

Fetched 50 posts


Unnamed: 0,text,score,url,sentiment
0,"Upcoming AmA: ""Bootstrapped to 25,000,000 user...",16,https://www.reddit.com/r/SaaS/comments/1iecmnj...,positive
1,"Weekly Feedback Post - SaaS Products, Ideas, C...",6,https://www.reddit.com/r/SaaS/comments/1ics6vd...,positive
2,"Your users don’t care if you’re using OpenAI, ...",42,https://www.reddit.com/r/SaaS/comments/1if64xj...,positive
3,The Hidden Superpower in Building Successful S...,18,https://www.reddit.com/r/SaaS/comments/1ifa29a...,positive
4,What is most genius marketing you can have com...,29,https://www.reddit.com/r/SaaS/comments/1if6lw5...,positive


# Naive Bayes Classifier:



Naive Bayes is a probabilistic classifier based on Bayes' Theorem:

P(A|B) = P(B|A) × P(A) / P(B)
In the context of text classification:

- P(Class|Text) = P(Text|Class) × P(Class) / P(Text)
- We want to find which class has the highest probability given the text

 When to Use Naive Bayes?

Naive Bayes is particularly good for:

- Text classification (spam detection, sentiment analysis)
- Document categorization
- Email filtering
- Language detection
- Disease diagnosis

It works best when:

- You need fast training and prediction
- You have high-dimensional data (like text)
- You want probabilistic predictions
- You have relatively independent features
- You need to handle missing data well



---

The process involves several steps:
 1. Training Phase:

a. Calculate P(Class) for each class

- Count how often each class appears in training data
- Example: P(positive) = number of positive reviews / total reviews


  b. Calculate P(Word|Class) for each word and class

- Count word frequencies in each class
Apply smoothing to handle unseen words
- Example: P(great|positive) = (count of "great" in positive reviews + 1) / (total words in positive reviews + vocabulary size)



2. Prediction Phase:

- For a new text, calculate P(Class|Text) for each class
- Choose the class with highest probability

In [9]:
class NaiveBayesClassifier:
    def __init__(self):
        self.class_probs = {}  # P(class)
        self.word_probs = {}   # P(word|class)
        self.vocabulary = set()

    def preprocess_text(self, text):
        """Clean and tokenize text."""
        # Convert to lowercase
        text = text.lower()
        # Basic punctuation removal (you can expand this)
        for punct in '.,!?;:()[]{}""''':
            text = text.replace(punct, '')
        # Split into words
        return text.split()

    def build_vocabulary(self, texts):
        """Build vocabulary from all texts."""
        for text in texts:
            words = self.preprocess_text(text)
            self.vocabulary.update(words)

    def extract_features(self, text):
        """Convert text to bag of words."""
        words = self.preprocess_text(text)
        return Counter(words)

    def train(self, texts, labels):
        """Train the Naive Bayes classifier."""
        # Build vocabulary
        self.build_vocabulary(texts)

        # Count documents per class
        class_counts = Counter(labels)
        total_docs = len(texts)

        # Calculate P(class)
        for class_label in class_counts:
            self.class_probs[class_label] = class_counts[class_label] / total_docs

        # Initialize word counts per class
        word_counts = defaultdict(lambda: defaultdict(int))
        total_words_per_class = defaultdict(int)

        # Count word occurrences per class
        for text, label in zip(texts, labels):
            words = self.preprocess_text(text)
            for word in words:
                word_counts[label][word] += 1
                total_words_per_class[label] += 1

        # Calculate P(word|class) with Laplace smoothing
        vocab_size = len(self.vocabulary)
        self.word_probs = defaultdict(dict)

        for class_label in class_counts:
            for word in self.vocabulary:
                # Add-one smoothing
                numerator = word_counts[class_label][word] + 1
                denominator = total_words_per_class[class_label] + vocab_size
                self.word_probs[class_label][word] = numerator / denominator

    def predict(self, text):
        """Predict sentiment of text."""
        features = self.extract_features(text)

        # Calculate log probabilities for each class
        scores = {}
        for class_label in self.class_probs:
            # Start with log(P(class))
            scores[class_label] = math.log(self.class_probs[class_label])

            # Add log(P(word|class)) for each word
            for word, count in features.items():
                if word in self.vocabulary:
                    scores[class_label] += count * math.log(self.word_probs[class_label][word])

        # Return class with highest probability
        return max(scores.items(), key=lambda x: x[1])[0]

    def evaluate(self, test_texts, test_labels):
        """Calculate accuracy on test data."""
        correct = 0
        total = len(test_texts)

        predictions = []
        for text in test_texts:
            predictions.append(self.predict(text))

        for pred, true in zip(predictions, test_labels):
            if pred == true:
                correct += 1

        accuracy = correct / total
        return accuracy


## Prepare and clean the data


In [13]:
# Remove any empty texts
reddit_data = [post for post in reddit_data if post['text'].strip()]

# Extract texts and labels
texts = [post['text'] for post in reddit_data]
labels = [post['sentiment'] for post in reddit_data]

# Print some statistics
print(f"Total samples: {len(texts)}")
print(f"Positive samples: {labels.count('positive')}")
print(f"Negative samples: {labels.count('negative')}")

Total samples: 50
Positive samples: 16
Negative samples: 34


## Split data and train model


In [14]:
# Split into training and testing sets (80-20 split)
split_idx = int(0.8 * len(texts))
train_texts = texts[:split_idx]
train_labels = labels[:split_idx]
test_texts = texts[split_idx:]
test_labels = labels[split_idx:]

# Initialize and train classifier
classifier = NaiveBayesClassifier()
classifier.train(train_texts, train_labels)

## Evaluate the model


In [15]:
# Calculate and print accuracy
accuracy = classifier.evaluate(test_texts, test_labels)
print(f"Model Accuracy: {accuracy:.2%}")

# Print some example predictions
print("\nExample Predictions:")
for text, true_label in zip(test_texts[:5], test_labels[:5]):
    pred = classifier.predict(text)
    print(f"\nText: {text[:100]}...")  # Show first 100 chars
    print(f"True sentiment: {true_label}")
    print(f"Predicted sentiment: {pred}")

Model Accuracy: 80.00%

Example Predictions:

Text: I made a project to convert syllabus into AI generated study material PDFs Hey everyone! 👋

Recently...
True sentiment: negative
Predicted sentiment: negative

Text: Looking for feedback on my SaaS which can make Blogs to infographics i have built out a tool which c...
True sentiment: negative
Predicted sentiment: positive

Text: Best place to find Technical Co-Founder? 
I’ve been working on an idea for a sports-specific app tha...
True sentiment: negative
Predicted sentiment: negative

Text: New to development! Want to build something like this https://www.owayo.com/. Stuck at text moulding...
True sentiment: negative
Predicted sentiment: negative

Text: Would You Use a Gamified Learning & Rewards System for Tech Upskilling? Hi! I’m exploring an idea fo...
True sentiment: negative
Predicted sentiment: negative


In [None]:
# Function to analyze new posts
# def analyze_new_post(text):
#     """Analyze sentiment of a new post"""
#     prediction = classifier.predict(text)
#     return prediction

### ***summary***
- Calculates class probabilities P(class)
- Calculates word probabilities P(word|class)
- Uses Laplace smoothing to handle unseen words
- Works with log probabilities to prevent underflow