# ***To Develop any one NLP application.***

**Name:** Prexit Joshi  
**Roll No.:** 118

---

## Aim
To develop a NLP application to classify messages as *Spam* or *Not Spam (Ham)* using Bag-of-Words and Multinomial Naïve Bayes. This version is designed to be easy to explain in a viva.

## Objective
1. Preprocess text (lowercasing + remove special characters).  
2. Convert text into numerical form using **CountVectorizer (Bag-of-Words)**.  
3. Train a simple **Multinomial Naïve Bayes** classifier.  
4. Predict whether a message is spam or not and explain the steps in viva.


In [1]:
# 1. Imports and tiny dataset (easy to explain)
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Very small dataset for demonstration and viva
data = {
    'message': [
        "Congratulations! You won a free lottery",
        "Call now to claim your prize",
        "This is a meeting reminder",
        "Let's have lunch tomorrow",
        "Earn money fast by clicking this link",
        "Your appointment is scheduled"
    ],
    'label': ["spam", "spam", "ham", "ham", "spam", "ham"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,message,label
0,Congratulations! You won a free lottery,spam
1,Call now to claim your prize,spam
2,This is a meeting reminder,ham
3,Let's have lunch tomorrow,ham
4,Earn money fast by clicking this link,spam
5,Your appointment is scheduled,ham


In [2]:
# 2. Preprocessing function (explain: lowercasing + remove non-letters)
def clean_text(msg):
    msg = msg.lower()  # convert to lowercase
    msg = re.sub(r'[^a-zA-Z ]', '', msg)  # remove numbers and punctuation
    msg = msg.strip()
    return msg

df['clean'] = df['message'].apply(clean_text)
df[['message','clean','label']]

Unnamed: 0,message,clean,label
0,Congratulations! You won a free lottery,congratulations you won a free lottery,spam
1,Call now to claim your prize,call now to claim your prize,spam
2,This is a meeting reminder,this is a meeting reminder,ham
3,Let's have lunch tomorrow,lets have lunch tomorrow,ham
4,Earn money fast by clicking this link,earn money fast by clicking this link,spam
5,Your appointment is scheduled,your appointment is scheduled,ham


In [3]:
# 3. Convert to Bag-of-Words and train Naive Bayes (simple and fast)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['clean'])  # Bag-of-Words
y = df['label']

model = MultinomialNB()
model.fit(X, y)

print("Vocabulary (sample):", list(vectorizer.vocabulary_.items())[:10])
print("Model classes:", model.classes_)

Vocabulary (sample): [('congratulations', 5), ('you', 25), ('won', 24), ('free', 8), ('lottery', 13), ('call', 2), ('now', 17), ('to', 22), ('claim', 3), ('your', 26)]
Model classes: ['ham' 'spam']


In [4]:
# 4. Test with new messages (explain: transform -> predict)
def predict_message(msg):
    cleaned = clean_text(msg)
    vec = vectorizer.transform([cleaned])
    pred = model.predict(vec)[0]
    proba = model.predict_proba(vec)[0]
    return pred, proba

tests = ["Win cash now!", "Are we meeting today?", "Claim your free prize", "See you at the meeting"]
for t in tests:
    p, pr = predict_message(t)
    print(f"Input: {t}\nPrediction: {p}  (prob_spam={pr[list(model.classes_).index('spam')]:.2f})\n")

Input: Win cash now!
Prediction: spam  (prob_spam=0.63)

Input: Are we meeting today?
Prediction: ham  (prob_spam=0.30)

Input: Claim your free prize
Prediction: spam  (prob_spam=0.82)

Input: See you at the meeting
Prediction: ham  (prob_spam=0.43)



## Conclusion
A basic spam detection application was implemented using Bag-of-Words (CountVectorizer) and Multinomial Naïve Bayes. The steps are short and easy to explain in viva: text cleaning (lowercase + remove punctuation), converting text to word counts, training a simple probabilistic classifier, and predicting new messages. This method is suitable as a baseline and is excellent for demonstration and oral examination.

**Prepared by:** Prexit Joshi

**Roll No.:** 118
