NATURAL LANGUAGE PROCESSING (NLP)

Aim: Build a Spam Email Classifier

In [1]:
import pandas as pd

In [2]:
spam_data = pd.read_csv('spam_mail_classifier.csv')

In [34]:
spam_data.head()

Unnamed: 0,email_text,label
0,Let's catch up sometime next week!,ham
1,Don't forget to submit your project by Friday.,ham
2,Win a free iPhone now!!! Click here.,spam
3,Can you send me the report when it's ready?,ham
4,Meeting has been rescheduled to next Monday.,ham


# Vectorizer

In [5]:
X_text = spam_data['email_text']
y = spam_data['label']

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.95, min_df=2)
X = vectorizer.fit_transform(X_text)

# Train and Test Splitting

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Machine Learning Model

In [13]:
from sklearn.linear_model import LogisticRegression

In [14]:
lr = LogisticRegression(solver='liblinear',max_iter=1000)

In [15]:
lr.fit(X_train, y_train)

In [16]:
y_pred = lr.predict(X_test)

## Evaluate our model

In [17]:
from sklearn.metrics import classification_report

In [21]:
print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))

              precision    recall  f1-score   support

         ham       1.00      1.00      1.00       122
        spam       1.00      1.00      1.00        78

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200



# Manual Testing

In [22]:
def predict_spam(email_text):
    X_new = vectorizer.transform([email_text])
    return lr.predict(X_new)[0]

In [31]:
test_data = """

Hey there, it's Robert from Gigabrain...

If I gave you a test to see if you could tell if a Reddit profile was real... or a fake.

How do you think you would do? 

I know I would absolutely fail at this... The bots and sheer number of AI tools are getting insane...

And It’s not AI that’s the problem.

It’s shady people using it to manufacture trust. 

That "verified review" you trusted?

Those 200 upvotes you used to rely on?

It could all be fake... pushing hidden agendas you’d never notice on your own.

That's why we built DERPY...

It gives you x-ray vision into reddit profiles... exposing patterns that reveal REAL human insight... or automated bot behavior...

All at the push of a button... 

Because when you're making decisions for you and your family... 

You deserve better than made up credibility.

Want to see what's really going on behind those "expert" comments?

>> Run any username through DERPY now <<

Once you see it, you can't unsee it. 

- Robert

P.S. Want to make this even easier? Download our extension... and Gigabrain will follow you around Reddit... showing you who’s real (or not) at the tap of a button. 
"""

In [32]:
print(predict_spam(test_data))

ham
