<a href="https://colab.research.google.com/github/mltrev23/tech-test/blob/main/6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem
6. Deploy a spam/no-spam classification machine learning model with natural language processing techniques and count vectorizer
   - Case study: [Spam Email](https://www.kaggle.com/datasets/mfaisalqureshi/spam-email)

# Solution
## Setup Environment

In [28]:
pip install pandas numpy scikit-learn fastapi uvicorn requests pyngrok nest_asyncio

Collecting pyngrok
  Downloading pyngrok-7.2.0-py3-none-any.whl.metadata (7.4 kB)
Downloading pyngrok-7.2.0-py3-none-any.whl (22 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.0


## Data Preprocessing
### Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import pickle

### Download dataset

In [None]:
import subprocess

kaggle_url = 'mfaisalqureshi/spam-email'
file_name = 'spam.csv'

subprocess.run(['kaggle', 'datasets', 'download', '-d', kaggle_url, '-f', file_name])

CompletedProcess(args=['kaggle', 'datasets', 'download', '-d', 'mfaisalqureshi/spam-email', '-f', 'spam.csv'], returncode=0)

### Loading Data and Text Cleaning and Preprocessing

In [19]:
df = pd.read_csv(file_name)

def clean_text(text):
    # Remove special characters, numbers, and punctuation
    text = re.sub(r'\W', ' ', text)
    # Convert to lowercase
    text = text.lower()
    return text

# Apply text cleaning
df['message'] = df['Message'].apply(clean_text)

# Display the first few cleaned messages
df.head()

Unnamed: 0,Category,Message,message
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i don t think he goes to usf he lives aro...


### Encode the labels

In [20]:
# Encode labels: 'ham' -> 0, 'spam' -> 1
df['label'] = df['Category'].map({'ham': 0, 'spam': 1})

# Display label distribution
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,4825
1,747


### Split the Dataset into Training and Testing Sets

In [21]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2, random_state=42)

# Display the size of the train and test sets
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

Training set size: 4457
Testing set size: 1115


## Feature Extraction
### Vectorize the Text Data

In [22]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the training data
X_train_vect = vectorizer.fit_transform(X_train)

# Transform the test data
X_test_vect = vectorizer.transform(X_test)

# Display the shape of the vectorized data
print(f"Training data shape: {X_train_vect.shape}")
print(f"Testing data shape: {X_test_vect.shape}")


Training data shape: (4457, 7701)
Testing data shape: (1115, 7701)


## Model Selection and Training
### Train a Naive Bayes Classifier

In [23]:
# Initialize the Multinomial Naive Bayes model
model = MultinomialNB()

# Train the model on the training data
model.fit(X_train_vect, y_train)

# Display the training completion message
print("Model training completed.")

Model training completed.


## Model Evaluation
### Make Predictions on the Test Data

In [24]:
# Make predictions on the test data
y_pred = model.predict(X_test_vect)

# Display the first few predictions
y_pred[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

### Evaluate the Model

In [25]:
# Calculate and display the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Display a detailed classification report
report = classification_report(y_test, y_pred, target_names=['ham', 'spam'])
print("Classification Report:\n", report)

Model Accuracy: 0.99
Classification Report:
               precision    recall  f1-score   support

         ham       0.99      1.00      1.00       966
        spam       1.00      0.94      0.97       149

    accuracy                           0.99      1115
   macro avg       1.00      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115



## Model Saving and Deployment
### Save the Trained Model and Vectorizer

In [26]:
# Save the model to a file
with open('spam_classifier.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)

# Save the vectorizer to a file
with open('count_vectorizer.pkl', 'wb') as vectorizer_file:
    pickle.dump(vectorizer, vectorizer_file)

print("Model and vectorizer saved successfully.")

Model and vectorizer saved successfully.


### Deploy the Model Using FastAPI

In [34]:
from fastapi import FastAPI, Query
import pickle

# Initialize FastAPI app
app = FastAPI()

# Load the saved model and vectorizer
with open('spam_classifier.pkl', 'rb') as model_file:
    model = pickle.load(model_file)
with open('count_vectorizer.pkl', 'rb') as vectorizer_file:
    vectorizer = pickle.load(vectorizer_file)

# Define a prediction endpoint
@app.post("/predict/")
def predict_spam(email: str = Query(..., description="Email content to classify")):
    # Transform the input email text
    email_vect = vectorizer.transform([email])
    # Make a prediction
    prediction = model.predict(email_vect)
    # Return the prediction result
    result = {"prediction": "spam" if prediction[0] == 1 else "not spam"}
    print(result)
    return result

### Serving the model

In [30]:
!ngrok authtoken 2laQP6bVYRgAXRWonIEL3VdYIfQ_29SVCfHbGCRJxAieHco41

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
import uvicorn
import nest_asyncio
from pyngrok import ngrok

# Apply the nest_asyncio patch
nest_asyncio.apply()

public_url = ngrok.connect(9005, "http")
print('Public URL:', public_url)

uvicorn.run(app, host='0.0.0.0', port=9005)

Public URL: NgrokTunnel: "https://d2c7-104-199-181-62.ngrok-free.app" -> "http://localhost:9005"


INFO:     Started server process [374]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9005 (Press CTRL+C to quit)


INFO:     204.44.96.131:0 - "POST /predict?email=Congratulations!%20You%27ve%20won%20a%20$1000%20Walmart%20gift%20card.%20Click%20here%20to%20claim%20now. HTTP/1.1" 307 Temporary Redirect
{'prediction': 'spam'}
INFO:     204.44.96.131:0 - "POST /predict/?email=Congratulations!%20You%27ve%20won%20a%20$1000%20Walmart%20gift%20card.%20Click%20here%20to%20claim%20now. HTTP/1.1" 200 OK
