# Task
Train a Random Forest Classification model to predict 'ham' or 'spam' using the dataset in "dataset.csv".

## Load the dataset

### Subtask:
Load the `dataset.csv` file into a pandas DataFrame.

In [8]:
import pandas as pd

df = pd.read_csv('dataset.csv')
df.head()

Unnamed: 0,sender_raw_domain,sender_decoded,return_raw_domain,return_decoded,message_id_raw,message_id_decoded,sender_registered,return_registered,mid_registered,spf,...,suspicious_keywords,total_links,idn_links,emd_embed,attachments,suspicious_attachments,total_images,suspicious_images,base64_images,label
0,umail.accounts.riotgames.com,umail.accounts.riotgames.com,sg.umail.accounts.riotgames.com,sg.umail.accounts.riotgames.com,geopod-ismtpd-17,geopod-ismtpd-17,riotgames.com,riotgames.com,,pass,...,33,46,0,0,0,0,0,0,0,ham
1,naukri.com,naukri.com,naukri.com,naukri.com,n5plimla01.ieil.net,n5plimla01.ieil.net,naukri.com,naukri.com,ieil.net,pass,...,3,60,0,0,0,0,0,0,0,ham
2,id.supercell.com,id.supercell.com,mail.id.supercell.com,mail.id.supercell.com,email.amazonses.com,email.amazonses.com,supercell.com,supercell.com,amazonses.com,pass,...,0,12,0,0,0,0,0,0,0,ham
3,amazonpay.in,amazonpay.in,ap-south-1.amazonses.com,ap-south-1.amazonses.com,ap-south-1.amazonses.com,ap-south-1.amazonses.com,amazonpay.in,amazonses.com,amazonses.com,pass,...,2,21,0,0,0,0,0,0,0,ham
4,amazonpay.in,amazonpay.in,ap-south-1.amazonses.com,ap-south-1.amazonses.com,ap-south-1.amazonses.com,ap-south-1.amazonses.com,amazonpay.in,amazonses.com,amazonses.com,pass,...,2,21,0,0,0,0,0,0,0,ham


## Preprocessing

### Subtask:
Prepare the text data for model training by handling missing values, encoding labels, and splitting the data into training and testing sets.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Fill missing values in relevant text columns
text_columns = ['sender_raw_domain', 'message_id_raw'] # Using available text-like columns
for col in text_columns:
    if col in df.columns:
        df[col] = df[col].fillna('')

# Encode labels
le = LabelEncoder()
df['label_encoded'] = le.fit_transform(df['label'])

# Combine text columns for features
df['text_features'] = df[text_columns].agg(' '.join, axis=1)


# Split data into training and testing sets
X = df['text_features']
y = df['label_encoded']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Training set shape: (6598,) (6598,)
Testing set shape: (1650,) (1650,)


## Feature extraction

### Subtask:
Convert the text data into numerical features using a technique like TF-IDF.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform on training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform on testing data
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("TF-IDF transformed training data shape:", X_train_tfidf.shape)
print("TF-IDF transformed testing data shape:", X_test_tfidf.shape)

TF-IDF transformed training data shape: (6598, 1058)
TF-IDF transformed testing data shape: (1650, 1058)


## Model training

### Subtask:
Train a Random Forest Classification model on the training data.

In [11]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate a RandomForestClassifier model
model = RandomForestClassifier(random_state=42)

# Train the model
model.fit(X_train_tfidf, y_train)

## Evaluation

### Subtask:
Evaluate the trained model on the testing data using appropriate metrics.

In [12]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the testing data
y_pred = model.predict(X_test_tfidf)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Accuracy: 0.9964
Precision: 0.9661
Recall: 0.9344
F1-score: 0.9500


## Summary:

### Data Analysis Key Findings

* The dataset was loaded successfully, containing columns such as 'sender\_raw\_domain' and 'message\_id\_raw', but not 'subject' or 'body'.
* Missing values in 'sender\_raw\_domain' and 'message\_id\_raw' were filled with empty strings.
* The 'label' column was successfully encoded into numerical values.
* The data was split into training and testing sets, with 6598 samples for training and 1650 for testing.
* TF-IDF vectorization transformed the text data into numerical features, resulting in 1058 features.
* A Random Forest Classifier model was trained on the TF-IDF transformed training data.
* The model's performance on the testing data was evaluated.

### Insights or Next Steps

* Analyze the evaluation metrics to understand the model's performance in predicting 'ham' and 'spam'.
* Consider hyperparameter tuning for the Random Forest Classifier to potentially improve performance.
* Explore other text feature engineering techniques or models to see if they yield better results.

In [None]:
import joblib

# Load the saved model
loaded_model = joblib.load('random_forest_model.pkl')

# Load the saved TF-IDF vectorizer
loaded_tfidf_vectorizer = joblib.load('tfidf_vectorizer.pkl')

print("Model and vectorizer loaded successfully!")

In [14]:
import joblib

# Save the trained model
joblib.dump(model, 'random_forest_model.pkl')

# Save the TF-IDF vectorizer
joblib.dump(tfidf_vectorizer, 'tfidf_vectorizer.pkl')

print("Model and vectorizer saved successfully!")

Model and vectorizer saved successfully!


In [13]:
# Get the input text from the user
new_sender_raw_domain = "nic.xn--n-iga"
new_message_id_raw = "bestsolutionsoft.com"

# Combine the text features
new_text_features = new_sender_raw_domain + ' ' + new_message_id_raw

# Transform the new text using the fitted TF-IDF vectorizer
new_text_features_tfidf = tfidf_vectorizer.transform([new_text_features])

# Predict the label using the trained model
predicted_label_encoded = model.predict(new_text_features_tfidf)

# Decode the predicted label
predicted_label = le.inverse_transform(predicted_label_encoded)

print(f"The predicted label is: {predicted_label[0]}")

The predicted label is: spam


## Prediction

### Subtask:
Predict the label for a new input using the trained model.