# Mini Project: Spam Email

In [8]:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

# Load the synthetic data
data = pd.read_csv('synthetic_spam_data.csv')

# Split the data into features and labels
X = data.drop('label', axis=1).values
y = data['label'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize a Sequential model
model = tf.keras.models.Sequential()

# Adding layers to the model
model.add(tf.keras.layers.Dense(16, activation='relu', input_shape=(100,)))  # Assume 100 features for each email
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))  # Output layer to predict spam or not

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Loss: {loss}, Accuracy: {accuracy}')


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Loss: 0.7077093720436096, Accuracy: 0.5


#### Explanation 

1. **Importing Necessary Libraries**:
   - `pandas` for data manipulation.
   - `tensorflow` for building and training the neural network.
   - `sklearn.model_selection` for splitting the dataset into training and testing sets.

2. **Loading Data**:
   - The synthetic spam dataset is loaded from a CSV file into a pandas DataFrame.

3. **Preparing Data**:
   - The data is split into features (`X`) and labels (`y`).
   - It's then further split into training and testing sets, with 80% of the data used for training and 20% for testing.
   
       The line `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)` is using the `train_test_split` function from `scikit-learn` to split the dataset into training and testing sets.

        1. **Function Call**: `train_test_split` is a function from the `sklearn.model_selection` module.
   
        2. **Input Arguments**:
           - `X` and `y` are the features and labels of the dataset, respectively.
           - `test_size=0.2` specifies that 20% of the data should be allocated to the testing set, while the remaining 80% goes to the training set.
           - `random_state=42` is a seed for the random number generator, ensuring that the split is reproducible. This means that every time this line is run, the same split will be generated, which is important for consistency and comparing models.

        3. **Output**:
           - `X_train` and `y_train` are the features and labels for the training set, respectively.
           - `X_test` and `y_test` are the features and labels for the testing set, respectively.

        4. **Operation**:
           - The function shuffles the dataset (using the specified random seed) and then splits it into training and testing sets according to the specified test size ratio. This is crucial for training robust models, as it ensures that the model is evaluated on data it hasn't seen during training.


4. **Building the Model**:
   - A Sequential model is initialized.
   - Two layers are added: a dense layer with 16 neurons and ReLU activation, followed by a dense output layer with 1 neuron and sigmoid activation. This architecture is simple yet suitable for a binary classification task like spam detection.

5. **Compiling the Model**:
   - The model is compiled using the Adam optimizer, binary crossentropy loss (which is common for binary classification tasks), and accuracy as the evaluation metric.

6. **Training the Model**:
   - The model is trained for 10 epochs with a batch size of 32 using the training data, and validation data is provided for evaluating the model after each epoch.

7. **Evaluating the Model**:
   - The model is evaluated on the testing set, and the loss and accuracy are printed out.

**Output Explanation**:

The output shows the progress of training across 10 epochs. For each epoch, you see the following:
- Training loss (`loss`) and accuracy (`accuracy`) on the training data.
- Validation loss (`val_loss`) and accuracy (`val_accuracy`) on the validation data (which in this case is the testing set).

The goal during training is to minimize the loss while maximizing the accuracy. However, the training and validation metrics suggest that the model is struggling to learn from the data. Specifically:
- The training accuracy starts at around 51% and increases slightly to about 56%, while the validation accuracy hovers around 50% throughout. 
- The losses decrease slightly but not significantly, and the validation loss is even increasing at some epochs.

The final evaluation on the test set shows a loss of about 0.708 and an accuracy of 50%. This 50% accuracy is no better than random guessing in a binary classification task, suggesting that the model hasn't learned to distinguish spam from non-spam emails effectively from the synthetic data provided.

This could be due to several factors including the simplicity of the model, the quality or representativeness of the synthetic data, or the need for further preprocessing or feature engineering.
