<a href="https://colab.research.google.com/github/nowknowing/text-classification/blob/main/Copy_of_04_Text_Classification_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://weclouddata.s3.amazonaws.com/images/logos/wcd_logo_new_2.png"  width='25%'>  


Developed by WeCloudData
<br></br>

# Capstone Project: Your First NLP Classification Model

Welcome to your first capstone project in natural language processing (NLP)! In this notebook, you will build a complete text classification pipeline using Python and the SMS Spam Collection dataset. You'll learn how to:

- Load and explore the dataset
- Clean and preprocess text data
- Convert text into numerical features using TF-IDF
- Split the data into training and testing sets
- Adjust hyperparameters and train several classification models
- Evaluate model performance

After following along with the SMS spam example, you'll be encouraged to try similar techniques on other text datasets.

Let's get started!

## Step 1: Choosing a Dataset

For this project, we are using the **SMS Spam Collection** dataset. You can download the dataset from the following link:

[Kaggle SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset)

Make sure to download the CSV file (commonly named `spam.csv`) and place it in the same folder as this notebook.

In [None]:
# Example: Loading the SMS Spam Collection dataset
import pandas as pd

# Load the dataset (ensure spam.csv is in your working directory)
df = pd.read_csv('spam.csv', encoding='latin-1')

# The dataset may contain extra unnamed columns; keep only the relevant ones
df = df[['v1', 'v2']]
df.columns = ['label', 'message']

print('Example Dataset: SMS Spam Collection')
display(df.head())

### Exercise

1. Download the SMS Spam Collection dataset from Kaggle.
2. Alternatively, choose another text dataset of your interest.
3. Place the dataset file in the working directory and note the file name for later use.

## Step 2: Exploring the Data

In this step, we will explore the SMS Spam Collection dataset by:

- Viewing the first few rows
- Getting summary statistics using `.describe()`
- Checking the data structure with `.info()`
- Looking for missing values

This initial exploration helps you understand the dataset before diving into text preprocessing.

In [None]:
# View the first few rows
print('First 5 rows of the SMS Spam dataset:')
display(df.head())

# Summary statistics
print('Summary statistics:')
display(df.describe())

# Data structure and missing values
print('Dataset info:')
display(df.info())

print('Missing values in each column:')
display(df.isnull().sum())

In [None]:
print('Missing values in each column:')
display(df.isnull().sum())

### Exercise

Using the dataset you loaded (or your own dataset):

1. Display the first few rows of the dataset.
2. Print summary statistics and the dataset information.
3. Identify any missing values in the dataset.

## Step 3: Text Cleaning and Preprocessing

Real-world text data is often noisy. In this step, you'll clean and preprocess your text data by:

- Converting text to lowercase
- Removing punctuation and special characters
- Removing extra whitespace

This will help standardize the data for effective feature extraction.

In [None]:
import re

def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and non-alphabetic characters
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply text cleaning to the 'message' column
df['clean_message'] = df['message'].apply(clean_text)

print('Cleaned text sample:')
display(df[['message', 'clean_message']].head())

### Exercise

1. Write a function to clean the text data using the steps described above.
2. Apply your cleaning function to the text column of your dataset.
3. Inspect the cleaned text to ensure it meets expectations.

## Step 4: Feature Engineering - Converting Text to Numerical Features

Machine learning models require numerical input. One common approach to convert text data into numbers is TF-IDF vectorization. In this step, we will:

- Use `TfidfVectorizer` from Scikit-Learn to convert text into numerical TF-IDF features

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

# Fit and transform the cleaned text
X_features = vectorizer.fit_transform(df['clean_message'])

print('TF-IDF features shape:', X_features.shape)

### Exercise

1. Use `TfidfVectorizer` to transform your cleaned text data into TF-IDF features.
2. (Optional) Experiment with other vectorizers such as `CountVectorizer` and compare the results.

## Step 5: Splitting the Data

Before training our text classification models, we need to split the dataset into training and testing sets. The target variable here is `label`, indicating whether a message is spam or not.

In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = X_features
y = df['label']

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Training set shape:', X_train.shape)
print('Testing set shape:', X_test.shape)

### Exercise

1. Define your features (`X`) and target (`y`).
2. Use `train_test_split` to divide your data into training and test sets (e.g., 80% training, 20% testing).

## Step 5.5: Adjusting Hyperparameters

Before training the models, it's important to set and adjust hyperparameters. For text classification, we will use:

- **Logistic Regression:** Adjust `max_iter`
- **Multinomial Naive Bayes:** Adjust `alpha`
- **Support Vector Machine (SVM):** Adjust `C` and kernel parameters

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

# Initialize models with hyperparameters
lr_model = LogisticRegression(max_iter=200, solver='lbfgs')
nb_model = MultinomialNB(alpha=1.0)
svm_model = SVC(C=1.0, kernel='linear', probability=True)

models = {
    'Logistic Regression': lr_model,
    'Multinomial Naive Bayes': nb_model,
    'SVM': svm_model
}

print('Models initialized with adjusted hyperparameters:')
for name, model in models.items():
    print(name, model)

### Exercise

1. Initialize each model with the suggested hyperparameters.
2. (Optional) Experiment with different hyperparameter values and observe how they affect performance.

## Step 6: Training Different Classification Models

Now it's time to train our text classification models using the training data from the SMS Spam dataset.

In [None]:
# Train each model and store predictions
predictions = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    predictions[name] = pred
    print(f"{name} model trained.")

print('\nAll models have been trained on the SMS Spam dataset!')

### Exercise

1. Train each model on the training set and store the predictions for the test set.
2. (Optional) Try adding an additional model or two to compare performance.

## Step 7: Evaluating Model Performance

After training our models, it's essential to evaluate their performance on the test data. We will use:

- A classification report to assess metrics such as accuracy, precision, recall, and F1-score
- A confusion matrix to visualize the distribution of correct and incorrect predictions

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

for name, pred in predictions.items():
    print(f"\nModel: {name}")
    print('Classification Report:')
    print(classification_report(y_test, pred))
    print('Confusion Matrix:')
    print(confusion_matrix(y_test, pred))
    print('---------------------------')

### Exercise

1. For each model, generate a classification report and confusion matrix.
2. Compare the performance metrics to determine which model performs best on the SMS Spam dataset.
3. Write a brief analysis of why one model might outperform the others based on the metrics.

## Final Thoughts and Next Steps

Great job! You have now built a complete NLP classification pipeline using the SMS Spam Collection dataset:

1. **Choosing a Dataset:** Downloaded the SMS Spam Collection dataset from Kaggle.
2. **Loading and Exploring the Data:** Loaded and explored the dataset.
3. **Text Cleaning and Preprocessing:** Cleaned and preprocessed the text data.
4. **Feature Engineering:** Converted text into numerical features using TF-IDF.
5. **Splitting the Data:** Divided the data into training and testing sets.
6. **Adjusting Hyperparameters:** Set and adjusted hyperparameters for each model.
7. **Training Models:** Built several text classification models.
8. **Evaluating Performance:** Assessed model performance on unseen data.

### Next Steps

- Experiment with other text datasets and replicate these steps.
- Explore additional text preprocessing techniques (e.g., stemming, lemmatization, or advanced stopword removal).
- Try more advanced models, including deep learning approaches for NLP.

Keep exploring and enjoy your journey into natural language processing!

## References and Further Reading

Here are some useful resources for the modules and functions used in this notebook:

- **Pandas:** [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)
- **Scikit-Learn:** [Scikit-Learn Documentation](https://scikit-learn.org/stable/documentation.html)
  - [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
  - [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
  - [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
  - [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
  - [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
- **Regular Expressions (re):** [Python re Module](https://docs.python.org/3/library/re.html)
- **NLTK:** [NLTK Documentation](https://www.nltk.org/)

These resources will help you dive deeper into the topics covered in this notebook.