# Twitter Sentiment Analysis: Feature Engineering and Data Preprocessing

## Notebook Overview
This notebook is dedicated to preparing the cleaned Sentiment140 dataset for machine learning. The goal is to transform the raw text data into a format suitable for modeling by applying feature engineering techniques and preprocessing steps. This includes text normalization, stopword removal, expansion of contractions, and tokenization. We will also generate the necessary features from the temporal data and n-grams to optimize our sentiment analysis model.

## Table of Contents
1. **Introduction**  
   - Purpose of the notebook  
   - Description of the cleaned dataset  

2. **Text Preprocessing**  
   - Removal of stopwords  
   - Expansion of contractions  
   - Lemmatization and Tokenization  
   - Text vectorization (TF-IDF and CountVectorizer)  

3. **Feature Engineering**  
   - Temporal features (Day of the week, Hour of the day, Month)  
   - N-gram feature extraction (Unigrams, Bigrams, Trigrams)  

4. **Handling Imbalanced Data**  
   - Techniques for balancing sentiment classes (SMOTE, undersampling)  

5. **Train/Test Split**  
   - Splitting the dataset into training and testing sets  

6. **Conclusion**  
   - Summary of feature engineering and preprocessing steps  
   - Next steps and preparation for modeling

---


# 1. Introduction

### Purpose of the Notebook
The purpose of this notebook is to perform feature engineering and preprocessing on the cleaned Sentiment140 dataset to prepare it for sentiment analysis modeling. This will involve transforming the text data into numerical features that can be used by machine learning algorithms, while also addressing common challenges like imbalanced classes.

### Description of the Cleaned Dataset
The dataset under analysis is the cleaned version of the Sentiment140 dataset. After performing initial data cleaning in previous steps, the dataset consists of:

- **Sentiment Labels**:  
  - 0: Negative sentiment  
  - 1: Positive sentiment

- **Text Column**:  
  Contains the normalized text of tweets, ready for analysis. The text has been cleaned of URLs, mentions, and special characters.

- **Date Column**:  
  The date column has been standardized to the correct datetime format with UTC offsets, allowing for temporal analysis.

---


#### Let's import the necessary libraries and load the dataset.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from collections import Counter

import re

from imblearn.under_sampling import RandomUnderSampler

from scipy.sparse import hstack, csr_matrix

import os

import joblib
# run this once only
#nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

In [2]:
# Load the dataset
file_path = './clean_data/cleaned_twitter_data_After_EDA.csv'
data = pd.read_csv(file_path, encoding='latin1')

In [3]:
data.head()

Unnamed: 0,target,date,text,text_length,day_of_week
0,0,2009-04-06 22:19:45-07:00,a thats a bummer you shoulda got david carr of...,67.0,0
1,0,2009-04-06 22:19:49-07:00,is upset that he cant update his facebook by t...,104.0,0
2,0,2009-04-06 22:19:53-07:00,i dived many times for the ball managed to sav...,76.0,0
3,0,2009-04-06 22:19:57-07:00,my whole body feels itchy and like its on fire,46.0,0
4,0,2009-04-06 22:19:57-07:00,no its not behaving at all im mad why am i her...,85.0,0


In [4]:
data.dtypes

target           int64
date            object
text            object
text_length    float64
day_of_week      int64
dtype: object

In [5]:
data.shape

(1578237, 5)

# 2. Text Preprocessing

In this section, we will apply several preprocessing techniques to the text data.

**Stopword Removal**:  
  Removing common words that don't add much value to sentiment analysis (e.g., "the", "and", "is").

In [6]:
# Define the function to remove stopwords
def remove_stopwords(text):
    # Tokenize the text
    words = word_tokenize(text.lower())  # Convert to lowercase for uniformity
    # Define the list of stopwords
    stop_words = set(stopwords.words('english'))
    # Remove stopwords from the text
    filtered_text = [word for word in words if word not in stop_words]
    # Rebuild the text
    return ' '.join(filtered_text)

# Apply stopword removal to the 'text' column
data['cleaned_text'] = data['text'].apply(remove_stopwords)

# Verify the results
print("Sample cleaned text after stopword removal:")
print(data[['text', 'cleaned_text']].head())

Sample cleaned text after stopword removal:
                                                text  \
0  a thats a bummer you shoulda got david carr of...   
1  is upset that he cant update his facebook by t...   
2  i dived many times for the ball managed to sav...   
3     my whole body feels itchy and like its on fire   
4  no its not behaving at all im mad why am i her...   

                                        cleaned_text  
0      thats bummer shoulda got david carr third day  
1  upset cant update facebook texting might cry r...  
2  dived many times ball managed save 50 rest go ...  
3                   whole body feels itchy like fire  
4                           behaving im mad cant see  


In [7]:
# Define a function to check stopwords removal
def check_stopword_removal(original_text, cleaned_text):
    # Define the list of stopwords
    stop_words = set(stopwords.words('english'))
    
    # Tokenize both original and cleaned texts
    original_tokens = word_tokenize(original_text.lower())
    cleaned_tokens = word_tokenize(cleaned_text.lower())
    
    # Count stopwords in original text
    original_stopwords = [word for word in original_tokens if word in stop_words]
    cleaned_stopwords = [word for word in cleaned_tokens if word in stop_words]
    
    # Print the most common stopwords before and after removal
    original_stopwords_freq = Counter(original_stopwords).most_common(10)
    cleaned_stopwords_freq = Counter(cleaned_stopwords).most_common(10)
    
    print("Top 10 stopwords in original text:")
    print(original_stopwords_freq)
    print("\nTop 10 stopwords in cleaned text:")
    print(cleaned_stopwords_freq)

# Test the function with a sample row
check_stopword_removal(data['text'].iloc[0], data['cleaned_text'].iloc[0])

Top 10 stopwords in original text:
[('a', 2), ('you', 1), ('of', 1), ('to', 1), ('do', 1), ('it', 1), ('d', 1)]

Top 10 stopwords in cleaned text:
[]


**Contraction Expansion**:  
  Expanding contractions like **"don't"** to **"do not"** and **"can't"** to **"cannot"** to standardize the text. Since the apostrophes were removed previously (e.g., "can't" became "cant"), we can update the contraction dictionary to handle contractions in their modified form, without apostrophes.

In [8]:
# Dictionary mapping common contractions to their expanded forms.
contraction_dict = {
    "dont": "do not",
    "cant": "cannot",
    "wont": "will not",
    "isnt": "is not",
    "arent": "are not",
    "wasnt": "was not",
    "werent": "were not",
    "hasnt": "has not",
    "havent": "have not",
    "didnt": "did not",
    "doesnt": "does not",
    "wouldnt": "would not",
    "shouldnt": "should not",
    "couldnt": "could not",
    "im": "i am",
    "youre": "you are",
    "hes": "he is",
    "shes": "she is",
    "its": "it is",
    "were": "we are",
    "theyre": "they are",
    "whats": "what is",
    "thats": "that is",
    "whos": "who is",
    "heres": "here is",
    "theres": "there is",
    "lets": "let us",
    "ive": "i have",
    "youve": "you have",
    "weve": "we have",
    "theyve": "they have",
    "imma": "i am going to",
    "wouldve": "would have",
    "shouldve": "should have",
    "couldve": "could have",
    "mightve": "might have",
    "mustve": "must have",
    "id": "i would",
    "you'd": "you would",
    "he'd": "he would",
    "she'd": "she would",
    "we'd": "we would",
    "they'd": "they would",
    "it'd": "it would",
    "there'd": "there would",
    "who'd": "who would",
    "what'd": "what would",
    "where'd": "where would",
    "when'd": "when would",
    "why'd": "why would",
    "how'd": "how would",
    "yall": "you all",
    "aint": "is not",
    "gonna": "going to",
    "wanna": "want to",
    "lemme": "let me",
    "gimme": "give me",
    "gotta": "got to",
    "kinda": "kind of",
    "sorta": "sort of",
    "outta": "out of",
    "lotta": "lot of",
    "dunno": "do not know",
    "yknow": "you know",
    "cmon": "come on"
}


def expand_contractions(text):
    """
    Replace common contractions in the given text with their expanded forms.

    Parameters
    ----------
    text : str
        The input string containing contractions.

    Returns
    -------
    str
        The input text with all recognized contractions expanded.
    """
    for contraction, expanded_form in contraction_dict.items():
        text = re.sub(r'\b' + contraction + r'\b', expanded_form, text, flags=re.IGNORECASE)
    return text

def check_contraction_expansion(original_text, expanded_text, contraction_dict):
    """
    Check if all contractions found in the original text have been expanded in the new text.

    Parameters
    ----------
    original_text : str
        The original text that may contain contractions.
    expanded_text : str
        The text after contraction expansion.
    contraction_dict : dict
        Dictionary of contractions and their expansions.

    Returns
    -------
    bool
        True if all contractions in the original text are expanded, False otherwise.
    """
    orig_tokens = word_tokenize(original_text.lower())
    exp_tokens = word_tokenize(expanded_text.lower())

    orig_contractions = [w for w in orig_tokens if w in contraction_dict]
    exp_contractions = [w for w in exp_tokens if w in contraction_dict]

    print("Original text contractions:", orig_contractions)
    print("Expanded text contractions:", exp_contractions)
    return len(exp_contractions) == 0

# Example usage (assuming 'data' is a DataFrame with a 'cleaned_text' column):
data['expanded_text'] = data['cleaned_text'].apply(expand_contractions)

# Test on a sample row:
sample_original = data['cleaned_text'].iloc[0]
sample_expanded = data['expanded_text'].iloc[0]

if check_contraction_expansion(sample_original, sample_expanded, contraction_dict):
    print("Contractions expanded successfully!")
else:
    print("Some contractions were not expanded.")


Original text contractions: ['thats']
Expanded text contractions: []
Contractions expanded successfully!


#### it looks like everything went well

**Check for duplicate tweets**:  
We could have missed some duplicates. now that we have the full text cleanend we can recheck again and remove those that are duplicate if they are.

In [9]:
# Step 1: Check the number of duplicates before removal
initial_duplicates = data.duplicated(subset=["expanded_text"]).sum()
print(f"Number of duplicates before removal: {initial_duplicates}")

# Step 2: Remove duplicates based on the 'expanded_text' column (update the dataset in place)
data.drop_duplicates(subset=["expanded_text"], inplace=True)

# Step 3: Check for duplicates again after removal
remaining_duplicates = data.duplicated(subset=["expanded_text"]).sum()
print(f"Number of duplicates after removal: {remaining_duplicates}")

# Reset the index after in-place modification
data.reset_index(drop=True, inplace=True)

Number of duplicates before removal: 94700
Number of duplicates after removal: 0


**Lemmatization and Tokenization**:  
  Breaking the text into tokens (words) and reducing them to their base form (e.g., "running" -> "run").

In [10]:
# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def tokenize_and_lemmatize(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    # Lemmatize each token
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens

# Apply the function to the 'text' column
data['lemmatized_tokens'] = data['text'].apply(tokenize_and_lemmatize)

data['lemmatized_tokens'].head()

0    [a, thats, a, bummer, you, shoulda, got, david...
1    [is, upset, that, he, cant, update, his, faceb...
2    [i, dived, many, time, for, the, ball, managed...
3    [my, whole, body, feel, itchy, and, like, it, ...
4    [no, it, not, behaving, at, all, im, mad, why,...
Name: lemmatized_tokens, dtype: object

**Text Vectorization**:  
  Using techniques like **TF-IDF** or **CountVectorizer** to transform the text into numerical form for machine learning models.

In [11]:
# Join tokens into a single string per document
data['lemmatized_text'] = data['lemmatized_tokens'].apply(lambda tokens: ' '.join(tokens))

# Initialize the TF-IDF Vectorizer with parameters to control feature space size
tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,  # limit to 10k features
    stop_words='english', # remove common English stopwords
    min_df=5,            # ignore terms that appear in fewer than 5 documents
    max_df=0.9           # ignore terms that appear in more than 90% of documents
)

# Fit and transform the joined lemmatized text
tfidf_features = tfidf_vectorizer.fit_transform(data['lemmatized_text'])

print("TF-IDF feature matrix shape:", tfidf_features.shape)

TF-IDF feature matrix shape: (1483537, 10000)


# 3. Feature Engineering

This section focuses on creating new features that can be used for modeling, including:

**Temporal Features**:  
  - Extracting features like **day of the week**, **hour of the day**, and **month** from the `date` column to capture temporal patterns in tweet activity.

In [12]:
# If `date` is not yet a datetime type, convert it:
data['date'] = pd.to_datetime(data['date'], errors='coerce')

# Extract temporal features:
data['day_of_week'] = data['date'].dt.dayofweek      # Monday=0, Sunday=6
data['hour_of_day'] = data['date'].dt.hour
data['month'] = data['date'].dt.month

# For inspection:
print(data[['date', 'day_of_week', 'hour_of_day', 'month']].head())

                       date  day_of_week  hour_of_day  month
0 2009-04-06 22:19:45-07:00            0           22      4
1 2009-04-06 22:19:49-07:00            0           22      4
2 2009-04-06 22:19:53-07:00            0           22      4
3 2009-04-06 22:19:57-07:00            0           22      4
4 2009-04-06 22:19:57-07:00            0           22      4


**N-gram Features**:  
  - Extracting **unigrams**, **bigrams**, and **trigrams** to capture patterns in word sequences that are meaningful for sentiment analysis.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Assuming `data['lemmatized_text']` contains the preprocessed text.
# Adjusting the TfidfVectorizer to include unigrams, bigrams, and trigrams:
tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,    # Limit vocabulary to top 10,000 features
    stop_words='english',  # Remove common English stopwords
    min_df=5,              # Ignore terms appearing in fewer than 5 documents
    max_df=0.9,            # Ignore terms appearing in more than 90% of documents
    ngram_range=(1, 3)     # Include unigrams, bigrams, and trigrams
)

tfidf_features = tfidf_vectorizer.fit_transform(data['lemmatized_text'])

print("TF-IDF feature matrix shape with n-grams:", tfidf_features.shape)

TF-IDF feature matrix shape with n-grams: (1483537, 10000)


**Combine temporal features with TF-IDF features**:  
  - Include the temporal features (day_of_week, hour_of_day, month) alongside the TF-IDF features, we should combine them before undersampling. This ensures that when we undersample, we remove rows consistently from both the text features and the temporal features.

In [14]:
# 1. Extract temporal features into a NumPy array
temporal_features = data[['day_of_week', 'hour_of_day', 'month']].values

# 2. Combine TF-IDF sparse matrix (X) with temporal features
# Convert temporal_features to a sparse matrix before hstack, to keep everything sparse
temporal_sparse = csr_matrix(temporal_features)

# Combine horizontally
X_full = hstack([tfidf_features, temporal_sparse])

# 4. Handling Imbalanced Data

The dataset may have imbalanced sentiment classes, with a larger number of positive tweets compared to negative ones. In this section, we will:

- Apply techniques like **SMOTE (Synthetic Minority Over-sampling Technique)** or **undersampling** to balance the classes and ensure that the model is not biased towards the majority class.

---


In [15]:
# Check class distribution before undersampling
print("Class Distribution Before Undersampling:")
print(data['target'].value_counts())

y = data['target']

# Perform undersampling on the combined feature set
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_full, y)

# Check class distribution after undersampling
print("\nClass Distribution After Undersampling:")
print(pd.Series(y_resampled).value_counts())

Class Distribution Before Undersampling:
target
0    750145
1    733392
Name: count, dtype: int64

Class Distribution After Undersampling:
target
0    733392
1    733392
Name: count, dtype: int64


# 5. Train/Test Split

We will split the cleaned and feature-engineered dataset into **training** and **testing** sets. This will allow us to evaluate the model's performance on unseen data.


In [16]:
# Ensure the directory exists
os.makedirs('./modeling_data', exist_ok=True)

# Perform train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled,
    y_resampled,
    test_size=0.2,      # 20% of data for testing
    random_state=42,     # for reproducibility
    stratify=y_resampled # maintain class balance in split
)

# Save the datasets for modeling
joblib.dump(X_train, './modeling_data/ML/X_train.joblib')
joblib.dump(X_test, './modeling_data/ML/X_test.joblib')
joblib.dump(y_train, './modeling_data/ML/y_train.joblib')
joblib.dump(y_test, './modeling_data/ML/y_test.joblib')

print("Datasets saved in ./modeling_data")

Datasets saved in ./modeling_data


# 6. Conclusion


### Summary of Feature Engineering and Preprocessing Steps

In this notebook, we have:

- **Text Preprocessing:**  
  - Removed stopwords to focus on meaningful tokens.
  - Expanded contractions to ensure words are in their standard forms.
  - Applied lemmatization to reduce words to their base forms, improving the consistency and quality of text features.

- **Temporal Feature Extraction:**  
  - Derived features such as the day of the week, hour of the day, and month from the tweet timestamps, capturing patterns that may influence sentiment.

- **N-gram Creation:**  
  - Extracted unigrams, bigrams, and trigrams, allowing the model to leverage context from sequences of words rather than single tokens alone.

- **Handling Imbalanced Data:**  
  - Employed undersampling techniques to balance the classes, ensuring the model is trained on a representative dataset.

- **Data Integration and Splitting:**  
  - Combined text-based TF-IDF features with temporal features.
  - Split the resulting dataset into training and testing subsets for unbiased model evaluation.
  - Saved the processed datasets for direct use in the next phase, avoiding the need for repeated preprocessing.

### Next Steps and Preparation for Modeling

With our dataset now fully preprocessed and enriched with a variety of engineered features, we are ready to proceed to the modeling stage. In the following notebook(s), we will:

- Load the prepared data.
- Experiment with various machine learning models.
- Evaluate their performance using appropriate metrics.
- Optimize the model hyperparameters to achieve the best possible results.

This sets the stage for building, training, and refining robust sentiment analysis models that leverage both textual and temporal signals.
