# Feature Engineering Notebook

This notebook is dedicated to transforming our cleaned and analyzed dataset into a format suitable for machine learning models.

Based on the insights gathered from the Exploratory Data Analysis (EDA), we are preparing features for **two separate models**:

1. **Binary Classification Model**  
   - Predict whether a review is **Positive (4–5)** or **Negative (1–2)** by dropping neutral (score = 3) reviews.

2. **Regression Model**  
   - Predict the **exact review score** (from 1 to 5) using numeric and text-based features.

---

#### Key Feature Engineering Steps:

- Generate custom features:
  - `helpfulness_ratio` from vote counts
  - `review_length_words` and `review_length_chars` from text
- Transform review text into numeric features using **TF-IDF Vectorization**
- Prepare final datasets separately for classification and regression models


#### Step 1: Load Cleaned Dataset & Setup

In this step, we load the cleaned dataset that has already been processed during the Data Cleaning phase. We'll also import the required libraries that will be used throughout the feature engineering process.


In [1]:
# Basic Imports
import pandas as pd
import numpy as np

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

# Load the cleaned dataset
df = pd.read_csv('cleaned_reviews.csv')

# Quick shape and preview
print(f"Dataset shape: {df.shape}")
df.head()

Dataset shape: (568401, 8)


Unnamed: 0,ProductId,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,cleaned_text
0,B001E4KFG0,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,i have bought several of the vitality canned d...
1,B00813GRG4,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,product arrived labeled as jumbo salted peanut...
2,B000LQOCH0,1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,this is a confection that has been around a fe...
3,B000UA0QIQ,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,if you are looking for the secret ingredient i...
4,B006K2ZZ7K,0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,great taffy at a great price there was a wide ...


#### Step 2: Feature Engineering – Common Features

In this step, we engineer features that are shared by both the binary classification and regression models.

####  Features:
- **review_length_words**: Total number of words in the cleaned review text.
- **review_length_chars**: Total number of characters in the cleaned review text.
- **helpfulness_ratio**: Ratio of helpful votes to total votes (numerator/denominator), with precautions to handle divide-by-zero and missing values.


In [4]:
# Feature 1: Review length (words)
df['review_length_words'] = df['cleaned_text'].apply(lambda x: len(str(x).split()))

# Feature 2: Review length (characters)
df['review_length_chars'] = df['cleaned_text'].apply(lambda x: len(str(x)))

# Feature 3: Helpfulness Ratio
# Avoid divide-by-zero by replacing 0 denominator with np.nan, then fillna with 0
df['helpfulness_ratio'] = np.where(
    df['HelpfulnessDenominator'] == 0,
    0,
    df['HelpfulnessNumerator'] / df['HelpfulnessDenominator']
)

# Cap extreme ratios > 1 (observed in EDA)
df['helpfulness_ratio'] = df['helpfulness_ratio'].apply(lambda x: min(x, 1.0))

# Preview
df[['review_length_words', 'review_length_chars', 'helpfulness_ratio']].describe()


Unnamed: 0,review_length_words,review_length_chars,helpfulness_ratio
count,568401.0,568401.0,568401.0
mean,78.095874,409.870753,0.407885
std,76.164102,410.642152,0.462061
min,1.0,3.0,0.0
25%,33.0,171.0,0.0
50%,55.0,286.0,0.0
75%,95.0,497.0,1.0
max,3348.0,20272.0,1.0


#### Step 3: Classification-Specific Feature – `sentiment_binary`

For the **binary classification model**, we need to convert the original 1–5 `Score` values into two classes:
- **Positive (1):** Scores of 4 or 5  
- **Negative (0):** Scores of 1 or 2  
- **Neutral (NaN):** Score of 3 (to be dropped later during classification preprocessing)

This binary feature, `sentiment_binary`, will act as the target variable (`y`) for the classification model.


In [12]:
# For Classification Model – Create sentiment_binary target
# Positive = 1 (Score 4 or 5), Negative = 0 (Score 1 or 2), Neutral = NaN (Score 3)
df['sentiment_binary'] = df['Score'].apply(lambda x: 1 if x >= 4 else (0 if x <= 2 else np.nan))

# For Regression Model – Keep the original Score as target
# Already exists as df['Score']

# Preview target variables
df[['Score', 'sentiment_binary']].head(10)


Unnamed: 0,Score,sentiment_binary
0,5,1.0
1,1,0.0
2,4,1.0
3,2,0.0
4,5,1.0
5,4,1.0
6,5,1.0
7,5,1.0
8,5,1.0
9,5,1.0


#### Step 4: Dropping Unused Columns¶
To simplify our modeling pipeline, we drop columns that are:

Redundant,
Not useful for modeling,
Already encoded into derived features.
Columns Dropped:
Id or ProductId: Not informative for prediction.
UserId, ProfileName: Identifying info – irrelevant for modeling.
HelpfulnessNumerator and HelpfulnessDenominator: Already captured in helpfulness_ratio.
Time: Not processed in this project.
Summary: Often overlaps with Text; we're focusing only on cleaned_text.
Text: Raw text column – already cleaned and stored in cleaned_text.
We retain only the cleaned and engineered features for further use.

In [8]:
# Drop unwanted columns
columns_to_drop = [
    'Id', 'ProductId', 'UserId', 'ProfileName',
    'HelpfulnessNumerator', 'HelpfulnessDenominator',
    'Time', 'Summary', 'Text'  # Keep 'cleaned_text'
]

df.drop(columns=columns_to_drop, inplace=True, errors='ignore')

# Preview final columns
df.head()


Unnamed: 0,Score,cleaned_text,review_length_words,review_length_chars,helpfulness_ratio,sentiment_binary
0,5,i have bought several of the vitality canned d...,48,259,1.0,1.0
1,1,product arrived labeled as jumbo salted peanut...,31,183,0.0,0.0
2,4,this is a confection that has been around a fe...,92,484,1.0,1.0
3,2,if you are looking for the secret ingredient i...,41,212,1.0,0.0
4,5,great taffy at a great price there was a wide ...,27,132,0.0,1.0


### Step 5: Saving the Feature-Engineered Dataset

Now that feature engineering is complete, we save the cleaned and processed dataset to a `.csv` file.

This step allows us to:
- **Decouple** the feature engineering from model training.
- **Avoid re-running** preprocessing every time we run modeling notebooks.
- Keep our workflow **modular and efficient**.

We will load this file in both the **classification** and **regression** modeling notebooks.


In [10]:
# Save the feature-engineered dataset to a CSV file
df.to_csv("feature_engineered_reviews.csv", index=False)

print(" Feature-engineered dataset saved as 'feature_engineered_reviews.csv'")


 Feature-engineered dataset saved as 'feature_engineered_reviews.csv'
