 ## 03_feature_engineering.ipynb
 - Step 3: Feature Engineering
 - ----------------------------------------------------
 - In this notebook, we will transform the cleaned data into model-friendly features.
 - The output will be X (features) and y (target) stored as CSV files for use in Step 4.


## 1. Import required libraries

In [13]:
import pandas as pd
import numpy as np

print("Successfully imported!")

Successfully imported!


## Step 2: Load cleaned dataset
We start by loading the `cleaned_data.csv` file that was prepared in Step 1 (01_data_cleaning.ipynb).


In [None]:
import os
print(os.getcwd())

df = pd.read_csv('../data/processed/laptop_price_clean.csv', encoding='latin1')

# Quick check of the dataset
df.head()


## Step 3: Feature Transformation
Now we will create new features from existing columns so that machine learning models can work effectively.


### 3.1 Convert RAM to numeric (GB)
Often, RAM is written like "8GB" → we need just the numeric part.



In [None]:
# Remove "GB" and convert RAM to integer
df["Ram_GB"] = df["Ram"].str.replace("GB", "").astype(int)

df[["Ram", "Ram_GB"]].head()


### 3.2 Convert Weight to numeric (kg)
Weights may be in string form like "2.3kg". We'll extract the numeric part.


In [None]:
# Remove "kg" from the Weight column and convert it to float
df["Weight_kg"] = df["Weight"].str.replace("kg", "").astype(float)

# Check the transformation
df[["Weight", "Weight_kg"]].head()


### 3.3 Compute Pixels Per Inch (PPI)
Resolution + screen size → PPI, a better feature than resolution alone.


In [None]:
# Split resolution into Screen Width and Height (e.g. "1920x1080" → 1920, 1080)
df[["Screen_W", "Screen_H"]] = df["Resolution"].str.split("x", expand=True).astype(int)

# Calculate PPI using the formula: sqrt(W^2 + H^2) / Inches
df["PPI"] = (((df["Screen_W"]**2 + df["Screen_H"]**2) ** 0.5) / df["Inches"]).astype(float)

# Check result
df[["Resolution", "Inches", "PPI"]].head()


### 3.4 Extract CPU Brand
Instead of raw CPU names, extract brand/tier for simplification.


In [None]:
# Function to categorize CPU into broad brands
def extract_cpu_brand(cpu_name):
    cpu_name = cpu_name.lower()
    if "intel" in cpu_name:
        return "Intel"
    elif "amd" in cpu_name:
        return "AMD"
    elif "apple" in cpu_name:
        return "Apple"
    else:
        return "Other"

# Apply function to CPU column
df["Cpu_Brand"] = df["Cpu"].apply(extract_cpu_brand)

# Check result
df[["Cpu", "Cpu_Brand"]].head()


### 3.5 Extract GPU Brand
Same logic for GPUs.


In [None]:
# Function to categorize GPU into broad brands
def extract_gpu_brand(gpu_name):
    gpu_name = gpu_name.lower()
    if "nvidia" in gpu_name:
        return "Nvidia"
    elif "amd" in gpu_name:
        return "AMD"
    elif "intel" in gpu_name:
        return "Intel"
    else:
        return "Other"

# Apply function to GPU column
df["Gpu_Brand"] = df["Gpu"].apply(extract_gpu_brand)

# Check result
df[["Gpu", "Gpu_Brand"]].head()


### 3.6 Storage Flags (SSD / HDD)
Storage strings like "512GB SSD + 1TB HDD" → we make separate numeric flags.


In [None]:
# Helper function to extract storage size in GB for a given keyword (SSD/HDD)
def check_storage(storage, keyword):
    storage = storage.lower()
    if keyword in storage:
        for part in storage.split("+"):       # Handle multiple drives (e.g. SSD + HDD)
            if keyword in part:
                num = part.strip().split(" ")[0]  # Extract size (e.g. "512GB")
                if "tb" in num:                   # Convert TB to GB
                    return int(float(num.replace("tb", "")) * 1024)
                elif "gb" in num:
                    return int(num.replace("gb", ""))
    return 0

# Create separate columns for SSD and HDD
df["SSD_GB"] = df["Storage"].apply(lambda x: check_storage(x, "ssd"))
df["HDD_GB"] = df["Storage"].apply(lambda x: check_storage(x, "hdd"))

# Check results
df[["Storage", "SSD_GB", "HDD_GB"]].head()


### 3.7 Simplify Operating System
We don’t need full names like "Windows 10 Home" vs "Windows 10 Pro". Just "Windows", "MacOS", "Linux", "Other".


In [None]:
# Function to simplify OS into broad categories
def simplify_os(os_name):
    os_name = os_name.lower()
    if "windows" in os_name:
        return "Windows"
    elif "mac" in os_name:
        return "MacOS"
    elif "linux" in os_name:
        return "Linux"
    else:
        return "Other"

# Apply function to OpSys column
df["OS_Simplified"] = df["OpSys"].apply(simplify_os)

# Check results
df[["OpSys", "OS_Simplified"]].head()


## Step 4: Encode categorical variables
Now we will apply One-Hot Encoding (OHE) to categorical columns like Company, TypeName, Cpu_Brand, Gpu_Brand, OS_Simplified.


In [None]:
# List of categorical columns to encode
categorical_cols = ["Company", "TypeName", "Cpu_Brand", "Gpu_Brand", "OS_Simplified"]

# Apply one-hot encoding and drop the first level to avoid multicollinearity
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Preview the encoded dataset
df_encoded.head()


## Step 5: Prepare X (features) and y (target)
Our target variable is **Price**.  
All other processed columns will be used as **X**.


In [None]:
# Define target variable
y = df_encoded["Price"]

# Drop raw/unnecessary columns (already engineered into new ones)
drop_cols = ["Ram", "Weight", "Resolution", "Cpu", "Gpu", "Storage", "OpSys", "Price"]
X = df_encoded.drop(columns=drop_cols)

# Preview features
X.head()


## Step 6: Save processed features
We will save `X.csv` and `y.csv` into the `data/processed/` folder for use in Step 4 (Model Building).


In [None]:
# Save X and y into processed folder for modeling
X.to_csv("../data/processed/X.csv", index=False)
y.to_csv("../data/processed/y.csv", index=False, header=["Price"])

print("✅ Feature Engineering Complete! Files saved to data/processed/")
