 ## 03_feature_engineering.ipynb
 - Step 3: Feature Engineering
 - ----------------------------------------------------
 - In this notebook, we will transform the cleaned data into model-friendly features.
 - The output will be X (features) and y (target) stored as CSV files for use in Step 4.


## 1. Import required libraries

In [29]:
import pandas as pd
import numpy as np

print("Successfully imported!")

Successfully imported!


## Step 2: Load cleaned dataset
We start by loading the `cleaned_data.csv` file that was prepared in Step 1 (01_data_cleaning.ipynb).


In [30]:
import os
print(os.getcwd())

df = pd.read_csv('../data/processed/laptop_price_clean.csv', encoding='latin1')

# Quick check of the dataset
df.head()


/home/prince/Laptop_price_prediction_model/notebooks


Unnamed: 0,laptop_ID,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_euros,SSD_Flag,HDD_Flag,Storage_GB,Ram_GB,Memory_GB,Company_std,OpSys_std
0,1,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,1339.69,1,0,128.0,8.0,128.0,Apple,Mac
1,2,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,898.94,0,0,128.0,8.0,128.0,Apple,Mac
2,3,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,575.0,1,0,256.0,8.0,256.0,HP,Other
3,4,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,2537.45,1,0,512.0,16.0,512.0,Apple,Mac
4,5,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,1803.6,1,0,256.0,8.0,256.0,Apple,Mac


## Step 3: Feature Transformation
Now we will create new features from existing columns so that machine learning models can work effectively.


### 3.1 Convert RAM to numeric (GB)
Often, RAM is written like "8GB" → we need just the numeric part.



In [31]:
# Remove "GB" and convert RAM to integer
df["Ram_GB"] = df["Ram"].str.replace("GB", "").astype(int)

df[["Ram", "Ram_GB"]].head()


Unnamed: 0,Ram,Ram_GB
0,8GB,8
1,8GB,8
2,8GB,8
3,16GB,16
4,8GB,8


### 3.2 Convert Weight to numeric (kg)
Weights may be in string form like "2.3kg". We'll extract the numeric part.


In [32]:
# Remove "kg" from the Weight column and convert it to float
df["Weight_kg"] = df["Weight"].str.replace("kg", "").astype(float)

# Check the transformation
df[["Weight", "Weight_kg"]].head()


Unnamed: 0,Weight,Weight_kg
0,1.37kg,1.37
1,1.34kg,1.34
2,1.86kg,1.86
3,1.83kg,1.83
4,1.37kg,1.37


### 3.3 Compute Pixels Per Inch (PPI)
Resolution + screen size → PPI, a better feature than resolution alone.


In [33]:
# Extract numeric width and height only
df[["Screen_W", "Screen_H"]] = df["ScreenResolution"].str.extract(r"(\d+)x(\d+)").astype(int)

# Calculate PPI
df["PPI"] = (((df["Screen_W"]**2 + df["Screen_H"]**2) ** 0.5) / df["Inches"]).astype(float)

# Check result
df[["ScreenResolution", "Inches", "Screen_W", "Screen_H", "PPI"]].head()


Unnamed: 0,ScreenResolution,Inches,Screen_W,Screen_H,PPI
0,IPS Panel Retina Display 2560x1600,13.3,2560,1600,226.983005
1,1440x900,13.3,1440,900,127.67794
2,Full HD 1920x1080,15.6,1920,1080,141.211998
3,IPS Panel Retina Display 2880x1800,15.4,2880,1800,220.534624
4,IPS Panel Retina Display 2560x1600,13.3,2560,1600,226.983005


### 3.4 Extract CPU Brand
Instead of raw CPU names, extract brand/tier for simplification.


In [34]:
# Function to categorize CPU into broad brands
def extract_cpu_brand(cpu_name):
    cpu_name = cpu_name.lower()
    if "intel" in cpu_name:
        return "Intel"
    elif "amd" in cpu_name:
        return "AMD"
    elif "apple" in cpu_name:
        return "Apple"
    else:
        return "Other"

# Apply function to CPU column
df["Cpu_Brand"] = df["Cpu"].apply(extract_cpu_brand)

# Check result
df[["Cpu", "Cpu_Brand"]].head()


Unnamed: 0,Cpu,Cpu_Brand
0,Intel Core i5 2.3GHz,Intel
1,Intel Core i5 1.8GHz,Intel
2,Intel Core i5 7200U 2.5GHz,Intel
3,Intel Core i7 2.7GHz,Intel
4,Intel Core i5 3.1GHz,Intel


### 3.5 Extract GPU Brand
Same logic for GPUs.


In [35]:
# Function to categorize GPU into broad brands
def extract_gpu_brand(gpu_name):
    gpu_name = gpu_name.lower()
    if "nvidia" in gpu_name:
        return "Nvidia"
    elif "amd" in gpu_name:
        return "AMD"
    elif "intel" in gpu_name:
        return "Intel"
    else:
        return "Other"

# Apply function to GPU column
df["Gpu_Brand"] = df["Gpu"].apply(extract_gpu_brand)

# Check result
df[["Gpu", "Gpu_Brand"]].head()


Unnamed: 0,Gpu,Gpu_Brand
0,Intel Iris Plus Graphics 640,Intel
1,Intel HD Graphics 6000,Intel
2,Intel HD Graphics 620,Intel
3,AMD Radeon Pro 455,AMD
4,Intel Iris Plus Graphics 650,Intel


### 3.6 Storage Flags (SSD / HDD)
Storage strings like "512GB SSD + 1TB HDD" → we make separate numeric flags.


In [36]:
def check_storage(storage, keyword):
    if pd.isna(storage):  # Handle missing values
        return 0
    
    storage_str = str(storage).lower()  # Convert everything to string
    if keyword in storage_str:
        for part in storage_str.split("+"):
            if keyword in part:
                num = part.strip().split(" ")[0]
                if "tb" in num:
                    return int(float(num.replace("tb", "")) * 1024)
                elif "gb" in num:
                    return int(num.replace("gb", ""))
                else:  # If just a number without gb/tb
                    return int(float(num))
    return 0


### 3.7 Simplify Operating System
We don’t need full names like "Windows 10 Home" vs "Windows 10 Pro". Just "Windows", "MacOS", "Linux", "Other".


In [37]:
# Function to simplify OS into broad categories
def simplify_os(os_name):
    os_name = os_name.lower()
    if "windows" in os_name:
        return "Windows"
    elif "mac" in os_name:
        return "MacOS"
    elif "linux" in os_name:
        return "Linux"
    else:
        return "Other"

# Apply function to OpSys column
df["OS_Simplified"] = df["OpSys"].apply(simplify_os)

# Check results
df[["OpSys", "OS_Simplified"]].head()


Unnamed: 0,OpSys,OS_Simplified
0,macOS,MacOS
1,macOS,MacOS
2,No OS,Other
3,macOS,MacOS
4,macOS,MacOS


## Step 4: Encode categorical variables
Now we will apply One-Hot Encoding (OHE) to categorical columns like Company, TypeName, Cpu_Brand, Gpu_Brand, OS_Simplified.


In [38]:
# List of categorical columns to encode
categorical_cols = ["Company", "TypeName", "Cpu_Brand", "Gpu_Brand", "OS_Simplified"]

# Apply one-hot encoding and drop the first level to avoid multicollinearity
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Preview the encoded dataset
df_encoded.head()


Unnamed: 0,laptop_ID,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_euros,...,TypeName_Ultrabook,TypeName_Workstation,Cpu_Brand_Intel,Cpu_Brand_Other,Gpu_Brand_Intel,Gpu_Brand_Nvidia,Gpu_Brand_Other,OS_Simplified_MacOS,OS_Simplified_Other,OS_Simplified_Windows
0,1,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,1339.69,...,True,False,True,False,True,False,False,True,False,False
1,2,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,898.94,...,True,False,True,False,True,False,False,True,False,False
2,3,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,575.0,...,False,False,True,False,True,False,False,False,True,False
3,4,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,2537.45,...,True,False,True,False,False,False,False,True,False,False
4,5,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,1803.6,...,True,False,True,False,True,False,False,True,False,False


## Step 5: Prepare X (features) and y (target)
Our target variable is **Price**.  
All other processed columns will be used as **X**.


In [42]:
# Define target variable
y = df_encoded["Price_euros"]

# Drop raw/unnecessary columns (already engineered into new ones)
drop_cols = ["Ram", "Weight", "ScreenResolution", "Cpu", "Gpu", "Memory_GB", "OpSys", "Price_euros"]
X = df_encoded.drop(columns=drop_cols)

# Preview features
X.head()


Unnamed: 0,laptop_ID,Inches,Memory,SSD_Flag,HDD_Flag,Storage_GB,Ram_GB,Company_std,OpSys_std,Weight_kg,...,TypeName_Ultrabook,TypeName_Workstation,Cpu_Brand_Intel,Cpu_Brand_Other,Gpu_Brand_Intel,Gpu_Brand_Nvidia,Gpu_Brand_Other,OS_Simplified_MacOS,OS_Simplified_Other,OS_Simplified_Windows
0,1,13.3,128GB SSD,1,0,128.0,8,Apple,Mac,1.37,...,True,False,True,False,True,False,False,True,False,False
1,2,13.3,128GB Flash Storage,0,0,128.0,8,Apple,Mac,1.34,...,True,False,True,False,True,False,False,True,False,False
2,3,15.6,256GB SSD,1,0,256.0,8,HP,Other,1.86,...,False,False,True,False,True,False,False,False,True,False
3,4,15.4,512GB SSD,1,0,512.0,16,Apple,Mac,1.83,...,True,False,True,False,False,False,False,True,False,False
4,5,13.3,256GB SSD,1,0,256.0,8,Apple,Mac,1.37,...,True,False,True,False,True,False,False,True,False,False


## Step 6: Save processed features
We will save `X.csv` and `y.csv` into the `data/processed/` folder for use in Step 4 (Model Building).


In [43]:
# Save X and y into processed folder for modeling
X.to_csv("../data/processed/X.csv", index=False)
y.to_csv("../data/processed/y.csv", index=False, header=["Price"])

print("✅ Feature Engineering Complete! Files saved to data/processed/")


✅ Feature Engineering Complete! Files saved to data/processed/
