<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo">
    </a>
</p>


# Test Environment for Generative AI classroom labs

This lab provides a test environment for the codes generated using the Generative AI classroom.

Follow the instructions below to set up this environment for further use.


# Setup


### Install required libraries

In case of a requirement of installing certain python libraries for use in your task, you may do so as shown below.


In [7]:
%pip install seaborn
import piplite

await piplite.install(['nbformat', 'plotly'])

### Dataset URL from the GenAI lab
Use the URL provided in the GenAI lab in the cell below. 


In [8]:
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_mod2.csv"

### Downloading the dataset

Execute the following code to download the dataset in to the interface.

> Please note that this step is essential in JupyterLite. If you are using a downloaded version of this notebook and running it on JupyterLabs, then you can skip this step and directly use the URL in pandas.read_csv() function to read the dataset as a dataframe


In [9]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

path = URL

await download(path, "dataset.csv")
file_name  = "dataset.csv"

---


# Test Environment


In [13]:
     import pandas as pd

     file_path = "dataset.csv"
     try:
         df = pd.read_csv(file_path, header=0)
         print("CSV loaded successfully!")
         print(df.head())
     except FileNotFoundError:
         print(f"Error: File not found at {file_path}. Please verify the path or download the file.")
     except Exception as e:
         print(f"An error occurred: {e}")
     

CSV loaded successfully!
   Unnamed: 0.1  Unnamed: 0 Manufacturer  Category  GPU  OS  CPU_core  \
0             0           0         Acer         4    2   1         5   
1             1           1         Dell         3    1   1         3   
2             2           2         Dell         3    1   1         7   
3             3           3         Dell         4    2   1         5   
4             4           4           HP         4    2   1         7   

   Screen_Size_inch  CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_pounds  \
0              14.0       0.551724       8             256        3.52800   
1              15.6       0.689655       4             256        4.85100   
2              15.6       0.931034       8             256        4.85100   
3              13.3       0.551724       8             128        2.69010   
4              15.6       0.620690       8             256        4.21155   

   Price Price-binned  Screen-Full_HD  Screen-IPS_panel  
0    978       

In [16]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Configuration: adjust these for your data
file_path = "dataset.csv"  # CSV with header row
feature_col = "feature"  # source variable (X) - UPDATE THIS BASED ON INSPECTION
target_col = "target"    # target variable (y) - UPDATE THIS BASED ON INSPECTION

# Load data
df = pd.read_csv(file_path, header=0)

# Inspect the DataFrame
print("DataFrame columns:", df.columns.tolist())
print("DataFrame shape:", df.shape)
print("First 5 rows:")
print(df.head())

# Prepare features and target (only if columns exist)
if feature_col in df.columns and target_col in df.columns:
    X = df[[feature_col]]
    y = df[target_col]

    # Train model
    model = LinearRegression()
    model.fit(X, y)

    # Predict and evaluate
    y_pred = model.predict(X)
    mse = mean_squared_error(y, y_pred)
    r2 = r2_score(y, y_pred)

    # Output results
    print("MSE:", mse)
    print("R^2:", r2)
else:
    print(f"Error: Columns '{feature_col}' or '{target_col}' not found. Please update the column names in the configuration.")

DataFrame columns: ['Unnamed: 0.1', 'Unnamed: 0', 'Manufacturer', 'Category', 'GPU', 'OS', 'CPU_core', 'Screen_Size_inch', 'CPU_frequency', 'RAM_GB', 'Storage_GB_SSD', 'Weight_pounds', 'Price', 'Price-binned', 'Screen-Full_HD', 'Screen-IPS_panel']
DataFrame shape: (238, 16)
First 5 rows:
   Unnamed: 0.1  Unnamed: 0 Manufacturer  Category  GPU  OS  CPU_core  \
0             0           0         Acer         4    2   1         5   
1             1           1         Dell         3    1   1         3   
2             2           2         Dell         3    1   1         7   
3             3           3         Dell         4    2   1         5   
4             4           4           HP         4    2   1         7   

   Screen_Size_inch  CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_pounds  \
0              14.0       0.551724       8             256        3.52800   
1              15.6       0.689655       4             256        4.85100   
2              15.6       0.931034       

In [18]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Configuration: adjust these for your data
file_path = "dataset.csv"  # CSV with header row
feature_col = "feature"  # source variable (X) - UPDATE THIS TO THE ACTUAL COLUMN NAME FROM YOUR CSV
target_col = "target"    # target variable (y) - UPDATE THIS TO THE ACTUAL COLUMN NAME FROM YOUR CSV

# Load data
df = pd.read_csv(file_path, header=0)

# Inspect the DataFrame (run this first to see column names)
print("DataFrame columns:", df.columns.tolist())
print("DataFrame shape:", df.shape)
print("First 5 rows:")
print(df.head())

# Prepare features and target (only if columns exist)
if feature_col in df.columns and target_col in df.columns:
    X = df[[feature_col]]
    y = df[target_col]

    # Train model
    model = LinearRegression()
    model.fit(X, y)

    # Predict and evaluate
    y_pred = model.predict(X)
    mse = mean_squared_error(y, y_pred)
    r2 = r2_score(y, y_pred)

    # Output results
    print("MSE:", mse)
    print("R^2:", r2)
else:
    print(f"Error: Columns '{feature_col}' or '{target_col}' not found. Please update the column names in the configuration above based on the printed columns.")


DataFrame columns: ['Unnamed: 0.1', 'Unnamed: 0', 'Manufacturer', 'Category', 'GPU', 'OS', 'CPU_core', 'Screen_Size_inch', 'CPU_frequency', 'RAM_GB', 'Storage_GB_SSD', 'Weight_pounds', 'Price', 'Price-binned', 'Screen-Full_HD', 'Screen-IPS_panel']
DataFrame shape: (238, 16)
First 5 rows:
   Unnamed: 0.1  Unnamed: 0 Manufacturer  Category  GPU  OS  CPU_core  \
0             0           0         Acer         4    2   1         5   
1             1           1         Dell         3    1   1         3   
2             2           2         Dell         3    1   1         7   
3             3           3         Dell         4    2   1         5   
4             4           4           HP         4    2   1         7   

   Screen_Size_inch  CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_pounds  \
0              14.0       0.551724       8             256        3.52800   
1              15.6       0.689655       4             256        4.85100   
2              15.6       0.931034       

In [23]:
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Configuration: adjust file path and column names as needed
file_path = "dataset.csv"  # CSV with header row
feature_col = "feature"  # source variable (X) - UPDATE THIS TO THE ACTUAL COLUMN NAME FROM YOUR CSV
target_col = "target"    # target variable (y) - UPDATE THIS TO THE ACTUAL COLUMN NAME FROM YOUR CSV

# Load data
df = pd.read_csv(file_path, header=0)

# Inspect the DataFrame (run this first to see column names)
print("DataFrame columns:", df.columns.tolist())
print("DataFrame shape:", df.shape)
print("First 5 rows:")
print(df.head())

# Prepare single-feature input (only if columns exist)
if feature_col in df.columns and target_col in df.columns:
    X = df[[feature_col]]
    y = df[target_col]

    # Degrees for polynomial regression
    degrees = [2, 3, 5]
    results = []

    for degree in degrees:
        # Generate polynomial features for the single input feature
        poly = PolynomialFeatures(degree=degree, include_bias=False)
        X_poly = poly.fit_transform(X)

        # Train linear regression on the expanded features
        model = LinearRegression()
        model.fit(X_poly, y)

        # Evaluate on training data
        y_pred = model.predict(X_poly)
        mse = mean_squared_error(y, y_pred)
        r2 = r2_score(y, y_pred)

        results.append({"degree": degree, "mse": mse, "r2": r2})

    # Best model by R^2
    best = max(results, key=lambda r: r["r2"])

    # Display results
    print("Results by degree:")
    for r in results:
        print(f"Degree {r['degree']}: MSE={r['mse']:.6f}, R2={r['r2']:.6f}")
    print("Best model: degree", best["degree"], "R2=", best["r2"])
else:
    print(f"Error: Columns '{feature_col}' or '{target_col}' not found. Please update the column names in the configuration above based on the printed columns.")


DataFrame columns: ['Unnamed: 0.1', 'Unnamed: 0', 'Manufacturer', 'Category', 'GPU', 'OS', 'CPU_core', 'Screen_Size_inch', 'CPU_frequency', 'RAM_GB', 'Storage_GB_SSD', 'Weight_pounds', 'Price', 'Price-binned', 'Screen-Full_HD', 'Screen-IPS_panel']
DataFrame shape: (238, 16)
First 5 rows:
   Unnamed: 0.1  Unnamed: 0 Manufacturer  Category  GPU  OS  CPU_core  \
0             0           0         Acer         4    2   1         5   
1             1           1         Dell         3    1   1         3   
2             2           2         Dell         3    1   1         7   
3             3           3         Dell         4    2   1         5   
4             4           4           HP         4    2   1         7   

   Screen_Size_inch  CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_pounds  \
0              14.0       0.551724       8             256        3.52800   
1              15.6       0.689655       4             256        4.85100   
2              15.6       0.931034       

In [29]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Configuration: adjust as needed
file_path = "dataset.csv"  # CSV with header row
target_col = "Price"  # name of the target column

# Load data
df = pd.read_csv(file_path, header=0)

# Inspect the DataFrame (optional, but helpful to confirm columns)
print("DataFrame columns:", df.columns.tolist())
print("DataFrame shape:", df.shape)
print("First 5 rows:")
print(df.head())

# Drop unnecessary columns (e.g., index-like columns)
df = df.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'], errors='ignore')  # Adjust if needed

# Handle categorical columns: One-hot encode them
# Identify categorical columns (non-numeric)
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
if categorical_cols:
    df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
    print("Categorical columns encoded:", categorical_cols)

# Validate columns
if target_col not in df.columns:
    raise ValueError(f"Target column '{target_col}' not found in data. Columns: {list(df.columns)}")
feature_cols = df.columns.drop(target_col)
if len(feature_cols) == 0:
    raise ValueError("No feature columns found besides the target column.")
X = df[feature_cols]
y = df[target_col]

# Ensure X is numeric (should be after encoding)
print("Feature columns (after encoding):", feature_cols)
print("X dtypes:", X.dtypes)

# Create a pipeline: scale features, generate polynomial features, then fit linear regression
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("model", LinearRegression())
])

# Train model
pipeline.fit(X, y)

# Evaluate on training data
y_pred = pipeline.predict(X)
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print("MSE:", mse)
print("R^2:", r2)


DataFrame columns: ['Unnamed: 0.1', 'Unnamed: 0', 'Manufacturer', 'Category', 'GPU', 'OS', 'CPU_core', 'Screen_Size_inch', 'CPU_frequency', 'RAM_GB', 'Storage_GB_SSD', 'Weight_pounds', 'Price', 'Price-binned', 'Screen-Full_HD', 'Screen-IPS_panel']
DataFrame shape: (238, 16)
First 5 rows:
   Unnamed: 0.1  Unnamed: 0 Manufacturer  Category  GPU  OS  CPU_core  \
0             0           0         Acer         4    2   1         5   
1             1           1         Dell         3    1   1         3   
2             2           2         Dell         3    1   1         7   
3             3           3         Dell         4    2   1         5   
4             4           4           HP         4    2   1         7   

   Screen_Size_inch  CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_pounds  \
0              14.0       0.551724       8             256        3.52800   
1              15.6       0.689655       4             256        4.85100   
2              15.6       0.931034       

In [33]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

file_path = "dataset.csv"
target_col = "Price"  # UPDATED: Use the actual target column name from your data

df = pd.read_csv(file_path, header=0)

# Inspect the DataFrame (run this to confirm columns)
print("DataFrame columns:", df.columns.tolist())
print("First 5 rows:")
print(df.head())

# Drop unnecessary columns if present
df = df.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'], errors='ignore')

# Validate target column
if target_col not in df.columns:
    raise ValueError(f"Target column '{target_col}' not found. Available columns: {list(df.columns)}")

# Separate features and target
feature_cols = df.columns.drop(target_col).tolist()
X = df[feature_cols]
y = df[target_col]

# Identify numeric and categorical columns for preprocessing
numeric_cols = X.select_dtypes(include=['number']).columns.tolist()
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

# Select first 2 numeric columns for polynomial features (adjust if needed)
poly_features_cols = numeric_cols[:2] if len(numeric_cols) >= 2 else numeric_cols

# Define preprocessor: encode categoricals, apply poly to selected numeric, passthrough others
preprocessor = ColumnTransformer(
    transformers=[
        ("poly", PolynomialFeatures(degree=2, include_bias=False), poly_features_cols),
        ("cat", OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_cols),  # FIXED: Added handle_unknown='ignore' to handle new categories in CV
        ("passthrough", "passthrough", [c for c in numeric_cols if c not in poly_features_cols])
    ]
)

pipe = Pipeline([
    ("preprocessor", preprocessor),
    ("model", Ridge())
])

param_grid = {
    "preprocessor__poly__degree": [2, 3],
    "model__alpha": [0.1, 1.0, 10.0]
}

grid = GridSearchCV(pipe, param_grid, cv=5, scoring="neg_mean_squared_error", n_jobs=-1)
grid.fit(X, y)

best = grid.best_estimator_
y_pred = best.predict(X)
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print("MSE:", mse)
print("R^2:", r2)
print("Best params:", grid.best_params_)


DataFrame columns: ['Unnamed: 0.1', 'Unnamed: 0', 'Manufacturer', 'Category', 'GPU', 'OS', 'CPU_core', 'Screen_Size_inch', 'CPU_frequency', 'RAM_GB', 'Storage_GB_SSD', 'Weight_pounds', 'Price', 'Price-binned', 'Screen-Full_HD', 'Screen-IPS_panel']
First 5 rows:
   Unnamed: 0.1  Unnamed: 0 Manufacturer  Category  GPU  OS  CPU_core  \
0             0           0         Acer         4    2   1         5   
1             1           1         Dell         3    1   1         3   
2             2           2         Dell         3    1   1         7   
3             3           3         Dell         4    2   1         5   
4             4           4           HP         4    2   1         7   

   Screen_Size_inch  CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_pounds  \
0              14.0       0.551724       8             256        3.52800   
1              15.6       0.689655       4             256        4.85100   
2              15.6       0.931034       8             256        4.



MSE: 42875.20350329165
R^2: 0.8695956962898167
Best params: {'model__alpha': 0.1, 'preprocessor__poly__degree': 2}


## Authors


[Abhishek Gagneja](https://www.linkedin.com/in/abhishek-gagneja-23051987/)


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-12-10|0.1|Abhishek Gagneja|Initial Draft created|


Copyright © 2023 IBM Corporation. All rights reserved.
