# Generative AI for Models Development

In this lab, we will use generative AI to create Python scripts to develop and evaluate different predictive models for a given data set.

### Learning Objetives

In this lab, you will learn how to use generative AI to create Python codes that can:

- Use linear regression in one variable to fit the parameters to a model
- Use linear regression in multiple variables to fit the parameters to a model
- Use polynomial regression in a single variable to fit the parameters to a model
- Create a pipeline for performing linear regression using multiple features in polynomial scaling
- Use the grid search with cross-validation and ridge regression to create a model with optimum hyperparameters

In [2]:
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_mod2.csv"

The first step is to ask the Gen AI model to generate a code to import the provided dataset to a Pandas' data frame. You must specify if you imported the data. Then, you should see the dataset headers in the first row of the CSV file.

You can structure the prompt to create the code as follows:

**Write a Python code that can perform the following tasks.**

**Read the CSV file, located on a given file path, into a pandas data frame, assuming that the first row of the file can be used as the headers for the data.**

In [3]:
import pandas as pd

# Path to the CSV file. Replace with the actual file path.
#csv_file_path = "path/to/your/file.csv"
csv_file_path = URL

# Read the CSV file into a DataFrame, using the first row as column headers.
df = pd.read_csv(csv_file_path, header=0)

# df now contains the data with headers from the first row.

In [4]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,GPU,OS,CPU_core,Screen_Size_inch,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_pounds,Price,Price-binned,Screen-Full_HD,Screen-IPS_panel
0,0,0,Acer,4,2,1,5,14.0,0.551724,8,256,3.528,978,Low,0,1
1,1,1,Dell,3,1,1,3,15.6,0.689655,4,256,4.851,634,Low,1,0
2,2,2,Dell,3,1,1,7,15.6,0.931034,8,256,4.851,946,Low,1,0
3,3,3,Dell,4,2,1,5,13.3,0.551724,8,128,2.6901,1244,Low,0,1
4,4,4,HP,4,2,1,7,15.6,0.62069,8,256,4.21155,837,Low,1,0


## Linear regression in one variable

**Write a Python code that performs the following tasks.**
1. **Develops and trains a linear regression model that uses one attribute of a data frame as the source variable and another as a target variable.**
2. **Calculate and display the MSE and R^2 values for the trained model.**

You can use this code to develop a linear regression model with the target variable as `Price` and the source variable as `CPU_frequency`. Try this out in the Test environment.

In [5]:
#import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Path to the CSV file (replace with your actual path)
#csv_file_path = "path/to/your/file.csv"

# Read CSV into a DataFrame with the first row as headers
#df = pd.read_csv(csv_file_path, header=0)

# Names of the feature (source variable) and target (response) columns
feature_col = 'CPU_frequency' # replace with your actual feature column name
target_col = "Price"    # replace with your actual target column name

# Prepare data
X = df[[feature_col]]  # 2D array as required by scikit-learn
y = df[target_col]

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X, y)

# Predict on the training data
y_pred = model.predict(X)

# Evaluate model performance
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

# Display results
print(f"MSE: {mse}")
print(f"R^2: {r2}")

MSE: 284583.4405868629
R^2: 0.1344436321024326


## Linear regression in multiple variables

**Write a Python code that performs the following tasks.**
1. **Develops and trains a linear regression model that uses some attributes of a data frame as the source variables and one of the attributes as a target variable.**
2. **Calculate and display the MSE and R^2 values for the trained model.**

You can use the generated code and build a linear regression model for the data set that uses `CPU_frequency`, `RAM_GB`, `Storage_GB_SSD`, `CPU_core`, `OS`, `GPU` and `Category` as source variables and `Price` as the target variable.
You may compare the performance of the two models by comparing their `MSE` and `R^2` values.

In [6]:
#mport pandas as pd
from sklearn.model_selection import train_test_split
#from sklearn.linear_model import LinearRegression
#from sklearn.metrics import mean_squared_error, r2_score

# Path to the CSV file (replace with your actual path)
#csv_file_path = "path/to/your/file.csv"

# Read CSV into a DataFrame with the first row as headers
#try:
#    df = pd.read_csv(csv_file_path, header=0)
#except FileNotFoundError:
#    raise SystemExit(f"CSV file not found: {csv_file_path}")
#except Exception as e:
#    raise SystemExit(f"Error reading CSV: {e}")

# Specify the target column name; features will be all other columns
target_col = "Price"  # replace with your actual target column name
#feature_cols = [col for col in df.columns if col != target_col]
feature_cols=["CPU_frequency","RAM_GB","Storage_GB_SSD","CPU_core","OS","GPU","Category"]

if not feature_cols:
    raise SystemExit("No feature columns found. Ensure the target column name is correct.")

# Prepare data
X = df[feature_cols]
y = df[target_col]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model (multiple features)
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Display results
print(f"MSE: {mse}")
print(f"R^2: {r2}")

MSE: 168575.62043820196
R^2: 0.26853839463024776


## Polynomial regression

You can explore creating a model that uses higher-order features from original attributes. Higher orders of the same feature allow better generalization on the target variable. It is called polynomial regression, and you can use Generative AI to create a code for this.

Assume you are given a single attribute as the source variable and one as a target variable. You must create a model using polynomial regression for a given order. You can also make the model for different order values and compare their performance based on MSE and R^2 scores.

You can write a similar prompt to the following code:

**Write a Python code that performs the following tasks.**
1. **Develops and trains multiple polynomial regression models, with orders 2, 3, and 5, that use one attribute of a data frame as the source variable and another as a target variable.**
2. **Calculate and display the MSE and R^2 values for the trained models.**
3. **Compare the performance of the models.**

Try to run the generated code on the testing interface with the source variable as `CPU_frequency` and the target variable as `Price`.

In [7]:
#import pandas as pd
#from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
#from sklearn.linear_model import LinearRegression
#from sklearn.metrics import mean_squared_error, r2_score

# Path to the CSV file (replace with actual path)
#csv_file_path = "path/to/your/file.csv"
# Names of the feature (source variable) and target column
feature_col = "CPU_frequency"  # replace with your actual feature column name
target_col = "Price"    # replace with your actual target column name

# Read CSV into a DataFrame with headers from the first row
#df = pd.read_csv(csv_file_path, header=0)
X = df[[feature_col]]
y = df[target_col]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

results = []
for degree in [2, 3, 5]:
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_poly, y_train)

    y_pred = model.predict(X_test_poly)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    results.append({"degree": degree, "mse": mse, "r2": r2})

for r in results:
    print(f"Degree {r['degree']}: MSE={r['mse']:.6f}, R^2={r['r2']:.6f}")

best = max(results, key=lambda x: x['r2'])
print(f"Best degree: {best['degree']} with R^2={best['r2']:.6f} and MSE={best['mse']:.6f}")

Degree 2: MSE=196263.561458, R^2=0.148398
Degree 3: MSE=205918.030208, R^2=0.106507
Degree 5: MSE=207335.703610, R^2=0.100356
Best degree: 2 with R^2=0.148398 and MSE=196263.561458


## Creating a Pipeline

Pipelines are processes containing a sequence of steps that lead to creating a trained model.

You will now use the Generative AI model to create a pipeline for performing feature scaling, creating polynomial features for multiple attributes, and performing linear regression using these variables.

You can build a similar prompt to the following code:

**Write a Python code that performs the following tasks.**
1. **Create a pipeline that performs parameter scaling, Polynomial Feature generation, and Linear regression. Use the set of multiple features as before to create this pipeline.**
2. **Calculate and display the MSE and R^2 values for the trained model.**

Make appropriate changes to the attributes you will use in the code. Consider the same set of attributes as used for multiple feature linear regression, which is `CPU_frequency`, `RAM_GB`, `Storage_GB_SSD`, `CPU_core`, `OS`, `GPU` and `Category` as source variables, and `Price` as the target variable.

In [8]:
#import pandas as pd
#from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
#from sklearn.linear_model import LinearRegression
#from sklearn.metrics import mean_squared_error, r2_score

# Path to the CSV file (replace with your actual path)
#csv_file_path = "path/to/your/file.csv"

# Read CSV into a DataFrame with headers from the first row

#df = pd.read_csv(csv_file_path, header=0)

# Target column name and feature columns (all others are features)
target_col = "Price"  # replace with your actual target column name
#feature_cols = [col for col in df.columns if col != target_col]
feature_cols=["CPU_frequency","RAM_GB","Storage_GB_SSD","CPU_core","OS","GPU","Category"]

# Features and target
X = df[feature_cols]
y = df[target_col]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Pipeline: scale features, generate polynomial features, then fit linear regression
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("model", LinearRegression())
])

# Train the model
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Display results
print(f"MSE: {mse}")
print(f"R^2: {r2}")

MSE: 241404.50736010206
R^2: -0.04747132496449025


## Grid search and Ridge regression

An improved way to train your model is to use ridge regression instead of linear regression. You can use the polynomial features of multiple attributes. One of the key factors of ridge regression is using the parameter `alpha` as a hyperparameter for training. Using grid search, one can determine the optimum value of the hyperparameter for the given set of features. Grid search also uses cross-validation training to train and prepare the optimum model.

You can use generative AI to create the Python code to perform a grid search for the optimum ridge regression model, which uses polynomial features generated from multiple parameters.

You may use the following prompt to create this:

**Write a Python code that performs the following tasks.**
1. **Use polynomial features for some of the attributes of a data frame.**
2. **Perform Grid search on a ridge regression model for a set of values of hyperparameter alpha and polynomial features as input.**
3. **Use cross-validation in the Grid search.**
4. **Evaluate the resulting model's MSE and R^2 values.**


You make use of the following parametric values for this purpose.

- Source Variables: `CPU_frequency`, `RAM_GB`, `Storage_GB_SSD`, `CPU_core`, `OS`, `GPU` and `Category`
- Target Variable: `Price`
- Set of values for alpha: `[0.0001,0.001,0.01, 0.1, 1, 10]`
- Cross Validation: `4-fold`
- Polynomial Feature order: `2`

In [9]:
#import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
#from sklearn.pipeline import Pipeline
#from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
#from sklearn.metrics import mean_squared_error, r2_score

# Path to the CSV file (replace with actual path)
#csv_file_path = "path/to/your/file.csv"

# Read CSV into DataFrame with headers from the first row
#df = pd.read_csv(csv_file_path, header=0)

# Target column and feature columns (all columns except target)
target_col = "Price"  # replace with your actual target column name
#feature_cols = [col for col in df.columns if col != target_col]
feature_cols=["CPU_frequency","RAM_GB","Storage_GB_SSD","CPU_core","OS","GPU","Category"]
X = df[feature_cols]
y = df[target_col]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Pipeline: polynomial features (degree varies via grid) followed by Ridge regression
pipeline = Pipeline([
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),  # degree is gridable
    ("ridge", Ridge())
])

# Grid over polynomial degree and ridge alpha
param_grid = {
    "poly__degree": [2, 3, 5],
    "ridge__alpha": [0.0001,0.001,0.01, 0.1, 1, 10]
}

# GridSearch with cross-validation
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=4, scoring="neg_mean_squared_error")
grid_search.fit(X_train, y_train)

# Best estimator and evaluation on test set
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Best degree: {grid_search.best_params_['poly__degree']}")
print(f"Best alpha: {grid_search.best_params_['ridge__alpha']}")
print(f"Test MSE: {mse}")
print(f"Test R^2: {r2}")

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, 

Best degree: 2
Best alpha: 10
Test MSE: 240003.5674572872
Test R^2: -0.04139254709805984


