<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo">
    </a>
</p>


# Test Environment for Generative AI classroom labs

This lab provides a test environment for the codes generated using the Generative AI classroom.

Follow the instructions below to set up this environment for further use.


# Setup


### Install required libraries

In case of a requirement of installing certain python libraries for use in your task, you may do so as shown below.


In [1]:
%pip install seaborn
import piplite

await piplite.install(['nbformat', 'plotly'])

### Dataset URL from the GenAI lab
Use the URL provided in the GenAI lab in the cell below. 


In [2]:
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_mod1.csv"

### Downloading the dataset

Execute the following code to download the dataset in to the interface.

> Please note that this step is essential in JupyterLite. If you are using a downloaded version of this notebook and running it on JupyterLabs, then you can skip this step and directly use the URL in pandas.read_csv() function to read the dataset as a dataframe


In [3]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

path = URL

await download(path, "dataset.csv")

---


# Test Environment


In [20]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import Ridge
%matplotlib inline

In [6]:
def read_csv_to_dataframe(file_path):
    """
    Reads a CSV file located at the provided file path into a Pandas DataFrame.
    
    Args:
    file_path (str): The path to the CSV file.
    
    Returns:
    pandas.DataFrame: The data read from the CSV file.
    """
    try:
        # Reading the CSV file into DataFrame
        df = pd.read_csv(file_path)
        return df
    except FileNotFoundError:
        print(f"The file at {file_path} was not found.")
    except Exception as e:
        print(f"An error occurred: {e}")
    
    return None

# Example usage:
# Assuming `file_path` is already defined as the path to your CSV
file_path = "./dataset.csv"
df = read_csv_to_dataframe(file_path)

if df is not None:
    print(df.head())

   Unnamed: 0 Manufacturer  Category     Screen  GPU  OS  CPU_core  \
0           0         Acer         4  IPS Panel    2   1         5   
1           1         Dell         3    Full HD    1   1         3   
2           2         Dell         3    Full HD    1   1         7   
3           3         Dell         4  IPS Panel    2   1         5   
4           4           HP         4    Full HD    2   1         7   

   Screen_Size_cm  CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_kg  Price  
0          35.560            1.6       8             256       1.60    978  
1          39.624            2.0       4             256       2.20    634  
2          39.624            2.7       8             256       2.20    946  
3          33.782            1.6       8             128       1.22   1244  
4          39.624            1.8       8             256       1.91    837  


In [11]:
# Generate ML Linear Regression Model:
# Check and display NaN values in the DataFrame
print(df.isnull().sum())

# Initialize the imputer to fill NaN values with median
imputer = SimpleImputer(strategy='median')

# Fit the imputer on the 'Weight_kg' column and transform the dataframe
df[['Weight_kg']] = imputer.fit_transform(df[['Weight_kg']])

# Select the source (predictor) and target (response) variables
X = df[['Weight_kg']]  # Using 'Weight_kg' as the predictor
y = df['Price']       # Using 'Price' as the response

# Splitting the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate and display the Mean Squared Error (MSE) and R-squared (R²)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")

Unnamed: 0        0
Manufacturer      0
Category          0
Screen            0
GPU               0
OS                0
CPU_core          0
Screen_Size_cm    4
CPU_frequency     0
RAM_GB            0
Storage_GB_SSD    0
Weight_kg         5
Price             0
dtype: int64
Mean Squared Error (MSE): 269971.1773376724
R-squared (R²): -0.17142413752152041


In [12]:
# Generate ML Linear Regression Model:
# Select source (predictor) and target (response) variables
X = df[['CPU_frequency']]  # Using 'CPU_frequency' as the predictor
y = df['Price']             # Using 'Price' as the response

# Splitting the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate and display the Mean Squared Error (MSE) and R-squared (R²)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")

Mean Squared Error (MSE): 239035.99429436037
R-squared (R²): -0.03719417833496452


In [13]:
# Generate ML Linear Regression Model with multiple features for X:
# Select source (predictor) variables and target (response) variable
X = df[['CPU_frequency', 'RAM_GB', 'Storage_GB_SSD', 'CPU_core', 'OS', 'GPU', 'Category']]
y = df['Price']  # Using 'Price' as the response variable

# Convert categorical variables into numerical format using one-hot encoding
X = pd.get_dummies(X, drop_first=True)

# Splitting the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate and display the Mean Squared Error (MSE) and R-squared (R²)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")

Mean Squared Error (MSE): 168575.62043820196
R-squared (R²): 0.26853839463024776


### The model performance is "better" when we use multiple features.

In [15]:
# Generate ML Polynomial Regression Model:
# Select source (predictor) and target (response) variables
X = df[['CPU_frequency']]
y = df['Price']

# Splitting the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define polynomial feature degrees
degrees = [2, 3, 5]

# Initialize lists to store results
mse_scores = []
r2_scores = []

# Train and evaluate polynomial regression models for different degrees
for degree in degrees:
    # Generate polynomial and affine features
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly_train = poly.fit_transform(X_train)
    X_poly_test = poly.transform(X_test)
    
    # Initialize and train Linear Regression model
    model = LinearRegression()
    model.fit(X_poly_train, y_train)

    # Predict on the test set
    y_pred = model.predict(X_poly_test)

    # Calculate and store MSE and R²
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mse_scores.append(mse)
    r2_scores.append(r2)

    print(f"Polynomial Degree: {degree}")
    print(f"Mean Squared Error (MSE): {mse}")
    print(f"R-squared (R²): {r2}")
    print("\n")

# Compare performance
for i in range(len(degrees)):
    print(f"Model Degree {degrees[i]}: MSE = {mse_scores[i]}, R² = {r2_scores[i]}")

Polynomial Degree: 2
Mean Squared Error (MSE): 196263.5614577202
R-squared (R²): 0.14839844951318837


Polynomial Degree: 3
Mean Squared Error (MSE): 205918.03020812548
R-squared (R²): 0.10650702302573634


Polynomial Degree: 5
Mean Squared Error (MSE): 207335.70360838601
R-squared (R²): 0.1003556373238833


Model Degree 2: MSE = 196263.5614577202, R² = 0.14839844951318837
Model Degree 3: MSE = 205918.03020812548, R² = 0.10650702302573634
Model Degree 5: MSE = 207335.70360838601, R² = 0.1003556373238833


In [17]:
# Generate ML Linear Regression Model with pipeline:
# Select source (predictor) variables and target (response) variable
X = df[['CPU_frequency', 'RAM_GB', 'Storage_GB_SSD', 'CPU_core', 'OS', 'GPU', 'Category']]
y = df['Price']

# Splitting the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing steps
numeric_features = ['CPU_frequency', 'RAM_GB', 'Storage_GB_SSD', 'CPU_core']
numeric_transformer = StandardScaler()

categorical_features = ['OS', 'GPU', 'Category']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Bundle preprocessing steps and estimator into a single pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Train the model using the pipeline
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Calculate and display the Mean Squared Error (MSE) and R-squared (R²)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")

Mean Squared Error (MSE): 137926.64583333334
R-squared (R²): 0.40152647504862826


**Once the pipline is created with standardization it perfoms better than what it does previously.**

In [21]:
# Generate ML Ridge Regression Model with GridSearchCV:
# Select source (predictor) variables and target (response) variable
X = df[['CPU_frequency', 'RAM_GB', 'Storage_GB_SSD', 'CPU_core', 'OS', 'GPU', 'Category']]
y = df['Price']

# Splitting the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define polynomial feature generator
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_transformed = poly.fit_transform(X_train)
X_test_transformed = poly.transform(X_test)

# Define the Ridge Regression model and grid for hyperparameter 'alpha'
ridge = Ridge()
param_grid = {'ridge__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10]}

# Setup GridSearchCV
grid_search = GridSearchCV(Pipeline(steps=[('poly', poly), ('ridge', ridge)]),\
                           param_grid, cv=4, scoring='neg_mean_squared_error')

# Fit the model
grid_search.fit(X_train_transformed, y_train)

# Get the best model and parameters
best_model = grid_search.best_estimator_
alpha_best = grid_search.best_params_['ridge__alpha']
print(f"Best Alpha from Grid Search: {alpha_best}")

# Predict using the best model
y_pred = best_model.predict(X_test_transformed)

# Calculate and display MSE and R²
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE) with best model: {mse}")
print(f"R-squared (R²) with best model: {r2}")



Best Alpha from Grid Search: 0.0001
Mean Squared Error (MSE) with best model: 2508725.0008351263
R-squared (R²) with best model: -9.88553618709533


### My comments:
----
#### Practicing with Gen AI:
In this script, we are practicing how generative AI prompts work, such that it can generate code according to your needs.
In this example, we try to use it for ML Model development from a given dataset to make this process quicker and more efficient. We try to get insights with statistical descriptions, correlative insights, and some plottting usng Gen AI prompts.

The AI generated code is written based on specific prompts that were given to it by me, based on the instructions specified in the lab:

**Examples of The prompts that were given to the AI (IBM Granite 3.2 8B(Reasoning)):**

**prompt 1:**

```
Write a Python code that can perform the following tasks.
Read the CSV file, located on a given file path, into a pandas data frame, assuming that the first row of the file can be used as the headers for the data.
```

**prompt 2:** 
```
Write a Python code that performs the following tasks.
1. Develops and trains a linear regression model that uses one attribute of a data frame as the source variable and another as a target variable.
2. Calculate and display the MSE and R^2 values for the trained model
```
This was produces it with `Weight_kg` as `X` and the target variable `Y` as `Price`.

**prompt 3:**
```
Write a Python code that performs the following tasks.
1. Develops and trains a linear regression model that uses one attribute of a data frame as the source variable 'CPU_frequency' and another as a target variable 'Price'.
2. Calculate and display the MSE and R^2 values for the trained model
```
**prompt 4:**
```
Write a Python code that performs the following tasks.
1. Develops and trains a linear regression model that uses some attributes of a data frame as the source variables ("CPU_frequency, RAM_GB, Storage_GB_SSD, CPU_core, OS, GPU and Category") and one of the attributes as a target variable ("Price").
2. Calculate and display the MSE and R^2 values for the trained model.
```
#### The model performance is "better" when we use multiple features compared to the previous prompts when we use a single feature for x.

**prompt 5:**
```
Write a Python code that performs the following tasks.
1. Develops and trains multiple polynomial regression models, with orders 2, 3, and 5, that use one attribute of a data frame as the source variable ('CPU_frequency') and another as a target variable ('Price').
2. Calculate and display the MSE and R^2 values for the trained models.
3. Compare the performance of the models.
```
#### If we compare the result that we got for prompt 3 and the result for prompt 5:
**Note: for both prompt 3 and prompt 5 (source variable `CPU_frequency` and target variable `Price`), but for prompt 3 we run a LinearRegression model, whereas for prompt 5 we runa polynomial regression.**

> We find that the polynomial Regression model generated by prompt 5 with degree = 2 is the 'best' model out of the ones we have tested here. 

**prompt 6:**
```
Write a Python code that performs the following tasks.
1. Create a pipeline that performs parameter scaling, Polynomial Feature generation, and Linear regression. Use the set of multiple features as before to create this pipeline. Source variables are: 'CPU_frequency, RAM_GB, Storage_GB_SSD, CPU_core, OS, GPU and Category' and target variable 'Price'.
2. Calculate and display the MSE and R^2 values for the trained model.
```

**prompt 7:**
```
Write a Python code that performs the following tasks.
1. Use polynomial features for some of the attributes of a data frame. Source Variables: CPU_frequency, RAM_GB, Storage_GB_SSD, CPU_core, OS, GPU and Category. Target variable is Price.
2. Perform Grid search on a ridge regression model for a set of values of hyperparameter alpha and polynomial features as input.Set of values for alpha: 0.0001,0.001,0.01, 0.1, 1, 10
Cross Validation: 4-fold
Polynomial Feature order: 2
3. Use cross-validation in the Grid search.
4. Evaluate the resulting model's MSE and R^2 values.
```
#### Overall, It does the job well enough for the example or the tasks required for this lab.

----

## Authors


[Abhishek Gagneja](https://www.linkedin.com/in/abhishek-gagneja-23051987/)


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-12-10|0.1|Abhishek Gagneja|Initial Draft created|


Copyright © 2023 IBM Corporation. All rights reserved.
