<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo">
    </a>
</p>


# Test Environment for Generative AI classroom labs

This lab provides a test environment for the codes generated using the Generative AI classroom.

Follow the instructions below to set up this environment for further use.


In this lab, you will learn how to use generative AI to create Python codes that can:

1. Use `linear regression` in **one variable** to fit the parameters to a model
2. Use `linear regression` in **multiple variables** to fit the parameters to a model
3. Use `polynomial regression` in a **single variable** to fit the parameters to a model
4. Create a pipeline for performing `linear regression` using multiple features in polynomial scaling
5. Use the **grid search** with `cross-validation` and `ridge regression` to create a model with optimum hyperparameters

# Setup


### Install required libraries

In case of a requirement of installing certain python libraries for use in your task, you may do so as shown below.


In [1]:
%pip install seaborn
import piplite

await piplite.install(['nbformat', 'plotly'])

### Dataset URL from the GenAI lab
Use the URL provided in the GenAI lab in the cell below. 


In [2]:
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_mod2.csv"

### Downloading the dataset

Execute the following code to download the dataset in to the interface.

> Please note that this step is essential in JupyterLite. If you are using a downloaded version of this notebook and running it on JupyterLabs, then you can skip this step and directly use the URL in pandas.read_csv() function to read the dataset as a dataframe


In [3]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

path = URL

await download(path, "dataset.csv")

---


In [4]:
import pandas as pd
df = pd.read_csv('dataset.csv')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [5]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,GPU,OS,CPU_core,Screen_Size_inch,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_pounds,Price,Price-binned,Screen-Full_HD,Screen-IPS_panel
0,0,0,Acer,4,2,1,5,14.0,0.551724,8,256,3.528,978,Low,0,1
1,1,1,Dell,3,1,1,3,15.6,0.689655,4,256,4.851,634,Low,1,0
2,2,2,Dell,3,1,1,7,15.6,0.931034,8,256,4.851,946,Low,1,0
3,3,3,Dell,4,2,1,5,13.3,0.551724,8,128,2.6901,1244,Low,0,1
4,4,4,HP,4,2,1,7,15.6,0.62069,8,256,4.21155,837,Low,1,0


### Linear regression in one variable

**AI prompt 1:**  
Write a Python code that performs the following tasks.
1. Develops and trains a linear regression model that uses one attribute of a data frame as the source variable and another as a target variable.
2. Calculate and display the MSE and R^2 values for the trained model
  
**Result**  
Sure! Here is a Python code that performs the tasks you mentioned:

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load your data into a pandas DataFrame
data = pd.read_csv('your_data.csv')

# Define your source and target variables
X = data['source_variable'].values.reshape(-1, 1)
y = data['target_variable'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate MSE and R^2 values
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R^2 Score:", r2)
```

This code will load your data, split it into training and testing sets, train a linear regression model using the source variable as the input and the target variable as the output, and then calculate and display the Mean Squared Error (MSE) and R^2 values for the trained model.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [7]:
X = df[['CPU_frequency']]
y = df[['Price']]
x_train, x_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=.2,
                                                   random_state=14)

In [8]:
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [9]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)

Mean Squared Error: 377282.07458982826
R2 Score: 0.14979264117084223


### Linear Regression in Multiple Variables

**AI prompt 2:**  
Write a Python code that performs the following tasks.
1. Develops and trains a linear regression model that uses some attributes of a data frame as the source variables and one of the attributes as a target variable.
2. Calculate and display the MSE and R^2 values for the trained model.
  
**Result**  
Certainly! Here is a Python code that accomplishes the tasks you specified:

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load your data into a pandas DataFrame
data = pd.read_csv('your_data.csv')

# Define your source variables and target variable
X = data[['source_variable1', 'source_variable2']].values
y = data['target_variable'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate MSE and R^2 values
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R^2 Score:", r2)
```

This code will load your data, split it into training and testing sets, train a linear regression model using multiple source variables as inputs and one target variable as the output, and then calculate and display the Mean Squared Error (MSE) and R^2 values for the trained model.

In [10]:
X = df[["CPU_frequency", "RAM_GB", "Storage_GB_SSD",
        "CPU_core", "OS", "GPU", "Category"]]
y = df[['Price']]
x_train, x_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.8,
                                                    random_state=43)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((47, 7), (191, 7), (47, 1), (191, 1))

In [11]:
lr = LinearRegression()
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)
print("Mean Square Error:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))

Mean Square Error: 200688.32777900505
R2 Score: 0.3566912736946105


### Polynomial regression

You can explore creating a model that uses higher-order features from original attributes. Higher orders of the same feature allow better generalization on the target variable. It is called polynomial regression, and you can use Generative AI to create a code for this.

Assume you are given a single attribute as the source variable and one as a target variable. You must create a model using polynomial regression for a given order. You can also make the model for different order values and compare their performance based on MSE and $R^2$ scores.

**AI prompt 3:**  
Write a Python code that performs the following tasks.
1. Develops and trains multiple polynomial regression models, with orders 2, 3, and 5, that use one attribute of a data frame as the source variable and another as a target variable.
2. Calculate and display the MSE and R^2 values for the trained models.
3. Compare the performance of the models.
  
**Result**  

Certainly! Here is a Python code that meets your requirements:

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

# Load your data into a pandas DataFrame
data = pd.read_csv('your_data.csv')

# Define your source and target variables
X = data['source_variable'].values.reshape(-1, 1)
y = data['target_variable'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train polynomial regression models with orders 2, 3, and 5
orders = [2, 3, 5]
for order in orders:
    poly_features = PolynomialFeatures(degree=order)
    X_poly_train = poly_features.fit_transform(X_train)
    X_poly_test = poly_features.transform(X_test)
    
    model = LinearRegression()
    model.fit(X_poly_train, y_train)
    
    y_pred = model.predict(X_poly_test)
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f"Polynomial Regression Model (Order {order}):")
    print("Mean Squared Error:", mse)
    print("R^2 Score:", r2)
    print()

# Compare the performance of the models based on MSE and R^2 values
# The model with the lowest MSE and highest R^2 score is considered the best performing model
```

This code will load your data, split it into training and testing sets, train polynomial regression models with orders 2, 3, and 5 using the source variable as input and the target variable as output, calculate and display the Mean Squared Error (MSE) and R^2 values for each model, and then you can compare the performance of the models based on these values.

In [12]:
from sklearn.preprocessing import PolynomialFeatures
X = df[['CPU_frequency']]
y = df[['Price']]
x_train, x_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=.2,
                                                    random_state=24)

In [13]:
x_train.head()

Unnamed: 0,CPU_frequency
209,0.931034
220,0.931034
150,0.862069
203,0.965517
226,0.793103


In [14]:
import numpy as np

orders = [2,3,5]
mses = []
r2s = []
for order in orders:
    pf = PolynomialFeatures(degree=order)
    x_pf_train = pf.fit_transform(x_train)
    x_pf_test = pf.transform(x_test)
    
    lr = LinearRegression()
    lr.fit(x_pf_train, y_train)
    y_pred = lr.predict(x_pf_test)

    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mses.append(mse)
    r2s.append(r2)
    
    print(f"{order} order\n*******")
    print("Mean Square Error:", mse)
    print("R2 Score:", r2)
    print()

# print(mses)
# print(r2s)
print("The best polynomial order:", np.argmin(mses)+3 if np.argmin(mses) > 1 else np.argmin(mses)+2)
print("The worst polynomial order:", np.argmin(r2s)+3 if np.argmin(r2s) > 1 else np.argmin(r2s)+2)

2 order
*******
Mean Square Error: 289080.97135422117
R2 Score: 0.3744275363323871

3 order
*******
Mean Square Error: 265526.22990816
R2 Score: 0.42540009799370293

5 order
*******
Mean Square Error: 228955.85281412964
R2 Score: 0.5045385511018243

The best polynomial order: 5
The worst polynomial order: 2


### Creating a Pipeline

**AI prompt 4:**  

  
**Result**  


### Creating a Pipeline

Pipelines are processes containing a sequence of steps that lead to creating a trained model.

You will now use the Generative AI model to create a pipeline for performing feature scaling, creating polynomial features for multiple attributes, and performing linear regression using these variables.

**AI prompt 4:**  
Write a Python code that performs the following tasks.
1. Create a pipeline that performs parameter scaling, Polynomial Feature generation, and Linear regression. Use the set of multiple features as before to create this pipeline.
2. Calculate and display the MSE and R^2 values for the trained model.
  
**Result**  
Sure! Here is the Python code that performs the tasks you mentioned:

```python
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Assuming X_train and y_train are your training data
# Assuming X_test and y_test are your test data

# Create a pipeline
pipeline = make_pipeline(
    StandardScaler(),
    PolynomialFeatures(),
    LinearRegression()
)

# Fit the pipeline on training data
pipeline.fit(X_train, y_train)

# Predict on test data
y_pred = pipeline.predict(X_test)

# Calculate MSE and R^2 values
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R^2 Score:", r2)
```

This code creates a pipeline with parameter scaling, polynomial feature generation, and linear regression. It then fits the pipeline on the training data, predicts on the test data, and calculates the MSE and R^2 values for the trained model.

In [15]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [19]:
X = df[["CPU_frequency", "RAM_GB", "Storage_GB_SSD",
        "CPU_core", "OS", "GPU", "Category"]]
y = df[['Price']]

In [20]:
pipeline = make_pipeline(StandardScaler(), 
                         PolynomialFeatures(), 
                         LinearRegression())
pipeline.fit(x_train, y_train)
y_pred = pipeline.predict(x_test)

In [21]:
print("Mean Square Error:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))

Mean Square Error: 289080.9713542206
R2 Score: 0.37442753633238834


### Grid search and Ridge regression

An improved way to train your model is to use ridge regression instead of linear regression. You can use the polynomial features of multiple attributes. One of the key factors of ridge regression is using the parameter alpha as a hyperparameter for training. Using grid search, one can determine the optimum value of the hyperparameter for the given set of features. Grid search also uses cross-validation training to train and prepare the optimum model.

You can use generative AI to create the Python code to perform a grid search for the optimum ridge regression model, which uses polynomial features generated from multiple parameters.

**AI prompt 5:**  
Write a Python code that performs the following tasks.
1. Use polynomial features for some of the attributes of a data frame.
2. Perform Grid search on a ridge regression model for a set of values of hyperparameter alpha and polynomial features as input.
3. Use cross-validation in the Grid search.
4. Evaluate the resulting model's MSE and R^2 values.
  
**Result**  
Sure! Here is the Python code that performs the tasks you mentioned:

```python
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

# Assuming df is your dataframe with the data

# Select the attributes for which you want to create polynomial features
selected_attributes = ['attribute1', 'attribute2']

# Create polynomial features for selected attributes
poly = PolynomialFeatures(degree=2)
df_poly = poly.fit_transform(df[selected_attributes])

# Define the Ridge regression model
ridge = Ridge()

# Define the hyperparameters for Grid search
param_grid = {
    'alpha': [0.1, 1, 10],
}

# Perform Grid search with cross-validation
grid_search = GridSearchCV(estimator=ridge, param_grid=param_grid, cv=5)
grid_search.fit(df_poly, df['target_variable'])

# Evaluate the resulting model
best_model = grid_search.best_estimator_
mse = -cross_val_score(best_model, df_poly, df['target_variable'], scoring='neg_mean_squared_error', cv=5)
r2 = cross_val_score(best_model, df_poly, df['target_variable'], scoring='r2', cv=5)

print("Mean Squared Error:", mse.mean())
print("R^2 Score:", r2.mean())
```

This code creates polynomial features for selected attributes of a dataframe, performs Grid search on a Ridge regression model with specified hyperparameters and polynomial features, uses cross-validation in the Grid search, and evaluates the resulting model by calculating the MSE and R^2 values.

In [22]:
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.linear_model import Ridge

In [36]:
X = df[["CPU_frequency", "RAM_GB", "Storage_GB_SSD",
        "CPU_core", "OS", "GPU", "Category"]]
y = df[['Price']]

h_paras = {'alpha':[0.0001,0.001,0.01, 0.1, 1, 10]}

In [43]:
pf = PolynomialFeatures(degree=2)
ridge = Ridge()
X_pf = pf.fit_transform(X)

In [44]:
grid_search = GridSearchCV(ridge, h_paras, cv=5)
grid_search.fit(X_pf, y)

In [45]:
best_model = grid_search.best_estimator_
best_model

In [41]:
y_pred = grid_search.predict(X_pf)

In [46]:
mse = -cross_val_score(best_model, X_pf, y, 
                       scoring='neg_mean_squared_error', 
                       cv=5)
r2 = cross_val_score(best_model, X_pf, y, 
                     scoring='r2', 
                     cv=5)

In [48]:
print("Mean Square Error:", mse.mean())
print("R2 Score:", r2.mean())

Mean Square Error: 172816.64569997336
R2 Score: 0.2789678362300345


**Result from ChatGPT**  

Here is a Python code that performs the tasks you specified. It uses polynomial features for some of the attributes in a data frame, performs grid search on a Ridge regression model, includes cross-validation, and evaluates the resulting model's MSE (Mean Squared Error) and R² values.

```python
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score

# Step 1: Create a sample dataframe or load your dataset
# Assume `df` is your DataFrame containing the data

# Example: A dataframe with some attributes
data = {
    'CPU_frequency': [2.3, 3.0, 2.8, 2.5, 3.2],
    'RAM_GB': [8, 16, 8, 32, 16],
    'Storage_GB_SSD': [256, 512, 128, 1024, 512],
    'CPU_core': [4, 6, 4, 8, 6],
    'Price': [1000, 1500, 1100, 1800, 1600]
}

df = pd.DataFrame(data)

# Step 2: Define features and target
X = df[['CPU_frequency', 'RAM_GB', 'Storage_GB_SSD', 'CPU_core']]  # Features
y = df['Price']  # Target

# Step 3: Create polynomial features for some attributes
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Step 4: Standardize the data (especially important for Ridge regression)
scaler = StandardScaler()
X_poly_scaled = scaler.fit_transform(X_poly)

# Step 5: Define the Ridge regression model
ridge = Ridge()

# Step 6: Perform Grid Search on Ridge regression model
# Define hyperparameter grid for Ridge model
h_param_grid = {'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10]}

# Perform Grid Search with Cross-validation (5 folds)
grid_search = GridSearchCV(estimator=ridge, param_grid=h_param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_poly_scaled, y)

# Step 7: Get the best hyperparameters from the grid search
print("Best alpha parameter: ", grid_search.best_params_['alpha'])

# Step 8: Evaluate the model with Cross-Validation
# Perform cross-validated predictions
y_pred = cross_val_predict(grid_search.best_estimator_, X_poly_scaled, y, cv=5)

# Step 9: Evaluate model performance
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R^2 Score: {r2}")
```
##### Explanation:
1. **Polynomial Features**: The P`olynomialFeatures()` function is used to generate polynomial features for the attributes. In this case, degree 2 features are created.
2. **Grid Search**: The `GridSearchCV()` is used to search for the best alpha parameter for Ridge regression using cross-validation (5-fold).
3. **Cross-Validation**: Cross-validation is performed using the `cv=5` argument in the `GridSearchCV()` and also through `cross_val_predict()`.
4. **Model Evaluation**: The resulting model's performance is evaluated using Mean Squared Error (MSE) and R² score to measure how well the model predicts.  
Let me know if you want further details or adjustments!

In [50]:
from sklearn.model_selection import GridSearchCV, cross_val_predict

# Define features and target
X = df[['CPU_frequency', 'RAM_GB', 
        'Storage_GB_SSD', 'CPU_core']]
y = df['Price']

# Create polynomial features for some attributes
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
display(X_poly[0])

# Standardize the data (especially important for Ridge regression)
scaler = StandardScaler()
X_poly_scaled = scaler.fit_transform(X_poly)
display(X_poly_scaled[0])

# Define the Ridge regression model
ridge = Ridge()

# Perform Grid Search on Ridge regression model
# Define hyperparameter grid for Ridge model
h_param_grid = {'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10]}

# Perform Grid Search with Cross-validation (5 folds)
grid_search = GridSearchCV(estimator=ridge, 
                           param_grid=h_param_grid, 
                           cv=5, 
                           scoring='neg_mean_squared_error')
grid_search.fit(X_poly_scaled, y)

# Step 7: Get the best hyperparameters from the grid search
print("Best alpha parameter: ", grid_search.best_params_['alpha'])

# Evaluate the model with Cross-Validation
# Perform cross-validated predictions
y_pred = cross_val_predict(grid_search.best_estimator_, 
                           X_poly_scaled, y, 
                           cv=5)

# Step 9: Evaluate model performance
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R^2 Score: {r2}")

array([5.51724138e-01, 8.00000000e+00, 2.56000000e+02, 5.00000000e+00,
       3.04399524e-01, 4.41379310e+00, 1.41241379e+02, 2.75862069e+00,
       6.40000000e+01, 2.04800000e+03, 4.00000000e+01, 6.55360000e+04,
       1.28000000e+03, 2.50000000e+01])

array([-1.85148175,  0.04748847,  0.29454709, -0.50860609, -1.77049832,
       -0.77561347, -1.28562575, -1.28153227, -0.08510008,  0.11290304,
       -0.27942649,  0.29454709, -0.31220354, -0.60776085])

Best alpha parameter:  1
Mean Squared Error (MSE): 205189.8615745415
R^2 Score: 0.3759180402499378


## Authors


[Abhishek Gagneja](https://www.linkedin.com/in/abhishek-gagneja-23051987/)


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-12-10|0.1|Abhishek Gagneja|Initial Draft created|


Copyright © 2023 IBM Corporation. All rights reserved.
