<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo">
    </a>
</p>


# Test Environment for Generative AI classroom labs

This lab provides a test environment for the codes generated using the Generative AI classroom.

Follow the instructions below to set up this environment for further use.


# Setup


### Install required libraries

In case of a requirement of installing certain python libraries for use in your task, you may do so as shown below.


In [1]:
%pip install seaborn
import piplite

await piplite.install(['nbformat', 'plotly'])

### Dataset URL from the GenAI lab
Use the URL provided in the GenAI lab in the cell below. 


In [2]:
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_mod1.csv"

### Downloading the dataset

Execute the following code to download the dataset in to the interface.

> Please note that this step is essential in JupyterLite. If you are using a downloaded version of this notebook and running it on JupyterLabs, then you can skip this step and directly use the URL in pandas.read_csv() function to read the dataset as a dataframe


In [3]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

path = URL

await download(path, "dataset.csv")
file_name  = "dataset.csv"

---


# Test Environment


## Building the prompt: Importing the data set

First, you must ask the Gen AI model to generate a code to import the provided data set to a Pandas' data frame. You must specify if you are importing the data. Assuming the first row is the header, the data set headers should be available as the first row of the CSV file.

The prompt to create the code can be structured as follows.

**Write a Python code that can perform the following tasks:**

**Read the CSV file, located on a given file path, into a Pandas data frame, assuming that the first rows of the file are the headers for the data.**

In [7]:
# Keep appending the code generated to this cell, or add more cells below this to execute in parts
import pandas as pd

# Path to the CSV file
file_path = file_name

# Read the CSV into a DataFrame; first row is used as header by default
df = pd.read_csv(file_path, header=0)

# Show the first few rows (optional for quick inspection)
df.head()

Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,Screen,GPU,OS,CPU_core,Screen_Size_cm,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_kg,Price
0,0,Acer,4,IPS Panel,2,1,5,35.56,1.6,8,256,1.6,978
1,1,Dell,3,Full HD,1,1,3,39.624,2.0,4,256,2.2,634
2,2,Dell,3,Full HD,1,1,7,39.624,2.7,8,256,2.2,946
3,3,Dell,4,IPS Panel,2,1,5,33.782,1.6,8,128,1.22,1244
4,4,HP,4,Full HD,2,1,7,39.624,1.8,8,256,1.91,837


## Building the prompt: Handle missing data

You can now ask the Generative AI model to generate a script to handle the missing data.

First, use the model to identify the attributes with missing data.

For this, you may run the following query

**Write a Python code that identifies the columns with missing values in a pandas data frame and gives missing value counts per column.**

In [6]:
# Compute missing value counts per column
missing_counts = df.isnull().sum()

# Identify columns with at least one missing value
cols_with_missing = missing_counts[missing_counts > 0].index.tolist()

# Output results
print('Columns with missing values:', cols_with_missing)
print('Missing value counts per column:\n', missing_counts)
# Optional: missing_counts_dict = missing_counts.to_dict() and print if needed

Columns with missing values: ['Screen_Size_cm', 'Weight_kg']
Missing value counts per column:
 Unnamed: 0        0
Manufacturer      0
Category          0
Screen            0
GPU               0
OS                0
CPU_core          0
Screen_Size_cm    4
CPU_frequency     0
RAM_GB            0
Storage_GB_SSD    0
Weight_kg         5
Price             0
dtype: int64


Once run, you will see that the two attributes with missing values are `Screen_Size_cm`, a categorical variable, and `Weight_kg`, a continuous variable.

Now, you need to replace the missing values with appropriate values. The following are the rules for this:

1. Missing entries in columns containing categorical values need to be replaced with the most frequent entries
2. Missing entries in columns with continuous data need to be replaced with the mean value of the column
3. If a value is missing in the target column, you may need to drop that row

You can build a prompt for this, as shown below.

**Write a Python code to replace the missing values in a pandas data frame, per the following guidelines.**
1. **For a categorical attribute "Screen_Size_cm", replace the missing values with the most frequent value in the column.**
2. **For a continuous value attribute "Weight_kg", replace the missing values with the mean value of the entries in the column.**

In [8]:
# Impute the categorical column 'Screen_Size_cm' with its most frequent value (mode)
most_frequent_screen = df['Screen_Size_cm'].mode().iloc[0]
df['Screen_Size_cm'] = df['Screen_Size_cm'].fillna(most_frequent_screen)

# Impute the continuous column 'Weight_kg' with its mean value
mean_weight = df['Weight_kg'].mean()
df['Weight_kg'] = df['Weight_kg'].fillna(mean_weight)

# Optional: quick verification
print('Imputation completed.')
print(df[['Screen_Size_cm', 'Weight_kg']].head())

Imputation completed.
   Screen_Size_cm  Weight_kg
0          35.560       1.60
1          39.624       2.20
2          39.624       2.20
3          33.782       1.22
4          39.624       1.91


## Building the prompt: Modify data type

Further, you should update both attributes' data type to floating values.

You should see a similar response to the following prompt:



**Write a Python code snippet to change the data type of the attributes "Screen_Size_cm" and "Weight_kg" of a data frame to float.**

In [9]:
# Convert specific columns to float using a robust approach that coerce errors to NaN
df['Screen_Size_cm'] = pd.to_numeric(df['Screen_Size_cm'], errors='coerce')
df['Weight_kg'] = pd.to_numeric(df['Weight_kg'], errors='coerce')

In [10]:
df.dtypes

Unnamed: 0          int64
Manufacturer       object
Category            int64
Screen             object
GPU                 int64
OS                  int64
CPU_core            int64
Screen_Size_cm    float64
CPU_frequency     float64
RAM_GB              int64
Storage_GB_SSD      int64
Weight_kg         float64
Price               int64
dtype: object

## Building the prompt: Standardization and Normalization

You may notice that the parameter `Screen_Size_cm` represents the screen size in centimeters instead of the standard unit, which is inches. Also notice that the parameter `Weight_kg` would be better expressed in pounds.

You can ask generative AI to create a code to convert these parameters without having to mention the conversion rates.

You can build a prompt for this, as shown below.



**Write a Python code to modify the contents under the following attributes of the data frame as required.**
1. **Data under 'Screen_Size_cm' is assumed to be in centimeters. Convert this data into inches. Modify the name of the attribute to 'Screen_Size_inch'.**
2. **Data under 'Weight_kg' is assumed to be in kilograms. Convert this data into pounds. Modify the name of the attribute to 'Weight_pounds'.**

In [11]:
# Transform units: cm -> inches and kg -> pounds; rename columns accordingly
# Assumes an existing DataFrame named 'df' with columns 'Screen_Size_cm' and 'Weight_kg'
df = df.assign(
    Screen_Size_inch=df['Screen_Size_cm'] / 2.54,
    Weight_pounds=df['Weight_kg'] * 2.2046226218
).drop(columns=['Screen_Size_cm', 'Weight_kg'])

It may also be required to normalize the data under some attributes. Since there are many normalization forms, mentioning the exact needs and tasks is important. Also, you can save the normalized data as a new attribute or change the original attribute. You need to ensure that all the details of the prompt are clear.

For example, let us assume that the data under `CPU_frequency` needs to be normalized w.r.t. the max value under the attribute. You need the changes to be reflected directly under the attribute instead of creating a new attribute.

You can ask Generative AI to generate a script that does this for you.

You can build a prompt for this, as shown below.

**Write a Python code to normalize the content under the attribute "CPU_frequency" in a data frame df concerning its maximum value. Make changes to the original data, and do not create a new attribute.**

In [12]:
df['CPU_frequency'] = df['CPU_frequency'] / df['CPU_frequency'].max()

## Building the prompt: Categorical to numerical

For predictive modeling, the categorical variables are not usable currently. So, you must convert the important categorical variables into indicator numerical variables. Indicator variables are typically new attributes, with content being 1 for the indicated category and 0 for all others. Once you create the indicator variables, you may drop the original attribute.

For example, assume that attribute `Screen` needs to be converted into individual indicator variables for each entry. Once done, the attribute `Screen` needs to be dropped.

You can build a prompt for this, as shown below.

**Write a Python code to perform the following tasks.**

1. **Convert a data frame df attribute "Screen", into indicator variables, saved as df1, with the naming convention "Screen_<unique value of the attribute>".**
2. **Append df1 into the original data frame df.**
3. **Drop the original attribute from the data frame df.**

In [14]:
# One-hot encode the 'Screen' column into indicator variables
# df1 will contain columns named like 'Screen_<value>'
df1 = pd.get_dummies(df['Screen'], prefix='Screen')

# Append the new indicator columns to the original dataframe
df = df.join(df1)

# Drop the original 'Screen' attribute
df = df.drop(columns=['Screen'])

In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,GPU,OS,CPU_core,CPU_frequency,RAM_GB,Storage_GB_SSD,Price,Screen_Size_inch,Weight_pounds,Screen_Full HD,Screen_IPS Panel
0,0,Acer,4,2,1,5,0.551724,8,256,978,14.0,3.527396,False,True
1,1,Dell,3,1,1,3,0.689655,4,256,634,15.6,4.85017,True,False
2,2,Dell,3,1,1,7,0.931034,8,256,946,15.6,4.85017,True,False
3,3,Dell,4,2,1,5,0.551724,8,128,1244,13.3,2.68964,False,True
4,4,HP,4,2,1,7,0.62069,8,256,837,15.6,4.210829,True,False


## Practice problems

1. Create a prompt to generate a Python code that converts the values under `Price` from USD to Euros.

2. Modify the normalization prompt to perform min-max normalization on the `CPU_frequency` parameter.

**Write a Python code to perform the following tasks:**
1. **Convert the column Price from USD to Euros. Modify the name of the attribute to 'Price_Euros'.**
2. **Perform a min-max normalization on the attribute "CPU_frequency" in a data frame df. Make changes to the original data, and do not create a new attribute.**

In [17]:
usd_to_eur_rate = 0.92  # exchange rate assumption: 1 USD = 0.92 EUR
# Convert 'price' in USD to Euros and rename to 'Price_Euros' (in-place)
df['Price_Euros'] = df.pop('Price') * usd_to_eur_rate

# Min-max normalize in-place for 'CPU_frequency'
min_v = df['CPU_frequency'].min()
max_v = df['CPU_frequency'].max()
range_v = max_v - min_v
# Normalize; if range_v == 0, results become NaN, replace with 0
df['CPU_frequency'] = (df['CPU_frequency'] - min_v) / range_v
df['CPU_frequency'] = df['CPU_frequency'].fillna(0)

In [19]:
df.head()

Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,GPU,OS,CPU_core,CPU_frequency,RAM_GB,Storage_GB_SSD,Screen_Size_inch,Weight_pounds,Screen_Full HD,Screen_IPS Panel,Price_Euros
0,0,Acer,4,2,1,5,0.235294,8,256,14.0,3.527396,False,True,899.76
1,1,Dell,3,1,1,3,0.470588,4,256,15.6,4.85017,True,False,583.28
2,2,Dell,3,1,1,7,0.882353,8,256,15.6,4.85017,True,False,870.32
3,3,Dell,4,2,1,5,0.235294,8,128,13.3,2.68964,False,True,1144.48
4,4,HP,4,2,1,7,0.352941,8,256,15.6,4.210829,True,False,770.04


### Personal Insights
* GEN AI generated code is not always correct or may lack context, in this case the standarization proposed code to `Screen_Size_cm` and `Weight_kg` had a functional programming approach, though it was valid and correct, it was not the context of the project and a prompt tunning should be adressed to regenarate the code accordingly.
* Last normalization to the `CPU_frequency` attribute is not correct since it is being applied to an already normalized attribute and not to the original data.

## Authors


[Abhishek Gagneja](https://www.linkedin.com/in/abhishek-gagneja-23051987/)


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-12-10|0.1|Abhishek Gagneja|Initial Draft created|


Copyright Â© 2023 IBM Corporation. All rights reserved.
