<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo">
    </a>
</p>


# Test Environment for Generative AI classroom labs

This lab provides a test environment for the codes generated using the Generative AI classroom.

Follow the instructions below to set up this environment for further use.


# Setup


### Install required libraries

In case of a requirement of installing certain python libraries for use in your task, you may do so as shown below.


In [71]:
%pip install nbformat plotly


Note: you may need to restart the kernel to use updated packages.


### Dataset URL from the GenAI lab
Use the URL provided in the GenAI lab in the cell below. 


In [83]:
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_mod1.csv"

---


# Test Environment


Now, you will start building your prompt. First, you must ask the Gen AI model to generate a code to import the provided data set to a Pandas' data frame. You must specify if you are importing the data. Assuming the first row is the header, the data set headers should be available as the first row of the CSV file.

The prompt to create the code can be structured as follows.

# Building the prompt: Importing the data set

### PROMPT: Write a Python code that can perform the following tasks; Read the CSV file, located on a given file path (URL above), into a Pandas data frame, assuming that the first rows of the file are the headers for the data.

 Assuming the first rows of the file are the headers, you don't need to specify any additional parameters
 Additional details:
 - The `pd.read_csv()` function is used to read a CSV file into a Pandas data frame.
 - By default, it assumes that the first row of the file contains the headers for the data.
 - If your file doesn't have headers, you can specify `header=None` as an additional parameter.
 - You can also specify other parameters, such as `sep` to specify the delimiter used in the file.
 - Make sure you have the Pandas library installed in your Python environment before running this code.

In [87]:
import pandas as pd

# Specify the file path to the CSV file
file_path = URL

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(file_path)

# Display the DataFrame
print(df)

     Unnamed: 0 Manufacturer  Category     Screen  GPU  OS  CPU_core  \
0             0         Acer         4  IPS Panel    2   1         5   
1             1         Dell         3    Full HD    1   1         3   
2             2         Dell         3    Full HD    1   1         7   
3             3         Dell         4  IPS Panel    2   1         5   
4             4           HP         4    Full HD    2   1         7   
..          ...          ...       ...        ...  ...  ..       ...   
233         233       Lenovo         4  IPS Panel    2   1         7   
234         234      Toshiba         3    Full HD    2   1         5   
235         235       Lenovo         4  IPS Panel    2   1         5   
236         236       Lenovo         3    Full HD    3   1         5   
237         237      Toshiba         3    Full HD    2   1         5   

     Screen_Size_cm  CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_kg  Price  
0            35.560            1.6       8             2

# Building the prompt: Handle missing data

### PROMPT: Write a Python code that identifies the columns with missing values in a pandas data frame.

In [97]:
# Identify columns with missing values
columns_with_missing_values = df.columns[df.isnull().any()]

# Additional details:
# - The `df.isnull()` function returns a Boolean data frame where each cell is True if it contains a missing value (NaN), and False otherwise.
# - The `df.columns` attribute returns the column labels of the data frame.
# - The `.any()` method returns a Boolean Series indicating whether any value in the given axis (in this case, columns) is True.
# - Finally, the `.columns` attribute is used to retrieve the column labels where the condition is True.

# You can now use the 'columns_with_missing_values' variable to further analyze or handle the columns with missing values.

In [99]:
print("Columns with missing values:", columns_with_missing_values)

Columns with missing values: Index(['Screen_Size_cm', 'Weight_kg'], dtype='object')


### PROMPT: Write a Python code to replace the missing values in a pandas data frame, per the following guidelines.
#### 1. For a categorical attribute "Screen_Size_cm", replace the missing values with the most frequent value in the column.
#### 2. For a continuous value attribute "Weight_kg", replace the missing values with the mean value of the entries in the column.

In [108]:
# Replace missing values in the 'Screen_Size_cm' column with the most frequent value
most_frequent_value = df['Screen_Size_cm'].mode()[0]
df['Screen_Size_cm'].fillna(most_frequent_value, inplace=True)

# Replace missing values in the 'Weight_kg' column with the mean value
mean_value = df['Weight_kg'].mean()
df['Weight_kg'].fillna(mean_value, inplace=True)

# Additional details:
# - The `.mode()` method is used to calculate the most frequent value in a column.
# - The `[0]` indexing is used to retrieve the most frequent value from the resulting Series.
# - The `.fillna()` method is used to replace missing values with a specified value.
# - The `inplace=True` parameter is used to modify the original data frame instead of creating a new one.

# You can now use the modified 'df' data frame, which has the missing values replaced according to the guidelines.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Screen_Size_cm'].fillna(most_frequent_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Weight_kg'].fillna(mean_value, inplace=True)


In [111]:
# Display the DataFrame after replacing missing values
print("\nDataFrame after replacing missing values:")
print(df)
print("\nMissing values count after replacement:")
print(df.isna().sum())


DataFrame after replacing missing values:
     Unnamed: 0 Manufacturer  Category     Screen  GPU  OS  CPU_core  \
0             0         Acer         4  IPS Panel    2   1         5   
1             1         Dell         3    Full HD    1   1         3   
2             2         Dell         3    Full HD    1   1         7   
3             3         Dell         4  IPS Panel    2   1         5   
4             4           HP         4    Full HD    2   1         7   
..          ...          ...       ...        ...  ...  ..       ...   
233         233       Lenovo         4  IPS Panel    2   1         7   
234         234      Toshiba         3    Full HD    2   1         5   
235         235       Lenovo         4  IPS Panel    2   1         5   
236         236       Lenovo         3    Full HD    3   1         5   
237         237      Toshiba         3    Full HD    2   1         5   

     Screen_Size_cm  CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_kg  Price  
0            

# Building the prompt: Modify data type

### PROMPT: Write a Python code snippet to change the data type of the attributes "Screen_Size_cm" and "Weight_kg" of a data frame to float.

In [122]:
# Change the data type of 'Screen_Size_cm' and 'Weight_kg' to float
df['Screen_Size_cm'] = df['Screen_Size_cm'].astype(float)
df['Weight_kg'] = df['Weight_kg'].astype(float)

# Additional details:
# - The `.astype()` method is used to change the data type of a column.
# - In this case, we're specifying `float` as the desired data type.
# - Make sure the columns contain numeric values that can be converted to float.
# - If there are any non-numeric values in the columns, the conversion will raise an error.

# You can now use the modified 'df' data frame, which has the data types of 'Screen_Size_cm' and 'Weight_kg' changed to float.

# Building the prompt: Standardization and Normalization

### PROMPT: Write a Python code to modify the contents under the following attributes of the data frame as required.
#### 1. Data under 'Screen_Size_cm' is assumed to be in centimeters. Convert this data into inches. Modify the name of the attribute to 'Screen_Size_inch'.
#### 2. Data under 'Weight_kg' is assumed to be in kilograms. Convert this data into pounds. Modify the name of the attribute to 'Weight_pounds'.

In [126]:
# Convert 'Screen_Size_cm' from centimeters to inches and modify the attribute name
df['Screen_Size_inch'] = df['Screen_Size_cm'] * 0.393701
df.drop('Screen_Size_cm', axis=1, inplace=True)

# Convert 'Weight_kg' from kilograms to pounds and modify the attribute name
df['Weight_pounds'] = df['Weight_kg'] * 2.20462
df.drop('Weight_kg', axis=1, inplace=True)

# Additional details:
# - The code multiplies the values under 'Screen_Size_cm' by 0.393701 to convert centimeters to inches.
# - The resulting values are stored in a new attribute named 'Screen_Size_inch'.
# - The original 'Screen_Size_cm' attribute is dropped from the data frame using the `.drop()` method.
# - Similarly, the code multiplies the values under 'Weight_kg' by 2.20462 to convert kilograms to pounds.
# - The resulting values are stored in a new attribute named 'Weight_pounds'.
# - The original 'Weight_kg' attribute is dropped from the data frame.

# You can now use the modified 'df' data frame, which has the contents and attribute names modified as required.

### PROMPT: Write a Python code to normalize the content under the attribute "CPU_frequency" in a data frame df concerning its maximum value. Make changes to the original data, and do not create a new attribute.

In [137]:
# Normalize the content under 'CPU_frequency' with respect to its maximum value
max_value = df['CPU_frequency'].max()
df['CPU_frequency'] = df['CPU_frequency'] / max_value

# Additional details:
# - The code calculates the maximum value of the 'CPU_frequency' attribute using the `.max()` method.
# - It then divides the values under 'CPU_frequency' by the maximum value to normalize them.
# - The resulting normalized values overwrite the original values in the 'CPU_frequency' attribute.

# You can now use the modified 'df' data frame, which has the content under the 'CPU_frequency' attribute normalized.

# Building the prompt: Categorical to numerical

For predictive modeling, the categorical variables are not usable currently. So, you must convert the important categorical variables into indicator numerical variables. Indicator variables are typically new attributes, with content being 1 for the indicated category and 0 for all others. Once you create the indicator variables, you may drop the original attribute.

### PROMPT: Write a Python code to perform the following tasks.
#### 1. Convert a data frame df attribute "Screen", into indicator variables, saved as df1, with the naming convention "Screen_<unique value of the attribute>".
#### 2. Append df1 into the original data frame df.
#### 3. Drop the original attribute from the data frame df.

In [142]:
# Convert the 'Screen' attribute into indicator variables
df1 = pd.get_dummies(df['Screen'], prefix='Screen')

# Append df1 into the original data frame df
df = pd.concat([df, df1], axis=1)

# Drop the original 'Screen' attribute from the data frame
df.drop('Screen', axis=1, inplace=True)

# Additional details:
# - The `pd.get_dummies()` function is used to convert a categorical attribute into indicator variables.
# - The resulting indicator variables are stored in a new data frame named 'df1'.
# - The `prefix` parameter is used to specify the naming convention for the indicator variables.
# - The `pd.concat()` function is used to concatenate the original data frame 'df' and 'df1' along the column axis (axis=1).
# - The resulting concatenated data frame is assigned back to 'df'.
# - Finally, the `.drop()` method is used to drop the original 'Screen' attribute from 'df'.

# You can now use the modified 'df' data frame, which has the 'Screen' attribute converted into indicator variables, appended, and the original attribute dropped.

### PROMPT: Create a prompt to generate a Python code that converts the values under Price from USD to Euros.

In [145]:
import numpy as np

# Set a fixed exchange rate (1 USD = 0.85 EUR)
# In a real application, you might want to fetch the current rate from an API
USD_TO_EUR_RATE = 0.85

# Display the original DataFrame
print("Original DataFrame:")
print(df)
print("\nOriginal data types:")
print(df.dtypes)

# Convert 'Price' column to numeric, coercing errors to NaN
# This handles non-numeric values gracefully
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')

# Create a new column 'Price_EUR' with converted values
df['Price_EUR'] = df['Price'] * USD_TO_EUR_RATE



Original DataFrame:
     Unnamed: 0 Manufacturer  Category  GPU  OS  CPU_core  CPU_frequency  \
0             0         Acer         4    2   1         5       0.551724   
1             1         Dell         3    1   1         3       0.689655   
2             2         Dell         3    1   1         7       0.931034   
3             3         Dell         4    2   1         5       0.551724   
4             4           HP         4    2   1         7       0.620690   
..          ...          ...       ...  ...  ..       ...            ...   
233         233       Lenovo         4    2   1         7       0.896552   
234         234      Toshiba         3    2   1         5       0.827586   
235         235       Lenovo         4    2   1         5       0.896552   
236         236       Lenovo         3    3   1         5       0.862069   
237         237      Toshiba         3    2   1         5       0.793103   

     RAM_GB  Storage_GB_SSD  Price  Screen_Size_inch  Weight_pounds

In [147]:
# Count how many values couldn't be converted
invalid_count = df['Price'].isna().sum()
if invalid_count > 0:
    print(f"\nWarning: {invalid_count} invalid price value(s) detected and set to NaN")


In [149]:

# Format the EUR prices to 2 decimal places for display
# Note: This creates a string representation for display purposes
df['Price_EUR_Formatted'] = df['Price_EUR'].apply(lambda x: f'{x:.2f}' if not pd.isna(x) else 'N/A')



In [151]:


# Display the DataFrame with converted prices
print("\nDataFrame with USD and EUR prices:")
print(df[['Price', 'Price_EUR', 'Price_EUR_Formatted']])

# Summary statistics of prices (excluding NaN values)
print("\nSummary of prices:")
price_summary = df[['Price', 'Price_EUR']].describe()
print(price_summary)

# Optional: If you need to save the DataFrame with the new column
# df.to_csv('products_with_eur_prices.csv', index=False)


DataFrame with USD and EUR prices:
     Price  Price_EUR Price_EUR_Formatted
0      978     831.30              831.30
1      634     538.90              538.90
2      946     804.10              804.10
3     1244    1057.40             1057.40
4      837     711.45              711.45
..     ...        ...                 ...
233   1891    1607.35             1607.35
234   1950    1657.50             1657.50
235   2236    1900.60             1900.60
236    883     750.55              750.55
237   1499    1274.15             1274.15

[238 rows x 3 columns]

Summary of prices:
             Price    Price_EUR
count   238.000000   238.000000
mean   1462.344538  1242.992857
std     574.607699   488.416544
min     527.000000   447.950000
25%    1066.500000   906.525000
50%    1333.000000  1133.050000
75%    1777.000000  1510.450000
max    3810.000000  3238.500000


In [153]:
print (df)

     Unnamed: 0 Manufacturer  Category  GPU  OS  CPU_core  CPU_frequency  \
0             0         Acer         4    2   1         5       0.551724   
1             1         Dell         3    1   1         3       0.689655   
2             2         Dell         3    1   1         7       0.931034   
3             3         Dell         4    2   1         5       0.551724   
4             4           HP         4    2   1         7       0.620690   
..          ...          ...       ...  ...  ..       ...            ...   
233         233       Lenovo         4    2   1         7       0.896552   
234         234      Toshiba         3    2   1         5       0.827586   
235         235       Lenovo         4    2   1         5       0.896552   
236         236       Lenovo         3    3   1         5       0.862069   
237         237      Toshiba         3    2   1         5       0.793103   

     RAM_GB  Storage_GB_SSD  Price  Screen_Size_inch  Weight_pounds  \
0         8     

### PROMPT: Modify the normalization prompt to perform min-max normalization on the CPU_frequency parameter.

In [158]:
# Get the minimum and maximum values of CPU_frequency
min_value = df['CPU_frequency'].min()
max_value = df['CPU_frequency'].max()

# Perform min-max normalization: (x - min) / (max - min)
df['CPU_frequency_normalized'] = (df['CPU_frequency'] - min_value) / (max_value - min_value)

# Display the DataFrame with normalized values
print("\nDataFrame with min-max normalized CPU_frequency:")
print(df)




DataFrame with min-max normalized CPU_frequency:
     Unnamed: 0 Manufacturer  Category  GPU  OS  CPU_core  CPU_frequency  \
0             0         Acer         4    2   1         5       0.551724   
1             1         Dell         3    1   1         3       0.689655   
2             2         Dell         3    1   1         7       0.931034   
3             3         Dell         4    2   1         5       0.551724   
4             4           HP         4    2   1         7       0.620690   
..          ...          ...       ...  ...  ..       ...            ...   
233         233       Lenovo         4    2   1         7       0.896552   
234         234      Toshiba         3    2   1         5       0.827586   
235         235       Lenovo         4    2   1         5       0.896552   
236         236       Lenovo         3    3   1         5       0.862069   
237         237      Toshiba         3    2   1         5       0.793103   

     RAM_GB  Storage_GB_SSD  Price  S

In [69]:
# Verify the normalization: min should be 0 and max should be 1
print("\nMinimum normalized value:", df['CPU_frequency_normalized'].min())
print("Maximum normalized value:", df['CPU_frequency_normalized'].max())

# Keep the original values for reference
df['CPU_frequency_original'] = df['CPU_frequency']

# Replace the original column with normalized values
df['CPU_frequency'] = df['CPU_frequency_normalized']

# Display the final DataFrame
print("\nFinal DataFrame:")
print(df[['CPU_frequency', 'CPU_frequency_original']])


Minimum normalized value: 0.0
Maximum normalized value: 1.0

Final DataFrame:
     CPU_frequency  CPU_frequency_original
0         0.235294                0.235294
1         0.470588                0.470588
2         0.882353                0.882353
3         0.235294                0.235294
4         0.352941                0.352941
..             ...                     ...
233       0.823529                0.823529
234       0.705882                0.705882
235       0.823529                0.823529
236       0.764706                0.764706
237       0.647059                0.647059

[238 rows x 2 columns]


## Authors


[Abhishek Gagneja](https://www.linkedin.com/in/abhishek-gagneja-23051987/)


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-12-10|0.1|Abhishek Gagneja|Initial Draft created|


Copyright © 2023 IBM Corporation. All rights reserved.
