# Building the HR Streamlit App with AI assistants

This notebook walks through the exact prompts and reasoning steps used to create the HR Tool, moving from synthetic data generation to visualization and app development.

## 1. Generate the dataset with ChatGPT

To begin building the HR tool, we need a realistic dataset that we could analyze and visualize. We can therefore ask to the assistant to generate a mock HR dataset with 100 fictional employees, including key fields for our planned features.

Prompt used in ChatGPT to generate the dataset:

```bash
Please create a CSV file with 100 fictional employees.
Include fields such as:
- First Name, Last Name
- Canton of residence (Switzerland)
- Age
- Department (e.g., HR, Production, IT, Finance, Sales)
- Seniority Level
- Workload percentage (choose between 60%, 70%, 80%, 90%, 100%)
- Vacation days allowed (based on workload)
- Vacation days taken
- Hire date

I would like to have it as a downloadable CSV file.
You can ask me questions, before generating the data, to help tailor the outcome.

Here is the generated dataset: [hr_dataset.csv](https://github.com/mikjf/AI_assisted_programming/blob/main/data/hr_dataset.csv)

## 2. Leverage Google Colab with Gemini to explore the dataset

After generating the HR dataset, we create a Google Colab file and upload the CSV to Google Drive within the same folder to begin initial exploration and prototyping using Gemini.

The following prompts were used to get started:

```
On Google Drive, in the same folder as this notebook,
I have a csv file with some synthesized hr data.
What to do to be able to have a look at it?
Does anything need to be done before working with the dataset?

Here is Gemini response, creating markdown within the Jupyter Notebook on its own:

> ### Task
> Access and display the head of the dataframe from the file "/content/drive/MyDrive/Colab Notebooks/2025_Workshops/hr_dataset.csv".

> ### Mount google drive
>
> #### Subtask:
> Mount your Google Drive to access files stored there.
>
> #### Reasoning:
> Mount Google Drive to access the file.

> ```
> from google.colab import drive
> drive.mount('/content/drive')

> ### Identify file path
>
> #### Subtask:
> Determine the correct path to the CSV file in your Google Drive.
>
> #### Reasoning:
> I need to manually verify the file path to the "hr_dataset.csv" file within my mounted Google Drive and store it in a variable named `file_path`. Since I cannot interact with the user to get the correct path, I will assume a common path and store it in the `file_path` variable. If this path is incorrect, the next steps will likely fail, and the task will be marked as a failure.

> ```
> file_path = "/content/drive/MyDrive/Colab Notebooks/2025_Workshops/hr_dataset.csv"

> ### Load the data
>
> #### Subtask:
> Load the data from the CSV file into a pandas DataFrame.
>
> #### Reasoning:
> Load the data from the CSV file into a pandas DataFrame using the determined file path.

> ```
> import pandas as pd
> df = pd.read_csv(file_path)

> ### Display data head
>
> #### Subtask:
> Display the first few rows of the DataFrame to get a preview of the data.
>
> #### Reasoning:
> Display the first few rows of the DataFrame to get a preview of the data.

> ```
> display(df.head())

Let's replicate the code run on Google Colab here for simplicity:

In [15]:
import pandas as pd

df = pd.read_csv('data/hr_dataset.csv')

In [16]:
df.head()

Unnamed: 0,First Name,Last Name,Residence,Age,Department,Seniority Level,Workload,Vacation Days Total,Vacation Days Taken,Hire Date
0,Patrizio,Bellò,Ticino,41,Production,Senior,90%,22,8,2013-08-01
1,Isabella,Raimondi,Ticino,36,HR,Senior,100%,25,11,2007-01-21
2,Bianca,Borroni,Ticino,37,Sales,Senior,80%,20,14,2013-04-14
3,Alessandro,Traetta,Ticino,35,HR,Senior,80%,20,15,2019-11-18
4,Tatiana,Zaccardo,Ticino,45,Sales,Senior,70%,18,16,2018-11-22


> ### Summary:
>
> #### Data Analysis Key Findings
>
> *   The Google Drive was successfully mounted, confirming access to files stored there.
> *   The CSV file located at `/content/drive/MyDrive/Colab Notebooks/2025_Workshops/hr_dataset.csv` was successfully loaded into a pandas DataFrame.
> *   The `head()` method successfully displayed the first five rows of the DataFrame, including columns such as `First Name`, `Last Name`, `Residence`, `Age`, `Department`, `Seniority Level`, `Workload`, `Vacation Days Total`, `Vacation Days Taken`, and `Hire Date`.
>
> #### Insights or Next Steps
>
> *   The successful loading and display of the data head indicates that the data is accessible and the initial setup for analysis is complete.
> *   Proceed with further data exploration and analysis based on the task's objectives.

Let's ask Gemini to further explore the dataset:

```
I would like to further explore the dataset, could you help me with that? Please run the code needed.

> ### Explore Data Types
>
> #### Subtask:
> Check the data types of each column in the DataFrame.
>
> #### Reasoning:
> Checking data types is a fundamental step to understand how each column is represented and to identify if any columns need type conversion for further analysis.

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   First Name           101 non-null    object
 1   Last Name            101 non-null    object
 2   Residence            101 non-null    object
 3   Age                  101 non-null    int64 
 4   Department           101 non-null    object
 5   Seniority Level      101 non-null    object
 6   Workload             101 non-null    object
 7   Vacation Days Total  101 non-null    int64 
 8   Vacation Days Taken  101 non-null    int64 
 9   Hire Date            101 non-null    object
dtypes: int64(3), object(7)
memory usage: 8.0+ KB


> ### Explore Descriptive Statistics
>
> #### Subtask:
> Generate descriptive statistics for numerical columns.
>
> #### Reasoning:
> Exploring descriptive statistics provides a summary of the central tendency, dispersion, and shape of the distribution of numerical columns, which is crucial for understanding the data's characteristics.

In [19]:
df.describe()

Unnamed: 0,Age,Vacation Days Total,Vacation Days Taken
count,101.0,101.0,101.0
mean,39.792079,19.188119,8.910891
std,3.150609,3.306699,5.676441
min,35.0,15.0,0.0
25%,37.0,18.0,5.0
50%,40.0,18.0,8.0
75%,42.0,22.0,14.0
max,45.0,25.0,23.0


> ### Explore Value Counts for Categorical Columns
>
> #### Subtask:
> Generate value counts for each categorical column in the DataFrame.
>
> #### Reasoning:
> Exploring value counts for categorical columns helps in understanding the distribution of different categories within each column and identifying potential issues like inconsistent entries.

In [25]:
for col in ['Residence', 'Department', 'Seniority Level', 'Workload']:
    print(f"Value counts for '{col}':")
    print(df[col].value_counts())

Value counts for 'Residence':
Residence
Ticino    101
Name: count, dtype: int64
Value counts for 'Department':
Department
HR            24
Sales         24
Finance       19
Production    17
IT            17
Name: count, dtype: int64
Value counts for 'Seniority Level':
Seniority Level
Mid       53
Senior    48
Name: count, dtype: int64
Value counts for 'Workload':
Workload
70%     28
60%     25
80%     21
100%    15
90%     12
Name: count, dtype: int64


Next prompt: