---
title: Assignment 03 AI Prompts
author:
    - name: Ava Godsy
      affiliations:
        - id: bu
          name: Boston University
          city: Boston
          state: MA
number-sections: true
date: '2025-09-22'
format:
    html:
        toc: true
        toc-depth: 2
        theme: cosmo
    docx: default
    pdf: default
date-modified: today
date-format: long
---

# Cleaning Education column

## Prompt
using python remove "/n" from values in column EDUCATION_LEVELS_NAME in dataframe "df"

## Response
To remove "\n" (newline characters) from the values in the column EDUCATION_LEVELS_NAME in a DataFrame df, you can use the .str.replace() method from pandas.

Here's the code:

In [None]:
#| echo: true
#| eval: false

df['EDUCATION_LEVELS_NAME'] = df['EDUCATION_LEVELS_NAME'].str.replace('\n', '', regex=False)


Explanation:

str.replace('\n', '', regex=False) replaces all newline characters (\n) with an empty string (''), effectively removing them.

regex=False ensures that \n is treated as a literal string (which is fine in this case since \n is a special character anyway).

If you also want to remove any leading or trailing spaces (common after removing \n), you can chain .str.strip():

In [None]:
#| echo: true
#| eval: false

df['EDUCATION_LEVELS_NAME'] = df['EDUCATION_LEVELS_NAME'].str.replace('\n', '', regex=False).str.strip()


## Additional Prompt
i got an error TypeError: 'Column' object is not callable

## Additional Response
Thanks — the error:

In [None]:
#| echo: true
#| eval: false

TypeError: 'Column' object is not callable

means you're likely using PySpark rather than pandas. In PySpark, DataFrame columns are represented by Column objects, and they behave differently than pandas Series.

✅ Fix for PySpark

If you're using PySpark and want to remove \n from a column (like EDUCATION_LEVELS_NAME), use withColumn() and regexp_replace() from pyspark.sql.functions.

Here's the correct code:

In [None]:
#| echo: true
#| eval: false

from pyspark.sql.functions import regexp_replace

df = df.withColumn("EDUCATION_LEVELS_NAME", regexp_replace("EDUCATION_LEVELS_NAME", "\n", ""))


# Salary Distribution by Industry and Employment Type

## Prompt
Using Plotly, Create a box plot where: X-axis = NAICS2_NAME Y-axis = SALARY_FROM Group by EMPLOYMENT_TYPE_NAME. dataframe = data_selected

## Response
Here's how you can create a box plot using Plotly in Python, with:

X-axis = NAICS2_NAME

Y-axis = SALARY_FROM

Grouped by = EMPLOYMENT_TYPE_NAME

Dataframe = data_selected

✅ Code using Plotly Express:

In [None]:
#| echo: true
#| eval: false

import plotly.express as px

fig = px.box(
    data_selected,
    x="NAICS2_NAME",
    y="SALARY_FROM",
    color="EMPLOYMENT_TYPE_NAME",
    title="Salary Distribution by Industry and Employment Type",
    points="all",  # Optional: shows all data points
    notched=True   # Optional: adds notches to box for median CI
)

fig.update_layout(
    xaxis_title="Industry (NAICS2)",
    yaxis_title="Starting Salary",
    boxmode="group",  # Groups boxes side by side
    xaxis_tickangle=45  # Rotate x-axis labels for better readability
)

fig.show()


## Additional Prompt
make the height taller, change the colors to purple, blue and green. make the font garamond

## Additional Response
You can customize the height and colors of your Plotly box plot using the height parameter and the color_discrete_sequence argument.

Here's the updated code with:

Increased height

Custom colors: purple, blue, green

To set the font to Garamond in your Plotly chart, you can specify it in the update_layout() method using the font parameter.

✅ Final Updated Code

In [None]:
import plotly.express as px

fig = px.box(
    df_selected,
    x="NAICS2_NAME",
    y="SALARY_FROM",
    color="EMPLOYMENT_TYPE_NAME",
    title="Salary Distribution by Industry and Employment Type",
    points="all",  # Show all points
    notched=True,  # Notched boxes
    height=700,  # Taller figure
    color_discrete_sequence=["purple", "blue", "green"]  # Custom colors
)

fig.update_layout(
    xaxis_title="Industry (NAICS2)",
    yaxis_title="Starting Salary",
    boxmode="group",  # Grouped box plots
    xaxis_tickangle=45,  # Rotate x-axis labels
    font=dict(
        family="Garamond, serif",  # Set font to Garamond
        size=14  # Optional: adjust font size
    )
)

fig.show()

# Salary Analysis by ONET Occupation Type Table
## Prompt
Using Python, create a table of median "SALARY" by ONET_NAME. also include a count of records for each ONET_NAME, call the column Job_Postings

## Response
To create a table of the median "SALARY" grouped by "ONET_NAME", and include a count of records (as "Job_Postings"), you can use pandas in Python. Here's how you can do it:

✅ Sample Code

In [2]:
#| echo: true
#| eval: false

import pandas as pd

# Sample: Load your DataFrame
# df = pd.read_csv('your_data.csv')  # Or however your DataFrame is created

# Group by ONET_NAME, calculate median salary and count of job postings
result = df.groupby('ONET_NAME').agg(
    SALARY_Median=('SALARY', 'median'),
    Job_Postings=('SALARY', 'count')
).reset_index()

# Optional: Sort by Job_Postings or SALARY_Median if needed
# result = result.sort_values(by='Job_Postings', ascending=False)

# Display the result
print(result)



NameError: name 'df' is not defined

## Additional Prompt
I got the error TypeError: GroupedData.agg() got an unexpected keyword argument 'SALARY_Median'
## Additional Response

Ah, the error message you're getting:

In [None]:
#| echo: true
#| eval: false

TypeError: GroupedData.agg() got an unexpected keyword argument 'SALARY_Median'


suggests you're not using pandas, but PySpark instead. In PySpark, the syntax for .agg() is different — it doesn't accept keyword arguments with new column names directly.

✅ Correct PySpark Version

Here’s how you can do what you need in PySpark:

In [None]:
#| echo: true
#| eval: false

from pyspark.sql import functions as F

# Group by ONET_NAME and calculate median salary and count of postings

# First, compute Job_Postings (count) and SALARY_Median (approximate median using percentile_approx)
result = df.groupBy("ONET_NAME").agg(
    F.expr("percentile_approx(SALARY, 0.5)").alias("SALARY_Median"),
    F.count("*").alias("Job_Postings")
)

# Show the result
result.show()


# Salary Analysis by ONET Occupation Type Plot

## Prompt
Using Plotly Create a bubble chart where dataframe = onet_salary X-axis = ONET_NAME Y-axis = Median Salary Size = Job_Postings Apply custom colors and font styles.

## Response
ChatGPT said:

Here's how you can create a bubble chart using Plotly with the following settings:

DataFrame: onet_salary

X-axis: ONET_NAME

Y-axis: Median Salary

Bubble Size: Job_Postings

Custom Colors & Font Styles: Applied using Plotly's update_layout

✅ Prerequisites

Ensure you have Plotly installed:

In [None]:
#| echo: true
#| eval: false

pip install plotly


✅ Sample Code

In [None]:
#| echo: true
#| eval: false

import plotly.express as px

# Sample usage (assumes 'onet_salary' is a DataFrame with relevant columns)
fig = px.scatter(
    onet_salary,
    x="ONET_NAME",
    y="Median Salary",
    size="Job_Postings",
    color="Median Salary",  # You can change this to another column if desired
    hover_name="ONET_NAME",
    size_max=60,
    title="Bubble Chart of ONET Median Salaries vs Job Postings",
)

# Customize layout: fonts, colors, background
fig.update_layout(
    title_font=dict(family="Arial Black", size=24, color="darkblue"),
    font=dict(family="Verdana", size=14, color="black"),
    plot_bgcolor="white",
    paper_bgcolor="#f7f7f7",
    xaxis=dict(title="ONET Job Titles", tickangle=45),
    yaxis=dict(title="Median Salary ($)", gridcolor="#e5e5e5"),
)

# Show the figure
fig.show()

# Salary by Education Level Table

## Prompt
Create a table using pyspark from dataframe df_selected where there is an EDU_GROUP column, which is a grouping of column "EDUCATION_LEVELS_NAME". If EDUCATION_LEVELS_NAME contains No Education Listed, GED, or Associate degree then "Associate's or lower". If EDUCATION_LEVELS_NAME contains Bachelor's degree then Bachelor’s. If EDUCATION_LEVELS_NAME contains Master's degree then Master's. If EDUCATION_LEVELS_NAME contains Ph.D. or professional degree then PhD. ALSO Include LOT_V6_SPECIALIZED_OCCUPATION_NAME Average_Salary and MAX_YEARS_EXPERIENCE in the table

## Response
To create a table in PySpark from df_selected with the specified logic for EDU_GROUP, and include the requested columns, follow these steps:

✅ Steps Overview

Define EDU_GROUP using when and col from pyspark.sql.functions

Select required columns

Create a temporary view or save as a table if needed

In [None]:
#| echo: true
#| eval: false

from pyspark.sql.functions import col, when

# Create the EDU_GROUP column based on EDUCATION_LEVELS_NAME
df_with_edu_group = df_selected.withColumn(
    "EDU_GROUP",
    when(
        col("EDUCATION_LEVELS_NAME").rlike("(?i)No Education Listed|GED|Associate"), 
        "Associate's or lower"
    ).when(
        col("EDUCATION_LEVELS_NAME").rlike("(?i)Bachelor"),
        "Bachelor's"
    ).when(
        col("EDUCATION_LEVELS_NAME").rlike("(?i)Master"),
        "Master's"
    ).when(
        col("EDUCATION_LEVELS_NAME").rlike("(?i)Ph\\.D\\.|professional degree"),
        "PhD"
    ).otherwise("Other")  # Optional: handle unmatched entries
)

# Select required columns
final_df = df_with_edu_group.select(
    "EDU_GROUP",
    "LOT_V6_SPECIALIZED_OCCUPATION_NAME",
    "Average_Salary",
    "MAX_YEARS_EXPERIENCE"
)

# Optional: Create a temporary view or save as a table
final_df.createOrReplaceTempView("education_summary_view")

# Or save as permanent Hive table (optional and requires Hive support)
# final_df.write.mode("overwrite").saveAsTable("education_summary")




# Salary by Education Level Plot

## Prompt
Plot scatter plot using plotly where colors are EDU_GROUP (Associate's or lower is yellow, Bachelor's is green, Master's is blue, and PhD is purple), dataframe = final_df, x-axis is MAX_YEARS_EXPERIENCE (with jitter), y-axis is Average_Salary. Title is Garamond size 24 and all other text is Garamond size 12. All text is black.

## Response
Here's how you can create a Plotly scatter plot from a PySpark DataFrame (final_df) with the following requirements:

✅ Requirements Recap:

x-axis: MAX_YEARS_EXPERIENCE with jitter

y-axis: Average_Salary

Color by: EDU_GROUP

"Associate's or lower" → yellow

"Bachelor's" → green

"Master's" → blue

"PhD" → purple

Title: font Garamond, size 24, black

All other text: Garamond, size 12, black

✅ Full Code (with Explanation)

In [None]:
#| echo: true
#| eval: false

import plotly.express as px
import numpy as np

# Step 1: Convert PySpark DataFrame to Pandas
pdf = final_df.toPandas()

# Step 2: Add jitter to MAX_YEARS_EXPERIENCE
np.random.seed(0)
jitter_strength = 0.3
pdf['JITTERED_EXPERIENCE'] = pdf['MAX_YEARS_EXPERIENCE'] + np.random.uniform(
    -jitter_strength, jitter_strength, size=len(pdf)
)

# Step 3: Define custom color mapping
color_map = {
    "Associate's or lower": 'yellow',
    "Bachelor's": 'green',
    "Master's": 'blue',
    "PhD": 'purple'
}

# Step 4: Create the Plotly scatter plot
fig = px.scatter(
    pdf,
    x='JITTERED_EXPERIENCE',
    y='Average_Salary',
    color='EDU_GROUP',
    color_discrete_map=color_map,
    title="Average Salary vs. Max Years of Experience by Education Group",
    labels={
        'JITTERED_EXPERIENCE': 'Max Years of Experience (jittered)',
        'Average_Salary': 'Average Salary',
        'EDU_GROUP': 'Education Group'
    },
    opacity=0.7
)

# Step 5: Update layout with Garamond font and sizes
fig.update_layout(
    title_font=dict(family='Garamond', size=24, color='black'),
    font=dict(family='Garamond', size=12, color='black'),
    legend_title_font=dict(family='Garamond', size=12, color='black'),
    legend_font=dict(family='Garamond', size=12, color='black')
)

# Step 6: Show the figure
fig.show()


# Salary by Remote Work Type Histograms

## Prompt
create a plotly histogram where the x-axis is "MAX_YEARS_EXPERIENCE" split into groups of 1, and the y axis is "Average_Salary", dataframe is remote_df, and group by REMOTE_TYPE_NAME

## Response
To create a Plotly histogram where:

x = MAX_YEARS_EXPERIENCE (binned by 1 year),

y = Average_Salary,

grouped by = REMOTE_TYPE_NAME,

you’ll want to use Plotly Express with histfunc='avg' so that it computes the average salary per bin, and use nbins or manually bin the experience column.

✅ Option 1: Use Plotly’s Built-in Histogram Aggregation

This is the simplest approach:

In [None]:
#| echo: true
#| eval: false

import plotly.express as px

fig = px.histogram(
    remote_df,
    x="MAX_YEARS_EXPERIENCE",
    y="Average_Salary",
    color="REMOTE_TYPE_NAME",
    histfunc="avg",               # compute average salary per bin
    nbins=int(remote_df['MAX_YEARS_EXPERIENCE'].max()) + 1,
    title="Average Salary by Years of Experience and Remote Type",
    labels={
        'MAX_YEARS_EXPERIENCE': 'Max Years of Experience',
        'Average_Salary': 'Average Salary',
        'REMOTE_TYPE_NAME': 'Remote Work Type'
    },
    barmode='group'  # group bars side by side
)

fig.update_layout(
    xaxis_title="Max Years of Experience",
    yaxis_title="Average Salary",
    bargap=0.2
)

fig.show()


## Additional Prompt
I got the error TypeError: 'Column' object is not callable

## Additional Response
Thanks — that error suggests you're using PySpark DataFrame (remote_df) directly with Plotly, which doesn't work because Plotly expects a Pandas DataFrame, not a PySpark one.

The error likely came from this line:

In [None]:
#| echo: true
#| eval: false

nbins=int(remote_df['MAX_YEARS_EXPERIENCE'].max()) + 1


In PySpark, remote_df['MAX_YEARS_EXPERIENCE'] returns a Column object, not a value — which leads to the TypeError: 'Column' object is not callable.

✅ Fix: Convert to Pandas First

You must convert your PySpark DataFrame to Pandas before plotting with Plotly:

✔️ Step-by-step Fix

In [None]:
#| echo: true
#| eval: false

# Step 1: Convert to Pandas
pdf = remote_df.toPandas()

# Step 2: Create plot using Plotly
import plotly.express as px

fig = px.histogram(
    pdf,
    x="MAX_YEARS_EXPERIENCE",
    y="Average_Salary",
    color="REMOTE_TYPE_NAME",
    histfunc="avg",
    nbins=int(pdf['MAX_YEARS_EXPERIENCE'].max()) + 1,
    barmode='group',
    title="Average Salary by Years of Experience and Remote Type",
    labels={
        'MAX_YEARS_EXPERIENCE': 'Max Years of Experience',
        'Average_Salary': 'Average Salary',
        'REMOTE_TYPE_NAME': 'Remote Work Type'
    }
)

fig.update_layout(
    xaxis=dict(dtick=1),
    yaxis_title="Average Salary",
    bargap=0.2
)

fig.show()
