#  Pandas for ML Engineers ‚Äî AI/ML Salaries Dataset Analysis

<div style="background-color:#1e293b;padding:15px;border-left:6px solid #38bdf8;color:#e2e8f0">

This notebook is **your hands-on playground** to master essential pandas operations and best practices every Machine Learning engineer should know.  
You‚Äôll work through exercises, explore real-world dataset (AI/ML Salaries Dataset), and learn how to write clean, efficient, and production-ready pandas code.

---

<h4> üí™ Why this matters </h4>
For an ML engineer, **data wrangling is 80% of the job**. The faster and cleaner you can manipulate data:
- The quicker you get to meaningful insights.
- The fewer bugs you introduce.
- The smoother your pipeline runs in production.

Here, you‚Äôll learn not only *how* to use pandas, but also *why* certain practices matter in the real world.

---

<h4> üõ† How to use this notebook (especially if you‚Äôre new to Jupyter Notebooks) </h4>

1. **Run cells one-by-one**  
   - Click on a cell (grey or white box) and press `Shift` + `Enter` to run it.  
   - Code cells will execute Python, Markdown cells will render formatted text.

2. **Write your answers in the `# TODO` sections**  
   - Each exercise has space for your solution.  
   - Try to solve it before looking at the answer.

3. **Reveal solutions** (only after trying!)  
   - Scroll to the end of an exercise.  
   - Click the small triangle ‚ñ∂ next to **"Solution"** to expand it.  
   - Compare your answer and learn from any differences.

4. **Restart & Run All**  
   - If you think something‚Äôs broken, go to **Kernel ‚Üí Restart & Run All** in the menu to start fresh.

---

<h4> üí° Tips for success </h4>
- Read the <b>Best Practice</b> boxes carefully ‚Äî these are the habits that make you an effective ML engineer.
- Experiment! Change parameters, try different methods, and see what happens.
- Periodically save your work
- If you‚Äôre new to pandas, keep the [official documentation](https://pandas.pydata.org/pandas-docs/stable/) handy.

---

Let‚Äôs dive in! üöÄ

</div>


<h2 style="color:#2b6cb0">1. Setup & Data Loading</h2>

<div style="background-color:#1e293b;padding:15px;border-left:6px solid #38bdf8;color:#e2e8f0">

<h4> ‚öôÔ∏è Setup & Environment </h4>

To avoid conflicts with other Python projects, it‚Äôs recommended to create a virtual environment.

---

<h5> Local Setup (VS Code) </h5>

Step 1 ‚Äî Open a terminal and create a virtual environment (only once):  
üêç Using Python's built-in venv
```bash
python -m venv .venv
# If that's not working, try
py -m venv .venv

#  Activate (Windows)
.venv\Scripts\activate

#  Activate (Mac/Linux)
source .venv/bin/activate
```
üí° Tip: Always make sure your VS Code kernel is set to the same environment you activated here.  
In VS Code: Command Palette ‚Üí Python: Select Interpreter ‚Üí Choose .venv

Step 2 ‚Äî Install dependencies
```bash
pip install pandas
```

Step 3 ‚Äî Select Kernel in the Jupyter notebook:  
In the top right corner of this notebook: Select Kernel -> Python Environments -> Choose .venv
</div>

In [1]:
import pandas as pd


print("‚úÖ Pandas:", pd.__version__)


ModuleNotFoundError: No module named 'pandas'

# Dataset Structure

The dataset contains one table structured as follows:

**work_year**: The year the salary was paid.

**experience_level**: The experience level in the job during the year with the following possible values:
- EN: Entry-level / Junior
- MI: Mid-level / Intermediate
- SE: Senior-level / Expert
- EX: Executive-level / Director

**employment_type**: The type of employment for the role:
- PT: Part-time
- FT: Full-time
- CT: Contract
- FL: Freelance

**job_title**: The role worked in during the year.

**salary**: The total gross salary amount paid.

**salary_currency**: The currency of the salary paid as an ISO 4217 currency code.

**salary_in_usd**: The salary in USD (FX rate divided by avg. USD rate for the respective year via fxdata.foorilla.com).

**employee_residence**: Employee's primary country of residence during the work year as an ISO 3166 country code.

**remote_ratio**: The overall amount of work done remotely, possible values are as follows:
- 0: No remote work (less than 20%)
- 50: Partially remote
- 100: Fully remote (more than 80%)

**company_location**: The country of the employer's main office or contracting branch as an ISO 3166 country code.

**company_size**: The average number of people that worked for the company during the year:
- S: less than 50 employees (small)
- M: 50 to 250 employees (medium)
- L: more than 250 employees (large)

In [2]:
# Load data
df = pd.read_csv('Iris.csv')

In [9]:
# YOUR TURN ‚Äî Basic checks (TODO)
# 1) Print the shape of the DataFrame
print('shape:', df.shape)
# 2) Show column names and dtypes
print('\ncolumns & dtypes:')
# 3) Display first 5 rows
print(df.dtypes)
print('\nfirst rows:')
display(df.head())
print(df.describe())

shape: (150, 6)

columns & dtypes:
Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species              str
dtype: object

first rows:


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


               Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count  150.000000     150.000000    150.000000     150.000000    150.000000
mean    75.500000       5.843333      3.054000       3.758667      1.198667
std     43.445368       0.828066      0.433594       1.764420      0.763161
min      1.000000       4.300000      2.000000       1.000000      0.100000
25%     38.250000       5.100000      2.800000       1.600000      0.300000
50%     75.500000       5.800000      3.000000       4.350000      1.300000
75%    112.750000       6.400000      3.300000       5.100000      1.800000
max    150.000000       7.900000      4.400000       6.900000      2.500000


<details>
<summary><b>Solution ‚Äî click to expand</b></summary>

```python
# SOLUTION
print('shape:', df.shape)
print('\ncolumns & dtypes:')
print(df.dtypes)
print('\nfirst rows:')
display(df.head())
```
</details>

<h2 style="color:#2c7a7b">2. Basic Exploration</h2>

Before building models, it‚Äôs essential to **understand your dataset‚Äôs structure and behavior**.  
Exploring basic statistics and distributions helps you:

- Detect **data quality issues** (missing values, outliers, inconsistent scales)
- Identify **feature ranges and variability** (important for scaling and normalization)
- Understand **relationships** that may guide feature engineering


In [4]:
# YOUR TURN ‚Äî Basic exploration (TODO)
# - Run df.info()
df.info()
print('\nSummary statistics:')
# - Run df.describe(include='all')
display(df.describe(include='all'))
# - Show number of missing values per column
print('\nMissing values per column:')
print(df.isna().sum())

<class 'pandas.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    str    
dtypes: float64(4), int64(1), str(1)
memory usage: 7.2 KB

Summary statistics:


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
count,150.0,150.0,150.0,150.0,150.0,150
unique,,,,,,3
top,,,,,,Iris-setosa
freq,,,,,,50
mean,75.5,5.843333,3.054,3.758667,1.198667,
std,43.445368,0.828066,0.433594,1.76442,0.763161,
min,1.0,4.3,2.0,1.0,0.1,
25%,38.25,5.1,2.8,1.6,0.3,
50%,75.5,5.8,3.0,4.35,1.3,
75%,112.75,6.4,3.3,5.1,1.8,



Missing values per column:
Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


<details>
<summary><b>Solution ‚Äî click to expand</b></summary>

```python
# SOLUTION
df.info()
print('\nSummary statistics:')
display(df.describe(include='all'))

print('\nMissing values per column:')
print(df.isna().sum())
```
</details>

In [5]:
df.info()
print('\nSummary statistics:')
display(df.describe(include='all'))

print('\nMissing values per column:')
print(df.isna().sum())

<class 'pandas.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    str    
dtypes: float64(4), int64(1), str(1)
memory usage: 7.2 KB

Summary statistics:


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
count,150.0,150.0,150.0,150.0,150.0,150
unique,,,,,,3
top,,,,,,Iris-setosa
freq,,,,,,50
mean,75.5,5.843333,3.054,3.758667,1.198667,
std,43.445368,0.828066,0.433594,1.76442,0.763161,
min,1.0,4.3,2.0,1.0,0.1,
25%,38.25,5.1,2.8,1.6,0.3,
50%,75.5,5.8,3.0,4.35,1.3,
75%,112.75,6.4,3.3,5.1,1.8,



Missing values per column:
Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


<h2 style="color:#276749">3. Filtering & Selection</h2>

Your Mission: Master the art of data selection and indexing.
Use boolean masks and `.loc` for safe selection & assignment.

In [6]:
# YOUR TURN
# TODO: Filter data for specific conditions:
# 1. Iris-setosa species only
setosa = df.loc[df['Species'] == 'Iris-setosa']
print(f"Iris-setosa samples: {len(setosa)} records")
# 2. Sepal length above 5.0 cm
large_sepal = df.loc[df['SepalLengthCm'] > 5.0]
print(f"Large sepal (>5.0cm): {len(large_sepal)} records")
# 3. Petal width above 1.0 cm
wide_petal = df.loc[df['PetalWidthCm'] > 1.0]
print(f"Wide petal (>0.1cm): {len(wide_petal)} records")
# 4. Combine multiple conditions
premium_flowers = df.loc[
    (df['SepalLengthCm'] > 5.0) &
    (df['PetalWidthCm'] > 1.0) &
    (df['Species'] != 'Iris-setosa')
]
print(f"Premium flowers (Large sepal + Wide petal + Not setosa): {len(premium_flowers)} records")
print("\nSpecies breakdown for premium flowers:")
print(premium_flowers['Species'].value_counts())


Iris-setosa samples: 50 records
Large sepal (>5.0cm): 118 records
Wide petal (>0.1cm): 93 records
Premium flowers (Large sepal + Wide petal + Not setosa): 92 records

Species breakdown for premium flowers:
Species
Iris-virginica     49
Iris-versicolor    43
Name: count, dtype: int64



<details>
<summary><b>Solution ‚Äî click to expand</b></summary>

```python
# SOLUTION
# Filter senior-level employees
senior_employees = df.loc[df['experience_level'] == 'SE']
senior_employees
print(f"Senior employees:  {senior_employees.head()}")
print(f"Senior employees: {len(senior_employees)} records")

# High-salary positions
high_salary = df.loc[df['salary_in_usd'] > 100000]
print(f"High salary positions (>$100K): {len(high_salary)} records")

# Fully remote positions
fully_remote = df.loc[df['remote_ratio'] == 100]
print(f"Fully remote positions: {len(fully_remote)} records")

# Combined conditions - Senior, high salary, fully remote
premium_jobs = df.loc[
    (df['experience_level'] == 'SE') & 
    (df['salary_in_usd'] > 100000) & 
    (df['remote_ratio'] == 100)
]
print(f"Premium jobs (Senior + High Salary + Remote): {len(premium_jobs)} records")

# Display top job titles for premium positions
print("\nTop job titles for premium positions:")
print(premium_jobs['job_title'].value_counts().head())
```
</details>


<div style="background-color:#1e293b;padding:15px;border-left:6px solid #38bdf8;color:#e2e8f0">

<b>üí° Best practice: Use <code>.loc[mask, columns]</code> for filtering and assignment</b>

When selecting rows and columns in Pandas, always use <code>.loc</code> with explicit row and column indexing, like:

```python
df.loc[mask, ['col1']]
```
instead of chained indexing like:

```python
df[mask]['col1']
```
Why?
Avoids "SettingWithCopyWarning": Chained indexing can return a view or a copy unpredictably, so assignments may not work as intended, leading to bugs.


<h2 style="color:#2b6cb0">4. Sorting</h2>

In [7]:
# YOUR TURN
# TODO: Perform the following sorting operations:
# 1. Sort by sepal length in descending order and print top 5 longest
sepal_sorted = df.sort_values('SepalLengthCm', ascending=False)
print("Top 5 longest sepals:")
print(sepal_sorted[['Species', 'SepalLengthCm', 'SepalWidthCm']].head())
# 2. Sort by multiple columns: Species, then SepalLengthCm
multi_sorted = df.sort_values(['Species', 'SepalLengthCm'], ascending = [True, False])
print("\nTop flowers by species and sepal length:")
print(multi_sorted[['Species', 'SepalLengthCm', 'PetalLengthCm']].head(10))
# 3. Find top 10 largest petals (by length)
top_10_petals = df.nlargest(10, 'PetalLengthCm')
print("\nTop 10 longest petals:")
print(top_10_petals[['Species', 'PetalLengthCm', 'PetalWidthCm', 'SepalLengthCm']])


Top 5 longest sepals:
            Species  SepalLengthCm  SepalWidthCm
131  Iris-virginica            7.9           3.8
135  Iris-virginica            7.7           3.0
122  Iris-virginica            7.7           2.8
117  Iris-virginica            7.7           3.8
118  Iris-virginica            7.7           2.6

Top flowers by species and sepal length:
        Species  SepalLengthCm  PetalLengthCm
14  Iris-setosa            5.8            1.2
15  Iris-setosa            5.7            1.5
18  Iris-setosa            5.7            1.7
33  Iris-setosa            5.5            1.4
36  Iris-setosa            5.5            1.3
5   Iris-setosa            5.4            1.7
10  Iris-setosa            5.4            1.5
16  Iris-setosa            5.4            1.3
20  Iris-setosa            5.4            1.7
31  Iris-setosa            5.4            1.5

Top 10 longest petals:
            Species  PetalLengthCm  PetalWidthCm  SepalLengthCm
118  Iris-virginica            6.9           2.3

<details>
<summary><b>Hint ‚Äî click to expand</b></summary>

Use sort_values
</details>

<details>
<summary><b>Solution ‚Äî click to expand</b></summary>

```python
# SOLUTION
# 1. Sort by salary (descending)
salary_sorted = df.sort_values('salary_in_usd', ascending=False)
print("Top 5 highest salaries:")
print(salary_sorted[['job_title', 'experience_level', 'salary_in_usd']].head())

# 2. Multi-column sorting: experience level, then salary
multi_sorted = df.sort_values(['experience_level', 'salary_in_usd'], ascending=[True, False])
print("\nTop salaries by experience level:")
print(multi_sorted[['experience_level', 'job_title', 'salary_in_usd']].head(10))

# 3. Top 10 highest paid positions (more efficient than sort_values)
top_10_salaries = df.nlargest(10, 'salary_in_usd')
print("\nTop 10 highest paid positions:")
print(top_10_salaries[['job_title', 'experience_level', 'company_size', 'salary_in_usd']])
```
</details>

In [8]:
# 1. Sort by salary (descending)
salary_sorted = df.sort_values('salary_in_usd', ascending=False)
print("Top 5 highest salaries:")
print(salary_sorted[['job_title', 'experience_level', 'salary_in_usd']].head())

# 2. Multi-column sorting: experience level, then salary
multi_sorted = df.sort_values(['experience_level', 'salary_in_usd'], ascending=[True, False])
print("\nTop salaries by experience level:")
print(multi_sorted[['experience_level', 'job_title', 'salary_in_usd']].head(10))

# 3. Top 10 highest paid positions (more efficient than sort_values)
top_10_salaries = df.nlargest(10, 'salary_in_usd')
print("\nTop 10 highest paid positions:")
print(top_10_salaries[['job_title', 'experience_level', 'company_size', 'salary_in_usd']])


KeyError: 'salary_in_usd'


<h2 style="color:#2c7a7b">5. Grouping & Aggregation</h2>

Your Mission: Unleash the power of groupby for insightful aggregations.
Use `groupby` + `agg`. 

In [None]:
# YOUR TURN
# TODO: Calculate average measurements by:
# 1. Species(Average sepal length by species)
stats_by_species = (df.groupby('Species')
                    .agg(avg_sepal_length = ('SepalLengthCm', 'mean'))
                    .reset_index().round(2))
print("Average sepal length by species:")
print(stats_by_species)
# 2. Species with multiple aggregations (mean, min, max)
multi_stats_species = (df.groupby('Species')
                       .agg(
                           avg_sepal_length = ('SepalLengthCm', 'mean'),
                           min_petal_length = ('PetalLengthCm', 'min'),
                           max_petal_width = ('PetalWidthCm', 'max')
                       )
                       .reset_index()
                       .sort_values('min_petal_length', ascending = False)
                       .round(2))
print("\nMultiple stats by species:")
print(multi_stats_species)
# 3. Create size categories and group by them
df['size_category'] = pd.cut(df['SepalLengthCm'],
                             bins = [0,5,6,10],
                             labels = ['Small', 'Medium', 'Large'])

stats_by_size = (df.groupby('size_category')['PetalLengthCm']
                 .agg(avg_petal_length='mean')
                 .reset_index()
                 .sort_values('avg_petal_length', ascending=False)
                 .round(2))
print("\nAverage petal length by size category:")
print(stats_by_size)
# 4. Species AND size category (multi-level groupby)
multi_group = (df.groupby(['Species', 'size_category'])['PetalLengthCm']
               .agg(avg_petal_length = 'mean')
               .reset_index()
               .sort_values('avg_petal_length', ascending =False)
               .round(2))
print("\nPetal length by species and size category:")
print(multi_group)


Average sepal length by species:
           Species  avg_sepal_length
0      Iris-setosa              5.01
1  Iris-versicolor              5.94
2   Iris-virginica              6.59

Multiple stats by species:
           Species  avg_sepal_length  min_petal_length  max_petal_width
2   Iris-virginica              6.59               4.5              2.5
1  Iris-versicolor              5.94               3.0              1.8
0      Iris-setosa              5.01               1.0              0.6

Average petal length by size category:
  size_category  avg_petal_length
2         Large              5.32
1        Medium              3.24
0         Small              1.71

Petal length by species and size category:
           Species size_category  avg_petal_length
7   Iris-virginica         Large              5.68
6   Iris-virginica        Medium              5.01
4  Iris-versicolor         Large              4.58
5   Iris-virginica         Small              4.50
3  Iris-versicolor        Me

In [None]:
# Average salary by experience level 
salary_by_experience = (df.groupby('experience_level')
                       .agg(avg_salary=('salary_in_usd', 'mean'))
                       .reset_index().round(2))
print("Salary statistics by experience level:")
print(salary_by_experience)

# Average salary by company size 
salary_by_company_size = (df.groupby('company_size')
                         .agg(avg_salary=('salary_in_usd', 'mean'))
                         .reset_index()
                         .sort_values('avg_salary', ascending=False)
                         .round(2))
print("\nAverage salary by company size:")
print(salary_by_company_size)

# Average salary by remote ratio 
salary_by_remote = (df.groupby('remote_ratio')['salary_in_usd']
                   .agg(avg_salary='mean')
                   .reset_index()
                   .sort_values('avg_salary', ascending=False)
                   .round(2))
print("\nAverage salary by remote ratio:")
print(salary_by_remote)

# Multi-level groupby with named aggregation
multi_group = (df.groupby(['experience_level', 'company_size'])['salary_in_usd']
              .agg(
                  avg_salary='mean'
              )
              .reset_index()
              .sort_values('avg_salary', ascending=False)
              .round(2))
print("\nSalary by experience level and company size:")
print(multi_group)

<details>
<summary><b>Solution ‚Äî click to expand</b></summary>

```python
# SOLUTION
salary_by_experience = (df.groupby('experience_level')
                       .agg(avg_salary=('salary_in_usd', 'mean'))
                       .reset_index().round(2))
print("Salary statistics by experience level:")
print(salary_by_experience)

# Average salary by company size 
salary_by_company_size = (df.groupby('company_size')
                         .agg(avg_salary=('salary_in_usd', 'mean'))
                         .reset_index()
                         .sort_values('avg_salary', ascending=False)
                         .round(2))
print("\nAverage salary by company size:")
print(salary_by_company_size)

# Average salary by remote ratio 
salary_by_remote = (df.groupby('remote_ratio')['salary_in_usd']
                   .agg(avg_salary='mean')
                   .reset_index()
                   .sort_values('avg_salary', ascending=False)
                   .round(2))
print("\nAverage salary by remote ratio:")
print(salary_by_remote)

# Multi-level groupby with named aggregation
multi_group = (df.groupby(['experience_level', 'company_size'])['salary_in_usd']
              .agg(
                  avg_salary='mean'
              )
              .reset_index()
              .sort_values('avg_salary', ascending=False)
              .round(2))
print("\nSalary by experience level and company size:")
print(multi_group)
```
</details>

<div style="background-color:#1e293b;padding:15px;border-left:6px solid #38bdf8;color:#e2e8f0">

<b>üí° Best practice: Use <code>groupby</code> + <code>agg</code> with named aggregation and reset the index</b>

In Pandas, the <code>groupby</code> operation is fundamental for summarizing data by categories, such as calculating averages, counts, or other statistics for each group.

---

**Key points for ML engineers:**

- Use the <code>.agg()</code> method with **named aggregation** syntax to create clear, readable summaries with custom column names.  
- After aggregation, use <code>.reset_index()</code> to convert the group keys from the index back into regular columns. This makes the DataFrame easier to merge or use downstream.

---

**Why it matters:**

- Named aggregation makes your code **self-documenting**, so it's easier to understand which statistics are being calculated.
- Resetting the index avoids confusing MultiIndex objects, which can complicate further processing or merging.
- This approach fits well into ML pipelines where feature engineering often requires grouped summary statistics.

---

**Example:**

```python
# Compute mean and standard deviation of 'body_mass_g' grouped by 'species'
salary_by_experience = (df.groupby('experience_level')
                       .agg(avg_salary=('salary_in_usd', 'mean'))
                       .reset_index())

<h2 style="color:#2c7a7b">6. Creating New Columns</h2>

Your Mission: Transform raw data into ML-ready features.


<div style="background-color:#1e293b;padding:15px;border-left:6px solid #38bdf8;color:#e2e8f0">

<b>üí° Best practice: Avoid in-place edits on DataFrame slices; use <code>.copy()</code> and return new DataFrames</b>

When you select a subset (slice) of a DataFrame using filtering or indexing, Pandas may return a **view** or a **copy**. Modifying this slice directly can lead to unexpected behavior or the infamous <code>SettingWithCopyWarning</code>.

---

**Why avoid in-place edits on slices?**

- Pandas does not guarantee whether your slice is a view (modifies original data) or a copy (modifies only the slice).
- Modifying a view may cause side effects on the original DataFrame.
- Modifying a copy may not affect the original DataFrame, leading to bugs if you expect it to.
- You may receive warnings, making your code noisy and less reliable.

---

**How to avoid these issues:**

- Always use <code>.copy()</code> when creating slices you intend to modify:

```python
subset = df.loc[mask].copy()
subset['new_col'] = some_transformation
```

In [None]:
# TODO: Create the following new features:
# 1. is_setosa (boolean for Iris-setosa species)
# 2. is_large_flower (boolean for SepalLengthCm > 5.5)


In [None]:
# Create new features
df_features = df.copy()

# 1. Setosa species indicator
df_features['is_setosa'] = df['Species'].isin(['Iris-setosa'])
# 2. Large flower indicator
df_features['is_large_flower'] = df['SepalLengthCm'] > 5.5


# Display the new features
print("New features created:")
print(f"\nSetosa flowers: {df_features['is_setosa'].value_counts()}")
print(f"\nLarge flowers: {df_features['is_large_flower'].value_counts()}")

New features created:

Setosa flowers: is_setosa
False    100
True      50
Name: count, dtype: int64

Large flowers: is_large_flower
True     91
False    59
Name: count, dtype: int64


<details>
<summary><b>Solution ‚Äî click to expand</b></summary>

```python
# SOLUTION
# Create new features
df_features = df.copy()

# 1. Senior position indicator
df_features['is_senior'] = df['experience_level'].isin(['SE', 'EX'])

# 2. Remote friendly indicator
df_features['is_remote_friendly'] = df['remote_ratio'] > 50


# Display the new features
print("New features created:")
print(f"\nSenior positions: {df_features['is_senior'].value_counts()}")
print(f"\nRemote friendly: {df_features['is_remote_friendly'].value_counts()}")
```
</details>