# Data Cleaning and Imputation
06/29/2025

We use the salaries dataframe for this.

In [2]:
import pandas as pd
df_salaries = pd.read_csv('datasets/salaries.csv')
df_salaries.head()

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/salaries.csv'

In [None]:
df_salaries.groupby('Experience')['Salary_USD'].mean()

## Addressing Missing Data

### Checking For Missing Values

- We chain: `.isna()` and `.sum()`

Example:
`df.isna().sum()`

Always ask yourself when did I get missing values as it might tell something about the data, **by doing that you can identify what kind of missing data they are**

What you can do then is **isolate** missing and non-missing and ddo `.describe()` on each

- `missing complete at random`
- `missing at random`
- `missing not at random`

### Strategies when Dealing w/ Missing Data

- Drop missing values
  - We drop them if the total values amount to 5% or less of total values
- Impute them (mean, median, or mode)
  - This depends on distribution and context
- Impute by sub-group
  - Well different subgroups have different summary statistics so if you're gonna impute them better use the most fitted

**Dropping Missing Values:**
1. We calculate the threshold (5% of the dataset)
  ```python
  threshold = len(df) * 0.05
  ```
2. Determine the columns that contain missing values less than the threshold
   ```python
   cols_to_drop = df.columns[df.isna().sum() <= threshold]
   ```
3. Drop the missing values
   ```python
    df.dropna(subset = cols_to_drop, inplace = True)
   ```
   - According to the documentation subset is a label usually columns if you are dropping rows


**Imputing:**
- This can be done with sklearn as well

1. Find columns with missing values
   ```python
    cols_with_na = df.columns[df.isna().sum() > 0]
   ```
2. Loop through the columns and passing the value
   ```python
   for col in cols_with_na:
       df[col].fillna(df[col].mode()[0])
   ```

**Imputing by Subgroups:**

1. You get the aggregate function that you want to use for each group
   ```python
    mean_series = df.groupby('category')['numerical_col'].mean()
   ```
2. You transform the series into a dictionary
   ```python
    mean_dict = mean_series.to_dict()
   ```
3. Fill the subgroups using `.fillna()`
   ```python
    df['numerical_col'] = df['numerical_col'].fillna(df['category'].map(mean_dict))
   ```
**What if the group is more than 1?**

-You can still use transform() with multiple grouping columns:

1. Group by multiple columns and compute the group mean
```python
mean_series = df.groupby(['store_id', 'dep_no'])['sales'].transform('mean')
```
2. Use `.fillna()` to impute missing values with the corresponding group mean
```python
df['sales'] = df['sales'].fillna(mean_series)
```