In [33]:
import pandas as pd
import numpy as np

# Say we have a dataframe with lots of cars:
(or other values you want) The data we have about each entry is the brand, price and power. Though we're missing some entries about power.

In [34]:
autos = pd.read_csv('ebay_cars_power.csv',index_col=0)
autos

Unnamed: 0,brand,price,power
0,peugeot,5000.0,158
1,bmw,8500.0,286
2,volkswagen,8990.0,102
3,smart,4350.0,71
4,ford,1350.0,0
...,...,...,...
49995,audi,24900.0,239
49996,opel,1980.0,75
49997,fiat,13200.0,69
49998,audi,22900.0,150


In [35]:
len(autos[autos['power']==0])

3623

The dataset has 43626 rows and 3623 rows are missing the power value. We can fill the missing entries using the average power value for the whole dataset... But we know that an average Fiat or Smart won't have the same amount of power as the average Porsche or Mercedes. It's better to fill in the missing values with average power values for their brand.

# Sidenote: average power, before we mingle with power

In [36]:
autos['power'].mean()

118.82900105441709

# Solution
Lets create a list of all the brands, then loop over it.

In [37]:
# Create a list of all the brands:
brand_names =  autos["brand"].unique()
# Loop over every brand in the list:
for car in brand_names:
    # For every unique brand: create a mask  & calculate an average power
    mask_car = (autos['power'] == 0) & (autos['brand'] == car)
    mean_power = autos[(autos['brand'] == car) & (autos['power'] != 0)]['power'].mean()
    # Apply the mask:
    autos['power'] = autos['power'].mask(mask_car, mean_power)

In [38]:
len(autos[autos['power']==0])

0

# Obviously the power average for the whole dataset went up:

In [39]:
autos['power'].mean()

129.15194458438143

# Voila
Now you can do the same with any categories and values to fill.