- After identifying outliers in the data,  
  it is necessary to go through a cleaning process,  
  such as deleting all data rows containing the outliers or  
  replacing only the outlier values with missing/specific values to preserve data in other columns.
- When replacing outliers, use the "replace" or "np.where" function.
- When an outlier is expressed not as a single value but as a specific condition or range,  
  the "np.where" function is more convenient than "replace".

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_excel("data/sample_data_outlier.xlsx")
print(df.shape)
df.head()

(100, 2)


Unnamed: 0,id_banana,length_banana
0,1,11.9
1,2,17.69
2,3,16.06
3,4,12.34
4,5,17.53


- In the sample data with banana length information,  
  assume that the value of the "length_banana" column is less than 5 or greater than 35 as an error/outlier.

In [3]:
df[(df['length_banana']<=5)|(df['length_banana']>=35)]

Unnamed: 0,id_banana,length_banana
77,78,36.6
98,99,0.0
99,100,3.1


- Using the "np.where" function as shown below,  
  the three values corresponding to the outlier range can be replaced  
  with np.nan, the number 0, or the average value of "length_banana".

In [4]:
# replaced with np.nan

df["length_banana_1"] = np.where((df['length_banana']<=5)|(df['length_banana']>=35),
                                 np.nan,
                                 df["length_banana"])

In [5]:
# replaced with the number 0

df["length_banana_2"] = np.where((df['length_banana']<=5)|(df['length_banana']>=35),
                                 0,
                                 df["length_banana"])

In [6]:
# replaced with the average value of "length_banana"

df["length_banana_3"] = np.where((df['length_banana']<=5)|(df['length_banana']>=35),
                                 df['length_banana'].mean(),
                                 df["length_banana"])

In [7]:
df.loc[[77,98,99]]

Unnamed: 0,id_banana,length_banana,length_banana_1,length_banana_2,length_banana_3
77,78,36.6,,0.0,16.6569
98,99,0.0,,0.0,16.6569
99,100,3.1,,0.0,16.6569
