Handling outliers is an important step in data preprocessing. Various methods can be used to address outliers, depending on the context and goals of the analysis. Here are some common methods for handling outliers:

In [1]:
import pandas as pd # type: ignore
import numpy as np # type: ignore

df = pd.read_csv('E:/learning/UT DataScience/Python for Data Science/Session 1/Heart data.csv')
df

Unnamed: 0,Age (age in year),sex,chest pain,blood pressure,cholestoral,blood sugar,electrocardiographic,heart rate,exercise induced,depression,slope,ca,thal,c
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0.0
1,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0.0
2,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0.0
3,56.0,1.0,2.0,120.0,236.0,0.0,0.0,178.0,0.0,0.8,1.0,0.0,3.0,0.0
4,57.0,0.0,4.0,120.0,354.0,0.0,0.0,163.0,1.0,0.6,1.0,0.0,3.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
592,52.0,1.0,4.0,140.0,266.0,0.0,0.0,134.0,1.0,2.0,2.0,,,1.0
593,43.0,1.0,4.0,140.0,288.0,0.0,0.0,135.0,1.0,2.0,2.0,,,1.0
594,41.0,1.0,4.0,120.0,336.0,0.0,0.0,118.0,1.0,3.0,2.0,,,1.0
595,44.0,1.0,4.0,135.0,491.0,0.0,0.0,135.0,0.0,0.0,,,,1.0


In [6]:
feature = df.columns
feature

Index(['Age (age in year)', 'sex', 'chest pain', 'blood pressure',
       'cholestoral ', 'blood sugar', 'electrocardiographic ', 'heart rate',
       'exercise induced', 'depression ', 'slope', 'ca', 'thal', 'c'],
      dtype='object')

#### 1. Removal of Outliers:

1-1 Manual Removal: Identify and remove outliers manually based on domain knowledge or visual inspection.

In [2]:
# Manual removal based on domain knowledge or visual inspection
df_manual_removed = df[df['cholestoral '] < 100]

print("Manual Removal:\n", df_manual_removed)

Manual Removal:
      Age (age in year)  sex  chest pain  blood pressure  cholestoral   \
227               56.0  1.0         4.0           120.0          85.0   

     blood sugar  electrocardiographic   heart rate  exercise induced  \
227          0.0                    0.0       140.0               0.0   

     depression   slope  ca  thal    c  
227          0.0    NaN NaN   NaN  0.0  


1-2 Statistical Removal: Remove outliers based on statistical methods such as Z-scores, IQR (Interquartile Range), etc.

1-2-1 Statistical Removal Using Z-scores

In [5]:
from scipy import stats
import numpy as np

df_zscore = df.copy()
df_zscore['cholestoral '] = df_zscore['cholestoral '].fillna(df_zscore['cholestoral '].median())
df_zscore

z_score = np.abs(stats.zscore(df_zscore['cholestoral ']))
threshold = 3
df_zscore_removed = df[(z_score < threshold)]

print("Orginal Data:\n", df.shape)
print("Removal using Z-scores:\n", df_zscore_removed.shape)

Orginal Data:
 (597, 14)
Removal using Z-scores:
 (590, 14)
