# 1. Methodology

## 1.1. Data quality
Clean data needs to pass some quality criteria. They are logical rules or constraints that base on business knowledge. These constraints fall into the following categories:
- Data-type constraints: Each column must be of a particular data type such as numeric, date or text.
- Accuracy: You have to verify that the data is close to the true values, sometimes by using external sources.
- Range constraints: Typically, numbers or dates should fall within a certain range.
- Set-membership constraints: Values of a column must come from a pre-defined set.
- Pattern constraints: Certain text fields have to match regular expression patterns.
- Cross-field validation: For example, in a dataset of sales contracts, the delivery date cannot be earlier than the signature date.
- Uniqueness: A field or a combination of fields must be unique across the dataset. For example, two customers cannot have the same ID.
- Consistancy: For example, a customer is recoreded in two different tables with two different address.
- Completeness: Certain columns cannot be empty.
- Uniformity: Each field can only has one unit of measure such as kg or lb, USD or EUR.

*Reference: [Wikipedia - Data Cleansing](https://en.wikipedia.org/wiki/Data_cleansing)*

## 1.2. The workflow

#### Inspecting
The inspection can be done in the data exploration step. Here are the two most important methods to inspect your dataset:
- Data profiling: Calculating summary statistics is really helpful to give a general idea about the quality of the data. You will have to answer the questions such as *How many values is missing?*, *Is this field has a constraint with another?* and *Which data type should this column be of?*.
- Data visualization: Visualization, especially when combined with statistical methods helps you answer *How the data is distributed?* and *Which point is an outlier?*.

#### Cleaning
In this step, we take into account all the criteria mentioned above. Overall, incorrect data will be either removed, corrected or imputed. For the rest of this topic, we mainly discuss how to apply cleaning techniques using Pandas.

*Reference: [Towards Data Science](https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4)*

# 2. Basic data cleaning

## 2.1. Common techniques

In [None]:
import numpy as np
import pandas as pd

#### Selecting columns
You can do selecting the columns you want to use or removing unnecessary ones.

In [None]:
aqua = pd.DataFrame({
    'year': pd.Series([2020, 2020, 2020, 2020, 2020, 2020]),
    'month_name': pd.Series(['Jan', 'Jan', 'Jun', 'Jun', 'Jul', 'Jul']),
    'month_number': pd.Series([1, 1, 6, 6, 7, 7]),
    'commodity': pd.Series(['Fish', 'Shrimp', 'Fish', 'Shrimp', 'Fish', 'Shrimp']),
    'profit': pd.Series([7415, 3239, 7280, 2007, 3574, 9285]),
    'company': pd.Series(['Pandas', 'Pandas', 'Pandas', 'Pandas', 'Pandas', 'Pandas'])
})
aqua

In [None]:
aqua[['year', 'month_number', 'commodity', 'profit']]

In [None]:
aqua.drop(columns=['month_name', 'company'])

#### Renaming columns
Column names should follow either `PascalCase` or `snake_case`.

In [7]:
aqua = pd.DataFrame({
    'Year': pd.Series([2020, 2020, 2020, 2020, 2020, 2020]),
    'Month name': pd.Series(['Jan', 'Jan', 'Jun', 'Jun', 'Jul', 'Jul']),
    'Month number': pd.Series([1, 1, 6, 6, 7, 7]),
    'Product name': pd.Series(['Fish', 'Shrimp', 'Fish', 'Shrimp', 'Fish', 'Shrimp']),
    'Profit': pd.Series([7415, 3239, 7280, 2007, 3574, 9285]),
    'Company name': pd.Series(['Pandas', 'Pandas', 'Pandas', 'Pandas', 'Pandas', 'Pandas'])
})
aqua

Unnamed: 0,Year,Month name,Month number,Product name,Profit,Company name
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


In [8]:
# PascalCase
aqua_pascal = aqua.copy()

aqua_pascal.columns = aqua.columns.str.title().str.replace(' ', '')
aqua_pascal

Unnamed: 0,Year,MonthName,MonthNumber,ProductName,Profit,CompanyName
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


In [9]:
# snake_case
aqua_snake = aqua.copy()

aqua_snake.columns = aqua.columns.str.lower().str.replace(' ', '_')
aqua_snake

Unnamed: 0,year,month_name,month_number,product_name,profit,company_name
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


The `rename()` method allows renaming specific columns.

In [None]:
aqua_snake.rename(columns={
    'product_name': 'commodity',
    'company_name': 'company'
})

#### Correcting data types

In [41]:
athletes = pd.DataFrame({
    'year': [2019, 2019, 2020., 2020, 2020, 2020],
    'date': ['20191103', '20190812', '20200125', '20200129', '20200412', '20200220'],
    'time': ['145509', '135433', '214412', '124254', '123349', '233517'],
    'medal': ['Gold', 'Bronze', 'Silver', 'Bronze', 'Silver', 'Silver'],
    'name': ['Wayne', 'Robert', 'Ashley', 'Jamie', 'Jessie', 'Sergio'],
    'left_handed': [1, 0, 0, 0, 1, 0]
})
athletes

Unnamed: 0,year,date,time,medal,name,left_handed
0,2019.0,20191103,145509,Gold,Wayne,1
1,2019.0,20190812,135433,Bronze,Robert,0
2,2020.0,20200125,214412,Silver,Ashley,0
3,2020.0,20200129,124254,Bronze,Jamie,0
4,2020.0,20200412,123349,Silver,Jessie,1
5,2020.0,20200220,233517,Silver,Sergio,0


In [43]:
athletes.dtypes

year           float64
date            object
time            object
medal           object
name            object
left_handed      int64
dtype: object

Simple data types (string or numeric) can easily be corrected using the `astype()` method.

In [None]:
athletes = athletes.astype({
    'year': int,
    'left_handed': bool
})
athletes

For more complex data types (date or categorical), the corresponding function have to be used.

In [None]:
pd.to_datetime(athletes.date, format='%Y%m%d')

In [42]:
pd.to_datetime(athletes.date + ' ' + athletes.time, format='%Y%m%d %H%M%S')

0   2019-11-03 14:55:09
1   2019-08-12 13:54:33
2   2020-01-25 21:44:12
3   2020-01-29 12:42:54
4   2020-04-12 12:33:49
5   2020-02-20 23:35:17
dtype: datetime64[ns]

In [None]:
pd.Categorical(athletes.medal, categories=['Bronze', 'Silver', 'Gold'])

In [None]:
athletes.date = pd.to_datetime(athletes.date, format='%Y%m%d')
athletes.medal = pd.Categorical(athletes.medal, categories=['Bronze', 'Silver', 'Gold'])
athletes

In [None]:
athletes.sort_values(by='medal')

#### Filtering

In [2]:
ds = pd.DataFrame({
    'worker': [
        'Wayne', 'Robert', 'Ashley',
        'Jamie', 'Jessie', 'Sergio',
        'Harry', 'Johnny', 'Aaron'
    ],
    'age': [8, 37, 25, 26, 80, 30, 20, 31, 28],
    'job': [
        'Student', 'Data Scientist', 'DATA ANALYST',
        'data engineer', 'Retired', 'Business Intelligence',
        'Student', 'Data Analyst', 'AI Engineer'
    ],
    'years_on_job': [0, 12, 2, 6, 0, 18, 12, 2, 8]
})
ds

Unnamed: 0,worker,age,job,years_on_job
0,Wayne,8,Student,0
1,Robert,37,Data Scientist,12
2,Ashley,25,DATA ANALYST,2
3,Jamie,26,data engineer,6
4,Jessie,80,Retired,0
5,Sergio,30,Business Intelligence,18
6,Harry,20,Student,12
7,Johnny,31,Data Analyst,2
8,Aaron,28,AI Engineer,8


In the dataset above, we only consider people who are in legal working age (15 to 60) and are working in the data industry. Notice that `age` minus `years_on_job` (which calculates how old did he/she starts working) cannot be smaller than 15.

In [3]:
ds[
    (ds.job.str.lower().str.contains('data')) &
    (ds.age >= 15) &
    (ds.age <= 60) &
    (ds.age - ds.years_on_job >= 15)
]

Unnamed: 0,worker,age,job,years_on_job
1,Robert,37,Data Scientist,12
2,Ashley,25,DATA ANALYST,2
3,Jamie,26,data engineer,6
7,Johnny,31,Data Analyst,2


## 2.2. Text cleaning

In [None]:
import numpy as np
import pandas as pd

#### Trimming
Space and newline characters usually appear in text columns, because of user's habit.

In [None]:
trade = pd.DataFrame({
    'year': pd.Series([2017, 2018, 2019, 2020]),
    'country': pd.Series([
        'United\nKingdom  ',
        '  United\nKingdom',
        'United    Kingdom',
        ' United Kingdom\n']),
    'export': pd.Series([5466, 8558, 8435, 8435]),
    'import': pd.Series([1546, 3546, 2007, 3574])
})
trade

In [None]:
trade.country.unique()

In [None]:
trade.country.str.split().str.join(' ')

In [None]:
trade.country = trade.country.str.split().str.join(' ')
trade.country.unique()

#### Standardization
The approach is to translate different naming convention, abbreviations or formats into one unique value.

In [None]:
shrimp = pd.DataFrame({
    'date': ['2020-01-01', '2020-01-02', '2020-01-03'],
    'commodity': ['Shrimp, frozen, chem free', 'Shrimp, frz, chemical-free', 'Prawn, frz, chemical-free'],
    'price': [10, 13, 14],
    'unit': ['usd/kg', 'USD/KG', 'USD/kg']
})
shrimp

In [None]:
shrimp.commodity = shrimp.commodity.str.replace('Prawn', 'Shrimp')
shrimp.commodity = shrimp.commodity.str.replace('frz', 'frozen')
shrimp.commodity = shrimp.commodity.str.replace('chem free', 'chemical-free')
shrimp.unit = shrimp.unit.str.replace('usd', 'USD')
shrimp.unit = shrimp.unit.str.replace('KG', 'kg')

In [None]:
shrimp

#### Padding numbers

In [None]:
info = pd.DataFrame({
    'customer_id': [3, 423, 5464],
    'phone': [363334444, 913334444, 123334444],
    'name': ['Jack', 'James', 'Gabriel'],
    'information': ['England Male', 'Colombia Male', 'France Female']
})
info

In [None]:
info = info.astype(str)
info.dtypes

In [None]:
info.customer_id = info.customer_id.str.pad(width=4, fillchar='0')
info.phone = info.phone.str.pad(width=10, fillchar='0')

In [None]:
info

#### Spliting a column

In [4]:
info = pd.DataFrame({
    'customer_id': [3, 423, 5464],
    'phone': [363334444, 913334444, 123334444],
    'name': ['Jack', 'James', 'Gabriel'],
    'information': ['England Male', 'Colombia Male', 'France Female']
})
info

Unnamed: 0,customer_id,phone,name,information
0,3,363334444,Jack,England Male
1,423,913334444,James,Colombia Male
2,5464,123334444,Gabriel,France Female


In [8]:
info['information'].str.split()

0     [England, Male]
1    [Colombia, Male]
2    [France, Female]
Name: information, dtype: object

In [6]:
# unpacking
info['nationality'], info['gender'] = info['information'].str.split().str

info.drop(columns=['information'])

Unnamed: 0,customer_id,phone,name,nationality,gender
0,3,363334444,Jack,England,Male
1,423,913334444,James,Colombia,Male
2,5464,123334444,Gabriel,France,Female


#### Concatenating columns

In [None]:
football = pd.DataFrame({
    'first_name': ['Wayne', 'Cristiano', 'Lionel'],
    'last_name': ['Rooney', 'Ronaldo', 'Messi'],
    'position': ['Second Striker', 'Left Winger', 'Right Winger']
})
football

In [None]:
football['player'] = football.first_name + ' ' + football.last_name

In [None]:
football

# 3. Handling missing data

## 3.1. Why is data missing?

In [2]:
import numpy as np
import pandas as pd

<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
<div id="myChart"></div>
<script type="text/javascript">
    let chartDiv = document.getElementById('myChart');
    Plotly.newPlot(chartDiv, {
        x: [1, 2],
        y: [3, 4],
        margin: { t: 0 }
    });
</script>

<table border="0">
    <tr>
        <td style="width:33%; text-align:center">
            <b>Missing Completely At Random (MCAR)</b>
        </td>
        <td style="width:33%; text-align:center">
            <b>Missing At Random (MAR)</b>
        </td>
        <td style="width:34%; text-align:center">
            <b>Missing Not At Random (MNAR)</b>
        </td>
    </tr>
    <tr>
        <td style="text-align:justify">
            The name says it all. There's no actual reason behind the missing values.
            This type of missing does not lead to bias, therefore
            deletion and imputation are both suitable solutions.
        </td>
        <td style="text-align:justify">
            The missing values in a feature relate to another feature.
            For example, under 25 years old people miss their IQ score.
            Deleting these records causes bias, that makes imputing the best choice.
        </td>
        <td style="text-align:justify"> 
            Assume people with IQ score of 100 or less tend to refuse to answer the survey.
            There is no way missing data can be infered only by looking at collected data.
            Either deletion or imputation makes data biased, and Data Scientist may not even realize it's MNAR.
        </td>
    </tr>
    <tr>
        <td>
            <table>
                <tr><th>Complete data</th><th>Real data</th></tr>
<tr><td>
    
Age |IQ Score|
:---|:-------|
20  |120     |
22  |112     |
24  |127     |
29  |97      |
30  |103     |
40  |95      |
45  |141     |
47  |92      |
52  |115     |

</td><td>

Age |IQ Score|
:---|:-------|
20  |120     |
22  |        |
24  |127     |
29  |        |
30  |103     |
40  |95      |
45  |        |
47  |92      |
52  |115     |

</td></tr>
            </table>
        </td>
        <td>
            <table>
                <tr><th>Complete data</th><th>Real data</th></tr>
<tr><td>
    
Age |IQ Score|
:---|:-------|
20  |120     |
22  |112     |
24  |127     |
29  |97      |
30  |103     |
40  |95      |
45  |141     |
47  |92      |
52  |115     |

</td><td>

Age |IQ Score|
:---|:-------|
20  |        |
22  |        |
24  |        |
29  |97      |
30  |103     |
40  |95      |
45  |141     |
47  |92      |
52  |115     |

</td></tr>
            </table>
        </td>
        <td>
            <table>
                <tr><th>Complete data</th><th>Real data</th></tr>
<tr><td>
    
Age |IQ Score|
:---|:-------|
20  |120     |
22  |112     |
24  |127     |
29  |97      |
30  |103     |
40  |95      |
45  |141     |
47  |92      |
52  |115     |

</td><td>

Age |IQ Score|
:---|:-------|
20  |120     |
22  |112     |
24  |127     |
29  |        |
30  |103     |
40  |        |
45  |141     |
47  |        |
52  |115     |

</td></tr>
            </table>
        </td>
    </tr>
</table>

In [None]:
# COVID-19 data
country = ['USA', 'Brazil', 'India', 'Russia', 'South Africa',
           'Peru', 'Mexico', 'Chile', 'Iran', 'Italy']
cases = [4169991, 2289951, 1288130, 795038, 408052, 371096, 370712, 338759, 284034, 245338]
deaths = [147333, 84207, 30645, 12892, 6093, 17645, 41908, 8838, 15074, 35029]
recovered = [1979617, 1570237, 817593, 580330, 236260, 255945, 236209, 311431, 247230, 197842]
area = ['North America', 'South America', 'Asia', 'Europe', 'Africa',
        'South America', 'North America', 'South America', 'Asia', 'Europe']

pd.DataFrame({
    'country': country,
    'cases': cases,
    'deaths': deaths,
    'recovered': recovered,
    'area': area
})

## 3.2. Deleting

In [2]:
import numpy as np
import pandas as pd

#### Columns deleting
A column having more than 50% of missing data can be drop.

In [3]:
country = ['USA', 'Brazil', 'India', 'Russia', 'South Africa', 'Peru', 'Mexico', 'Chile', 'Iran', 'Italy']
cases = [4169991, 2289951, 1288130, 795038, 408052, 371096, 370712, 338759, 284034, 245338]
deaths = [147333, 84207, 30645, 12892, 6093, 17645, 41908, 8838, 15074, 35029]
recovered = [1979617, None, None, 580330, None, None, 236209, None, 247230, None]

covid = pd.DataFrame({
    'country': country,
    'cases': cases,
    'deaths': deaths,
    'recovered': recovered
})
covid

Unnamed: 0,country,cases,deaths,recovered
0,USA,4169991,147333,1979617.0
1,Brazil,2289951,84207,
2,India,1288130,30645,
3,Russia,795038,12892,580330.0
4,South Africa,408052,6093,
5,Peru,371096,17645,
6,Mexico,370712,41908,236209.0
7,Chile,338759,8838,
8,Iran,284034,15074,247230.0
9,Italy,245338,35029,


In [5]:
covid.isna().mean().map('{:.0%}'.format)

country      0.0
cases        0.0
deaths       0.0
recovered    0.6
dtype: float64

In [None]:
covid.drop(columns='recovered')

#### Rows deleting

In [None]:
country = ['USA', 'Brazil', 'India', 'Russia', 'South Africa', 'Peru', 'Mexico', 'Chile', 'Iran', 'Italy']
cases = [4169991, 2289951, 1288130, 795038, 408052, 371096, 370712, 338759, 284034, 245338]
deaths = [147333, 84207, 30645, 12892, 6093, 17645, 41908, 8838, 15074, 35029]
recovered = [1979617, None, 817593, None, 236260, 255945, 236209, 311431, 247230, 197842]

covid = pd.DataFrame({
    'country': country,
    'cases': cases,
    'deaths': deaths,
    'recovered': recovered
})
covid

In [None]:
covid.dropna(subset=['recovered'])

## 3.3. Filling
Some values may be used to fill missing date are mean, median, mode and zero.

In [None]:
import numpy as np
import pandas as pd

In [None]:
country = ['USA', 'Brazil', 'India', 'Russia', 'South Africa',
           'Peru', 'Mexico', 'Chile', 'Iran', 'Italy']
cases = [4169991, 2289951, 1288130, 795038, 408052, 371096, 370712, 338759, 284034, 245338]
deaths = [147333, 84207, 30645, 12892, 6093, 17645, 41908, 8838, 15074, 35029]
recovered = [1979617, None, 817593, None, 236260, 255945, 236209, 311431, 247230, 197842]
area = ['North America', 'South America', 'Asia', np.nan, 'Africa',
        'South America', 'North America', 'South America', np.nan, 'Europe']

covid = pd.DataFrame({
    'country': country,
    'cases': cases,
    'deaths': deaths,
    'recovered': recovered,
    'area': area
})
covid

In [None]:
recovered_mean = covid.recovered.mean()
recovered_mean

In [None]:
area_mode = covid.area.mode()[0]
area_mode

In [None]:
covid.recovered = covid.recovered.fillna(recovered_mean)
covid.area = covid.area.fillna(continent_mode)
covid

## 3.4. Imputing
k-NN (k-Nearest Neighbors) is one of the machine learning algorithms can be used in imputing missing values. This algorithm considers $k$ nearest observations (according to some distance metrics) to predict missing values.

In [14]:
import numpy as np
import pandas as pd

In [15]:
country = ['USA', 'Brazil', 'India', 'Russia', 'South Africa',
           'Peru', 'Mexico', 'Chile', 'Iran', 'Italy']
cases = [4169991, 2289951, 1288130, 795038, 408052, 371096, 370712, 338759, 284034, 245338]
deaths = [147333, 84207, 30645, 12892, 6093, 17645, 41908, 8838, 15074, 35029]
recovered = [1979617, 1570237, 817593, 580330, 236260, 255945, 236209, 311431, 247230, 197842]
area = ['America', 'America', 'Asia', np.nan, 'Africa',
        'America', 'America', np.nan, 'Asia', 'Europe']

covid = pd.DataFrame({
    'country': country,
    'cases': cases,
    'deaths': deaths,
    'recovered': recovered,
    'area': area
})
covid

Unnamed: 0,country,cases,deaths,recovered,area
0,USA,4169991,147333,1979617,America
1,Brazil,2289951,84207,1570237,America
2,India,1288130,30645,817593,Asia
3,Russia,795038,12892,580330,
4,South Africa,408052,6093,236260,Africa
5,Peru,371096,17645,255945,America
6,Mexico,370712,41908,236209,America
7,Chile,338759,8838,311431,
8,Iran,284034,15074,247230,Asia
9,Italy,245338,35029,197842,Europe


In [16]:
train = covid[~covid.area.isna()]
x_train = train[['cases', 'deaths', 'recovered']]
y_train = train.area

predict = covid[covid.area.isna()]
x_predict = predict[['cases', 'deaths', 'recovered']]

In [17]:
from sklearn.neighbors import KNeighborsClassifier as kNN
clf = kNN(3, weights='distance').fit(x_train, y_train)
y_predict = clf.predict(x_predict)
y_predict

array(['America', 'America'], dtype=object)

In [18]:
train.append(predict.assign(area=y_predict)).sort_values('cases')

Unnamed: 0,country,cases,deaths,recovered,area
9,Italy,245338,35029,197842,Europe
8,Iran,284034,15074,247230,Asia
7,Chile,338759,8838,311431,America
6,Mexico,370712,41908,236209,America
5,Peru,371096,17645,255945,America
4,South Africa,408052,6093,236260,Africa
3,Russia,795038,12892,580330,America
2,India,1288130,30645,817593,Asia
1,Brazil,2289951,84207,1570237,America
0,USA,4169991,147333,1979617,America


# 4. Handling abnormal data

## 4.1. Duplicated values
Duplicated values caused by unique contraint of a column or a combination of columns. If duplicated values occur, there can only be no more than 1 true value.

Depend on the context, you have many options to handle duplicated values:
- List and sort all duplicated values, then manually remove incorrect records.
- Remove duplicated values based on a specific criteria, such as keep the greatest value only.
- Calculate a value such as sum or mean representing all duplicated records.

In [None]:
import numpy as np
import pandas as pd

In [None]:
report = pd.DataFrame({
    'year': pd.Series([2019, 2019, 2020, 2020, 2020, 2020]),
    'company': pd.Series(['Pandas', 'Numpy', 'Pandas', 'Numpy', 'Numpy', 'Pandas']),
    'sales': pd.Series([5466, 8558, 8435, 7280, 9285, 6650]),
    'profit': pd.Series([1546, 3546, 3574, 3352, 4678, 2007])
})
report

In this example, the combination of `year` and `company` create a unique constraint. This means in each year, a company cannot have two values of sales and profit.

In [None]:
subset = ['year', 'company']

#### Removing manually

In [None]:
report[report.duplicated(subset, keep=False)].sort_values(subset)

In [None]:
report.drop(index=[4, 2])

#### Removing based on a criteria

In [None]:
# keep the biggest sales values only
report\
    .sort_values(by=['year', 'company', 'sales'])\
    .drop_duplicates(subset=subset, keep='last')

#### Aggregating

In [None]:
report.groupby(by=['year', 'company']).sum().reset_index()

## 4.2. Outliers
An outlier is a data point that differs significantly from other observations. Outliers can cause serious problems in statistical analysis. Detecting outliers is more likely be an art rather than a science, therefore you need both quantitative and qualitative methods to identify outliers.

However, there's no best rule for handling outliers. You need to ask yourself *Why are they outliers?* and *How can they affect your analysis?*. In this section, we discuss how to detect and handle outliers using Pandas.

In [None]:
import numpy as np
import pandas as pd

#### Using z-score
Given a vector, $x$, we calculate z-score (denoted $z$) with the following formula:

$$z = \frac{x-\mu}{\sigma}$$

where $\mu$ is the mean and $\sigma$ is the standard deviation.

The approach of this method is to eliminate values of $z<-3$ and $z>3$. You can also remove $x<\mu-3\sigma$ and $x>\mu+3\sigma$ which gives the same result. Notice that the coefficient can be changed to 2.5 or 3.5 depends on the problem.

In [None]:
def outliers_zscore(array, z):
    'Return a new array has the outliers being replaced with NaN.'
    import numpy as np
    array = np.array(array, dtype=float)
    mean = array.mean()
    std = array.std()
    lower = mean - z*std
    upper = mean + z*std
    array[(array < lower) | (array > upper)] = np.nan
    return array

In [None]:
wine = pd.read_excel(r'data\wine_quality.xlsx')
wine.head()

In [None]:
# handling outliers for all columns
for i in wine.columns:
    wine[i] = outliers_zscore(wine[i], z=3)

In [None]:
pd.DataFrame({
    'removed_count': wine.isna().sum(),
    'removed_rate': (wine.isna().sum() / wine.shape[0]).apply(lambda x: f'{x:.2%}')
})

#### Using interquartile range
This method considers eliminating values that is lower than $Q_1-1.5\times \mbox{IQR}$ or higher than $Q_3+1.5\times \mbox{IQR}$, where: $Q_1$, $Q_2$ and $Q_3$ are the quartiles; $\mbox{IQR}=Q_3-Q_1$ is the interquartile range.

In [None]:
def outliers_iqr(array):
    import numpy as np
    array = np.array(array, dtype=float)
    Q1, Q3 = np.quantile(array, [0.25, 0.75])
    IQR = Q3 - Q1
    lower = Q1 - 1.5*IQR
    upper = Q3 + 1.5*IQR
    array[(array < lower) | (array > upper)] = np.nan
    return array

In [None]:
wine = pd.read_excel(r'data\wine_quality.xlsx')
wine.head()

In [None]:
# handling outliers for all columns
for i in wine.columns:
    wine[i] = outliers_iqr(wine[i])

In [None]:
pd.DataFrame({
    'removed_count': wine.isna().sum(),
    'removed_rate': (wine.isna().sum() / wine.shape[0]).apply(lambda x: f'{x:.2%}')
})

#### Logarithmic transforming
Another strategy to handling outliers is to perform a log transformation on the data, which dampens the effect of outliers.

In [None]:
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y = [7.46, 6.77, 10, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style='whitegrid')

fig, ax = plt.subplots(ncols=2, sharex=True, figsize=(15,4))
sns.regplot(x=x, y=y, ax=ax[0]).set_title('Effect of ouliers')
sns.regplot(x=x, y=np.log(y), ax=ax[1]).set_title('Effect of log transformed outliers')
plt.axis('equal')
plt.show()