<a href="https://colab.research.google.com/github/luferIPCA/MIA-MLA-24-25/blob/main/2_Data_Manipulation_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Masters' in Applied Artificial Intelligence
## Machine Learning Algorithms Course (MLA)

Notebooks for MLA course

by [*lufer*](mailto:lufer@ipca.pt)

(2024)

---

# Part II - Datasets Manipulation (I)

This is the first notebook for datasets manipulation. Cleaning, Normalizing, Initializing are some of the required tasks during dataset preparation for training.

**Contents**:

1. **Features Engineering**
2. **Cleaning Data**
3. **Outliers**



## Environment preparation


### Importing necessary Libraries

In [None]:
import pandas as pd
import numpy as np

Mounting Drive

In [None]:

from google.colab import drive

# it will ask for your google drive credentiaals
drive.mount('/content/gDrive/', force_remount=True)

*Loading dataset*

In [None]:
#global path variable
path="/content/gDrive/MyDrive/Colab Notebooks/MIA - ML - 2024-2025/Datasets/"
#path

In [None]:
import requests

download_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"
target_csv_path = path+"nbaAll.csv"

#create a local file with remote csv data
response = requests.get(download_url)
response.raise_for_status()
with open(target_csv_path, "wb") as f:
    f.write(response.content)
print("Download ready.")

nbaOriginal = pd.read_csv(path+"nbaAll.csv")

In [None]:
nba = nbaOriginal.copy()


## 1 - Features Engineering

Features Engineering means a set of actions to deal with features (columns) of a dataset.

It involves selecting, manipulating and transforming raw data into features that can be used in training models.

**Feature Engineering Definition**

*Feature engineering is the process of selecting, manipulating and transforming raw data into features that can be used in supervised learning. It consists of five processes: feature creation, transformations, feature extraction, exploratory data analysis and benchmarking.*

Processes involving:


*   Feature creation (adding or removing some features)
*   Transformations
*   Feature extraction
*   Exploratory data analysis
*   Benchmark

Concrete actions:

*   Inputation (missing values: categorical and numerical)
*   Handling outliers (Removing, Replacing values, Discretization)
*   Log Transform (handle confusing data)
*   One-Hot Encoding (unique value for each possible case)
*   Scaling (Normalization, Standardization )



### Warming...

Analysing the dataset

In [None]:
#checking dataset structure
nba.shape

In [None]:
nba.info()

In [None]:
nba.head()

*Filtering columns with "isin"*

method *isin()* checks if the Dataframe contains the specified value(s)

In [None]:
#get values in the period 1948-1949
nbaYear = nba[nba["year_id"].isin([1948, 1949])]
nbaYear

*Get first N columns from a dataframe*

In [None]:
n=3
aux = nba.iloc[:,:n]
aux
#question: how to check the first n rows?

*Get last N columns from a dataframe*

In [None]:
n=3
aux = nba.iloc[:,-n:]
aux
#question: how to check the last n rows?

### Deriving new Feature

(this will be more explored later on *Categorical to Numerical* section)

*Create new Feature (column)*

In [None]:
#convert object to datetime value
nba["date_played"] = pd.to_datetime(nba["date_game"])
#nba
#nba.columns

*Create new Feature from calculus over others*

In [None]:
#See https://www.plus2net.com/python/pandas-dt-timedelta64.php
import datetime
from datetime import date
today = pd.to_datetime(date.today())
#new column
nba['DaysCPassed'] = (today-nba['date_played']) / np.timedelta64(1, 'D')
nba.shape

In [None]:
nba.head()

In [None]:
nba.DaysCPassed.max()

Separate datetime in columns for month, day, year

In [None]:
#check
nba["date_played"]

In [None]:
#check
nba.dtypes

In [None]:
#create new columns year, month and day
nba['year'] = nba["date_played"].dt.year
nba['month'] = nba["date_played"].dt.month
nba['day'] = nba["date_played"].dt.dayofyear

In [None]:
nba.head()

Create new column for Holydays (boolean)

In [None]:
import holidays
#print(dir(holidays))
# Create a dict-like object for Portugues holidays
pt_holidays = holidays.Portugal()
#show all
for feriado in pt_holidays['2020-01-01': '2020-12-31'] :
    print(feriado)


In [None]:
# Create a function to check if the birthdate is a holiday
def is_holiday(birthdate):
    # Use only the month and day of the birthdate for comparison
    holiday_dates = {date for date in pt_holidays if date.month == birthdate.month and date.day == birthdate.day}
    return bool(holiday_dates)

# Apply the function to create the "Holiday" column
nba["Holiday"] = nba["date_played"].apply(is_holiday)
nba.head()

### Change features names

In [None]:
renamedNba = nba.rename(columns={"DaysCPassed": "DaysPassed"})

In [None]:
renamedNba.info()
print('-'*50)
nba.info()

### Deleting Features

*Delete a particular Feature (column)*

In [None]:
renamedNba.info()

In [None]:
renamedNba = renamedNba.drop(columns=['notes'])
#or
#renamedNba.drop(['notes'],axis=1)

renamedNba.info()
print('-'*50)
renamedNba.info()


### Changing the Data Type of Columns

In [None]:
df = nba.copy()
df.info()
#df

*Convert column types*

In [None]:
#object to datetime
df["date_played"] = pd.to_datetime(df["date_game"])

#pd.to_numeric
#astype(): str, int, float...

#dt.strftime()
#df["date_played"] = df["date_played"].dt.strftime('%d-%m-%Y')

*Identify unique values*

In [None]:
df.head()

In [None]:
a=df["game_location"].unique()
print(a)

*Counting distinct values*

In [None]:
a=df["game_location"].nunique()
a

*Occurences*

In [None]:
df['game_location'].value_counts()

In [None]:
df['team_id'].value_counts()

Make colunms Category type

>Attention: check what are "categorial" values? Compare with strings or objects!

>Categorical data is often used for grouping and aggregating data

In [None]:
nba.info()

In [None]:
t= pd.Categorical(nba['team_id'] )

In [None]:
t

In [None]:
df["game_location"] = pd.Categorical(df["game_location"])
df["game_location"].dtype

In [None]:
df.info()

### Grouping features

Grouping allow to merge columns, applying aggregating functions: mean, average, sum, etc...

In [None]:
nba.groupby("fran_id", sort=False)["pts"].sum()
# Expected:
# fran_id
# Huskies           3995
# Knicks          582497
# Stags            20398
# Falcons           3797
# Capitols         22387

## 2 - Cleaning Data

Cleaning data means actions to overcome eventual existing problems with the data. It could be necessary to handle null values, duplicates, imbalanced datasets, outliers, etc.

In [None]:
nba.info()

### Inputting Missing Values


Missing values can be handle by:
1. by removing (just that!)
2. by replacing (inputing):

* numeric values with "median" (if not normalized or have outliers)
* numeric values with "mean" (if normally distributed data)
* categorical values with "mode"


> **Note**: median is better than average because it is not susceptible to discrepancies in values. Otherwise, what would happen if a millionaire appeared in the average calculation?


Lets avoid *null-values*

The current nba dataset has null  values (*Null/None/ Nan Values*) (how to check that?).

The column "*notes*" has only 5424 *non-null* values. All remain columns have 126314 values..

(try to get this results)

Let analythe the following example:

In [None]:
#import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 31, 25, None, 27],
        'Gender': ['F', 'M', None, 'M', 'F'],
        'Salary': [50000, None, 30000, 40000, 60000]}

df = pd.DataFrame(data)

In [None]:
df.info()

*It* is easy to realize that *Name* has 5  *non-null* values, but the other columns have only 4.

In [None]:
#preserve original datatset
dfCopy = df.copy()
dfCopy

*Identify the Missing Values*

The missing values are converted by default. The functions to identify these missing values are:

*   **isnull()**
*   **notnull()**


The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.

"True" means the value is a missing value while "False" means the value is not a missing value.

In [None]:
missing_data = dfCopy.isnull()
missing_data.head()

### Replacing missing values

Replace null values of *Age* feature by *Unknown*

In [None]:
dfCopy["Age"]= dfCopy["Age"].fillna(value="Unknown")

In [None]:
dfCopy

*Replace null values by a particular value*

In [None]:
#dfCopy.fillna({'Age':'Unknown', 'Gender': 'Other'}, inplace=True)
#or
dfCopy=dfCopy.fillna({'Age':'Unknown', 'Gender': 'Other'})

In [None]:
display(dfCopy)

Sometimes NaN values are represented by "?". Replace the "?" symbol with *NaN* so the dropna() can remove the missing values:

In [None]:
#set NaN values
df1=dfCopy.replace('?',np.NaN)

Fill **numerical features** with the *mean* value

In [None]:
#reset dfCopy
dfCopy = df.copy()
dfCopy
#dfCopy.info()

In [None]:
#Using mean() function to input the NaN values using fillna
dfCopy.fillna({'Salary':dfCopy['Salary'].mean()})

In [None]:
dfCopy

Fill numerical features with the *mode* value

> mode() - most frequent value

In [None]:
#checking...
dfCopy.mode()
#explain the result...

In [None]:
#Using mode() function to input the NaN values using fillna
#Why [0]?
dfCopy.fillna({'Salary':dfCopy['Salary'].mode()[0]}, inplace = True)
dfCopy

Fill categorical features with the *mode* value

In [None]:
mode=dfCopy['Age'].mode()[0] #mode()[0] gives the first mode if multiple exist
mode
#age_mode = dfCopy['Age'].mode()[0]
#age_mode

In [None]:
dfCopy

In [None]:
df3=dfCopy['Age'].fillna(mode)

In [None]:
#Replace Age NaN values with the mode()
ag=dfCopy['Age'].mode()[0]
df2=dfCopy.fillna({'Age':ag, 'Gender':'Other'})
#or
df3=dfCopy.fillna({'Age':ag})
#dfCopy=dfCopy.fillna({'Age':dfCopy['Age'].mode()[0]})
#dfCopy.fillna({'Age':dfCopy['Age'].mode()[0]}, inplace = True)
#what does this?
df4=dfCopy['Age'].fillna(mode)
df4

In [None]:
mode=dfCopy['Age'].mode()[0]
dfCopy=dfCopy.fillna({'Age':mode})
dfCopy

### See the *null* values

In [None]:
#preserve original datatset
dfCopy = df.copy()
dfCopy

In [None]:
n1 = dfCopy.isnull().any(axis=1)
n1

### Get only the *null* values

In [None]:
nullRows = dfCopy[n1]
nullRows

### Get only the *non-null* values

In [None]:
n2 = dfCopy.notnull().all(axis=1)
n2

In [None]:
nonNullRows = dfCopy[n2]
nonNullRows

### Checking *Null Values* using Query Method

In this example, the != operator compare the column values with themselves, which returns *True* if the value is *null*.

In [None]:
#preserve original datatset
dfCopy = df.copy()
dfCopy

In [None]:
nullRows = dfCopy.query('Age != Age or Gender != Gender or Salary != Salary')

In [None]:
nullRows

###  Remove rows with missing values

The easiest way to deal with records containing missing values (incomplete records) is to ignore them!


In [None]:
#preserve original datatset
dfCopy = df.copy()
dfCopy

In [None]:
dfCopy.shape
#dfCopy

In [None]:
#default axis=0 (index==rows)
rowsWithoutMissingData = dfCopy.dropna()

In [None]:
rowsWithoutMissingData.shape

In [None]:
rowsWithoutMissingData

### Remove *features* (columns) with null-values

Remove problematic columns if they’re not relevant for your analysis.

In [None]:
#Features==Columns (axis 1)
dataWithoutMissingColumns = dfCopy.dropna(axis=1)

In [None]:
dataWithoutMissingColumns

### Change *Null Values*

In [None]:
nba.info()

In [None]:
#see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
data_with_default_notes = nba.copy()
data_with_default_notes["notes"]=data_with_default_notes["notes"].fillna(value="no notes at all")
data_with_default_notes["notes"].describe()
# Expected:
# count              126314
# unique                232
# top       no notes at all
# freq               120890
# Name: notes, dtype: object

### Invalid Values

For instance, check when pts are zero (0)

In [None]:
nba[nba["pts"] == 0]
#nba[nba["pts"] == 0]['pts']

### Inconsistencies Between Values in Different Columns

In [None]:
nba[(nba["pts"] > nba["opp_pts"]) & (nba["game_result"] != "W")].empty
# Expected:
# True

In [None]:
nba[(nba["pts"] < nba["opp_pts"]) & (nba["game_result"] != "L")].empty
# Expected:
# True

## 3 - Outliers

### Understanding outliers

Analysing outliers, i.e., values very distant from the standard deviation.

See
* [How To Find Outliers in Data Using Python](https://careerfoundry.com/en/blog/data-analytics/how-to-find-outliers/)
* [What Is an Outlier](https://careerfoundry.com/en/blog/data-analytics/what-is-an-outlier/#what-is-an-outlier)
* [How to Find Outliers in a Data Set](https://humansofdata.atlan.com/2017/10/how-to-find-outliers-data-set/)

Outliers are the extreme values within the dataset. They can be found in several ways:
* by describe()
* by using statistical calculs (with standard deviation calculation, includes z-cores).
* grapically (boxplot)
* by other methods: Tukey’s Fences (explore!!)

In [None]:
#recover original df

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 31, 25, None, 27],
        'Gender': ['F', 'M', None, 'M', 'F'],
        'Salary': [50000, None, 30000, 40000, 60000]}

df = pd.DataFrame(data)

In [None]:
df



---



**A- Find outliers statisticaly - with *describe*()**

In [None]:
df.describe()

In [None]:
#all standard deviations
std_devs = df.std(numeric_only=True,ddof=1)
std_devs

In [None]:
df.describe()['Salary']



---



**B - Find outliers statisticaly - with std()**

> To identify potential outliers using the standard deviation method, you can calculate the mean and standard deviation of a numeric column and check if any values lie outside the range defined by μ±kσ, where μ is the mean, σ is the standard deviation, and
k is a chosen threshold (commonly k=3).

Attention: To get better results, remember:

1. Remove NaN values
2. Sort the entire DataFrame by the intended feature


Calculating the standard deviations with *std()* and *mean()*

In [None]:
#Analyse only the Salary
desv = df['Salary'].std()
desv

In [None]:
mean = df['Salary'].mean()
mean

In [None]:
#find uncommon values, like greater 2 or 3 times (threshold) the std
df.loc[df['Salary']>= mean + 3 * desv, 'Salary'].count()
#no outliers

Another perspective:

In [None]:
import numpy as np

# Calculate the mean and standard deviation for 'Salary' and 'Age'
for column in ['Salary', 'Age']:
    mean = df[column].mean()
    std_dev = df[column].std()
    threshold = 3  # Common threshold for outlier detection

    # Define the range for outliers
    lower_bound = mean - threshold * std_dev
    upper_bound = mean + threshold * std_dev

    # Identify outliers
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

    print(f"Analysis for {column}:")
    print(f"Mean: {mean}, Std Dev: {std_dev}")
    print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")
    print(f"Outliers:\n{outliers}\n")

Ok, according to std, we don't have outliers!

However, this approach works better for columns without missing values. For Salary and Age, missing values are ignored automatically by pandas.



---



**C - See it Graphically with BoxPLot and IRQ**


Steps:

1. Compute Quartiles:
* Q1: The 25th percentile (lower quartile).
* Q3: The 75th percentile (upper quartile).


2. Calculate IQR:
* IQR=Q3-Q1


3. Outlier Range:
* Lower Bound: Q1-1.5xIQR (very common to use 3)
* Upper Band: Q3+1.5*IQR. (very common to use 3)

4. Outliers:
* Any values below the lower bound or above the upper bound are considered outliers.

In [None]:
df

In [None]:
import plotly.express as px

In [None]:
#create an horizontal box-plot
fig = px.box(df, x='Salary')
fig.update_xaxes(tickangle=-90)  # Rotate labels by -45 degrees
fig.update_layout(width=700, height=500)
fig.show()

In [None]:
#create a box-plot
#fig = px.box(df, x='Salary')
fig = px.box(df, y='Salary', title="Vertical Box Plot of Salaries", labels={'Salary': 'Salary ($)'})
fig.update_layout(width=500, height=700)  # Specify width and height in pixels
fig.show()

#what this means
#fig = px.box(df, y='Salary', x='Age', title="Vertical Box Plot of Salaries", labels={'Salary': 'Salary ($)', 'Job': 'Job Role'})

In [None]:
# or
# Create a boxplot for Salary
import matplotlib.pyplot as plt
plt.figure(figsize=(5, 3))
plt.boxplot(df['Salary'].dropna(), vert=False, patch_artist=True,
            boxprops=dict(facecolor='lightblue', color='blue'),
            whiskerprops=dict(color='blue'), capprops=dict(color='blue'),
            medianprops=dict(color='red'))

# Add labels and title
plt.title("Boxplot of Salary", fontsize=14)
plt.xlabel("Salary", fontsize=12)

# Display the plot
plt.show()

Boxes Analysis:

Boxplot Components:

* The box represents the interquartile range (IQR).
* The line inside the box represents the median.
* The whiskers extend to the smallest and largest data points within 1.5 x IQR from Q1 and Q3, respectively.
* Points outside the whiskers are plotted individually as outliers.

Expected Result:

* The salaries (30000, 40000, 50000, 60000) fall within the whiskers.
* No individual points outside the whiskers indicate no outliers in this dataset.

Final remarks:

* No points lie outside the whiskers, confirming there are no outliers in the Salary data

Another tools:

Using Scatter Plots:

In [None]:

import matplotlib.pyplot as plt
# Scatter plot
plt.figure(figsize=(5, 6))
plt.scatter(df.index, df['Salary'], color='blue', label='Salary')
plt.axhline(y=df['Salary'].median(), color='red', linestyle='--', label='Median')
plt.title("Scatter Plot of Salary")
plt.xlabel("Index")
plt.ylabel("Salary")
plt.legend()
plt.grid(True)
plt.show()

Using a Histogram

In [None]:
import matplotlib.pyplot as plt

# Calculate the median
median_value = df['Salary'].median()

# Histogram
plt.figure(figsize=(5, 6))
plt.hist(df['Salary'], bins=10, color='skyblue', edgecolor='black')
# Add a vertical line for the median
plt.axvline(median_value, color='red', linestyle='--', linewidth=2, label=f'Median: {median_value}')

# Labels, Title, and Legend
plt.title("Histogram of Salary with Median")
plt.xlabel("Salary")
plt.ylabel("Frequency")
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()



---



*What happens if the the salary of Eva is 600000?*


In [None]:
#recover original df

data2 = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 31, 25, None, 27],
        'Gender': ['F', 'M', None, 'M', 'F'],
        'Salary': [50000, None, 30000, 40000, 600000]}

df2 = pd.DataFrame(data2)
df2

In [None]:
df2.describe()

Lets use IQR IQR=Q3-Q1)

In [None]:
import pandas as pd

# Dataset
data2 = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 31, 25, None, 27],
    'Gender': ['F', 'M', None, 'M', 'F'],
    'Salary': [50000, None, 30000, 40000, 600000]
}

df2 = pd.DataFrame(data2)

# Drop NaN values and sort the entire DataFrame by 'Salary'
df_sorted = df2.dropna(subset=['Salary']).sort_values(by='Salary')

# Calculate Q1 (25th percentile) on the 'Salary' column
q1 = df_sorted['Salary'].quantile(0.25)
q3 = df2['Salary'].quantile(0.75)

# IQR=Q3−Q1
iqr=q3-q1

print("Q1=",q1)
print("Q3=",q3)
print("IRQ=",iqr)


In [None]:
import plotly.express as px

# Box plot for Salary
fig = px.box(df2, y='Salary', title="Box Plot for Salary (Outlier Detection)", labels={'Salary': 'Salary ($)'})
fig.update_layout(width=500, height=700)  # Specify width and height in pixels
fig.show()

Remarks:
* no point after or before the wiskers.
* Thus, no outliers


**Final Explanation:**

* 600000 afects both, the std and the mean!

* Standard deviation works well if the data is normalized
* Standard deviation is sensitive to extreme values (outliers - tey distorce both μ (mean) and σ (standard deviation),
* Thus, std is less effective in datasets with large variability or skewed distributions.
* In these cases use more robust methods like ***Median Absolute Deviation (MAD)*** for outlier detection.



---



**Exercise**

What happens if the salaray values are:


'Salary': [50000, 40000, 30000, 70000, 1000000]





---



###Handling outliers

Analysing outliers, i.e., values very distant from the standard deviation.

See
* [How To Find Outliers in Data Using Python](https://careerfoundry.com/en/blog/data-analytics/how-to-find-outliers/)
* [What Is an Outlier](https://careerfoundry.com/en/blog/data-analytics/what-is-an-outlier/#what-is-an-outlier)
* [How to Find Outliers in a Data Set](https://humansofdata.atlan.com/2017/10/how-to-find-outliers-data-set/)


Outliers are the extreme values within the dataset.

Let analyse a new dataset

**Get the dataset**



In [None]:
filePath="/content/gDrive/MyDrive/Colab Notebooks/MIA - ML - 2024-2025/Datasets/"

In [None]:
dataset = pd.read_csv(filePath+'credit_simple.csv', sep=';')
dataset.shape

In [None]:
dataset.dtypes



---



**Preparing the dataset**

* Identify the dependent variable
* Isolate the feature "CLASSE"

In [None]:
y = dataset['CLASSE']
X = dataset.iloc[:,:-1] #all raws, allcolumns from 0 to n-1

**Find outliers statisticaly - with *describe*()**

In [None]:
X.describe()

In [None]:

X.describe()['SALDO_ATUAL']
#Note:
# funtion describe() calculates the Sample standard deviation
# function std() calcules the Population standard deviation
# to make it equal, std(ddof=1)
# "ddof" stands for "Delta Degrees of Freedom"

Analysing it:

Statistic	| Value
----------|----------
count	| 993 (number of entries)
mean | 24,258,570 (average value)
std	| 688,349,600 (standard deviation)
min	| 250 (minimum value)
25%	| 1,371 (first quartile, lower 25%)
50%	| 2,323 (median, middle value)
75%	| 3,976 (third quartile, upper 25%)
max	| 21,544,410,000 (maximum value)

Notes:
1. Disparity Between std and Quartiles:

* The std (standard deviation) is 688,349,600, which is an extremely large value compared to the interquartile range (IQR), from 1,371 (25%) to 3,976 (75%).
* This suggests that the dataset has outliers or extreme values, especially near the maximum.

2. Skewness:

* The *mean* (24,258,570) is much larger than the *median* (2,323), indicating that the data is right-skewed, likely due to extreme high values (e.g., the maximum: 21,544,410,000).

Conclusion:

Considering SALDO_ATUAL, the minimum value (250.000) is very small compared with std (688,349,600). There are outliers, definitely!

In [None]:
X['SALDO_ATUAL'].max()

In [None]:
X.sort_values('SALDO_ATUAL',ascending=False)['SALDO_ATUAL']

In [None]:
#import numpy as np

# Calculate the mean and standard deviation for 'SALDO_ATUAL'
column='SALDO_ATUAL'
mean = X[column].mean()
std_dev = X[column].std()
threshold = 1.5  # Common threshold for outlier detection

# Define the range for outliers
lower_bound = mean - threshold * std_dev
upper_bound = mean + threshold * std_dev

# Identify outliers
outliers = X[(X[column] < lower_bound) | (X[column] > upper_bound)]
#outliers = X[(X[column] > upper_bound)]



print(f"Analysis for {column}:")
print(f"Mean: {mean}, Std Dev: {std_dev}")
print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")
print(f"Outliers:\n{outliers['SALDO_ATUAL']}\n")

#or removing outliers
#import seaborn as sns
#sns.set()
#df_clean = X[(X[column]>lower_bound)&(X[column]<upper_bound)]
#sns.boxplot(y = df_clean[column])

There are two outliers!

**Find outliers statisticaly - with std()**

Calculating standard deviations with *std*()

In [None]:
#all standard deviations
std_devs = X.std(numeric_only=True,ddof=1)
std_devs

In [None]:
#Analyse only  SALDO_ATUAL
desv = X['SALDO_ATUAL'].std()
desv

In [None]:
#find uncommon values, like greater 2 or 3 times the wiskers
X.loc[X['SALDO_ATUAL']> upper_bound, 'SALDO_ATUAL']
#there are two lines (127 and 160) that have such values

It is confirmed that SALDO_ATUAL has outliers (line 127 and 160)!

**Get outliers, graphicaly**

There are many methods for visualization and finding outliers in data:

* Histogram, Scatter Plot
* Box plot
* Scatter

See [Most Common Types of Data Visualization](https://careerfoundry.com/en/blog/data-analytics/data-visualization-types/)

In [None]:
import plotly.express as px

In [None]:
#create a box-plot
fig = px.box(X, x='SALDO_ATUAL')
fig.show()

Correcting Outliers

1. Removing affecting rows
2. Replacing values



Removing the affecting rows

In [None]:
#X = X.drop([127,160])

In [None]:
#X.info()

In [None]:
#X.describe()

In [None]:
print(filePath)

Let's handle outliers replacing by the median.

In [None]:
#replace those values by tje median
mediana = X['SALDO_ATUAL'].median()
mediana

Replacing with *median*

In [None]:
X.loc[X['SALDO_ATUAL']> upper_bound, 'SALDO_ATUAL']=mediana
#check again
X.loc[X['SALDO_ATUAL']> upper_bound,'SALDO_ATUAL'].count()


In [None]:
#output the resultant DAtaframe
#X.to_csv(filePath+'newDataSet4.csv',columns=['SALDO_ATUAL'])

SALDO_ATUAL now, has no outliers!

In [None]:
#create a box-plot
fig = px.box(X, x='SALDO_ATUAL')
fig.show()



---




Next Notebook will explore:

1. Data Bining
2. Categorical to Numeric
3. Datasets Manipulation

##4 - References

[Complete Guide to Feature Engineering: Zero to Hero](https://www.analyticsvidhya.com/blog/2021/09/complete-guide-to-feature-engineering-zero-to-hero/)

End!