### outlier
An outlier is a data point in a data set that is distant from all other observations. 

A data point that lies outside the overall distribution of the dataset.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### criteria to outlier

1. Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
2. Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation

### Various ways of finding the outlier.
1. Using scatter plots
2. Box plot
3. using z score
4. using the IQR interquantile range

## Detecting outlier using Z score

### Using Z score

Formula for Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

In [10]:
import numpy as np

In [11]:
dataset= [-5, 11,10,12,14,12,15,14,13,15,102,12,14,17,19,107,10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

In [12]:
outliers=[]

def detect_outliers(data):
    threshold=3
    mean = np.mean(data)
    std = np.std(data)
    
    for i in data:
        z_score= (i - mean)/std 
        
        if np.abs(z_score) > threshold:
            outliers.append(i)
            
    return outliers

In [13]:
outlier_pt = detect_outliers(dataset)
print(outlier_pt)

[102, 107, 108]


## InterQuantile Range

75%- 25% values in a dataset

### Steps
#### 1. Arrange the data in increasing order
#### 2. Calculate first(q1) and third quartile(q3)
#### 3. Find interquartile range (q3-q1)
#### 4.Find lower bound q1*1.5
#### 5.Find upper bound q3*1.5

Anything that lies outside of lower and upper bound is an outlier

In [14]:

print(sorted(dataset))


[-5, 10, 10, 10, 10, 10, 11, 11, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 17, 19, 102, 107, 108]


In [15]:

quantile1, quantile3 = np.percentile(dataset,[25,75])

print("Q1:", quantile1, " Q3:", quantile3)


Q1: 12.0  Q3: 15.0


In [16]:

iqr_value = quantile3 - quantile1

print("IQR:", iqr_value)


IQR: 3.0


In [17]:

lower_bound_val = quantile1 - (1.5 * iqr_value) 

upper_bound_val = quantile3 + (1.5 * iqr_value) 

print("Lower Bound:", lower_bound_val, " Upper Bound:", upper_bound_val)


Lower Bound: 7.5  Upper Bound: 19.5


In [None]:

iqr_outliers = [x for x in dataset if (x < lower_bound_val) or (x > upper_bound_val)]
print("Outliers (IQR):", iqr_outliers)


### Skewness

Skewness measures the asymmetry of a data distribution. 

Positive skew means the tail is on the right, negative skew means the tail is on the left.


In [None]:
import pandas as pd

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50]

series = pd.Series(data)

# Calculate skewness
skewness = series.skew()
print("Skewness:", skewness)


In [None]:
from scipy.stats import skew

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50]

# Calculate skewness
skewness_value = skew(data)
print("Skewness:", skewness_value)



skewness > 0 → Right-skewed (long tail on right)

skewness < 0 → Left-skewed (long tail on left)

skewness ≈ 0 → Symmetrical distribution


Right Skew (Positive Skew)

    Tail: Longer on the right side (towards higher values).
    
    Data concentration: Most values are on the left.
    
    Mean, Median, Mode: Mean > Median > Mode

Left Skew (Negative Skew)

    Tail: Longer on the left side (towards lower values).
    
    Data concentration: Most values are on the right.
    
    Mean, Median, Mode: Mean < Median < Mode
    

### Kurtosis

Kurtosis measures the "tailedness" of a distribution — i.e., 

High kurtosis (Leptokurtic):

    Distribution has heavy tails (more outliers).
    
    Peak is sharper than normal.
    
    Example: Stock returns with extreme highs/lows.

Low kurtosis (Platykurtic):

    Distribution has light tails (fewer outliers).
    
    Peak is flatter than normal.
    
    Normal kurtosis (Mesokurtic):
    
Kurtosis ≈ 3 (for standard definition).

Example: Normal distribution.

Kurtosis measures how "peaked" or "flat" a distribution is compared to a normal distribution.

It tells you about the heaviness of the tails (outliers).

In [None]:
from scipy.stats import kurtosis

data = [1, 2, 2, 3, 3, 4, 4, 5, 6, 7, 50]  

kurt = kurtosis(data)  

print("Kurtosis:", kurt)


In [None]:
import pandas as pd

data = [1, 2, 2, 3, 3, 4, 4, 5, 6, 7, 50]
series = pd.Series(data)

kurt = series.kurt()

print("Kurtosis:", kurt)


Interpretation (Fisher’s definition used by SciPy):

    kurtosis > 0 → Leptokurtic (peaked, heavy tails)
    
    kurtosis < 0 → Platykurtic (flat, light tails)
    
    kurtosis ≈ 0 → Mesokurtic (normal-like)