### Machine Learning

* ML is not about crunching numbers but finding valuable insights from the data that is available.
* There needs to have a proper domain knowledge of the problem that we are working on.

### Balanced dataset

The data where there are equal number of categories (categorical data).

* Equal class distribution.

### Im-balanced dataset

The data where there are un-equal number of categories (categorical data).

* Un-equal class distribution.

### Linear separable

> By drawing a line if we are able to separate two or more class distributions then it is called linear seperable.

### 1D Scatter plots

> 1D scatter plots are very hard to visualize and nothing can be interpreted from it.

### Histogram

* A frequency distribution shows how often each different value in a set of data occurs. It very much likely appears to be a bar chart.

* PDF → Probability Density Function is the smoothed form of Histogram.

    - Density Plot → shows how dense or compactedness the region is with points.

* Histogram and PDF's are widely used for univariate analysis.
* The farther the class distributions are, the more well separated or classification can be achieved.

* CDF → Cummulative Distribution Function tells - what percentage of the values lie corresponding to its PDF.

    - The range of CDF is 0 to 1.
    - Basically the prabablities lie within these ranges.

> CDF and PDF plays a major role in Data Analysis and model building and classification.

### Mean

Formula

$$\mu = \frac{\sum x_i}{n}$$

* Measures the central tendency.

### Standard deviation

Formula

$$\sigma = \sqrt \frac{\sum (x_i - \mu)^2}{N}$$

* The numerical representation to determine the deviation or distance from each point in the data to its mean id called spread or standard deviation.

### Variance

Formula

$$\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}$$

* The average of distance from mean to each data point.
* Distance is always positive.

**Note**

* Always try to avoid outliers while computing central tendencies.
* Mean, Variance, and SD can easily get corrupted by outliers whereas Median does not.
    - If more than 50% of the points are corrupted, only then Median is corrupted.

### Percentile → [50th percentile = Median ($\tilde{x}$)]

* A percentile is a term used in statistics to express how a score compares to other scores in the same set.
* While there is technically no standard definition of percentile, it's typically communicated as the percentage of values that fall below a particular value in a set of data scores

Suppose there are $N$ data points and let's imagine we want to find $M^{th}$ percentile. We can say - 

* The data should be sorted.
* $M^{th}$ percentile is the data point which is there in the $m^{th}$ position.
* It simply means that what percentage of points are less than $M^{th}$ value and what percentage values are greater.

> $25^{th}$, $50^{th}$, $75^{th}$, and $100^{th}$ percentiles are called as Quantiles.

> * $50^{th}$ → Median ($\tilde{x}$)
> * `>` $90^{th}$ percentiles are very important.

**Formula**

$$L_p = \bigg [(n - 1) * \frac{p}{100} \bigg ] + 1$$

where - 

* $L_p$ → Locator of p
* $n$ → Total number of elements
* $p$ → $m^{th}$ percentile number

In [1]:
dummy_data = [31, 33, 18, 12, 5, 39, 25, 30, 22, 16, 32]

### Percentile exercise

In [2]:
def compute_percentile(p, data):
    """
    Formula          → l_p = ((n - 1) * (p / 100)) + 1
    percentile_value → data[int(l_p) - 1] + (l_p - int(l_p)) * (data[int(l_p)] - data[int(l_p) - 1])
    
    :param p   : percentile value
    :param data: data to which the percentile is calculated
    """
    data = sorted(data)
    
    if (p == 100):
        return data[-1]
    
    l_p = (len(data) - 1) * (p / 100) + 1
    
    int_l_p = int(l_p)
    fl_l_p = l_p - int_l_p
    
    val1 = data[int_l_p - 1]
    val2 = data[int_l_p]    
    pval = val1 + (fl_l_p * (val2 - val1))
    
    return round(pval, 2)

In [3]:
compute_percentile(p=46, data=dummy_data)

23.8

### Median Absolute Deviation (MAD)

* Absolute deviation = 

$$|x_i - \tilde{x}|$$

* MAD =
    * $x_i$ are the original data values values
    * Median → $f(a) = \tilde{a}$
    * $y = |x_i - f(a=x)|$ → is basically the bunch deviations w.r.t median.
    * MAD = $f(a=y)$

### MAD exercise

In [4]:
def get_median(data):
    data = sorted(data)
    inx = len(data) // 2
    
    if (len(data) % 2 == 0):
        inx_l = inx - 1
        median = (data[inx] + data[inx_l]) / 2
    else:
        median = data[inx]
    
    return median

def compute_mad(data, c=0.6745):
    med = get_median(data)
    abs_std = [abs(i - med) for i in data]
    mad = get_median(data=abs_std) / c
    return round(mad, 2)

In [5]:
get_median(data=dummy_data)

25

In [6]:
compute_mad(data=dummy_data)

10.38

### IQR (Inter Quartile Range)

* 50 percentage of the data lies in this range

In [7]:
def get_iqr(data):
    p75 = compute_percentile(p=75, data=data)
    p25 = compute_percentile(p=25, data=data)
    return p75 - p25

In [8]:
get_iqr(data=dummy_data)

14.5