# **Day02 of Machine Learning**

## **Estimates of Location** 
Variables with measured or count data might have thousands of distinct values. A basic step in exploring the data is getting a "typical value" for each feature(variable): an estimate of where most of the data is located. \
Hence, **Estimates of location**, also known as *measures of central tendency*, are statistical metrics used to describe the center or typical value of a data set. These estimates provide an idea of where most of the data points are situated.

The main estimates of location are:

#### **1. Mean** 
The mean is the sum of all values in a dataset divided by the number of values. It represents the *"average"* of the data.
- It is sensitive to extreme values (outliers), meaning it can be distorted if the data includes significantly high or low numbers. 
- It is best suited for data that follows a *normal distribution*.
$$ 
    \text{Mean} (\mu) = \frac{1}{n} \sum_{i=1}^{n} x_i
$$


In [1]:
import statistics
import numpy as np
import pandas as pd
import scipy.stats as stats

In [2]:
marks = [77, 59, 64, 85, 75, 68, 80, 73, 59]

In [3]:
#using sum and len methods
sum(marks) / len(marks)

71.11111111111111

In [4]:
#using Python Standard Library
statistics.mean(marks)

71.11111111111111

In [5]:
#using numpy
arr = np.array([18, 26, 31, 9, 10, 26, 22, 36, 20])
np.mean(arr)

22.0

In [22]:
#using pandas
women = pd.read_csv("datasets/womenR.csv", index_col=0)
women.head()

Unnamed: 0,height,weight
1,58,115
2,59,117
3,60,120
4,61,123
5,62,126


In [7]:
women.mean()

height     65.000000
weight    136.733333
dtype: float64

&nbsp;

#### **2. Median**
The median is the middle value of a dataset when the data points are arranged in ascending or descending order. If the number of data points is odd, it is the exact middle value; if even, it is the average of the two middle values. Compared to the mean, which uses all observations, the median depends only on the values in the center of the sorted data.
- The median is less affected by outliers and skewed data, making it a better measure of central tendency when dealing with non-normal or skewed distributions.
- It’s especially useful for ordinal data or data with skewed distributions.

In [8]:
#using python standard library
statistics.median(marks)

73

In [9]:
#using numpy
np.median(arr)

22.0

In [10]:
#using pandas
women.median()

height     65.0
weight    135.0
dtype: float64

&nbsp;

#### **3. Mode**
The mode is the value or (values in case of a tie) that appears most frequently in a dataset. A dataset can have more than one mode (bimodal, multimodal), or no mode at all if all values occur with equal frequency.
- It is a simple summary statistic for categorical data, and it is generally not used for numeric data.
- It is less sensitive to outliers but can be less informative for continuous or large datasets where many values are unique.

In [11]:
#using python standard library
statistics.mode(marks)

59

In [12]:
#using scipy
stats.mode(arr)

ModeResult(mode=26, count=2)

In [13]:
#using pandas
demo = pd.DataFrame(arr, columns=["arr"])
demo.mode()

Unnamed: 0,arr
0,26


&nbsp;

#### **4. Trimmed Mean**
The trimmed mean is similar to the arithmetic mean but calculated after removing a certain percentage of the largest and smallest values from the dataset. This helps reduce the influence of outliers.
- It provides a more robust estimate of the central tendency in datasets that contain extreme values.
- It is commonly used in situations where data may have errors or outliers that could distort the simple mean.

In [14]:
#using scipy
stats.trim_mean(arr, 0.2)

21.857142857142858

In [15]:
#using scipy
stats.trim_mean(women, 0.2)

array([ 65.        , 135.77777778])

&nbsp;

#### **5. Weighted Mean**
The weighted mean accounts for the importance (or weight) of each data point. Each value is multiplied by a corresponding weight (user specified), and the sum of these products is divided by the sum of the weights.
- Some values are intrinsically more variable than others, and highly variables observations are given a lower weight.
- The data collected does not equally represent the different groups that we are interested in measuring.

$$ 
   \text{Weighted Mean} (\bar{x}) = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}
$$

In [16]:
#using numpy
w = np.array([2,5,6,5,2,2,3,6,3])
np.average(arr, weights=w)

23.852941176470587

In [17]:
# using numpy on pandas columns
np.average(women["height"], weights=women["weight"])

65.47098976109216

&nbsp;

#### **6. Harmonic Mean**
The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of the data values. It is particularly useful for rates and ratios.
- It is more appropriate for datasets where large values have disproportionately less impact, such as rates (e.g., speed or price per unit).
- It is highly sensitive to small values in the dataset.

$$ 
\text{Harmonic Mean (H)} = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}
$$

In [18]:
#using scipy 
stats.hmean(arr)

18.033176307516147

In [19]:
#using scipy on pandas dataframes
stats.hmean(women)

array([ 64.71181223, 135.12230245])

&nbsp;

#### **7. Geometric Mean**
The geometric mean is the nth root of the product of n numbers. It is used for datasets that are multiplicative in nature or have exponential growth (e.g., rates of return, growth rates).
- It is more appropriate than the arithmetic mean when dealing with ratios, percentages, or exponential data
- It tends to dampen the impact of extreme values, making it less sensitive to outliers than the arithmetic mean.

$$ 
    \text{Geometric Mean (G)} = \left( \prod_{i=1}^{n} x_i \right)^{\frac{1}{n}}
$$

In [20]:
#using scipy 
stats.gmean(arr)

20.109330884536913

In [21]:
#using scipy on pandas dataframe
stats.gmean(women)

array([ 64.85599887, 135.92201301])