## Central Tendency
    A measure of central Tendency is a summary statistics that represent the center point or typical value of dataset. These measure indicate where most values in a distribution fall and are also refered to as the central location of a distribution
    
    Methods are
    1. Arithmatic mean 
    2. Median
    3. Mode
    4. Geomatric mean
    5. Harmonic mean

### Mean
    It is the Average value of the data which is a division of sum of the values with the number of values.

### Use of mean in Machine Learning
    The arithmetic mean is useful in machine learning when summarizing a variable, e.g. reporting the most likely value. This is more meaningful when a variable has a Gaussian or Gaussian-like data distribution. The arithmetic mean can be calculated using the mean() NumPy function.

![mean.jpg](attachment:mean.jpg)

In [1]:
import statistics

data = [11, 21, 11, 19, 46, 21, 19, 29, 21, 18, 3, 11, 11]

x = statistics.mean(data)
print(x)

18.53846153846154


In [2]:
# Mean using numpy.mean() function
from numpy import mean
number_list = [19, 21, 46, 11, 18]
avg = mean(number_list)
print("The average of List is ", round(avg, 2))

The average of List is  23.0


In [3]:
# Pandas dataframe.mean()

import pandas as pd 
  
# Creating the dataframe  
df = pd.DataFrame({"A":[12, 4, 5, 44, 1], 
                   "B":[5, 2, 54, 3, 2],  
                   "C":[20, 16, 7, 3, 8], 
                   "D":[14, 3, 17, 2, 6]}) 
  
# Print the dataframe 
df 

Unnamed: 0,A,B,C,D
0,12,5,20,14
1,4,2,16,3
2,5,54,7,17
3,44,3,3,2
4,1,2,8,6


In [4]:
# Column Wise
df.mean(axis = 0) 

A    13.2
B    13.2
C    10.8
D     8.4
dtype: float64

In [5]:
# Column Wise
df.mean(axis = 1) 

0    12.75
1     6.25
2    20.75
3    13.00
4     4.25
dtype: float64

In [8]:
# If Dataset has NA values
import pandas as pd 
  
# Creating the dataframe  
df = pd.DataFrame({"A":[12, 4, 5, None, 1], 
                   "B":[7, 2, 54, 3, None], 
                   "C":[20, 16, 11, 3, 8], 
                   "D":[14, 3, None, 2, 6]}) 
  
# skip the Na values while finding the mean 
df.mean(axis = 0, skipna = True) 

A     5.50
B    16.50
C    11.60
D     6.25
dtype: float64

    The mean is essentially a model of your data set. It is the value that is most common. You will notice, however, that the mean is not often one of the actual values that you have observed in your data set. However, one of its important properties is that it minimises error in the prediction of any one value in your data set. That is, it is the value that produces the lowest amount of error from all other values in the data set.

    An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero.

### Disadvantage of mean
    The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. For example, consider the wages of staff at a factory below:

    Staff	1	2	3	4	5	6	7	8	9	10 
    Salary	15k	18k	16k	14k	15k	15k	12k	17k	90k	95k

    The mean salary for these ten staff is 30.7k. However, inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in the 12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation, we would like to have a better measure of central tendency. As we will find out later, taking the median would be a better measure of central tendency in this situation.

    Another time when we usually prefer the median over the mean (or mode) is when our data is skewed (i.e., the frequency distribution for our data is skewed). If we consider the normal distribution - as this is the most frequently assessed in statistics - when the data is perfectly normal, the mean, median and mode are identical. Moreover, they all represent the most typical value in the data set. However, as the data becomes skewed the mean loses its ability to provide the best central location for the data because the skewed data is dragging it away from the typical value. However, the median best retains this position and is not as strongly influenced by the skewed values. This is explained in more detail in the skewed distribution section later in this guide.



## Geometric Mean
    The geometric mean is calculated as the N-th root of the product of all values, where N is the number of values.
    For example, if the data contains only two values, the square root of the product of the two values is the geometric mean. For three values, the cube-root is used, and so on.

### When we use geometric mean?
    The geometric mean is appropriate when the data contains values with different units of measure, e.g. some measure are height, some are dollars, some are miles, etc.
    
   Use of geometric mean in machine learning
    
    One common example of the geometric mean in machine learning is in the calculation of the so-called G-Mean (geometric mean) metric that is a model evaluation metric that is calculated as the geometric mean of the sensitivity and specificity metrics.

    Geometric means can be useful in machine learning and artificial intelligence applications when comparing different items which may have different properties and numerical ranges. Geometric means can be used to normalize numerical ranges of the dataset so that each item in the dataset can be directly compared. Geometric means used in AI/ML differ from arithmetic means as a larger numerical range in arithmetic mean calculations would result in a much larger effect on the average than the geometric mean calculation.

    Geometric means are useful when growth is proportional or varies nonlinearly, which can be true of systems in data science and machine learning.


The geometric mean does not accept negative or zero values, e.g. all values must be positive.

In [9]:
# example of calculating the geometric mean
from scipy.stats import gmean
# define the dataset
data = [1, 2, 3, 40, 50, 60, 0.7, 0.88, 0.9, 1000]
# calculate the mean
result = gmean(data)
print('Geometric Mean: %.3f' % result)

Geometric Mean: 7.246


### Harmonic Mean
    The harmonic mean is the reciprocal of the average of the reciprocals

![harmonic-mean.svg](attachment:harmonic-mean.svg)

### How harmonic mean is used in machine learning
    1. The harmonic mean is used in machine learning to calculate something called an F-score or F-measure. The F-score is a test for evaluating the performance of algorithms in information retrieval.
    2. The harmonic mean is the appropriate mean if the data is comprised of rates. For ex.- In machine learning, we have rates when evaluating models, such as the true positive rate or the false positive rate in predictions.
    3. The harmonic mean does not take rates with a negative or zero value, e.g. all rates must be positive.

    One common example of the use of the harmonic mean in machine learning is in the calculation of the F-Measure (also the F1-Measure or the Fbeta-Measure); that is a model evaluation metric that is calculated as the harmonic mean of the precision and recall metrics

In [10]:
# example of calculating the harmonic mean
from scipy.stats import hmean
# define the dataset
data = [0.11, 0.22, 0.33, 0.44, 0.55, 0.66, 0.77, 0.88, 0.99]
# calculate the mean
result = hmean(data)
print('Harmonic Mean: %.3f' % result)

Harmonic Mean: 0.350


### How to Choose the Correct Mean?
1. If values have the same units: Use the arithmetic mean.
2. If values have differing units: Use the geometric mean.
3. If values are rates: Use the harmonic mean.

## Median
    The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below:

    65	55	89	56	35	14	56	55	87	45	92
    
    We first need to rearrange that data into order of magnitude (smallest first):

    14	35	45	55	55	56	56	65	87	89	92
    
    56 is the median

### When should you use the median?
    The median is the most informative measure of central tendency for skewed distributions or distributions with outliers.

    In skewed distributions, more values fall on one side of the center than the other, and the mean, median and mode all differ from each other.

    In a positively skewed distribution, there’s a cluster of lower scores and a spread out tail on the right.
    
![positively-skewed-distribution.png](attachment:positively-skewed-distribution.png)

    In a negatively skewed distribution, there’s a cluster of higher scores and a spread out tail on the left.

![negatively-skewed-distribution.png](attachment:negatively-skewed-distribution.png)

    Because the median only uses one or two values from the middle of a data set, it’s unaffected by extreme outliers or non-symmetric distributions of scores. In contrast, the positions of the mean and mode can vary in skewed distributions.

    For this reason, the median is often reported as a measure of central tendency for variables such as income, because these distributions are usually positively skewed.

    The level of measurement of your variable also determines whether you can use the median. The median can only be used on data that can be ordered – that is, from ordinal, interval and ratio levels of measurement.

In [11]:
import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

87.0


In [12]:
import statistics 
  
# unsorted list of random integers 
data1 = [2, -2, 3, 6, 9, 4, 5, -1] 
  
  
# Printing median of the 
# random data-set 
print("Median of data-set is : % s "
        % (statistics.median(data1))) 

Median of data-set is : 3.5 


## Mode
    The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. An example of a mode is presented below:
![mode-1.png](attachment:mode-1.png)

    Normally, the mode is used for categorical data where we wish to know which is the most common category, as illustrated below:

![mode-1a.png](attachment:mode-1a.png)

### How many modes can you have?
    A data set can often have no mode, one mode or more than one mode – it all depends on how many different values repeat most frequently.

    Your data can be:

1. without any mode
2. unimodal, with one mode,
3. bimodal, with two modes,
4. trimodal, with three modes, or
5. multimodal, with four or more modes.

### When to use the mode
    The level of measurement of your variables determines when you should use the mode.

    The mode works best with categorical data. It is the only measure of central tendency for nominal variables, where it can reflect the most commonly found characteristic (e.g., demographic information). The mode is also useful with ordinal variables – for example, to reflect the most popular answer on a ranked scale (e.g., level of agreement).
    
    For quantitative data, such as reaction time or height, the mode may not be a helpful measure of central tendency. That’s because there are often many more possible values for quantitative data than there are for categorical data, so it’s unlikely for values to repeat.

    Example of quantitative data with no mode
    
    You collect data on reaction times in a computer task, and your data set contains values that are all different from each other.

In [14]:
from scipy import stats

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86,1,1,1]

x = stats.mode(speed)

print(x)

ModeResult(mode=array([1]), count=array([3]))


### When should you use the mean, median or mode?
    The 3 main measures of central tendency are best used in combination with each other because they have complementary strengths and limitations. But sometimes only 1 or 2 of them are applicable to your data set, depending on the level of measurement of the variable.

1. The mode can be used for any level of measurement, but it’s most meaningful for nominal and ordinal levels.
2. The median can only be used on data that can be ordered – that is, from ordinal, interval and ratio levels of measurement.
3. The mean can only be used on interval and ratio levels of measurement because it requires equal spacing between adjacent values or scores in the scale.

    To decide which measures of central tendency to use, you should also consider the distribution of your data set.

    For normally distributed data, all three measures of central tendency will give you the same answer so they can all be used.

    In skewed distributions, the median is the best measure because it is unaffected by extreme outliers or non-symmetric distributions of scores. The mean and mode can vary in skewed distributions.