# Descriptive Statistics

Descriptive Statistics or Summary Statistics is used to describe a data set using a small number of statistical quantities such as:

1. measure of location of centrality
2. measure of variability or spread
3. measures of distribution
4. measures of distribution shape

In this notebook,  statistical quantities are used to describe a 1-D dataset

In [1]:
#Standard Imports
import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
#Display options
pd.options.display.max_rows = 10

In [2]:
# Dataset : demo dataset
tips = sns.load_dataset("tips") #tips is a Pandas DataFrame 
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
#Convert total_bill column to Numpy Array
data = tips.total_bill.values
data[:10]

array([16.99, 10.34, 21.01, 23.68, 24.59, 25.29,  8.77, 26.88, 15.04,
       14.78])

## Measures of Variability (Dispersion)

Once we have a measure for the central location of a data set, the next step is to quantify the spread or dispersion of the data around this location. Inherent in this process is the computation of a distance between the data and the central location. Variations in the distance measurement can lead to differences in variability measures, which include the following dispersion measurements:

1. Range
2. Mean Deviation
3. Mean Absolute Deviation
4. Variance
5. Standard Deviation


#### RANGE 

Of these, the simplest is a range measurement, which is simply the difference between two characteristic values. 
Often these are the minimum and maximum values in a data set. While simple, this provides little insight into the 
true spread or dispersion of the data, in particular, about the centrality measure(since the mean might lie 
anywhere between the maximum and minimum values).

In [5]:
# Compute Range Quantities
print('Maximum = {:6.4f}'.format(np.max(data)))
print('Minimum = {:6.4f}'.format(np.min(data)))
print("Range = {:6.4f}".format(np.max(data) - np.min(data)))
print('Peak To Peak (PTP) Range = {:6.4f}'.format(np.ptp(data)))

Maximum = 50.8100
Minimum = 3.0700
Range = 47.7400
Peak To Peak (PTP) Range = 47.7400


#### MEAN DEVIATION

Each of the remaining four dispersion measurements are made with respect to a location measure, which is typically the mean value. Thus, the mean deviation is typically measured with respect to the mean value, as shown by the following formula:
![Mean Deviation](images\md.png)

While simple, one sums up the deviations from the mean value and divides by the number of points, this measure cancels out high and low measurements, producing a small value measure (which can be zero) for the spread of the data around the mean value. This problem can be overcome by using two different techniques: summing absolute deviations or summing the square of the deviations, which leads to the last three measures.


#### MEAN ABSOLUTE DEVIATION

First, if we sum up the absolute deviations, we generate a sum of intrinsically positive values.
- __This MEAN ABSOLUTE VALUE is also known as l1-norm or MANHATTAN NORM.__

This finds use in many applications including machine learning.
![Mean Absolute Deviation](images\mad.png)

In [10]:
# Compute Deviations

tmp = data - np.mean(data)
n = data.shape[0]

print('Mean Deviation from Mean = {:6.4f}'.format(np.sum(tmp/n)))
print('Mean Absolute Deviation from Mean = {:6.4f}'.format(np.sum(np.absolute(tmp)/n)))

Mean Deviation from Mean = -0.0000
Mean Absolute Deviation from Mean = 6.8694


#### VARIANCE

Second, if we sum up squares of deviations from the mean, we again end up with a sum of intrinsically positive numbers.
- __This is known as the l2-norm or the EUCLIDEAN NORM__

This finds use in Pythagorean formula and Machine Learning

![Variance](images\var.png)

In [11]:
# Compute Variance
print("Variance = {:6.4f}".format(np.var(data)))

Variance = 78.9281


#### STANDARD DEVIATION

One concern with the variance, however, is that the units of variance are the square of the original units (for example, length versus length * length). To enable the measure of the variability around the mean to be compared to the mean, we generally will use the standard deviation, which is simply the square root of the variance:

![Standard Deviation](images\sd.png)

In [12]:
# Compute Standard Deviation
print("Standard Deviation = {:6.4f}".format(np.std(data)))

Standard Deviation = 8.8842


#### COEFFICIENT OF VARIATION

The value of a dispersion measure can be confusing. On its own, a large or small dispersion is less informative than a dispersion measure combined with a location measure. For example, a dispersion measure of 10 means something different if the mean is 1 versus 100. 
As a result, 
- __the coefficient of variation is sometimes used to encode the relative size of a dispersion to the mean__:
![Coefficient of Variation](images\cov.png)

In [13]:
# Compute Coeffiecient of Variation
print('Coefficient of Variation = {:6.4f}'.format(np.std(data)/np.mean(data)))

Coefficient of Variation = 0.4490


Q. compute the mean absolute deviation and standard deviation for the total_bill column, separately for those rows corresponding to lunch and dinner. What do these values suggest about the typical lunch or dinner party? Also, compare the coefficient of variation for the same data. Do these values also make sense?

In [14]:
lunch = tips[tips['time'] == 'Lunch']
dinner = tips[tips['time'] == 'Dinner']

In [15]:
lunch_data = lunch.total_bill.values
dinner_data = dinner.total_bill.values

In [16]:
temp_lunch = lunch_data - np.mean(lunch_data)
n_lunch = lunch_data.shape[0]

temp_dinner = dinner_data - np.mean(dinner_data)
n_dinner = dinner_data.shape[0]

In [19]:
print("Lunch")
print("Mean Absolute Deviation = {:6.4f}".format(np.sum(np.absolute(temp_lunch/n_lunch))))
print("Standard Deviation = {:6.4f}".format(np.std(lunch_data)))
print("Coefficient of Variation = {:6.4f}".format(np.std(lunch_data)/np.mean(lunch_data)))

Lunch
Mean Absolute Deviation = 5.6674
Standard Deviation = 7.6570
Coefficient of Variation = 0.4460


In [20]:
print("Dinner")
print("Mean Absolute Deviation = {:6.4f}".format(np.sum(np.absolute(temp_dinner/n_dinner))))
print("Standard Deviation = {:6.4f}".format(np.std(dinner_data)))
print("Coefficient of Variation = {:6.4f}".format(np.std(dinner_data)/np.mean(dinner_data)))

Dinner
Mean Absolute Deviation = 7.0580
Standard Deviation = 9.1160
Coefficient of Variation = 0.4383
