# Descriptive Statistics

Descriptive Statistics or Summary Statistics is used to describe a data set using a small number of statistical quantities such as:

1. measure of location of centrality
2. measure of variability or spread
3. measures of distribution
4. measures of distribution shape

In this notebook,  statistical quantities are used to describe a 1-D dataset

In [2]:
#Standard Imports
import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
#Display options
pd.options.display.max_rows = 10

In [5]:
# Dataset : demo dataset
tips = sns.load_dataset("tips") #tips is a Pandas DataFrame 
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [10]:
#Convert total_bill column to Numpy Array
data = tips.total_bill.values
data[:10]

array([16.99, 10.34, 21.01, 23.68, 24.59, 25.29,  8.77, 26.88, 15.04,
       14.78])

## Measures of Centrality

#### MEAN

1. Arithmetic Mean: Average (typical middle value)
2. Geometric Mean: n-th root of a set of values
3. Harmonic Mean: mean of ratio quantities


In [16]:
# Computing the different Pythagorean Means
print('Arithmetic Mean = {:6.4f}'.format(np.mean(data)))
print('Geometric Mean = {:6.4f}'.format(sp.stats.gmean(data)))
print('Harmonic Mean = {:6.4f}'.format(sp.stats.hmean(data)))

Arithmetic Mean = 19.7859
Geometric Mean = 17.9989
Harmonic Mean = 16.3090


Summary:
- The mean (or average) is a commonly used descriptive statistic. However, some data are distributed in ways that make the mean a poor measure for the typical value. 
- For example, if data are bi-modal (with lots of low values and lots of high values) the mean will typically lie between these two clumps of data. 
- Likewise, this can occur if a data set has a number of outliers at one end of a range (such as sales contracts that are much higher than the rest of the sales contracts in a list).
In this case, other measures can provide more robust estimates. Some of the more common such measures include the median, the mode, and trimmed mean.

#### MEDIAN

The median is conceptually an easy measure of centrality; half the data lie above and below the median. Thus, to compute the median you first sort the data and select the middle element. While conceptually simple, in practice computing the median is complicated by several issues:
We need to sort the data, which is challenging for big data (and also implies the data can be sorted).
For a data set with an even number of values there is no middle value.
In the last case, for example, given the list of points [0, 1, 2, 3, 4, 5], we can use three different techniques to compute a median: 

- a. average the two middle values (median=2.5), 
- b. take the lower of the two middle values (median=2), or 
- c. take the higher of the two middle values (median=3).

Often, a Python module by default employs the first technique (average the two middle values), although, as demonstrated below, exceptions to this do occur, such as in the built-in Python statistics module, which can return any of these three values.


In [25]:
print("Median Value = {:6.4f}".format(np.median(data)))

Median Value = 17.7950


#### MODE

The mode is simply the most common value in a data set. Often, a mode makes the most sense when the data have been binned, 
in which case the data have been aggregated and it becomes simple to find the bin with the most values (we will see
this in a later lesson on distributions). A mode can also make sense when data are categorical, or explicitly limited to a discrete set of values. 
When calculating the mode (or modal value), we also have the number of times this value occurred, as demonstrated below.

In [23]:
mode = sp.stats.mode(data)
print(" Modal Value = {0:6.4f}, occured {1} times.".format(mode[0][0], mode[1][0]))

 Modal Value = 13.4200, occured 3 times.


#### TRIMMED MEAN
One simple technique to compute a more robust, average value is to trim outlier points before calculating the mean value, producing 
a _trimmed mean_. The trimming process generally eliminates either a set number of points from the calculation, 
or any values that are more extreme than a certain range (such as three standard deviations from the untrimmed mean). 
In NumPy, the latter approach is used, and a lower and upper limit can be applied to eliminate values outside this range 
from being used in the computation of the [trimmed mean]

In [27]:
low = 12.5
up = 27.5
tm  = sp.stats.tmean(data, (low,up))
print( " Trimmed Mean = {0:6.4f} with bounds = {1:4.2f} : {2:4.2f}".format(tm,low,up))

 Trimmed Mean = 18.5894 with bounds = 12.50 : 27.50


Q. compute the mean and median for the total_bill column, separately for those rows corresponding to lunch and dinner. What do these values suggest about the typical lunch or dinner party?

In [37]:
lunch = tips[tips['time'] == 'Lunch']
dinner = tips[tips["time"] == 'Dinner']

In [38]:
total_bill_lunch = lunch.total_bill.values
total_bill_dinner = dinner.total_bill.values

In [41]:
print("MEDIAN")
print("Median Value for Lunch = {:6.4f}".format(np.median(total_bill_lunch)))
print("Median Value for Dinner = {:6.4f}".format(np.median(total_bill_dinner)))
print("MEAN")
print("Mean Value for Lunch = {:6.4f}".format(np.mean(total_bill_lunch)))
print("Mean Value for Dinner = {:6.4f}".format(np.mean(total_bill_dinner)))

MEDIAN
Median Value for Lunch = 15.9650
Median Value for Dinner = 18.3900
MEAN
Mean Value for Lunch = 17.1687
Mean Value for Dinner = 20.7972
