# Descriptive Statistics

Descriptive Statistics or Summary Statistics is used to describe a data set using a small number of statistical quantities such as:

1. measure of location of centrality
2. measure of variability or spread
3. measures of distribution
4. measures of distribution shape

In this notebook,  statistical quantities are used to describe a 1-D dataset

In [1]:
#Standard Imports
import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
#Display options
pd.options.display.max_rows = 10

In [2]:
# Dataset : demo dataset
tips = sns.load_dataset("tips") #tips is a Pandas DataFrame 
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
#Convert total_bill column to Numpy Array
data = tips.total_bill.values
data[:10]

array([16.99, 10.34, 21.01, 23.68, 24.59, 25.29,  8.77, 26.88, 15.04,
       14.78])

## Measures of Distribution

While useful, measures of location and dispersion are only two numbers. Thus, other techniques exist for analytically quantifying a distribution of data. Fundamentally, these techniques all rely on sorting the data and finding the value below which a certain percentage of data lie. In this manner, one percentile would correspond to one percent of the data, while the median would correspond to fifty percent of the data. Formally, the following pre-defined measures are commonly used:

    1.Percentiles: Divides the data into one percent chunks, the value indicates which percentile is of interest
    2.Deciles: Divides the data into ten chunks, each containing 10% of the data
    3.Quantiles: Divides the data into five chunks, each containing 20% of the data
    4.Quartiles: Divides the data into four chunks, each containing 25% of the data

__With NumPy we just use percentile function for all of these and specify the appropriate value for the percentile.__

In [21]:
# Demonstrate percentile with the Median

print('Median (via percentile) = {:6.4f}'.format(np.percentile(data, 50)))

Median (via percentile) = 17.7950


In [22]:
# Compute quartiles

print("First Quartile = {:6.4f}".format(np.percentile(data,25)))
print("Second Quartile = {:6.4f}".format(np.percentile(data, 50)))
print("Third Quartile = {:6.4f}".format(np.percentile(data, 75)))
print("Fourth Quartile = {:6.4f}".format(np.percentile(data,100)))

First Quartile = 13.3475
Second Quartile = 17.7950
Third Quartile = 24.1275
Fourth Quartile = 50.8100


In [25]:
# Interqaurtile Range is used as adistribution measure
print("Interquartile Range = {:6.4f}".format(np.percentile(data, 75) - np.percentile(data,25)))

Interquartile Range = 10.7800


In [26]:
# Compute Quantiles
print("First Quantile = {:6.4f}".format(np.percentile(data,20)))
print("Second Quantile = {:6.4f}".format(np.percentile(data,40)))
print("Third Quantile = {:6.4f}".format(np.percentile(data,60)))
print("Fourth Quantile = {:6.4f}".format(np.percentile(data,80)))
print("Fifth Quantile = {:6.4f}".format(np.percentile(data, 100)))

First Quantile = 12.6360
Second Quantile = 16.2220
Third Quantile = 19.8180
Fourth Quantile = 26.0980
Fifth Quantile = 50.8100


## Weighted Statistics

![weighted stats](images\ws.png)

In [28]:
import math

# Get Mean
print("Mean = {:6.4f}".format(np.mean(data)))

# Genrate random weights

w = np.random.uniform(size=data.shape)

# Compute weighted mean

wm = np.average(data, weights=w)

print("Weighted Mean = {:6.4f}".format(wm))

# Get Std Deviation

print("Standard Deviation = {:6.4f}".format(np.std(data)))

# Compute weighted standard deviation

wstd = math.sqrt(np.average((data-wm)**2, weights=w))

print("Weighted Std Deviation = {:6.4f}".format(wstd))

Mean = 19.7859
Weighted Mean = 19.5267
Standard Deviation = 8.8842
Weighted Std Deviation = 8.6896


## Measures of Shape

In contrast to measures of the distribution, such as the percentile or quantile, there are several other pre-defined quantities that __provide insight into the shape of a data set, especially in relation to the mean and standard deviation__.

- __SKEWNESS__
- The skewness measures the lack of symmetry with respect to the mean value. 
- Values near zero indicate symmetric distributions, while larger values indicate increasing asymmetry

- __KURTOSIS__
- Kurtosis, measures the spread (or how wide the tails are) of a distribution relative to the Normal distribution
- Small values of Kurtosis indicate data that are highly concentrated around the mean value, while large values of the kurtosis indicate data that are considerably more different than the mean value.

In [29]:
# Compute Skew
skew = sp.stats.skew(data)
print("Skewness = {:6.4f}".format(skew))

# Compute Kurtosis
kurt = sp.stats.kurtosis(data)
print("Kurtosis = {:6.4f}".format(kurt))

Skewness = 1.1262
Kurtosis = 1.1692


## Population or Sample

To this point, we have focused on describing a data set as if it stood alone. For example, you might have access to the full financial statements of a company or information about every transaction that took place in a given market. In some cases, however, you may only have a subset of this full information. This may arise when the original data is very large and you simply want to explore a subset of the full data, or perhaps you were simply given a subset in order to ascertain any issues of problems

Formally, this division can be described as being given either the full population or simply a sample. Traditional statistics, being developed many years ago when data sets were small and all calculations were done with pencil and paper, focused on using samples to make predictions (or estimates) of the full population. In practice, this introduces several new terms and small changes in calculating descriptive statistics. First, the originating data set is often called the parent population. Second, the process of selecting data from the parent population is known as sampling. 

__When using a sample to describe the parent population, we must account for the fact that we are using limited information to describe something (potentially) much larger__. 
Thus, when we use a sample to estimate the mean value of the parent population, we have reduced the information content of our sample. This means that when we use this estimated mean value to compute the estimated standard deviation, we actually compute a less precise estimate. This is formalized by a concept known as __degree of freedom, which is given by the number of data points in the sample, N, which is equal to the length of the data set in Python__.


![population sample](images\sample.png)

In [31]:
# To calculate this by using Python, we can simply pass in delta degrees of freedom parameter to the numpy.std method:
np.std(data,ddof=1)

8.902411954856856

In [32]:
# Compare Population and Sample std deviation
print("Population Std Deviation = {:6.4f}".format(np.std(data)))
print("Sample Std Deviation = {:6.4f}".format(np.std(data,ddof=1)))

Population Std Deviation = 8.8842
Sample Std Deviation = 8.9024


### Sample Standard Error of the Mean

Another statistic that is often of interest is the precision with which we can measure the mean value.
Intuitively, as the number of points in our sample increases, the precision should increase (and 
the standard deviation decrease). This is formulated by the sample standard error, or 
SE, which measures the standard error in quantifying a statistic. 
Thus, if we want to know the precision in our measurement of the mean, we can compute the sample standard error of 
the mean. Formally, this value is computed by dividing the sample standard deviation by the square root of the 
number of data points in the sample:
    
![std error](images\se.png)

NOTE: Note that for large data sets, the difference in dividing by  N
  or  N−1
  is minimal and we thus can often ignore the difference in practice.


In [33]:
#compute SSE
N = data.shape[0]
print("Sample Standard Error = {:6.4f}".format(np.std(data,ddof=1)/np.sqrt(N)))

Sample Standard Error = 0.5699
