# Descriptive Statistics (Chapter 3)

Descriptive statistic normally applies to the analyzing data in a way that describes, shows, or summarizes data in a meaningful way. **We do not draw conclusions from descriptive statistics**!!  That is the art of inferential statistics, which we will explore in the next chapter.  Let us differentiate the two:
* Suppose that we take a sample of everyone's age in the class
* We can use **descriptive statistics** to summarize the data that is collected (mean, median, standard deviation, quantiles, ...)
* We would use **inferential statistics** to infer information about graduate students enrolled in a masters program at Michigan Tech
Often, we want to infer information about a **population**, but are only able to observe a part of the population, known as the **Sample**.

Goal for this module:
* Learning statistical terms and concepts to describe a data sample.
* Using Python to recover descriptive statistics of data
* More experience visualizing data using Python

I will cover a separate example with roughly the same topics/layout as our textbook, so that you have two distinct examples to work from.  The data set I have chosen to use is the wine quality data set from the UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Wine+Quality.  As we will see, this data set provides many attributes of wine, including a summary "quality score" based on an expert opinion.  We may visit this data set in a later week to try and use machine learning to assign a quality score based on wine attributes.  For now, we are just interested in generating **descriptive statistics** from the data.  First, lets setup our python libraries and toolboxes

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Next, we need to prepare our data.  For this notebook, we will pull our data directly from the UCI repository.  It turns out, there are two data sets: one for red one, one for white wine. Lets pull the data for the redwine. In this case, we are reading in a csv file (from the web) into a pandas dataframe object.

In [2]:
url  =  "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine = pd.read_csv(url)
wine.head()

Unnamed: 0,"fixed acidity;""volatile acidity"";""citric acid"";""residual sugar"";""chlorides"";""free sulfur dioxide"";""total sulfur dioxide"";""density"";""pH"";""sulphates"";""alcohol"";""quality"""
0,7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
1,7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
2,7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;...
3,11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58...
4,7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5


Oops, it looks like semi-colons were used as a limiter (separator) for the data.  (The default delimiter is a comma). Let's re-import using the semi-colon as a separator

In [3]:
wine = pd.read_csv(url, sep=";")
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


That's better.  Lets find out the number of observation and features (shape) of our data set

In [4]:
wine.shape

(1599, 12)

So, 1599 observations (rows) and 12 features (columns).  The "quality" column is output variable in some sense.  Lets group by quality, and find out how many wines have a specific quality rating

In [5]:
counts = wine.groupby('quality').size()
print counts

quality
3     10
4     53
5    681
6    638
7    199
8     18
dtype: int64


Looks like most wines are mediocre (quality rating of 5 or 6). Few wines are truly excellent or poor.  Let's proceed with defining some useful descriptive statistics
* mean, $\mu$ (some people use the non-technical term, average)
* standard deviation, $\sigma$ (or variance, $\sigma^2$)
* quantiles / percentiles
* distributions

Let's proceed with finding the mean pH of wines, grouped by quality.

In [6]:
a = wine[['pH','quality']].groupby('quality').mean()
print(a)

               pH
quality          
3        3.398000
4        3.381509
5        3.304949
6        3.318072
7        3.290754
8        3.267222


Mathematically, how are these numbers obtained?  Well, we know for example, there are 10 wines with quality 3 (based on our counts above).  We can extract the pH values of these 10 wines.  The mean is the sum of the values divided by the number of values, $$\mu = \frac{1}{n} \sum_{i=1}^n x_i$$



In [7]:
wine[wine['quality']==3 ]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
459,11.6,0.58,0.66,2.2,0.074,10.0,47.0,1.0008,3.25,0.57,9.0,3
517,10.4,0.61,0.49,2.1,0.2,5.0,16.0,0.9994,3.16,0.63,8.4,3
690,7.4,1.185,0.0,4.25,0.097,5.0,14.0,0.9966,3.63,0.54,10.7,3
832,10.4,0.44,0.42,1.5,0.145,34.0,48.0,0.99832,3.38,0.86,9.9,3
899,8.3,1.02,0.02,3.4,0.084,6.0,11.0,0.99892,3.48,0.49,11.0,3
1299,7.6,1.58,0.0,2.1,0.137,5.0,9.0,0.99476,3.5,0.4,10.9,3
1374,6.8,0.815,0.0,1.2,0.267,16.0,29.0,0.99471,3.32,0.51,9.8,3
1469,7.3,0.98,0.05,2.1,0.061,20.0,49.0,0.99705,3.31,0.55,9.7,3
1478,7.1,0.875,0.05,5.7,0.082,3.0,14.0,0.99808,3.4,0.52,10.2,3
1505,6.7,0.76,0.02,1.8,0.078,6.0,12.0,0.996,3.55,0.63,9.95,3


In [8]:
wine[wine['quality']==3 ]['pH']

459     3.25
517     3.16
690     3.63
832     3.38
899     3.48
1299    3.50
1374    3.32
1469    3.31
1478    3.40
1505    3.55
Name: pH, dtype: float64

In [9]:
my_sum = 0
for pH in wine[wine['quality']==3 ]['pH']:
    my_sum += pH
my_mean = my_sum/wine[wine['quality']==3 ]['pH'].size
print("mean pH of wine with quality 3 is " + str(my_mean) + ".")

mean pH of wine with quality 3 is 3.398.


The mean is often not a sufficient descriptor of data.  One often cares about how the data *deviates* from the mean, the so-called "spread" of the data.    The common measurement is the mean squared deviation, $$\sigma^2 = \frac{1}{n} \sum_i(x_i-\mu)^2.$$  Lets use the group by function to find the standard deviation by quality, and then recover the standard deviation manually.

In [13]:
a = wine[['pH','quality']].groupby('quality').std()
print(a)

               pH
quality          
3        0.144052
4        0.181441
5        0.150618
6        0.153995
7        0.150101
8        0.200640


In [15]:
my_sum = 0
for pH in wine[wine['quality']==3 ]['pH']:
    my_sum += (pH-my_mean) ** 2
my_variance = my_sum/(wine[wine['quality']==3 ]['pH'].size)
my_std = np.sqrt(my_variance)
print("(mean and standard deviation) of pH, wine with quality 3 is (%g,%g)."% (my_mean,my_std) ) 

(mean and standard deviation) of pH, wine with quality 3 is (3.398,0.13666).


Well, that's not quite the same.  The built-in std() function gives a standard deviation of 0.144052 for wines of quality 3, and our manual calculation gives us a standard deviation of 0.13666.  What's going on?  This is a subtle point, related to populations and samples.  Here, our data set is assumed to be the entire **population** of red wines, so the formula, and manual implementation of standard deviation is correct.  What Python has computed is the "sample "standard deviation, $$s^2 = \frac{1}{n-1} \sum_i(x_i-\bar{x})^2.$$  The population and sample standard deviation is related by the formula $$ s^2 = \frac{n-1}{n}\sigma^2.$$  As $n\to\infty$, the population and sample standard deviation approach each other.

In [18]:
sample_std = my_std*np.sqrt( float(wine[wine['quality']==3 ]['pH'].size )/ (wine[wine['quality']==3 ]['pH'].size - 1))
print(sample_std)

0.1440524595802206
