## How to Carry out Descriptive Statistics in Python
This Jupyter Notebook contains a lot of descriptive statistic examples and how to carry them out in Python. Note, this is the code for the blog post (https://www.marsja.se/pandas-python-descriptive-statistics/). 

In [16]:
import numpy as np
import pandas as pd
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean

### Simulate Data using Python:

In [17]:
N = 20
P = ["noise","quiet"]
Q = [1,2,3]

values = [[998,511], [1119,620], [1300,790]]

mus = np.concatenate([np.repeat(value, N) for value in values])

data = pd.DataFrame(data = {'id': [subid for subid in range(N)]*(len(P)*len(Q))
    ,'iv1': np.concatenate([np.array([p]*N) for p in P]*len(Q))
    ,'iv2': np.concatenate([np.array([q]*(N*len(P))) for q in Q])
    ,'rt': np.random.normal(mus, scale=112.0, size=N*len(P)*len(Q))})

### Summary Statistics using Pandas:

In [5]:
data.describe()

Unnamed: 0,id,iv2,rt
count,120.0,120.0,120.0
mean,9.5,2.0,904.343846
std,5.790459,0.81992,302.341847
min,0.0,1.0,222.222632
25%,4.75,1.0,638.579718
50%,9.5,2.0,884.637538
75%,14.25,3.0,1157.212703
max,19.0,3.0,1546.588711


#### Grouped Descriptive Statistics:

In [6]:
grouped_data = data.groupby(['iv1', 'iv2'])
grouped_data['rt'].describe().unstack()

Unnamed: 0_level_0,count,count,count,mean,mean,mean,std,std,std,min,...,25%,50%,50%,50%,75%,75%,75%,max,max,max
iv2,1,2,3,1,2,3,1,2,3,1,...,3,1,2,3,1,2,3,1,2,3
iv1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
noise,20.0,20.0,20.0,987.657338,1171.814312,1306.581987,109.622704,107.250082,129.9964,751.064899,...,1211.461213,979.890264,1168.127591,1310.417271,1056.748712,1248.701845,1382.162473,1231.046135,1371.051372,1546.588711
quiet,20.0,20.0,20.0,525.298462,633.285676,801.425297,131.003698,90.74082,118.551951,222.222632,...,743.438071,500.223343,625.475771,788.500727,633.096902,683.698141,860.746124,718.711884,814.830784,1125.306481


#### Getting the Mean Values in Pandas:

In [7]:
grouped_data['rt'].mean().reset_index()

Unnamed: 0,iv1,iv2,rt
0,noise,1,987.657338
1,noise,2,1171.814312
2,noise,3,1306.581987
3,quiet,1,525.298462
4,quiet,2,633.285676
5,quiet,3,801.425297


In [8]:
grouped_data['rt'].aggregate(np.mean).reset_index()

Unnamed: 0,iv1,iv2,rt
0,noise,1,987.657338
1,noise,2,1171.814312
2,noise,3,1306.581987
3,quiet,1,525.298462
4,quiet,2,633.285676
5,quiet,3,801.425297


### Geometric & Harmonic Mean in Python

#### SciPy and Pandas Method:

In [9]:
grouped_data['rt'].apply(gmean, axis=None).reset_index()

Unnamed: 0,iv1,iv2,rt
0,noise,1,981.804049
1,noise,2,1167.169641
2,noise,3,1300.404909
3,quiet,1,507.310731
4,quiet,2,626.880486
5,quiet,3,793.254215


#### Harmonic using Scipy & Pandas:

In [11]:
grouped_data['rt'].apply(hmean, axis=None).reset_index()

Unnamed: 0,iv1,iv2,rt
0,noise,1,975.853716
1,noise,2,1162.550537
2,noise,3,1294.193974
3,quiet,1,485.916679
4,quiet,2,620.19477
5,quiet,3,785.132515


#### Trimmed Mean in Python

In [14]:
trimmed_mean = grouped_data['rt'].apply(trim_mean, .1)
trimmed_mean.reset_index()

Unnamed: 0,iv1,iv2,rt
0,noise,1,988.172512
1,noise,2,1169.006037
2,noise,3,1304.570969
3,quiet,1,530.993236
4,quiet,2,636.176287
5,quiet,3,797.767809


### Pandas Median

In [13]:
# Pandas Only:
# grouped_data['rt'].median().reset_index()
# Pandas + NumPy
grouped_data['rt'].aggregate(np.median).reset_index()

Unnamed: 0,iv1,iv2,rt
0,noise,1,979.890264
1,noise,2,1168.127591
2,noise,3,1310.417271
3,quiet,1,500.223343
4,quiet,2,625.475771
5,quiet,3,788.500727


### Scipy Mode

In [None]:
grouped_data['rt'].apply(mode, axis=None).reset_index()

### Median, Standard Deviation, Mean, and Trimmed Mean in a Pandas Dataframe

In [18]:
descr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).reset_index()
descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index)
descr

Unnamed: 0,iv1,iv2,median,std,mean,trimmed_mean
0,noise,1,979.890264,109.622704,987.657338,988.172512
1,noise,2,1168.127591,107.250082,1171.814312,1169.006037
2,noise,3,1310.417271,129.9964,1306.581987,1304.570969
3,quiet,1,500.223343,131.003698,525.298462,530.993236
4,quiet,2,625.475771,90.74082,633.285676,636.176287
5,quiet,3,788.500727,118.551951,801.425297,797.767809


### Measures of Variability in Python

### Pandas Standard deviation

In [19]:
grouped_data['rt'].std().reset_index()

Unnamed: 0,iv1,iv2,rt
0,noise,1,109.622704
1,noise,2,107.250082
2,noise,3,129.9964
3,quiet,1,131.003698
4,quiet,2,90.74082
5,quiet,3,118.551951


### Inter quartile range

In [21]:
grouped_data['rt'].quantile([.25, .5, .75]).unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,0.25,0.5,0.75
iv1,iv2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
noise,1,927.630752,979.890264,1056.748712
noise,2,1082.168931,1168.127591,1248.701845
noise,3,1211.461213,1310.417271,1382.162473
quiet,1,469.351283,500.223343,633.096902
quiet,2,592.876669,625.475771,683.698141
quiet,3,743.438071,788.500727,860.746124


### Pandas Variance

In [22]:
grouped_data['rt'].var().reset_index()

Unnamed: 0,iv1,iv2,rt
0,noise,1,12017.137211
1,noise,2,11502.580176
2,noise,3,16899.063947
3,quiet,1,17161.968844
4,quiet,2,8233.896362
5,quiet,3,14054.565128
