## Descriptive Statistics

 Import **NumPy**, **SciPy**, and **Pandas**

In [2]:
import numpy as np
import pandas as pd
import scipy 

 Randomly generate 1,000 samples from the normal distribution using `np.random.normal()`(mean = 100, standard deviation = 15)

In [3]:
samples = np.random.normal(loc=100,scale=15,size=1000)
samples

array([ 94.99499135, 115.43966277,  92.39191197, 107.83166172,
       100.03989646, 111.72485486,  93.9213909 ,  89.27264879,
       103.73879905, 125.63869814,  74.73186015,  77.19728831,
       110.92507655, 110.9998776 , 109.32857417,  77.77591393,
        72.86391933, 101.71505845, 109.09054677,  97.11924593,
       133.34584442, 123.83158116,  98.44499649, 112.13567634,
        86.81999349,  83.53584845, 102.30479241,  75.39136827,
        97.82707814, 117.4464219 , 101.63315331, 121.92024805,
       119.75669661, 104.91628665,  86.65361005, 114.13522469,
        81.7805784 ,  94.60770962, 109.06119387,  85.055953  ,
       108.25877781, 109.75836432,  73.86750904, 108.48197557,
       100.38727999, 133.59988281,  95.58527234, 113.71647487,
        92.23973703,  91.43355527,  82.82323435, 106.51246889,
       109.69902949, 112.07255338, 100.0664671 ,  95.10422618,
        99.67462489, 113.91402933, 101.56034462,  97.41117172,
       122.67164217, 113.59300729,  95.52503466,  90.84

Compute the **mean**, **median**, and **mode**

In [4]:
from statistics import mode
from scipy import stats
mean = samples.mean()
median = np.median(samples)
mode = stats.mode(samples)
print("mean: ",mean,"\n","median: ",median,"\n","mode: ",mode,sep="")

mean: 100.82159442968276
median: 101.67175307166639
mode: ModeResult(mode=array([52.82178326]), count=array([1]))


Compute the **min**, **max**, **Q1**, **Q3**, and **interquartile range**

In [5]:
min = np.min(samples)
max = np.max(samples)
q1 = np.percentile(samples,25)
q3 = np.percentile(samples,75)
iqr = q3 - q1
print("min: ",min,"\n","max: ",max,"\n","q1: ",q1,"\n","q3: ",q3,"\n","iqr: ",iqr,sep="")

min: 52.82178325511321
max: 146.23870131060457
q1: 90.89783372889514
q3: 110.27260581227576
iqr: 19.37477208338062


Compute the **variance** and **standard deviation**

In [6]:
variance = samples.var()
std_dev = samples.std()
print("variance: ",variance,"\n","standard deviation: ",std_dev,"\n",sep="")

variance: 225.04545545316643
standard deviation: 15.001515105254082



Compute the **skewness** and **kurtosis**

In [7]:
from scipy.stats import kurtosis, skew
skewness = skew(samples)
kurtosis = kurtosis(samples)
print("skewness: ",skewness,"\n","kurtosis: ",kurtosis,"\n",sep="")

skewness: -0.12711730721463557
kurtosis: 0.09311264219509052



## NumPy Correlation Calculation

Create an array x of integers between 10 (inclusive) and 20 (exclusive). Use `np.arange()`

In [8]:
x = np.arange(10,21)
x

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

Then use `np.array()` to create a second array y containing 10 arbitrary integers.

In [9]:
y = np.array([1,5,4,2,3,7,8,12,45,78,21])
y

array([ 1,  5,  4,  2,  3,  7,  8, 12, 45, 78, 21])

Once you have two arrays of the same length, you can compute the **correlation coefficient** between x and y

In [10]:
r = np.corrcoef(x,y)
r

array([[1.        , 0.68095243],
       [0.68095243, 1.        ]])

## Pandas Correlation Calculation

Run the code below

In [11]:
x = pd.Series(range(10, 20))
y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])

Call the relevant method  to calculate Pearson's r correlation.

In [12]:
r = y.corr(x)
r

0.7586402890911869

OPTIONAL. Call the relevant method to calculate Spearman's rho correlation.

In [13]:
rho = y.corr(x, method="spearman")
rho

0.9757575757575757

## Seaborn Dataset Tips

Import Seaborn Library

In [14]:
import seaborn as sns

Load "tips" dataset from Seaborn

In [15]:
tips = sns.load_dataset("tips")

Generate descriptive statistics include those that summarize the central tendency, dispersion

In [16]:
tips.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


Call the relevant method to calculate pairwise Pearson's r correlation of columns

In [None]:
pip install pingouin

In [18]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [17]:
import pingouin as pg
pg.pairwise_corr(tips, method='pearson')

Unnamed: 0,X,Y,method,alternative,n,r,CI95%,p-unc,BF10,power
0,total_bill,tip,pearson,two-sided,244,0.675734,"[0.6, 0.74]",6.692471e-34,4.952e+30,1.0
1,total_bill,size,pearson,two-sided,244,0.598315,"[0.51, 0.67]",4.39351e-25,1.002e+22,1.0
2,tip,size,pearson,two-sided,244,0.489299,"[0.39, 0.58]",4.300543e-16,14720000000000.0,1.0
