<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Descriptive-Statistics" data-toc-modified-id="Descriptive-Statistics-1">Descriptive Statistics</a></span><ul class="toc-item"><li><span><a href="#Measures" data-toc-modified-id="Measures-1.1">Measures</a></span></li><li><span><a href="#Packages-for-descriptive-statistics" data-toc-modified-id="Packages-for-descriptive-statistics-1.2">Packages for descriptive statistics</a></span><ul class="toc-item"><li><span><a href="#The-numpy-version" data-toc-modified-id="The-numpy-version-1.2.1">The numpy version</a></span><ul class="toc-item"><li><span><a href="#Average" data-toc-modified-id="Average-1.2.1.1">Average</a></span></li><li><span><a href="#Measures-of-dispersion" data-toc-modified-id="Measures-of-dispersion-1.2.1.2">Measures of dispersion</a></span></li></ul></li><li><span><a href="#The-pandas-version" data-toc-modified-id="The-pandas-version-1.2.2">The pandas version</a></span></li></ul></li><li><span><a href="#Measures-of-correlation-between-pairs" data-toc-modified-id="Measures-of-correlation-between-pairs-1.3">Measures of correlation between pairs</a></span><ul class="toc-item"><li><span><a href="#From-the-base-python-statistics-module" data-toc-modified-id="From-the-base-python-statistics-module-1.3.1">From the base python statistics module</a></span></li></ul></li></ul></li><li><span><a href="#Got-to-Intro-to-SciPy-notebook" data-toc-modified-id="Got-to-Intro-to-SciPy-notebook-2">Got to Intro to SciPy notebook</a></span></li></ul></div>

# Descriptive Statistics

Descriptive Statistics - It is the study of the sample wherein we try to find out different measures(mean, median, variance…) and their dependence/inter-dependence on the existing features.
<p>Inferential Statistics - After studying various measures and relationships in the sample, we try to generalize the measure to the whole population. It may be to estimate the mean of a certain numerical feature or to hypothesize a relationship between one or more features.

![descriptive_statistics.png](attachment:descriptive_statistics.png)

## Measures

- Central tendency tells you about the centers of the data. 
    - nUseful measures include the mean, median, and mode.
- Variability tells you about the spread of the data. 
    - Useful measures include variance and standard deviation.
- Correlation or joint variability tells you about the relation between a pair of variables in a dataset. 
    - Useful measures include covariance and the correlation coefficient.



## Packages for descriptive statistics

- Python’s statistics is a built-in Python library for descriptive statistics. You can use it if your datasets are not too large or if you can’t rely on importing other libraries.
- NumPy is for numerical computing, optimized for working with single- and multi-dimensional arrays. 
- SciPy is for scientific computing based on NumPy. It offers additional functionality compared to NumPy, including scipy.stats for statistical analysis.
- Pandas is for numerical computing based on NumPy. It excels in handling labeled one-dimensional (1D) data with Series objects and two-dimensional (2D) data with DataFrame objects.
- Matplotlib and seaborn are packages that can be used to visualize data.


In [None]:
import pandas as pd
import numpy as np
from numpy.random import randn
import scipy.stats
import statistics
from matplotlib import pyplot as plt

### The numpy version

https://numpy.org/doc/stable/reference/routines.statistics.html

In [None]:
a = [65,23,45,67,83,12,67,90,43,22,56,65,67,76]
b = np.array([[10,20,30], [40,50,60]])

In [None]:
np.mean(a)

In [None]:
np.mean(b)

In [None]:
np.mean(b, axis=0)  # going down each column

In [None]:
np.mean(b, axis=1)  # going across each row

#### Average

In [None]:
np.average(a)

In [None]:
np.average(b)

In [None]:
np.average(b, axis=0)  # going down each column

In [None]:
# Using a weighted average
x = [2,4, 8]
y = [0.2, 0.5, 0.3]
np.average(x, weights=y)

In [None]:
np.median(a)

In [None]:
# Don't try this.  Numpy does not have a mode function
# np.mode(a)

#### Measures of dispersion

In [None]:
np.var(a)

In [None]:
np.std(a)

In [None]:
np.percentile(a, 50)

In [None]:
np.ptp(a)

Some of the descriptive statistics that are not available in Numpy are found in Scipy.  For example:

- scipy.stats.mode(a)
- scipy.stats.skew(a)
- scipy.stats.kurtosis(a)

https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html

![skewness.png](attachment:skewness.png)

In [None]:
scipy.stats.mode(a)

In [None]:
skewness = scipy.stats.skew(a, bias=False)
skewness

# skewness = 0 : normally distributed.
# skewness > 0 : more weight in the left tail of the distribution.
# skewness < 0 : more weight in the right tail of the distribution.

In [None]:
# Don't try this.  Scipy does not have a mean function
# sp.mean(a)

### The pandas version

![pandas_stats.png](attachment:pandas_stats.png)

In [None]:
df = pd.read_csv('/Users/jimcody/Documents/2021Python/statistics/data/obesity_data.csv')
df.columns= df.columns.str.lower()
df.head()

In [None]:
df.describe()  # All numeric columns

In [None]:
df.weight.describe()  # Just for sales

In [None]:
# for categorical data
df.mtrans.describe()

In [None]:
df.weight.count()

In [None]:
s = df.weight.mean()
t = df.weight.mode()
u = df.weight.median()
v = df.weight.min()
w = df.weight.max()
print('Mean:   ',s)
print('Median: ',u)
print('Min:    ',v)
print('Max:    ',w)
print('Mode:   ',t)

In [None]:
s = df.weight.quantile(q=0.25)
t = df.weight.quantile(q=0.50)
u = df.weight.quantile(q=0.75)
print('25: ',s)
print('50: ',t)
print('75: ',u)


## Measures of correlation between pairs

![correlation-2.png](attachment:correlation-2.png)

Covariance and correlation are two terms that are opposed and are both used in statistics and regression analysis. Covariance shows you how the two variables differ, whereas correlation shows you how the two variables are related. Here, in this tutorial, you will explore covariance and correlation, which will help you understand the difference between covariance and correlation.

Covariance is a statistical term that refers to a systematic relationship between two random variables in which a change in the other reflects a change in one variable.

The covariance value can range from -∞ to +∞, with a negative value indicating a negative relationship and a positive value indicating a positive relationship.

correlation is a measure that determines the degree to which two or more random variables move in sequence. When an equivalent movement of another variable reciprocates the movement of one variable in some way or another during the study of two variables, the variables are said to be correlated.

In [None]:
df.cov()     # covariance

In [None]:
df.fcvc.cov(df.ncp)

In [None]:
df.corr()    # correlation between columns

In [None]:
# A scipy alternative

from scipy.stats import pearsonr
from scipy.stats import spearmanr

data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate Pearson's correlation
corr1= pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % corr)
corr2= spearmanr(data1, data2)
print('Spearmans correlation: %.3f' % corr)

In [None]:
# Used with a dataframe
corr1= pearsonr(df.fcvc, df.ncp)
corr1

### From the base python statistics module

In [None]:
statistics.variance(df.weight)

In [None]:
statistics.variance(df.weight)

# Got to Intro to SciPy notebook