# Introduction to Statistics

- __Theoretical perspective:__ Statistics is primarily an applied branch of mathematics, which tries to make sense of observations in the real world. 

- __Practical perspective:__ Statistics can help us make decisions in uncertain situations.

--

`Statistics and Probability are not the same. Probability theory enables us to find the consequences of a given ideal world, while statistical theory enables us to to measure the extent to which our world is ideal`

---

## StatisticS

- __The Field of Statistics:__ the study and practice of collecting and analyzing data.

- __Statistics:__ facts about, or summaries of data.


--


- __Descriptive statistics (sample):__ describe what the data shows, making the data we get more digestable eventhough we lose information about individual data point.

- __Inferential statistics (population):__ allow us to make conclusions that extend beyond the data (i.e.: testing an hyphotesis!) and help us make decisions about data when there's uncertainty.

--

`Statistics works as a proxy which is something related to what we want to measure, but isn't exactly what we want to measure`


---

## Exploratory Data Analysis (EDA)

It may seems counterintuitive, but classical statistics focused almost exclusively on _inference_ (i.e.: a complex set of procedures for drawing conclusions about large populations based on small samples). However, with the ready availability of computing power and expressive data analysis software, exploratory data analysis (a.k.a.: descriptive statistics), consistent on simple plots and summary statistics, have evolved well beyond its original scope.



---

### Elements of Structured Data (Data Types)

__Numeric:__ Data that are expressed on a numeric scale.

- __Continous ->__ Data that can take on any value in an interval (Synonyms: interval, float, numeric)

- __Discrete ->__ Data that can take on only integer values, such as counts (Synonyms: integer, count)



__Categorical:__ Data that can take on only a specific set of values representing a set of possible categories (Synonyms: enums, enumerated, factors, nominal).

- __Binary ->__ A special case of categorical data with just two categories of values, e.g.: 0/1, true/false (Synonyms: dichotomous, logical, indicator, boolean)

- __Ordinal ->__ Categorical data that has an explicit ordering (Synonyms: ordered factor).

---

### Rectangular Data

__Dataframe:__ Rectangular data (like a spreadsheet) is the basic data structure for statistical (and machine learning) models.

__Feature:__ A column within a table is commonly referred to as a feature (Synonyms: attribute, input, predictor, variable).

__Record:__ A row within a table is commonly referred to as a record (Synonyms: case, example, instance, observation, pattern, sample)

---

### Libraries

- [SciPy](https://docs.scipy.org/doc/scipy/reference/index.html#scipy-api)

- [statsmodels](https://www.statsmodels.org/stable/api.html)

- [wquantiles](https://pypi.org/project/wquantiles/)

In [None]:
# imports

import pandas as pd
import numpy as np
from scipy.stats import trim_mean   # conda install scipy
from statsmodels import robust      # conda install -c conda-forge statsmodels 
import wquantiles                   # pip install wquantiles

import seaborn as sns
import matplotlib.pylab as plt

---

### Estimates of Location

An estimate of where most of the data is located (i.e.: its central tendency)

- __Mean:__ The sum of all values divided by the number of values (a.k.a.: average)

- __Weighted mean:__ The sum of all values times a weight divided by the sum of the weights (a.k.a.: weighted average)

- __Trimmed mean:__ The average of all values after dropping a fixed number of extreme values (a.k.a.: truncated mean).

- __Median:__ The value such that one-half of the data lies above and below (a.k.a.: 50th percentile)

- __Weighted median:__ The value such that the one-half of the sum of the weights.

--

- __Percentile:__ The value such that _P_ percent of the values take on this value or less and (100 - _P_) percent take on this value or more (a.k.a.: quantile).

- __Robust:__ Not sensitive to extreme values (a.k.a.: resistant).

- __Outlier:__ A data value that is very different from most of the data (a.k.a.: extreme value).


In [None]:
state = pd.read_csv('./datasets/state.csv')
state

In [None]:
# Mean

mean = state['Population'].mean()

print('Mean:', mean)

In [None]:
# Weighted mean

mean = state['Murder.Rate'].mean()

wmean = np.average(state['Murder.Rate'], weights=state['Population'])

print('Mean:', mean, '\nWeighted mean:', wmean)

In [None]:
# Trimmed mean

tmean = trim_mean(state['Murder.Rate'], 0.1)

print('Trimmed mean:', tmean)

In [None]:
# Median 

median = state['Population'].median()

print('Median:', median)

In [None]:
# Weighted median

median = state['Murder.Rate'].median()

wmedian = wquantiles.median(state['Murder.Rate'], weights=state['Population'])

print('Median:', median, '\nWeighted median:', wmedian)

---

### Estimates of Variability

Variability measures whether the data values are tightly clustered or spread out. These estimators are not equivalent!!!

- __Deviations:__ The difference between the observed values and the estimate of location (a.k.a.: errors, residuals).

- __Mean absolute deviation:__ The mean of the absolute values of the deviations from the mean (a.k.a.: l1-norm, Manhattan norm).

- __Variance:__ The sum of squared deviations from the mean divided by `n - 1` where _n_ is the number of data values (a.k.a.: mean-squared-error).

- __Standard deviation:__ The square root of the variance.

--

- __Order statistics:__ Metrics based on the data values sorted from smallest to biggest (a.k.a.: ranks).

- __Range:__ The difference between the largest and the smallest value in a data set.

- __Interquartile range:__ The difference between the 75th percentile and the 25th percentile (a.k.a.: IQR).


In [None]:
# Deviations 

state['dev_population'] = state['Population'].mean() - state['Population']

state['dev_murder'] = state['Murder.Rate'].mean() - state['Murder.Rate']

print('Population deviation:', state['dev_population'].sum(), '\nMurder Rate deviation:', state['dev_murder'].sum())

state

In [None]:
# Mean absolute deviation

state['mean_dev_population'] = abs(state['Population'].mean() - state['Population'])

state['mean_dev_murder'] = abs(state['Murder.Rate'].mean() - state['Murder.Rate'])

print('Population mean absolute deviation:', state['mean_dev_population'].sum() / len(state['mean_dev_population']),
      '\nMurder Rate absolute deviation deviation:', state['mean_dev_murder'].sum() / len(state['mean_dev_murder']))

state

In [None]:
# Variance

state['var_population'] = (state['Population'].mean() - state['Population'])**2

variance = state['var_population'].sum() / (len(state['var_population']) - 1)

print('Variance:', variance)

In [None]:
# Standard deviation

std_dev = state['Population'].std()

print('Standard deviation:', std_dev)

In [None]:
# Order statistics (spread of sorted data)

feature = 'Population'

order_data = state[['State', feature]].sort_values(by=[feature]).reset_index(drop=True)

order_data

In [None]:
# Range

population_range = state['Population'].max() - state['Population'].min()

murder_range = state['Murder.Rate'].max() - state['Murder.Rate'].min()

print('Population range:', population_range, '\nMurder Rate range:', murder_range)

In [None]:
# Percentiles (it is NOT a data point)

per_75 = state['Population'].quantile(0.75)

per_25 = state['Population'].quantile(0.25)

per_50 = state['Population'].quantile(0.50)   # == Median

print('Percentile 75th:', per_75, '\nPercentile 25th:', per_25, '\nPercentile 50th:', per_50)

In [None]:
# Interquartile range

iqr = per_75 - per_25

print('Interquartile range:', iqr)

---
