## Branches of statistics
Knowing the type of statistics you need to answer your question will help you choose the appropriate methods to get the most accurate answer possible. There are two branches of statistics 

*   Descriptive statistics — It describes and summarize the data
*   Inferential statistics — It takes sample data in order to make inferences with respect to a larger population.



## Two types of data

*   Numeric, or quantitative data

    It’s made up of numeric values which is further divided into Discrete ( counted items) and Continuous( measured variables). For this type of data we can use summary statistics like mean, and plots like scatter plots etc

*   Categorical, or qualitative data

    It’s made up of values that belong to distinct groups which is further divided into Nominal and Ordinal. For this type of data we use summary statistics such as counts and plots like barplots etc

<div style="text-align:center"><img alt="Python Data Type" src="https://github.com/thunderstroke325/60-Days-of-Data-Science-and-ML/blob/main/assets/data.png?raw=true"/></div>

## Measure of Centre
It's the value of the center or middle of the data set.

*   Mean
*   Median
*   Mode
*   Range



<div style="text-align:center"><img alt="Python Data Type" src="https://github.com/thunderstroke325/60-Days-of-Data-Science-and-ML/blob/main/assets/measure_centre.png?raw=true"/></div>

<div style="text-align:center"><img alt="Python Data Type" src="https://github.com/thunderstroke325/60-Days-of-Data-Science-and-ML/blob/main/assets/measure_centre_diagram.png?raw=true"/></div>

## Measure of Spread / Measure of Variation
Variability is the key to the statistics. So, when you are describing the data, never rely on the center alone. Measure of spread identified the spread of the values.

*   Standard Deviation
*   Inter-Quartile Range (IQR)
*   Variance
*   Range



<div style="text-align:center"><img alt="Python Data Type" src="https://github.com/thunderstroke325/60-Days-of-Data-Science-and-ML/blob/main/assets/measure_frequency1.png?raw=true"/></div>

<div style="text-align:center"><img alt="Python Data Type" src="https://github.com/thunderstroke325/60-Days-of-Data-Science-and-ML/blob/main/assets/measure_frequency2.png?raw=true"/></div>

## Which measure to use?

*   If the distribution is normal or symmetrical, use mean and standard deviation. Mean works better for symmetrical data and is more sensitive to extreme values.
*   If the distribution is skewed or has large outliers, then use Median, Range or IQR. Median is usually better to use when your data is skewed i.e not symmetrical.
*   If the distribution is bimodal, Use mode to figure out if the two modes represent different groups , or range




## Numpy Statistical Functions

*   np.mean() - To determines the mean value
*   np.median() - To determines the median value
*   np.std() - To determines the standard deviation
*   np.var() - To determines the variance
*   np.average() - To determines the weighted average
*   np.pecentile() - To determines the nth percentile
*   np.amin() - To determines the minimum value
*   np.amax() - To determines the maximum value


In [3]:
import numpy as np
arr1= np.array([[12,43,56],[78,88,95],[79,89,43], [101,34,67]]) 
arr2 = np.array([5,6,7,12,34,67,89]) 

# Mean function
print("Mean:", np.mean(arr2))

# Median function
print("Median:",np.median(arr2))

# Standard Deviation Function
print("Standard Deviation:", np.std(arr2))

# Variance Function
print("Variance:",np.var(arr2))
# Average Function
print("Average:",np.average(arr2))

# Percentile Function
print("Percentile:",np.percentile(arr2,5,0))

# Minimum Function
print("Minimum element:",np.amin(arr1))

# Maximum Function
print("Maximum element:",np.amax(arr1))

Mean: 31.428571428571427
Median: 12.0
Standard Deviation: 31.409084867994768
Variance: 986.530612244898
Average: 31.428571428571427
Percentile: 5.3
Minimum element: 12
Maximum element: 101


## Groupby and Aggregate Functoin
When you want to compare summary statistics between groups, it’s much easier to use .groupby() and .agg(). The groupby function can be combined with one or more aggregation functions to easily summarize data.

In [4]:
import pandas as pd
df = pd.DataFrame({'A': [1, 10, 20, 2],
               'B': [1, 2, 30, 40],
               'C': np.random.randn(4)})
df.groupby('B').agg(['min', 'max'])

Unnamed: 0_level_0,A,A,C,C
Unnamed: 0_level_1,min,max,min,max
B,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,1,1,0.067527,0.067527
2,10,10,-0.329537,-0.329537
30,20,20,-0.122889,-0.122889
40,2,2,0.114895,0.114895


## Measure of Spread
Quantiles are a great way of summarizing numerical data since they can be used to measure center and spread, as well as to get a sense of where a data point stands in relation to the rest of the data set.

<div style="text-align:center"><img alt="Python Data Type" src="https://github.com/thunderstroke325/60-Days-of-Data-Science-and-ML/blob/main/assets/measure_spread_figure.png?raw=true"/></div>

In [5]:
arr2 = np.array([5,6,7,12,34,67,89]) 

print("Q2 quantile of arr : ", np.quantile(arr2, .50))
print("Q1 quantile of arr : ", np.quantile(arr2, .25))
print("Q3 quantile of arr : ", np.quantile(arr2, .75))
print("100th quantile of arr : ", np.quantile(arr2, .1))

Q2 quantile of arr :  12.0
Q1 quantile of arr :  6.5
Q3 quantile of arr :  50.5
100th quantile of arr :  5.6


In [6]:
arr = [31, 35, 45, 49, 59, 69, 74, 79, 80, 81, 89, 94, 96, 99, 101, 104, 112, 117,119,127,134]
  
# First quartile (Q1)
Q1 = np.median(arr[:12])
  
# Third quartile (Q3)
Q3 = np.median(arr[12:])
  
# Interquartile range (IQR)
IQR = Q3 - Q1
  
print(IQR)

40.5


## Continuous distribution
Distribution with location (loc) and Scale (scale) parameters.Continuous distributions can be uniform or can take forms where some values have a higher probability than others.

In [7]:
from scipy.stats import uniform
arr2 = np.array([5,6,7,12,34,67,89]) 
print (uniform.cdf(arr2, loc =4 , scale = 5))

[0.2 0.4 0.6 1.  1.  1.  1. ]


## Normal Distribution

In [8]:
from scipy.stats import norm
import numpy as np
arr2 = np.array([0.91,0.17,0.99996833, 0.81, 0.97,0.54])
print(norm.ppf(arr2))

[ 1.34075503 -0.95416525  4.00000928  0.8778963   1.88079361  0.10043372]
