### Scatter plot
* Provides case by case view of data for two numerical variables. Each point represents single case. Useful to spot association between variables.

### Dot plots
* Most basic display for one variable analysis. It is called one variable scatter plot.
* Shows exact values for each observation.
* Can get busy as sample size increases.
![dot plot](images/dot_plot.JPG)

### Histograms
* Dot plot is good for small dataset, in larger dataset we can assign observation to bin. Provides view of data density. Higher bar means more common data.
* Also describe the shape of the data distribution. Right skewed: long thin tail to the right, left skewed: long thin tail to the left, symmetric: equal trailing in both directions.
![skewness](images/skewness.JPG)
* We can identify mode using histogram.
* Histogram can be unimodal(1 prominent peak), bimodal(2 prominent peak) or multimodal(more than 2 prominent peak).
![modality](images/modal.JPG)
* Bin width can alter story of histogram, Wide bin, loose interesting details. Narrow bin will difficult to get overall picture. Also, get discontinuity

# Measure of center
### Mean, Average, Arithmatic mean
* Sum of the all observation divided by number of observations.
* Sample mean is shown as $\bar{x}$. Population mean is shown as $\mu$
* Mean is located where total distance of value below mean = total distance of value above the mean.

### Weighted mean
* Influence more by some observation than others. We assign weights to observation based on its importance.

$$ weighted\ mean = \frac{w_1x_1 + w_2x_2 + w_3x_3 + ... + w_nx_n}{w_1 + w_2 + w_3 + ... + w_n}$$
* Simple mean is just special case of weight mean in which weight of all w's is 1.

```
np.average(lst, weights = lst)
```
### Mode
* Most frequent observation.

### Median
* Mid point of distribution or 50th percentile.
* If the data is ordered from smaller to larger, mid element is median. If there are even number of elements, median is mean of middle two elements.
* In left skewed distribution mean is lower than median. As lower values pull mean toward them
* In right skewed distribution mean is larger than median. As larger value pull the mean toward them.
![skewness_mean_median](images/skewness_mean_median.JPG)

# Measure of Spread
* How far away typical observation from mean. Distance of an observation from mean is called deviation.

### Range
* Difference of min and max value

### Variance
* Average squared distance from mean.
* Square of the deviation and taking its average. Sample variance denoted by $s^2$. Population variance is $\sigma^2$
* Squaring the deviation does two things makes large value larger and get rid of negative sign so positive and negative does not cancel each other out.

$$s^2 = \frac{\sum^n_{i=1}(x_i - \bar{x})^2}{n-1}$$

### Standard deviation
* Square root of variance.
* It has the same unit as observed data.
* Useful to check how close the data are from mean.
* Sample standard deviation is denoted by $s$. Population standard deviation is $\sigma$.
$$\sigma^2 = \sqrt{\frac{\sum^n_{i=1}(x_i - \bar{x})^2}{n-1}}$$

#### n-1 vs n in sample and population (Bessel's correction)
* The standard deviation calculated with a divisor of n−1 is a standard deviation calculated from the sample as an estimate of the standard deviation of the population from which the sample was drawn. Because the observed values fall, on average, closer to the sample mean than to the population mean, the standard deviation which is calculated using deviations from the sample mean underestimates the desired standard deviation of the population. Using n−1 instead of n as the divisor corrects for that by making the result a little bit bigger.
* Note that the correction has a larger proportional effect when n is small than when it is large, which is what we want because when n is larger the sample mean is likely to be a good estimator of the population mean.
* When the sample is the whole population we use the standard deviation with n as the divisor because the sample mean is population mean.
* When we sample from population spread of sample is lower, so sample standard distribution under estimate population standard deviation
```
pd.std() or np.std()
# we have ddof parameter in both functions, which is default to 1 always, to find population std set it to 0.
```

### Interquartile range
* Range of middle 50% of data.
* Distance between first quartile and 3rd quartile.
$$IQR = Q3 - Q1$$

* Variance of tow different random variable = sum of there individual variance.

### Box plots
* Summarize dataset using 5 statistics, Q1, median, min, max, Q3. Also plot unusual observations(outlier).
![box plot](images/box_plot.JPG)
* more variable data larger the SD and IQR.
* Whiskers attempts to capture data out of the box, but they are never allowed to be more than  `1.5 * IQR` So upper whisker does not extend beyond `Q3 + 1.5*IQR`  and does not extend down to `Q1 - 1.5*IQR`. Observation beyond the whisker are noted with dot. Such observation are outliers in general.
![box plot and skewness](images/box_plot_skewness.JPG)
* Examining outliers provide, identification of strong skew in distribution, Data entry errors or collections error, Provide insight to interesting properties of data.


### Intensity map
* Geographical data usually mapped in intensity map. 
* Colors used to show higher and lower value of variable.
* Usually good for checking geographic trend.
![intensity map](images/intensity_map.JPG)

### Showing categorical variable
#### Frequency table
* Table of single variable is called frequency table. When we consider percentage or proportion instead count it is relative frequency table.
![frequency table](images/frequency_table.JPG)

#### Bar plot
* Common way to plot single categorical variable.
![bar_plot](images/bar_plot.JPG)
* Histogram vs bar-chart
    - Bar plot is for categorical, histogram for numerical
    - x-axis on histogram is number line, ordering of bar can not be changed. In bar plot order of bar can be changed.
    
#### Pie chart
![pie chart](images/pie_chart.JPG)

#### Contingency table
* Table that summarizes data for 2 categorical variables. Each value is table shows number of time each particular combination of variable outcome occurs.
![contingency_table](images/contingency_table.JPG)

#### Segmented bar plot
* Graphical display for information of contingency table.
![segmented bar](images/segmented_bar.JPG)
![segmented bar relative](images/segmented_bar_relative.JPG)

#### Mosaic plot
![mosaic plot](images/mosaic_plot.JPG)
* column width also represents marginal distribution.

#### Side by side box plot
* Compare numerical and categorical variable
* Useful to compare numerical variable across the groups.
![side by side plot](images/side_by_side.JPG)


### Robust statistics
* If extreme observation has little effect such statistics is called robust statistics.
* median and IQR are robust to extreme observations.
* mean and SD, range is not robust.

### Kernel Density Estimation
* One way to estimate probability density function, each observation is represented as small lump of area, stacking such lump is final density curve. Vertical axis represent density of data. Probability between 2 values is area between those.
* When sample size is small, in histogram data has jumps, KDE provides smooth estimates of overall data.
* We can also make inference about population from finite sample using kde.

### Outlier detection methods

#### SD method
* 2 SD method or 3 SD method, observation outside it is outlier

#### z-score
* |Z -score | > 3 is outlier
* Not good for small sample. SD can be inflated by single value, less extreme outlier go unnoticed

#### modified z-score
* To overcome extreme value problem, median and deviation from median is used in modified z-score.

#### Box plot
* points outside  q1 - (1.5 IQR) or q3  + (1.5 IQR)

#### Median absolute deviation
* 2 MAD = median +- 2MAD
* 3 MAD = median +- 3MAD


### Quantile plot
* Display all the data, plots quantile info.
* For data $x_i$ data sorted in increasing order, $f_i$ indicates that approximately 100*$f_i$ % of the data are below or equal to $x_i$.
* 5 Quantile means 5% data less or equal to that value
![](images/quanitle_plot.PNG)

### Quantile - Quantile plot (Q-Q plot)
* graph quantile of 1 univariate distribution against other. Is there any shift going on from 1 variable to other. If 2 distribution are similar then y = x.
* If we have assumption that given data is normally distributed then we can plot Normal Q-Q plot to check our assumption
* It is scatter plot created by plotting 2 set of quantiles against each other. If both quantiles from same distribution then line will be straight.
* Quantiles are basically just your data sorted in ascending order, with various data points labelled as being the point below which a certain proportion of the data fall.

### Correlation
* Quantify strength of the relationship between 2 variables. Requirement is both variable should be in same units. Transform each value to standard score (number of SD away from mean) it leads to pearson product moment correlation coefficient. For normal distribution. OR transform each variable to its rank [index is sorted list of values] spearson rank correlation coefficient for non normal distributions

### Covariance
* Measure of the tendency of 2 variables to vary together. 
* 2 variable X and Y
* dxi = xi - mean of x
* dyi = yi - mean of y. Which are deviation of mean. If x and y vary together their distribution will have the same sign.
$$COV(x,y) = \frac{1}{n}\sum (dxi)(dyi)$$
* Dot product of deviation divided by length.
* So it is maximized when both are identical and 0 if orthogonal, negative if point in opposite direction

In [2]:
import numpy as np

In [5]:
# cov = np.dot(xs - mean_x, ys - mean_y) / len(x)

In [6]:
# np.cov() returns covariance matrix

### Pearson correlation coefficient
* Covariance unit is product of unit of x and y. Divide the deviation by standard deviation, which gives standard score.

= $\frac{COV(x,y)}{S_x S_y}$
* Between -1 and 1 indicate strength and sign of association between 2 variable. Sign is same as sign of association. closer to 1 or -1 means stronger association.

![](images/pearson_correlation.PNG)

* In second row, we can see that correlation does not take look for slope.

In [None]:
```
corf, p = stats.personr(df['colA'], df['colB'])
```

* Pearson correlation  is normalization of covariance.

* np.cov(x,y) and np.corrcoef(x,y) returns covariance anc pearson correlation matrix respectively

#### Spearman's Rank correlation
* Pearson work for normal and linear. It is not robust for outlier. 
* Compute rank of given values
```
xs.corr(ys, method = 'spearman') # xs, ys is pandas series
```

* If relation is not linear, pearson under estimate strength of relation ( row 3 in above image). Pearson affected if on of the distribution id not normal and has outlier.

* To remove effect of skewness we can compute pearson correlation with log normal of x and y. 

---------

* Saturation is loss of information, multiple points are plotted on top of each other

### Estimator
#### Maximum likelihood estimator.
* Estimate underlying probability with given data
* 1 is head, 0 is tail. 100101 is outcome P(H) = 0.5
* 11011 P(H) = 0.8
* 00000000 P(H) = 0

* x1, x2, x3, ...., xn so, maximum likelihood estimator is 1/N$\sum xi$

* We take sum of outcome and normalize with total outcome. 


* Given data is 1,6,6,3,2,6,5,4,6,2

* P(1) = 1/10, P(2) = 2/10 P(3) = 1/10, P(4) = 1/10, P(5) = 1/10, P(6) = 4/10

* Suppose we have data of single coin flip, MLE will always assume it is loaded and assign probability 1. Solution is to add fake data like
    - Data is given 1 => p(1) = 1
    - add fake data 1,[1,0] P(H) = 2/3 = 0.667
    - Eventually fake data will pull Probability of H towards 0.5
    - Estimation after adding fake data is called Laplacian estimator. When few sample available use it.