In [4]:
import pandas as pd

# 1. Structured Data

Two types of structured data: **Numeric** and **Categorical**

1. **Numeric** - `Continuous` (wind speed) & `Discrete` (count of occurence of event)

2. **Categorical** - Takes fixed set of values. Also, `Binary`(special type) & `Ordinal`


Scikit-learn supports ordinal data with **`sklearn.preprocessing.OrdinalEncoder`**

# 2. Rectangular Data

General term for a 2D matrix with <u>rows</u> indicating *records/entries* and <u>columns</u> indicating *features/variables*.

Like a spreadsheet or database table


### Nonrectangular Data

* Time Series - records successive measurements of the same variable
* Spatial data - used in mapping and location analytics
* Graphs(or networks) - used to rep physical, social and abstract relationships

# 3. Estimation of Location

1. **Mean** - Sum of all values divided by the number of values.
2. **Weighted Mean** - The sum of all values times a weight divided by the sum of the weights.
3. **Median** - The value such that one-half of the data lies above and below.
4. **Percentile** - The value such that *P* percent of the data lies below
5. **Weighted Median** -  The value such that one-half of the sum of the weights lies above and below the sorted data.
6. **Trimmed Mean** -  The average of all values after dropping a fixed number of extreme values.
7. **Robust** - Not sensitive to extreme values

### Trimmed Mean

you calculate by dropping a fixed number of sorted values at each end and then taking an average of the remaining values.


Formula to compute the trimmed mean with *p* smallest and largest values omitted is:

![1.png](attachment:1.png)

**A trimmed mean eliminates the influence of extreme values.**

### Weighted Mean

you calculate by multiplying each data value *x_i* by a user-speicifed weight *w_i* and dividing their sum by the sum of the weights.

The formula for a weighted mean is:

![2.png](attachment:2.png)

There are two main motivations for using a weighted mean:

-  Some values are intrinsically more variable than others, and highly variable observations are given a lower weight.
- The data collected does not equally represent the different groups that we are interested in measuring. We give a higher weight to the values from the groups that were underrepresented.

### Median (Robust estimation of location)

- The median is the middle number on a sorted list of the data.

- Instead of the middle number, the weighted median is a value such that the sum of the weights is equal for the lower and upper halves of the sorted list.

**The median depends only on the values in the center of the sorted data.** 

**Median and weighted median is robust to outliers.**

`df['Col'].mean()/.median()` - *Mean & Median*

`trim_mean(df['col], 0.1)` - *Trimmed mean (10%)*

# 4. Estimates of Variability

*Variability*, also referred to as dispersion, measures whether the data values are tightly clustered or spread out. 

1. **Deviations** - The difference between observed values and estimates of location.
2. **Variance** - The sum of squared deviations from the mean divided by n-1 (n - number of data values)
3. **Mean absolute deviation** - The mean of the absolute values of the deviations from the mean
4. **Percentile/ Quantile** - P % of values take on this vallue and (100 - P) % take this.
5. **IQR** - Difference between 75 and 25 percentile.

**Why average deviations from the mean not good to measure variability?**

The negative deviations offset the positive ones and a result the sum of the deviations from the mean is zero.

Instead take average of the absolute values of he deviations from the mean.
![3.png](attachment:3.png)

The best-known estimates of variablity are the *variance* and the *S.D*

![4.png](attachment:4.png)

#### <u>Robust to outliers</u>

Variance, S.D and mean absolute deviation - not robust to outliers & extreme values.

*Median absolute deviation from the median* - robust to outlier
![5.png](attachment:5.png)

**S.D > Mean absolute deviation > Median absolute deviation**

# 5. Estimates based on Percentiles

In a data set, the *Pth* percentile is a value such that at least *P%* of the values take on this value or less and at least *(100 – P)%* of the values
take on this value or more. 

A common measurement of variability is the difference between the **25th percentile** and the **75th percentile**, called the ***interquartile range*** (or IQR).

{1,2,3,3,5,6,7,9}, in this the 25th percentile is at 2.5 and 75th percentile at 6.5.

- for even, take average of the middle and next to middle
- for odd, take the middle value

`numpy.quantile` or `df[col].quantile(0.75)` to compute quantile in python

In [10]:
state = pd.read_csv('.../dataset/state.csv')
state

FileNotFoundError: [Errno 2] No such file or directory: '.../dataset/state.csv'

In [7]:
import os
os.listdir()

['.ipynb_checkpoints', 'Data, Estimation of Data.ipynb']