# Exploratory Data Analysis

## Statistics 

Is a branch of applied mathematics that involves the collection, description, analysis, and inference of conclusions from quantitative data.

## Key terms for data types

- **Numeric**: Data expressed on a numerical scale
    - **Continuous**: Tada that can take on any value in an interval (interval, float, numeric). E.g. time duration or wind speed. 
    - **Discrete**: Can take on only integer values such as counts. (integer, count). E.g. Count of the occurrence of an event. 
- **Categorical**: Can take on only a specific set of values representing a set of possible categories (enums, enumerated, factors, nominal, polychotomous). E.g. A fixed set of values such as the type of something or a state name. 
    - **Binary**: Just two categories of values (dichotomous, logical, indicator, boolean). E.g. 0 or 1, true or false
    - **Ordinal**: Categorical data that has an explicit ordering (ordered factor). E.g. A numerical rating (1, 2 or 3).


The data type is important to help determine the type of visual display, data analysis or statistical model. 

## Rectangular data

A two-dimensional matrix with rows indicating records (cases) and columns indicating features (variables).
Unstructured data must be processed and manipulated so that it can be represented as a set of features in a rectangular data. 

- **Data Frame**: rectangular data (like a spreadsheet)
- **Feature**: A column within a table (attribute, input, predictor, variable)
- **Outcome**: (dependent variable, response, target, output). 
- **Records**: A row  within a table (case, example, instance, observation, pattern, sample). 

## Nonrectangular Data Structures

Graph data structures are used to represent physical, social and abstract relationships. They are useful to for certain types of problems such as network optimization. 


## Estimates of Location

- **Mean**
- **Weighted mean**
- **Median**
- **Percentile**
- **Weighted median**
- **Trimmed mean**
- **Robust** 
- **Outlier** 

### Mean or Average value: 

Is the sum of all values divided by the number of values.

$$\bar x =  \frac{\sum _{i=1}^{n} x_i}{n} $$


### Trimmed mean: 

Drop a fixed number of sorted values at each end and taking an average of the remaining values. It eliminates the influence of extreme values.

$$ \bar x =  \frac{\sum _{i=p+1}^{n-p} x_{(i)}}{n-2p} $$

### Weighted mean

Multiply each data value $x_i$ by a user-specified weight $w_i$ and divide their sum by the sum of the weights. 

It can be used when a value is more variable than other. A highly variable (therefore less accurate) observation are given a lower weight. 
It can also be used when the collected data does not equally represent the different groups that we are interested in measuring. Values from the groups that were underrepresented can get a higher weight. 

$$\bar x_{w} =  \frac{\sum _{i=1}^{n} w_{i}x_{i}}{\sum _{i=1}^{n} w_{i}} $$

### Median

The middle number on a sorted list, or the average of the two values in the middle in case of an even data list. 

It's referred to as a robust estimate since it is not influenced by outliers (extreme cases) 

In [44]:
import pandas as pd 
from scipy import stats
import numpy as np
!pip install wquantiles
!pip install statsmodels
import wquantiles
from statsmodels import robust

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting statsmodels
  Downloading statsmodels-0.13.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Collecting patsy>=0.5.2
  Downloading patsy-0.5.2-py2.py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.7/233.7 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.5.2 statsmodels-0.13.2


In [2]:
#define DataFrame
df = pd.DataFrame({'points': [25, 12, 15, 14, 19, 23, 25, 30, 1, 20],
                   'assists': [5, 7, 7, 9, 12, 9, 9, 4, 5, 4],
                   'rebounds': [11, 8, 10, 6, 6, 5, 9, 12, 7, 10]})
df


Unnamed: 0,points,assists,rebounds
0,25,5,11
1,12,7,8
2,15,7,10
3,14,9,6
4,19,12,6
5,23,9,5
6,25,9,9
7,30,4,12
8,1,5,7
9,20,4,10


In [3]:
mean_ = df["points"].mean()
trim = stats.trim_mean(df["points"], 0.1)
median_ = df["points"].median()
weighted_mean = np.average(df["points"], weights=df["assists"])
weighted_median = wquantiles.median(df["points"], weights=df["assists"])
print(f"mean is {mean_},\ntrim_mean is {trim},\
      \nmedian is {median_},\nweighted mean is {weighted_mean},\
      \nweighted median is {weighted_median}")

mean is 18.4,
trim_mean is 19.125,      
median is 19.5,
weighted mean is 18.380281690140844,      
weighted median is 19.1875


## Estimates of variability 
- **Deviations** (errors, residuals)
- **Variance** (mean-squared-error)
- **Standard Deviation** (square root of the variance)
- **Mean absolute deviation** (l1-norm, Manhattan norm)
- **Median absolute deviation from the median** 
- **Range**
- **Order statistics** (ranks)
- **Percentile** (quantile)
- **Interquartile range** (IQR)

### Standard deviation. 

It is based on deviations between the estimate of location and the observed data. They tell us how dispersed is the data around the central value. 

In [55]:
list_ = pd.DataFrame({"col1": [1, 25, 3, 5, 7]})
mean_ = list_["col1"].mean()
median_ = list_["col1"].median()
print(f"The mean is {mean_} and the median is {median_}")

The mean is 8.2 and the median is 5.0


The deviations from the mean are the differences:


In [56]:
list_["deviations"] = [x - median_ for x in list_["col1"]]
list_

Unnamed: 0,col1,deviations
0,1,-4.0
1,25,20.0
2,3,-2.0
3,5,0.0
4,7,2.0


The sum of deviations from the mean is zero. 

Mean absolute deviation consists in the sum of the absolute values of the deviations from the mean and their average

$${Mean absolute deviation} =  \frac{\sum _{i=1}^{n} \left | x_{i}-\bar{x} \right |}{n} $$

### Variance and standard deviation

$$ s^{2} =  \frac{\sum _{i=1}^{n}  (x_{i}-\bar{x})^{2} }{n-1}  $$

$$s = \sqrt{variance}$$

They are sensitive to outliers since they are based on the squared deviations (both numbers are going to be very high in this example because of the "25" in the index 1 of col1)

### Median absolute deviation 

Is a robust estimate of variability

$$MAD = Median(\left | x_{1} -m \right |, \left | x_{2} -m \right |,...,\left | x_{N} -m \right |)$$ 
where m is the median. 

### Percentiles 

To avoid sensitivity to outliers, we can look at the range of the data after dropping values from each end. The Pth percentile is a value such that at least P percent of the values take on this value or less and at least (100-P) percent of the values take on this value or more. 

Interquartile range is a measurement of variability, and it's the difference between 25th percentile and the 75th percentile. 

In [59]:
mean_abs_dev = list_["col1"].mad()
sd = list_["col1"].std()
var_ = list_["col1"].var()
mad = robust.scale.mad(list_["col1"])
print(f"Mean absolute deviation {mean_abs_dev}\nStandar deviation {sd}\nVariance {var_}\nMedian absolute deviation {mad}")
col1 = list_["col1"]
iqr = col1.quantile(0.75) - col1.quantile(0.25)
print(f"the IQR is {iqr}")

Mean absolute deviation 6.719999999999999
Standar deviation 9.654014708917737
Variance 93.2
Median absolute deviation 2.965204437011204
the IQR is 4.0


SD is always greater than mean absolute deviation, which is greater than the MAD