# Exploratory Data Analysis

## Statistics 

Is a branch of applied mathematics that involves the collection, description, analysis, and inference of conclusions from quantitative data.

## Key terms for data types

- **Numeric**: Data expressed on a numerical scale
    - **Continuous**: Tada that can take on any value in an interval (interval, float, numeric). E.g. time duration or wind speed. 
    - **Discrete**: Can take on only integer values such as counts. (integer, count). E.g. Count of the occurrence of an event. 
- **Categorical**: Can take on only a specific set of values representing a set of possible categories (enums, enumerated, factors, nominal, polychotomous). E.g. A fixed set of values such as the type of something or a state name. 
    - **Binary**: Just two categories of values (dichotomous, logical, indicator, boolean). E.g. 0 or 1, true or false
    - **Ordinal**: Categorical data that has an explicit ordering (ordered factor). E.g. A numerical rating (1, 2 or 3).


The data type is important to help determine the type of visual display, data analysis or statistical model. 

## Rectangular data

A two-dimensional matrix with rows indicating records (cases) and columns indicating features (variables).
Unstructured data must be processed and manipulated so that it can be represented as a set of features in a rectangular data. 

- **Data Frame**: rectangular data (like a spreadsheet)
- **Feature**: A column within a table (attribute, input, predictor, variable)
- **Outcome**: (dependent variable, response, target, output). 
- **Records**: A row  within a table (case, example, instance, observation, pattern, sample). 

## Nonrectangular Data Structures

Graph data structures are used to represent physical, social and abstract relationships. They are useful to for certain types of problems such as network optimization. 


## Estimates of Location

- **Mean**
- **Weighted mean**
- **Median**
- **Percentile**
- **Weighted median**
- **Trimmed mean**
- **Robust** 
- **Outlier** 

### Mean or Average value: 

Is the sum of all values divided by the number of values.
$$\bar x =  \frac{\sum _{i=1}^{n} x_i}{n} $$


### Trimmed mean: 

Drop a fixed number of sorted values at each end and taking an average of the remaining values. It eliminates the influence of extreme values.
$$\bar x =  \frac{\sum _{i=p+1}^{n-p} x_{(i)}}{n-2p} $$

### Weighted mean

Multiply each data value $x_i$ by a user-specified weight $w_i$ and divide their sum by the sum of the weights. 

It can be used when a value is more variable than other. A highly variable (therefore less accurate) observation are given a lower weight. 
It can also be used when the collected data does not equally represent the different groups that we are interested in measuring. Values from the groups that were underrepresented can get a higher weight. 

$$\bar x_{w} =  \frac{\sum _{i=1}^{n} w_{i}x_{i}}{\sum _{i=1}^{n} w_{i}} $$

### Median

The middle number on a sorted list, or the average of the two values in the middle in case of an even data list. 

It's referred to as a robust estimate since it is not influenced by outliers (extreme cases) 

In [27]:
import pandas as pd 
from scipy import stats
import numpy as np
!pip install wquantiles
import wquantiles

Defaulting to user installation because normal site-packages is not writeable
Collecting wquantiles
  Downloading wquantiles-0.6-py3-none-any.whl (3.3 kB)
Installing collected packages: wquantiles
Successfully installed wquantiles-0.6


In [28]:
#define DataFrame
df = pd.DataFrame({'points': [25, 12, 15, 14, 19, 23, 25, 30, 1, 20],
                   'assists': [5, 7, 7, 9, 12, 9, 9, 4, 5, 4],
                   'rebounds': [11, 8, 10, 6, 6, 5, 9, 12, 7, 10]})
df


Unnamed: 0,points,assists,rebounds
0,25,5,11
1,12,7,8
2,15,7,10
3,14,9,6
4,19,12,6
5,23,9,5
6,25,9,9
7,30,4,12
8,1,5,7
9,20,4,10


In [34]:
mean_ = df["points"].mean()
trim = stats.trim_mean(df["points"], 0.1)
median_ = df["points"].median()
weighted_mean = np.average(df["points"], weights=df["assists"])
weighted_median = wquantiles.median(df["points"], weights=df["assists"])
print(f"mean is {mean_},\ntrim_mean is {trim},\
      \nmedian is {median_},\nweighted mean is {weighted_mean},\
      \nweighted median is {weighted_median}")

mean is 18.4,
trim_mean is 19.125,      
median is 19.5,
weighted mean is 18.380281690140844,      
weighted median is 19.1875
