# Descriptive and Inferential Statistics

* statistics - practice of collecting and analysing data to discover findings that are useful or predict what causes those findings

* machine learning in itself is a statistical tool

* lot of blind sides in statistics, even for experienced professional - forgetting to see where the data comes from

  * this problem gets even more significance as are automating most of the statistical algorithms

* good to have solid understanding of statistics and hypothesis testing to avoid treating statistical automation as black box


## What is Data?

* data provides snapshots of a story

  * may be biased

  * may have gaps

  * may be missing relevant data

* data itself is not important but analysis and how it is collected is

* not just source of truth, also source of AI

* process of collecting data focused on a particular objective is *data mining*

* data provides clues, not truth

  * clues may lead to truth or erroneous conclusion

* should be curious on where the data comes from

  * garbage in garbage out

### Ground Truth

* correct/factual or true answer to a specific problem (rather than the one obtained by agent/sensor)

* if an self-driven car fails to see a pedestrian on the road through its camera, can it not detect the failure and stop?

  * no it can't as the system has no access ground truth. So no ground truth for it to fallback unless some other system or human provides it

## Descriptive vs. Inferential Statistics

* *descriptive statistics* summarises/describes data

* *inferential statistics* tries to uncover attribute about larger population with a smaller sample

  * could be wrong as our sample may not be representative of the population

## Populations, Samples and Bias

* a *population* is a particular group of interest we want to study

  * example - all golden retrievers in Scotland

  * does not have to tangible - it can be abstract too

    * example - we want to study all the fights taking off between 2am and 3am

      * not enough data for that time so our population is very small

      * now we treat that population as sample - sample of all theoretical flights between 2am and 3am 

      * theoretical fights are abstract population

* a *sample* is subset of a population that we are interested in

  * usually random and unbiased

  * sample must be as random as possible to avoid skewed conclusion

* bias

  * inevitable that our data will be biased

    * so many `confounding variables` and factors

      * confounding variable - unmeasured third variable that influences
    
    * only way to overcome this is by being truly random

### A Whirlwind tour of Bias Types

* geographical bias

* confirmation bias

  * collecting only data that supports your belief (knowingly or unknowingly)

* self-selection bias

  * a specific group more likely include themselves in a sample

  * example 1 - conducting a poll on social media to find Netflix users - because they have internet so may be using Netflix more than non-social media users so it is not representative

  * example 2 - polling customers in the flight whether they like the airline over the other airlines - the customers are self-selected as they already chose this airline to fly 

* survival bias

  * captures only living and survived subjects

  * examples are diverse but not obvious

    * WWII - fighter jet armour - people were looking at returned flights to understand where the bullets were hit so the bullet hit areas can be armoured. But mathematician Abraham Wald pointed out that look at where the bullets are not hit. The flights did not return because they were hit in the areas where survived flights were not hit 

    * management consulting companies only looking at successful companies and using it as predictor for future success

      * <https://xkcd.com/1827/>

    * veterinary study of cats falling from 6 stories are less inflict great injury

      * theory was that more than 6 story means the cats had enough time to brace for impact

      * but later it was discovered that dead cats are not considered (more cats died falling from more than 6 stories)
      
        * who brings dead cat to veterinarian

### Sample and Bias in Machine Learning

* math and computers cannot detect bias in the data. It is on you as good Data Science professional to detect

  * always scrutinise the data

* causes Machine Learning algorithm to make biased conclusion

  * Criminal Justice is one example - due to minority heavy dataset, the results are biased

  * Volvo self-driving car test in 2017, could not recognise Kangaroo in Aus as the data trained on only had Deer and likes


  ## Descriptive Statistics

  ### Mean

  * *mean* is average of set of values

  * shows `center of gravity`

  * $\bar{x}$ is sample mean

    * $\bar{x} = \sum \frac{x_i}{n} $

  * $\mu$ is population mean

    * $\mu = \sum \frac{x_i}{N} $

### Weighted Mean

* is similar to *mean* but mean give equal importance to each item

* weighted mean uses different weight for each item

* calc

  * $ \frac{(x_1.w_1) + (x_2.w_2) + ... (x_n.w_n)}{w_1+w_2+...+w_n}$

* why?

  * one good example is to use weighted score from exams to give grade

    * of 3 exams, give 20% weights to first 2 and 60% to the last

### Median

* middle most value of ordered set of values

* if even number of values found, then median is average the the 2 middle most values

* useful compared to mean when the data is skewed by outliers

* example, in one University the average salary for Geography graduates is $250k but in other Unis it is just $22k

  * turns out Michael Jordan, the basketball player is Geography graduate from that Uni

* when data has many outliers then median is better as it is less sensitive to outliers

* when mean and median are very different then the data is skewed by the outliers

* median is `50% quantile` where 50% of the data is less than or equal to it

* there are 25%, 50% and 75% quantiles

  * these are referred to as `quartiles`

    * 1st, 2nd and 3rd quartiles respectively

### Mode

* mode is most repetitive data

* if no repetition then no mode

* if two item occurs with same number of frequency then the data is `bimodal`

* not used a lot in practice unless data is highly repetitive

### Variance and Standard Deviation

### Population Variance and Standard Deviation

* measures how spread out the data is

* example: pets owned by colleagues (population)

  * mean $\mu = 6.5$

  * we want to see how different each value is from mean

    * so we subtract mean from each value

    * we will have some positive and negative values

    * and we can some what see how spread out each value is

  * is there a way to summarise this with one number?

    * we can find average of each of the differences 

      * but there are negative and positive numbers so they may cancel each other out and won't give real picture
    
    * we can use absolute values but even better approach is to square the differences

      * main reason to square is - it amplifies larger differences

      * also absolute values are not easy to work with in derivatives - NOT SURE WHY?

    * so we find the average of squared differences and this is called `variance`

    * $ population \space variance = \frac{(x_1 - mean)^2 ... (x_n - mean)^2}{N}$

    * $ \sigma^2 = \frac{\sum(x_i - \mu)^2}{N}$

* variance is great we can see how spread out the data is in one number but it is not in the same unit of the data

* we need to change variance to be in the same unit as data, so we can compare it with the data directly

  * so we find square root of variance - which is standard deviation

  * $ \sigma = \sqrt{\frac{\sum(x_i - \mu)^2}{N}}$


