# 1.2 Overview of stats and scales of measurement

### Descriptive Stats
Deals with collection, summary, presentation, and analysis of data in order to transform data into info that can be easily understood and interpreted

### Inferential Stats
Use data from a sample to make claims, predictions, estimates and test hypotheses about the chars of a pop.

### Population and Sample
- Population set of all individuals or elements of interest
- **Parameter** - Specific characteristic of a pop, fixed for a population
- **Statistic** - Specific characteristic of a sample, sample stats varies from sample to sample

### Data
- refers to the facts and figures collected, summed and analysed for presentation and interpretation
- it consists of individuals, variables, and observations

### Individuals, Variables and Observations
- **Individual** = object of interest in study - Ex. cars, cities, people etc.
- **Variables** = attributes or characteristics of interest for an individual (ex. color, height, weight)
- **Observations** = set of all outcomes or measurements collected for particular individual in study
- **Data Set** = collection of all data observations in a study

### Scales of Measurement
- **Nominal:** arbitrary labels or names, no order *Ex. walk, drive, train, bus, other*
    - Numeric codes can be used *Ex. 1 represents walk, 2 drive etc.*
- **Binary:** special case of categorical with just two categories 0 or 1, true or false etc.
- **Ordinal:** all properties of nominal data but with **order or rank**
    - When numeric codes used, there is NO measurable meaning to the number differences but there is an order
    - Ex. Education Level: 4 High School, 3 Under Grad, 2 Masters, 1 PhD
    
- **Interval:** numeric scales in which data exhibits props of ordinal scale 
    - **diff b/w values meaningful and specified on fixed unit of measure**
    - Ex. SAT scores - 500, 600, 750 - score ranked and differencs have meaning
    - No true 0 point - 0 does not mean absence of value but indicates another value on scale
    
- **Ratio:** all props of interval BUT **ratio of any two values is meaningful** and **has true 0**
    - Ex. yearly income, 36k to 30k is 1.2 times more
    - 0 income, worker didn't work, absence of a value
    - Other examples: weight, height, time

*Intro Readings - Applied stats for business and econ* [Ch. 1 pp 1-13](https://ebookcentral.proquest.com/lib/londonww/detail.action?docID=1648178)

# 1.6 - Types of Data
### Categorical or Qualitative
- Uses **nominal** or **ordinal** scales of measurement

### Numerical or Quantitative
- Uses **interval or ratio** scales of measurement
- **Discrete**: usually integer values, usually a quantity
    - Examples: Number of cars, test scores with half marks (non-integer discrete)
- **Continuous:** measure 'how much'
    - Examples: Height, weight, time
    - Sometimes rounded to nearest integer but still continuous (Ex. Age)

### Statistical Analysis for Categorical Vars
- Count number of observations, evaluate proportions or frequencies
- Arithmetic ops provide no meaningful results, ie average

### Statistical Analysis for Quantitative Vars
- Arithmetic ops provide meaninful results
- Ex. Get average age

#### Time- Series: data taken over range of time periods
#### Cross-Sectional: data collected from number of subjects during single time period

### Presentation of Data
- Tabular presentation
    - Frequency, % Freq., Relative Freq., Cumulative Freq.
    
### Frequency Distribution
- Summarize data in a table, show number of observations (frequency) of each category
- Categories are *nonoverlapping*
- Can summarize both qualitative and quantitative

*Readings - Practical Stats for Data Scientists* [Ch. 1 pp 1-7](https://ebookcentral.proquest.com/lib/londonww/detail.action?docID=6173908)

# 1.9 Measures of Location (mean, median, and mode)

### Mean
- most basic estimate of location
- x bar
- *Trimmed Mean*:
    -- text pp. 9 and 10
- *Weighted Mean*:
    -- text pp. 9 and 10
    
### Median
- Middle number in a sorted list
    - If even number of vals - average of two middle numbers
- Compared to Mean - only depends on 1 or 2 observations vs all observations
- Useful if large outliers (Ex. Bill gates living in a neighborhood when measuring houshold income)
- Less affected by outliers
- If data **normally distributed** - value of median and mean should be close
- If data **skewed** median and mean might be quite different


### Mode
- Value or values that occur the most often - highest frequency
- Simple stat for categorical data - not used for numeric

#### Readings
[Chapter 1 Exploratory Data Analysis, Estimates of Location section, pp.7–13.](https://ebookcentral.proquest.com/lib/londonww/detail.action?docID=6173908)

[Chapter 3 Describing Data: Summary Statistics, Section 3.2.1 Measures of Central Tendency, pp.43–50.](https://ebookcentral.proquest.com/lib/londonww/detail.action?docID=1648178)

# 1.12 Measures of Location (mean, median, and mode)

### Percentiles
- *pth* percentile value which divides data into two parts
    - p percent of the observations are less than or equal to this value
    - and at least *100-p* percent of observations are greater than or equal to this value
- sort data is ascending order
- compute postion of the pth percentil by *p/100*n*
    - p = percentile of interest
    - n = number of observations
- if value integer: percentile is average of corresponding value and next value in data
    - 
- if not integer: round up to obtain position of *pth* percentile
    
- *Weighted Mean*:
    -- text pp. 9 and 10
    
### Quartiles
- Divide ranked data into 4 parts each containing ~ 25% of data
    - Q1 = 25%
    - Q2 = 50% = Median
    - Q3 = 75%
- Useful if large outliers (Ex. Bill gates living in a neighborhood when measuring houshold income)
- Less affected by outliers
- If data **normally distributed** - value of median and mean should be close
- If data **skewed** median and mean might be quite different


### Range (R)
- Dif between largest and smallest value

### Interquartile Range (IQR)
- Middle 50% of data - IQR = Q3 - Q1
- Measure of spread, help eliminate outliers
- 


#### Readings
[Bruce, Brucke and Gedeck (2020) Chapter 1 Exploratory Data Analysis, pp.20–3.](https://ebookcentral.proquest.com/lib/londonww/detail.action?docID=6173908)



In [3]:
import numpy as np
data = np.array([23, 21, 16, 22])

np.std(data, ddof=1)



3.1091263510296048

# 1.15 Measures of Variation (variance and standard deviation)

### Variance
- average of the squared deviations of values from the mean
    - get mean of the population
    - subtract mean from each observation and square that difference
    - divide by pop size N (*n-1 for sample variance*)
    
### Standard Deviation
- Most commonly used measure of spread
- **Shows variation around the mean**
- Expressed in same unit as original data
- = Positive square root of the variance

#### Data in a frequency table
- Need to write down formula
- Just different way of working it out given frequency instead of mean

### Coefficient of Variation (CV)
- measures how large standard deviation in relation to mean
- CV = (s/xbar*100)%
- Dimensionless: independent of the unit of measurement
- Shows extent of variability in relation to mean of pop
- *Ex. finance investors calc risk  they can assume in comparision to amount of return expected* 
    - Lower the CV the better


#### Readings
[Lecture Slides](https://dr3vr6j2erh62.cloudfront.net/videostore/dsm030/transcripts/dsm030_rd_topic01_lecture05.pdf)

[Bruce, Bruce and Gedeck (2020), Chapter 1 Exploratory Data Analysis, pp.14–15.](https://ebookcentral.proquest.com/lib/londonww/detail.action?docID=6173908)
[Leekley (2010), Chapter 3 Describing Data: Summary Statistics, Section 3.2.2 Measures of Variation, pp.51–58. ](https://ebookcentral.proquest.com/lib/londonww/detail.action?docID=1648178)


# 1.18 Basic Plots and Python Code Examples


### Bar Chart
- A bar is used to visualise categorical data with bars.
- The categorical data needs to have corresponding values.
- The lengths (or heights) of such bars are proportional to those values.
- Display some numeric frequency for different categories or discrete groups
- Flexible: can contain more than one set of values for a category

### Histogram
- distribution of numerical data
- consists of several bars showing freq. of a value
- divide data into a series of non-overlapping intervals - count how many values in each interval
- **bins:** intervals of the histogram
- used to plot one continuous variable
- *Ex. present marks of all students in a certain module*

### Boxplot
- display data through quartiles
- illustrates the 3 quartiles and any extreme values
- has **whiskers** - lines extending from the box indicating variability outside Q1 and Q3
- can depict outliers as individual points
- view range and interquartile range on a boxplot
- **designed to give easy to read rep of the location and spread of disttribution**
- **Upper and Lower Fences:** separate outliers from the bulk of data
    - Upper fence = Q3 + (1.5 * IQR)
    - Lower fence = Q1 - (1.5 * IQR)



#### Readings
[Lecture Slides](https://dr3vr6j2erh62.cloudfront.net/videostore/dsm030/transcripts/dsm030_rd_topic01_lecture06.pdf)

[Bruce, Bruce and Gedeck (2020), Chapter 1 Exploratory Data Analysis, pp.14–15.](https://ebookcentral.proquest.com/lib/londonww/detail.action?docID=6173908)

[Leekley (2010), Chapter 3 Describing Data: Summary Statistics, Section 3.2.2 Measures of Variation, pp.51–58. ](https://ebookcentral.proquest.com/lib/londonww/detail.action?docID=1648178)


### Readings

[Leekley - Applied Stats for biz and econ Ch.1 -3 pp 1-70](https://ebookcentral.proquest.com/lib/londonww/detail.action?docID=1648178)

[Bruce - Practical stats for data scientists Ch.1 pp. 1-46](https://ebookcentral.proquest.com/lib/londonww/detail.action?docID=6173908)