# Statistics Fundamentals

We'll cover:
- Intro:
    - Definition
    - Importance of Statistics
    - Common Terms in Stats
    - Types
- Types of Data
- Statistical Measures
    - Measures of Central Tendency
    - Measures of Dispersion
    - Measures of Shape
    - Multivariate Analysis (Covariance and Correlation)

In [1]:
import pandas as pd
import numpy as np

## Intro On Statistics
### What is Statistics?
- A part of mathematics that involves the collection, analysis, interpretation, presentation, and organization of data.
- It's a discipline that deals with collection, description, analysis, and inference of conclusions from quantitative data and qualitative (notes: there are techniques to perform stats on categorical data - e.g. chi square test)

### Importance of Statistics
- The journey of a DS project starts with statistical analysis (e.g. inferential analysis) to help use explore and interpret our data.
- It helps extract impactful findings and define an objective

### Common Terms in Statistics
- Population and Sample
    - A **population** is the complete pool from which the sample is drawn for further analysis. 
    - A **sample** is a subset of the **population**
    - The reason why we study a sample from the population: We can’t possibly study the entire population, its too resource intensive, long-drawn or infeasible. 
    - Therefore, it's essential to have a good sample at hand. The sample should be a **good representative** of the population.
- **Measurement**: It's a number that is calculated or measured for each member of the sample. e.g. Weight, Blood Pressure, Vehicle Speed, etc..
- **Parameter**: It's the characteristic of the population. e.g. population mean
- **Statistic**:  It's the characteristic of the sample. e.g. sample mean
- **Distribution**: It refers to how the sample data is distributed (spread) across a range of values.

In [2]:
student_data = {
                'name': ['Mark', 'Mike', 'Tammy', 'Becky', 'John'],
                'age': [55,43,27,35, 46],
                'score':[99,78,83,87, 79],
                'city': ['New York', 'Nashville', 'San Diego', 'Atlanta', 'Boston']
                }

In [3]:
df = pd.DataFrame(student_data)
df

Unnamed: 0,name,age,score,city
0,Mark,55,99,New York
1,Mike,43,78,Nashville
2,Tammy,27,83,San Diego
3,Becky,35,87,Atlanta
4,John,46,79,Boston


- df is a sample from all students of a school (population)
- for example, row 0 is a record or a member of the sample

### Types of Statistics

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_06_Maths_and_Stats/Statistics_Fundamentals/Image_1.png)

- Descriptive statistics is statistics that offers an explicit explanation of the data, inferential statistics leads to logical inferences about the data and predictive statistics would be forecasting data.
- The key difference between Descriptive and Inferential is in Inferential you are building conclusions (inferences) about the population based on the analyzed sample. e.g. an increase of blood pressure may cause a heart attack. Basically, Inferential = Create a derived understanding about some aspect of the population.
- Predictive Anticipate what will happen to a characteristic in the future. Basically, Predictive: predicting future outcomes based on historical data


## Types of Data

Two main types:
- Categorical: it represents a characteristic. e.g. gender, blood type, country, marital status, etc...
- Numeric: It represents a measurement or calculation. e.g. height, age, weight, num of sales, product temperature, credit limit, etc...

What about Date?
- It depends on the context and date component
- full date is considered categorical
- if you extract month or day parts, for example, from the date, that value is considered numeric.

## Statistical Measures

### Measures of Central Tendency
- It's a summary that describes the central position of the dataset
- They are also called the 3Ms: Mean, Median, and Mode
- Note: If the data is not evenly distributed (skewed), those measures do not represent the center of the data.

Mean: 
- The most popular statistical measure.
- The total of the data points divided by their count


In [4]:
df['age'].mean(skipna=True) #in case the data has nulls, you want to ignore them

41.2

In [5]:
#using numpy
np.mean(df['age'])

41.2

In [6]:
# using statistics library
import statistics as stat
stat.mean(df['age'])

41.2

Median:
- The middle value of a dataset obtained after arranging the order descending or ascending.
- For an odd count of values, it's straightforward 
- For an even count of values, you need ot take the average of the 2 data points in the center. Fore example, the median of `[4,5,6,7,8,9]` is (6+7)/2

In [7]:
df['age'].median()

43.0

In [8]:
np.median(df['age'])

43.0

Mode
- It's the most frequent value in the dataset
- It's the only measure of central tendency that can be applied on both numerical and categorical data. For example, `['Red', 'Blue','Red','White','Red', 'Blue] has a mode value: 'Red'

In [9]:
df['age'].mode()

0    27
1    35
2    43
3    46
4    55
Name: age, dtype: int64

We get all the numbers because all of them have the same frequency

In [10]:
X = pd.Series([2,3,4,4,6,7,4,8,4,8])
X.mode()

0    4
dtype: int64

### Measures of Dispersion
- Also known as measures of variability. They are used to understand the spread of the data and its diversity. e.g. Standard Deviation or Range
- The mean may pose an issue in terms of describing the data. it could be unreliable if the data has outliers. Second, it may not be enough to explain the difference between some datasets.

In [12]:
# Dataset 1
data1 = np.array([3,5,6,7,8])

# Dataset 2
data2 = np.array([2,5,5,8,9])

# Dataset 3
data3 = np.array([2,4,4,9,10])

In [13]:
print('Data 1 mean:',np.mean(data1))
print('Data 2 mean:',np.mean(data2))
print('Data 3 mean:',np.mean(data3))

Data 1 mean: 5.8
Data 2 mean: 5.8
Data 3 mean: 5.8


**Observation**
- Despite the fact all datasets are different, they all share the same average.
- The average/mean is not giving the full picture.
- Therefore, we need additional measures to get a better explanation.

In [14]:
print('Data 1 Standard Deviation:',np.std(data1))
print('Data 2 Standard Deviation:',np.std(data2))
print('Data 3 Standard Deviation:',np.std(data3))

Data 1 Standard Deviation: 1.7204650534085253
Data 2 Standard Deviation: 2.4819347291981715
Data 3 Standard Deviation: 3.1240998703626617


![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_05_Data_Visualization/standard_deviation.png)

- Explains the dispersion(spread) of the data. For example, it shows that all 3 datasets are dispersed differently.
- It's the most popular measure of dispersion
- It's defined as the sum of squares of the deviation around the mean divided by the number of observations.

Range
- The distance between the smallest and largest data points.
- It's sensitive to outliers.

In [15]:
df['age'].max() - df['age'].min()

28

#### Percentiles and Quartiles

- It's a statistical concept used to divide the data into equal parts to help understand distribution and ranking.
- Types:
    - Percentile: a fraction of 100 (divide the data into 100 equal parts), with each part representing a percentage of the data. Some common percentiles:
        - 25th percentile (Q1): 25% of the data falls below this value.
        - 50th percentile (Q2 or median): 50% of the data falls below this value.
        - 90th percentile: 90% of the data falls below this value.
- Examples: Based on our school grades dataset
    - The 50th percentile for reading score is 70. This means 50% made grades higher than 70. The other half below 70. and 70 is the median.
    - The 25th percentile of math score is 56. This means 25% of the students made grades below 56 and 75% of the students did better (made higher than 56)

In [16]:
arr = np.array([4,5,6,6,7,8,2,3,4,5,5,10,9,9,7,9])

percnt_30 = np.percentile(arr, 30)
percnt_30

5.0

In [18]:
df.describe()

Unnamed: 0,age,score
count,5.0,5.0
mean,41.2,85.2
std,10.68644,8.497058
min,27.0,78.0
25%,35.0,79.0
50%,43.0,83.0
75%,46.0,87.0
max,55.0,99.0


> quartiles is dividing the data into 4 segments, which is similar to 25th, 50th, and 75% percentiles.

### Measures of Shape

Skewness
- a measure of asymmetry or distortion of the symmetric distribution of the data
- It measures the deviation of the given data
- A normal distribution is without any skewness


![meanmedianmode.png](https://s3.us-east-1.amazonaws.com/static2.simplilearn.com/lms/testpaper_images/ADSP/Advanced_Statistics/Probimages/Statistics_Fundamentals/Statistics_Notebookupdated/meanmedianmode.png)

In [20]:
df['age'].skew()

-0.12192809131980528

- Positive Skew: value > 0
- Negative Skew: value < 0
- Normal Distribution: value = 0 or very close to it

Kurtosis

![positivenegativekurt.PNG](https://s3.us-east-1.amazonaws.com/static2.simplilearn.com/lms/testpaper_images/ADSP/Advanced_Statistics/Probimages/Statistics_Fundamentals/Statistics_Notebookupdated/positivenegativekurt.PNG)

### Covariance and Correlation
- Previous measures fall under the Univariate analysis category
- There's also Multi-variate analysis, such covariance and correlation
- They analyze the relationship between 2 or more variables (or dependency)
- Numeric data only

- Covariance gives the direction of the relationship. For example, we have x and y columns
    - Negative means if x goes up, y goes down (and vice-versa)
    - Positive means if x goes up, y goes up too
- Correlation:
    - It's the most commonly used measure of the two because it includes both the direction and intensity(magnitude) of the relationship
    - The value calculated for correlation is called the Pearson Correlation Coefficient:
    - The range of the value is from -1 to +1:
        - If the value is 0 -> no relationship 
        - If the value is close to -1, high negative correlation (relationship) (if x goes up, y goes down)
        - If the value is close to +1, high positive correlation (relationship) (if x goes up, y goes up too)
    - It's common to say any number higher than 0.6 or lower than -0.6 is considered high correlation

In [22]:
#importing csv
df = pd.read_csv('/Users/bassel_instructor/Documents/Datasets/school_grades.csv')
df.head()

Unnamed: 0,gender,class group,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,male,group A,high school,standard,completed,67,67,63
1,female,group D,some high school,free/reduced,none,40,59,55
2,male,group E,some college,free/reduced,none,59,60,50
3,male,group B,high school,standard,none,77,78,68
4,male,group E,associate's degree,standard,completed,78,73,68


In [24]:
df.corr(numeric_only=True) # correlation matrix

Unnamed: 0,math score,reading score,writing score
math score,1.0,0.819398,0.805944
reading score,0.819398,1.0,0.954274
writing score,0.805944,0.954274,1.0


- math and reading scores have a positive correlation (0.82)
- on the other hand, read and writing scores have higher correlation

In [25]:
data = {
            'income':[10000, 20000, 15000, 25000],
            'age': [25, 34, 27, 37],
            'health_score': [95, 60, 90, 57],
            'movies_watched': [7,4,4,9] 
}

df = pd.DataFrame(data)
df

Unnamed: 0,income,age,health_score,movies_watched
0,10000,25,95,7
1,20000,34,60,4
2,15000,27,90,4
3,25000,37,57,9


In [26]:
df.corr(numeric_only=True)

Unnamed: 0,income,age,health_score,movies_watched
income,1.0,0.977525,-0.940153,0.316228
age,0.977525,1.0,-0.987,0.33548
health_score,-0.940153,-0.987,1.0,-0.233988
movies_watched,0.316228,0.33548,-0.233988,1.0


**observations**
- income and age have high positive correlation
- health_score and age have high negative correlation
- income and movies_watched have low correlation (age and movies_watched as well)