# ***INTRODUCTION TO STATISTICS***


Welcome to this study on **Statistics for Data Science**!

In this journey, we'll dive into key statistical concepts that are essential for AI and Machine Learning. We'll start with foundational concepts and progressively tackle more advanced topics, including probability, statistical inference, hypothesis testing, and regression.

This notebook covers:
- **Measures of Center**
- **Measures of Spread**


## Introduction

What is Statistics?

- Statistics is the study of how to collect, analyze, and draw conclusions from data. It’s a hugely valuable tool that you can use to bring the future into focus and infer the answer to tons of questions. For example, what is the likelihood of someone purchasing your product,  how many jeans sizes should you manufacture to fit 95% of the population? Statistics can be used to answer lots of different types of questions, but being able to identify which type of statistics is needed is essential to drawing accurate conclusions.

- Summary statisics generally discusses or gives insights to the summary of your data.






There are 2 types of Statistics namely; Descriptive and Inferrential Statistics.

- Descriptive Statistics describes and summarizes our data.

- While inferrential statistics in essential in making inferences and drawing conclusions about our data.


[<img src="https://drive.google.com/uc?id=1nD0D1tIX1gZtqggMmg5-YFpwb2oFKwvA" alt="Image Description" width="800"/>](https://github.com/anhhaibkhn/Data-Science-selfstudy-notes-Blog/blob/master/_notebooks/Introduction%20to%20Statistics%20in%20Python/pdfs/chapter1.pdf)

In [1]:
import gdown

In [2]:
file_id = '1yB5qSBOLl96Y563nIewKOU8RN_gsY3dO'  # Make sure it's a string
gdown.download(f'https://drive.google.com/uc?id={file_id}', 'data.csv', quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1yB5qSBOLl96Y563nIewKOU8RN_gsY3dO
To: d:\clean_ai_engineering\module_2\week8\data.csv
100%|██████████| 52.0k/52.0k [00:00<00:00, 15.2MB/s]


'data.csv'

In [3]:
import pandas as pd
import numpy as np

In [4]:
# Read the downloaded CSV file
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,country,food_category,consumption,co2_emission
0,1,Argentina,pork,10.51,37.2
1,2,Argentina,poultry,38.66,41.53
2,3,Argentina,beef,55.48,1712.0
3,4,Argentina,lamb_goat,1.56,54.63
4,5,Argentina,fish,4.36,6.96


## Measures of Center

Measures of center are statistical values that describe the central tendency of a dataset. They provide a summary measure that represents the entire dataset, allowing you to understand where the majority of your data points lie. Understanding these measures help to summarize and describe data effectively.

### 1. Mean

- The **mean**, often referred to as the average, is calculated by summing all the values in a dataset and dividing by the number of values. It is a commonly used measure of center.
  
  **Formula:**

$$
\text{Mean} (\mu) = \frac{\sum_{i=1}^{n} x_i}{n}
$$

**Where:**
- \( $x_i$ \) represents each value in the dataset.
- \( $n$ \) is the total number of values.

**Example:**
For the dataset [5, 10, 15, 20] \:

$$
\text{Mean} = \frac{5 + 10 + 15 + 20}{4} = \frac{50}{4} = 12.5
$$



- **Considerations:** The mean can be influenced by outliers. For instance, in the dataset [1, 2, 3, 100] \, the mean is significantly higher than most data points due to the outlier "100".

### 2. Median

- The **median** is the middle value when the dataset is ordered from least to greatest. If the dataset has an even number of observations, the median is the average of the two middle values. The median is a robust measure of center that is less affected by outliers.
  
  **Steps to Calculate Median:**
  1. Sort the data in ascending order.
  2. If the number of observations \($n$\) is odd, the median is the middle number.
  3. If \($n$\) is even, the median is the average of the two middle numbers.

  **Example:**

  For the dataset  [1, 3, 6, 3, 7, 9, 8] \:
  - Sorted: [1, 3, 3, 6, 7, 8, 9] \
  - The Median is $6$, the $4th$ value

  For the dataset [1, 2, 3, 5, 4, 6]:
  - Sorted: [1, 2, 3, 4, 5, 6] \
  - The Median is:
    $$
    \text{Median} = \frac{3 + 4}{2} = 3.5
    $$

### 3. Mode

- The **mode** is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode (bimodal or multimodal), or no mode at all if all values occur with the same frequency.

- **Example:** For the dataset [1, 2, 2, 3, 4] \:
  - Mode: \( $2$ \)

  For the dataset [1, 1, 2, 2, 3, 4] \:
  - The Modes are \( $1$ \) and \( $2$ \)  $i.e$, bimodal



### Summary of Measures of Center

- **Mean:** Sensitive to outliers, best for symmetric/normal distributions.
- **Median:** Robust against outliers, represents the center in skewed distributions.
- **Mode:** Useful for categorical data, indicates the most common value.




Understanding the measures of center is important in Data Science to help summarize data with a single representative value. They assist in comparing different datasets, understanding trends, and making informed decisions based on data.

[<img src="https://drive.google.com/uc?id=1fYdyU8Ris_FP1FFD_6ianuo8LgNSmJ68" alt="Measures of Center" width="600"/>](https://www.google.com/url?sa=i&url=https%3A%2F%2Ftowardsai.net%2Fp%2Fl%2Fpython-statistical-analysis-measures-of-central-tendency-and-dispersion&psig=AOvVaw3eoQpaEtgizlksZ5lHMCZQ&ust=1730369714671000&source=images&cd=vfe&opi=89978449&ved=0CBQQjRxqFwoTCPij1dTvtYkDFQAAAAAdAAAAABAE)



In [5]:
df

Unnamed: 0.1,Unnamed: 0,country,food_category,consumption,co2_emission
0,1,Argentina,pork,10.51,37.20
1,2,Argentina,poultry,38.66,41.53
2,3,Argentina,beef,55.48,1712.00
3,4,Argentina,lamb_goat,1.56,54.63
4,5,Argentina,fish,4.36,6.96
...,...,...,...,...,...
1425,1426,Bangladesh,dairy,21.91,31.21
1426,1427,Bangladesh,wheat,17.47,3.33
1427,1428,Bangladesh,rice,171.73,219.76
1428,1429,Bangladesh,soybeans,0.61,0.27


In [6]:
df.drop('Unnamed: 0', axis = 1, inplace = True)

In [7]:
df

Unnamed: 0,country,food_category,consumption,co2_emission
0,Argentina,pork,10.51,37.20
1,Argentina,poultry,38.66,41.53
2,Argentina,beef,55.48,1712.00
3,Argentina,lamb_goat,1.56,54.63
4,Argentina,fish,4.36,6.96
...,...,...,...,...
1425,Bangladesh,dairy,21.91,31.21
1426,Bangladesh,wheat,17.47,3.33
1427,Bangladesh,rice,171.73,219.76
1428,Bangladesh,soybeans,0.61,0.27


In [8]:
df.describe(include = 'all')

Unnamed: 0,country,food_category,consumption,co2_emission
count,1430,1430,1430.0,1430.0
unique,130,11,,
top,Argentina,pork,,
freq,11,130,,
mean,,,28.110406,74.383993
std,,,49.818044,152.098566
min,,,0.0,0.0
25%,,,2.365,5.21
50%,,,8.89,16.53
75%,,,28.1325,62.5975


In [19]:
# filter for Belgium 

be_consumption = df[df['country'] == 'Belgium']

In [11]:
be_consumption

Unnamed: 0,country,food_category,consumption,co2_emission
396,Belgium,pork,38.65,136.8
397,Belgium,poultry,12.2,13.11
398,Belgium,beef,15.63,482.31
399,Belgium,lamb_goat,1.32,46.23
400,Belgium,fish,18.97,30.29
401,Belgium,eggs,12.59,11.57
402,Belgium,dairy,236.19,336.43
403,Belgium,wheat,111.91,21.34
404,Belgium,rice,8.61,11.02
405,Belgium,soybeans,0.07,0.03


In [14]:
be_consume = be_consumption['consumption']        # this assigns only Belgium countries

In [15]:
# Calculate mean and median consumption in Belgium

print(np.mean(be_consume))

42.13272727272727


In [18]:
print(np.median(be_consume))

12.59


In [20]:
# filter for USA
usa_consumption = df[df['country'] == 'USA']

In [21]:
usa_consumption

Unnamed: 0,country,food_category,consumption,co2_emission
55,USA,pork,27.64,97.83
56,USA,poultry,50.01,53.72
57,USA,beef,36.24,1118.29
58,USA,lamb_goat,0.43,15.06
59,USA,fish,12.35,19.72
60,USA,eggs,14.58,13.39
61,USA,dairy,254.69,362.78
62,USA,wheat,80.43,15.34
63,USA,rice,6.88,8.8
64,USA,soybeans,0.04,0.02


In [23]:
us_consume = usa_consumption['consumption']     # assign USA

In [24]:
us_consume

55     27.64
56     50.01
57     36.24
58      0.43
59     12.35
60     14.58
61    254.69
62     80.43
63      6.88
64      0.04
65      7.86
Name: consumption, dtype: float64

In [25]:
# Calculate m)an and median consumption in USA
print(np.mean(us_consume))


44.650000000000006


In [26]:
print(np.median(us_consume))

14.58


In [29]:
df['country'].value_counts()

country
Argentina       11
Australia       11
Albania         11
Iceland         11
New Zealand     11
                ..
Sierra Leone    11
Sri Lanka       11
Indonesia       11
Liberia         11
Bangladesh      11
Name: count, Length: 130, dtype: int64

In [30]:
# Check the mode of the country column
mode_country = df['country'].mode()

In [31]:
print("Mode Country:", mode_country)

Mode Country: 0        Albania
1        Algeria
2         Angola
3      Argentina
4        Armenia
         ...    
125      Uruguay
126    Venezuela
127      Vietnam
128       Zambia
129     Zimbabwe
Name: country, Length: 130, dtype: object
