# Chapter 01- Exploratory Data Analysis

Dataset Link : https://www.kaggle.com/datasets/arjunprasadsarkhel/2021-olympics-in-tokyo

In [7]:
import pandas as pd
import numpy as np


In [4]:
data = pd.read_excel('Data/Medals.xlsx')

  warn("Workbook contains no default style, apply openpyxl's default")


In [8]:
data.head()

Unnamed: 0,Rank,Team/NOC,Gold,Silver,Bronze,Total,Rank by Total
0,1,United States of America,39,41,33,113,1
1,2,People's Republic of China,38,32,18,88,2
2,3,Japan,27,14,17,58,5
3,4,Great Britain,22,21,22,65,4
4,5,ROC,20,28,23,71,3


## Estimates of Location

**Mean (Average)**:
The mean or average is the sum of all values in a dataset divided by the total number of values. It gives you a measure of the central tendency of the data.
- Mean is sensitive to Outliers

In [10]:
# Mean
data['Total'].mean()

11.612903225806452

**Weighted Mean (Weighted Average)**:
The weighted mean is similar to the mean, but each value is multiplied by a weight before calculating the average. The weights represent the importance or significance of each value.

In [13]:
#Weighted Mean

np.average(data['Total'],weights=data['Gold'])

46.832352941176474

**Trimmed Mean (Truncated Mean)**:
The trimmed mean is the average of all values after removing a fixed number or percentage of extreme values from both ends of the dataset. It is used to reduce the impact of outliers on the mean.

In [22]:
#Trimmed Mean
from scipy.stats import trim_mean
trim_mean(data['Total'],0.1)

6.8933333333333335

**Robust** : obust refers to a statistical measure or method that is not sensitive to extreme values or outliers. It provides more stable and reliable results even in the presence of outliers.

**Median**:
The median is the middle value of a dataset when it is sorted in ascending or descending order. It divides the data into two equal halves. If the number of values is even, the median is the average of the two middle values.

In [21]:
#Median 
data['Total'].median()

4.0

**Weighted Median**:
The weighted median is similar to the median, but it takes into account the weights assigned to each value. It is the value such that half of the sum of the weights lies above and below the sorted data.

In [25]:
# Weighted Median

def weighted_median(data, weights):
    sorted_data, sorted_weights = zip(*sorted(zip(data, weights)))
    cum_weights = np.cumsum(sorted_weights)
    total_weight = cum_weights[-1]
    median_idx = np.searchsorted(cum_weights, total_weight / 2.0)
    if total_weight % 2 == 0 or cum_weights[median_idx] > total_weight / 2.0:
        return sorted_data[median_idx]
    else:
        return (sorted_data[median_idx] + sorted_data[median_idx + 1]) / 2.0

result = weighted_median(data['Total'], data['Gold'])
print(result)


40


**Percentile (Quantile)**:
The percentile is the value below which a given percentage of the data falls. For example, the 50th percentile is the median, which divides the data into two equal parts.

In [26]:
#Percentile
# 75th percentile
np.percentile(data['Total'], 75)

11.0

**Outlier (Extreme Value)**:
An outlier or extreme value is a data point that is significantly different from the other values in a dataset. It is an observation that lies an abnormal distance away from other observations. Outliers can sometimes skew the results of statistical analysis.

In [27]:
data.describe()

Unnamed: 0,Rank,Gold,Silver,Bronze,Total,Rank by Total
count,93.0,93.0,93.0,93.0,93.0,93.0
mean,46.333333,3.655914,3.634409,4.322581,11.612903,43.494624
std,26.219116,7.022471,6.626339,6.210372,19.091332,24.171769
min,1.0,0.0,0.0,0.0,1.0,1.0
25%,24.0,0.0,0.0,1.0,2.0,23.0
50%,46.0,1.0,1.0,2.0,4.0,47.0
75%,70.0,3.0,4.0,5.0,11.0,66.0
max,86.0,39.0,41.0,33.0,113.0,77.0


## Estimation of Variability

Variability is the heart of statistics and where a lot of information on a dataset can be gleaned.

**********************Deviations:********************** The difference between observed values and the estimate of location. Also called Errors or Residuals.

**Variance**: Square the deviations, from the mean, divide by n-1 where n is the number of instances.

In [30]:
# Variance

from statistics import variance
variance(data['Gold'])

49.315100514259

**Standard Deviation** : Square root of the variance

In [31]:
# Standard Deviation 

from statistics import stdev
stdev(data['Gold'])

7.022471111671375

**Mean Absolute Deviation**: Mean of the absolute values of the deviations from the mean, also known as l1-norm or Manhattan Norm

In [34]:
from numpy import mean, absolute

mean(absolute(data['Gold']-mean(data['Gold'])))

4.0048560527228565

**Mean Absolute deviation from Median**: Mean of the absolute values of the deviations from the median.

In [35]:
from numpy import mean, median

mean(absolute(data['Gold']-median(data['Gold'])))

3.2580645161290325