In [8]:
import pandas as pd
import numpy as np
from scipy import stats
import wquantiles
import statsmodels.robust.robust_linear_model as robust

In [2]:
gym = pd.read_csv('/home/satire/PycharmProjects/Statistics/csv/gym_members_exercise_tracking.csv')
gym.head()

Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Workout_Type,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI
0,56,Male,88.3,1.71,180,157,60,1.69,1313.0,Yoga,12.6,3.5,4,3,30.2
1,46,Female,74.9,1.53,179,151,66,1.3,883.0,HIIT,33.9,2.1,4,2,32.0
2,32,Female,68.1,1.66,167,122,54,1.11,677.0,Cardio,33.4,2.3,4,2,24.71
3,25,Male,53.2,1.7,190,164,56,0.59,532.0,Strength,28.8,2.1,3,1,18.41
4,38,Male,46.1,1.79,188,158,68,0.64,556.0,Strength,29.2,2.8,3,1,14.39


Mean
* The sum of all values divided by the number of all values
$$
\text{Set of numbers: } \{3, 5, 1, 2\}
$$
$$
\text{The mean is } \frac{3 + 5 + 1 + 2}{4} = \frac{11}{4} = 2.75
$$
Mean formula
$$
\text{Mean} = \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}
$$
$$
N = \text{Total number of records or observations}
$$

Trimmed mean
* Calculated by dropping a fixed number of sorted values at each end and then taking an average of the remaining values.
Formula
$$
\text{Trimmed mean} = \bar{x} = \frac{\sum_{i = p+1}^{n - p} x_{(i)}}{n - 2p}
$$
* Eliminates the influence of extremes values. the trimmed mean excludes the largest ans the smalled values (trim=0.1 drops 10 % from each end).

Weighted mean
* Some values are intrinsically more variable than others, and highly variable observations are given a lower weight.
* The data collected does not equally represent the different groups that we are interested in measuring.
Formula
$$
\bar{x}_w = \frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i}
$$

In [3]:
gym.describe()

Unnamed: 0,Age,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI
count,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0
mean,38.683453,73.854676,1.72258,179.883864,143.766701,62.223022,1.256423,905.422405,24.976773,2.626619,3.321686,1.809866,24.912127
std,12.180928,21.2075,0.12772,11.525686,14.345101,7.32706,0.343033,272.641516,6.259419,0.600172,0.913047,0.739693,6.660879
min,18.0,40.0,1.5,160.0,120.0,50.0,0.5,303.0,10.0,1.5,2.0,1.0,12.32
25%,28.0,58.1,1.62,170.0,131.0,56.0,1.04,720.0,21.3,2.2,3.0,1.0,20.11
50%,40.0,70.0,1.71,180.0,143.0,62.0,1.26,893.0,26.2,2.6,3.0,2.0,24.16
75%,49.0,86.0,1.8,190.0,156.0,68.0,1.46,1076.0,29.3,3.1,4.0,2.0,28.56
max,59.0,129.9,2.0,199.0,169.0,74.0,2.0,1783.0,35.0,3.7,5.0,3.0,49.84


In [4]:
mean = gym['Age'].mean()
trim_mean = stats.trim_mean(gym['Age'], 0.1)
median = gym['Age'].median()
print(mean)
print(trim_mean)
print(median)

38.68345323741007
38.79204107830552
40.0


Most of the time the mean is bigger than the trimmed mean, which is bigger than the median.

In [5]:
weighted_mean = np.average(gym['Height (m)'], weights=gym['Weight (kg)'])
weighted_median = wquantiles.median(gym['Height (m)'], weights=gym['Weight (kg)'])
print(weighted_mean)
print(weighted_median)

1.7359640331419441
1.73


Deviations
* The difference between the observed values and the estimate of location. (errors, residuals)
* For a set of numbers
$$
\{1, 4, 4\}
$$
* the mean is 3 and the median is 4.
* the deviations from the mean are the differences:
$$
1 - 3 = -2, \quad 4 - 3 = 1, \quad 4 - 3 = 1.
$$

Mean absolute deviation
* The mean of the absolute values of the deviations from the mean. (l1-norm, Manhattan norm)
* Take the absolute values of the deviations
$$
\{2, 1, 1\}
$$
* their average is
$$
\frac{2 + 1 + 1}{3} = 1.33\
$$
* Formula
$$
\text{Mean absolute deviation} = \frac{\sum_{i=1}^{n} |x_i - \bar{x}|}{n}
$$


Variance
* The sum of squared deviations from the mean divided by the number of observations minus 1. (mean-squared error)
* Formula
$$
\text{Variance} = s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}
$$

Standard deviation
* The square root of the variance.
* Formula
$$
s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}
$$
* The standard deviation is the same scale as the original data.

Median absolute deviation from the median
* The median of the absolute values of the deviations from the median.
* Formula
$$
\text{Median absolute deviation} = \text{median}(|x_1 - \text{median}, |x_2 - \text{median}, \ldots, |x_n - \text{median}|)
$$
* The MAD is not influenced by extreme values.

In [11]:
standard_deviation = gym['Weight (kg)'].std()
iqr = gym['Weight (kg)'].quantile(0.75) - gym['Weight (kg)'].quantile(0.25)
mad = robust.scale.mad(gym['Weight (kg)'])
print(standard_deviation)
print(iqr)
print(mad)

21.20750049840716
27.9
19.866869727975075


* the standard deviation is almost twice as large as the MAD due to the influence of the extreme values.

Range
* The difference between the largest and the smallest value in a data set.

Order statistics
* Metrics based on thw data values sorted from smallest to biggest. (ranks)

Percentile
* The value such P percent of the values take on this value or less and (100 - P) percent take on this value or more. (quantile)

Interquartile range
* The difference between the 75th percentile and the 25th percentile. (IQR)
* Take this set of numbers
$$
\{3,1,5,3,6,7,2,9\}
$$
* We sort them
$$
\{1,2,3,3,5,6,7,9\}
$$
* The 25th percentile is at 2.5 and the 75th percentile is at 6.5. So, the IQR is 6.5 - 2.5 = 4.