# Descriptive Statistics

Descriptive statistics involves summarizing and organizing data so that it can be easily understood. Descriptive statistics, unlike inferential statistics, seeks to describe the data, but does not try to make inferences from the sample to the entire population. Here, we usually describe the data in the sample. It generally means that descriptive statistics, unlike inferential statistics, are not developed based on probability theory.

Descriptive Statistical Analysis helps us understand the data and is a very important part of Machine Learning. This is because Machine Learning is about making predictions. On the other hand, statistics is about drawing conclusions from data, which is a necessary first step. In this session we will learn about the most important concepts of descriptive statistics. Descriptive statistics will help us better understand what our data is trying to tell us, which will lead to better machine learning models and understanding overall.

In [1]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

In [2]:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
print(x)
print(x_with_nan)

[8.0, 1, 2.5, 4, 28.0]
[8.0, 1, 2.5, nan, 4, 28.0]


In [3]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)
print(y)
print(y_with_nan)
print(z_with_nan)

[ 8.   1.   2.5  4.  28. ]
[ 8.   1.   2.5  nan  4.  28. ]
0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64


In [4]:
mean_ = sum(x) / len(x)
mean_

8.7

In [5]:
mean_ = statistics.mean(x)
print(mean_)

8.7


In [6]:
mean_ = statistics.mean(x_with_nan)
print(mean_)

nan


In [8]:
statistics.median(x_with_nan)

6.0

In [9]:
np.mean(y)


8.7

In [10]:
y.mean()

8.7

In [12]:
np.nanmean(y_with_nan)

8.7

In [13]:
mean_ = z.mean()
mean_

8.7

In [14]:
z_with_nan.mean()

8.7

## Weighted Mean

Weighted mean, also called weighted arithmetic mean or weighted average, is a generalization of arithmetic mean that allows us to determine the relative contribution of each data point to the result.

In [15]:

0.2 * 2 + 0.5 * 4 + 0.3 * 8

4.8

In [16]:
x = [8.0, 1, 2.5, 4, 28.0]
w = [0.1, 0.2, 0.3, 0.25, 0.15]

wmean = sum(w[i] * x[i] for i in range(len(x))) / sum(w)
print(wmean)

wmean = sum(x_ * w_ for (x_, w_) in zip(x, w)) / sum(w)
print(wmean)

6.95
6.95


In [17]:
y, z, w = np.array(x), pd.Series(x), np.array(w)

wmean = np.average(y, weights=w)
print(wmean)

wmean = np.average(z, weights=w)
print(wmean)

6.95
6.95


In [18]:
w * y

array([0.8 , 0.2 , 0.75, 1.  , 4.2 ])

In [20]:
w.tolist() * y

array([0.8 , 0.2 , 0.75, 1.  , 4.2 ])

In [21]:
sum(w * y)/sum(w)

6.95

## Harmonic Mean

Secara teknis, pengertian dari harmonic mean adalah: the reciprocal of the average of the reciprocals.

Reciprocal artinya adalah 1/value .

In [22]:
print(x)
hmean = len(x) / sum(1 / item for item in x)
hmean

[8.0, 1, 2.5, 4, 28.0]


2.7613412228796843

In [23]:
hmean = statistics.harmonic_mean(x)
hmean

2.7613412228796843

In [25]:
print(y)
scipy.stats.hmean(y)

[ 8.   1.   2.5  4.  28. ]


2.7613412228796843

In [26]:
print(z)
scipy.stats.hmean(z)

0     8.0
1     1.0
2     2.5
3     4.0
4    28.0
dtype: float64


2.7613412228796843

# Geometric Mean

In [30]:
print(x)
gmean = 1

for item in x:
    gmean *= item

gmean **= 1 / len(x)
gmean

[8.0, 1, 2.5, 4, 28.0]


4.677885674856041

In [28]:
print(y)
scipy.stats.gmean(y)

[ 8.   1.   2.5  4.  28. ]


4.67788567485604

In [29]:
print(z)
scipy.stats.gmean(z)

0     8.0
1     1.0
2     2.5
3     4.0
4    28.0
dtype: float64


4.67788567485604

# Median

Sample median is the middle element of the sorted dataset. The dataset can be sorted in ascending or descending order.

In [32]:
sorted(x)

[1, 2.5, 4, 8.0, 28.0]

In [31]:
n = len(x)
if n % 2:
    median_ = sorted(x)[round(0.5*(n-1))]
else:
    x_ord, index = sorted(x), round(0.5 * n)
    median_ = 0.5 * (x_ord[index-1] + x_ord[index])

median_

4

In [33]:
x

[8.0, 1, 2.5, 4, 28.0]

In [34]:
x[:-1]

[8.0, 1, 2.5, 4]

In [38]:
sorted(x[:-1])

[1, 2.5, 4, 8.0]

In [35]:
statistics.median_low(x[:-1])

2.5

In [36]:
statistics.median_high(x[:-1])

4

In [37]:
statistics.median(x[:-1])

3.25

In [39]:
print(sorted(x_with_nan))
print(statistics.median(x_with_nan))
print(statistics.median_low(x_with_nan))
print(statistics.median_high(x_with_nan))

[1, 2.5, 4, 8.0, nan, 28.0]
6.0
4
8.0


# Mode

Sample mode is the value in the data set that occurs most frequently. If there is no single such value, then the set is multimodal because it has several modal values. For example, in the set containing points 2, 3, 2, 8, and 12, the number 2 is the mode because it occurs twice, unlike the other items that only appear once.

In [40]:
u = [2, 3, 2, 8, 12]

v = [12, 15, 12, 15, 21, 15, 12]

mode_ = max((u.count(item), item) for item in set(u))[1]
mode_

2

In [41]:
mode_ = statistics.mode(u)
mode_

2

In [42]:
u, v = np.array(u), np.array(v)

mode_ = scipy.stats.mode(u)
mode_

  mode_ = scipy.stats.mode(u)


ModeResult(mode=array([2]), count=array([2]))

In [43]:
mode_ = scipy.stats.mode(v)
mode_

  mode_ = scipy.stats.mode(v)


ModeResult(mode=array([12]), count=array([3]))

In [44]:
print(mode_.mode)
print(mode_.count)

[12]
[3]


In [45]:
u, v, w = pd.Series(u), pd.Series(v), pd.Series([2, 2, math.nan])

print(u.mode())

print(v.mode())

print(w.mode())

0    2
dtype: int64
0    12
1    15
dtype: int64
0    2.0
dtype: float64


# Measures of Variability

Measures of central tendency are not enough to describe the data. We also need measures of variability that measure the spread of data points. In this section, we will learn how to identify and calculate measures of variability:

- Variance
- Standard deviation
- Skewness
- Percentiles
- Ranges

## variance

In [46]:
print(x)
n = len(x)

mean_ = sum(x) / n

var_ = sum((item - mean_)**2 for item in x) / (n - 1)
var_

[8.0, 1, 2.5, 4, 28.0]


123.19999999999999

In [47]:
var_ = statistics.variance(x)
var_

123.2

In [48]:
print(y)
var_ = np.var(y, ddof=1)
var_

[ 8.   1.   2.5  4.  28. ]


123.19999999999999

In [49]:
var_ = y.var(ddof=1)
var_

123.19999999999999

It is very important to specify the parameter ddof = 1. That is how we set the degrees of freedom to 1.

In [50]:
z.var(ddof=1)

123.19999999999999

## Standard Deviation

In [51]:
std_ = var_ ** 0.5
std_

11.099549540409285

In [58]:
print(x)
print(type(x))
std_ = statistics.stdev(x)
std_

[8.0, 1, 2.5, 4, 28.0]
<class 'list'>


11.099549540409287

In [59]:
print(y)
print(type(y))
np.std(y, ddof=1)

[ 8.   1.   2.5  4.  28. ]
<class 'numpy.ndarray'>


11.099549540409285

In [60]:
print(z)
print(type(z))
z.std(ddof=1)

0     8.0
1     1.0
2     2.5
3     4.0
4    28.0
dtype: float64
<class 'pandas.core.series.Series'>


11.099549540409285

## Skewness


In [61]:
x = [8.0, 1, 2.5, 4, 28.0]

n = len(x)

mean_ = sum(x) / n
var_ = sum((item - mean_)**2 for item in x) / (n-1)
std_ = var_ ** 0.5

skew_ = (sum((item - mean_)**3 for item in x) * n / ((n - 1) * (n - 2) * std_**3))

In [62]:
skew_

1.9470432273905929

In [68]:
print(y,y_with_nan)

y, y_with_nan = np.array(x), np.array(x_with_nan)

print(y,y_with_nan)

scipy.stats.skew(y, bias=False)


[ 8.   1.   2.5  4.  28. ] [ 8.   1.   2.5  nan  4.  28. ]
[ 8.   1.   2.5  4.  28. ] [ 8.   1.   2.5  nan  4.  28. ]


1.9470432273905927

In [69]:
scipy.stats.skew(y_with_nan, bias=False)

nan

In [70]:
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)

z.skew()

1.9470432273905924

In [71]:
z_with_nan.skew()

1.9470432273905924

## Percentiles

In [72]:
x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]

In [73]:
statistics.quantiles(x, n=2)

[8.0]

In [74]:
statistics.quantiles(x, n=4, method='inclusive')

[0.1, 8.0, 21.0]

In [75]:
y = np.array(x)
np.percentile(y, 5)

-3.44

In [76]:
np.percentile(y, 95)

34.919999999999995

In [77]:
np.percentile(y, [25, 50, 75])

array([ 0.1,  8. , 21. ])

In [78]:
np.median(y)

8.0

In [79]:
y_with_nan = np.insert(y, 2, np.nan)
y_with_nan

array([-5. , -1.1,  nan,  0.1,  2. ,  8. , 12.8, 21. , 25.8, 41. ])

In [80]:
np.quantile(y, 0.05)

-3.44

In [81]:
np.quantile(y, 0.95)

34.919999999999995

In [82]:
np.quantile(y, [0.25, 0.5, 0.75])

array([ 0.1,  8. , 21. ])

In [83]:
np.nanquantile(y_with_nan, [0.25, 0.5, 0.75])

array([ 0.1,  8. , 21. ])

In [90]:
z, z_with_nan = pd.Series(y), pd.Series(y_with_nan)
print(sorted(z),sorted(z_with_nan))
z.quantile(0.05)

[-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0] [-5.0, -1.1, nan, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]


-3.44

In [85]:
z.quantile(0.95)

34.919999999999995

In [86]:
z.quantile([0.25, 0.5, 0.75])

0.25     0.1
0.50     8.0
0.75    21.0
dtype: float64

In [87]:
z_with_nan.quantile([0.25, 0.5, 0.75])

0.25     0.1
0.50     8.0
0.75    21.0
dtype: float64

## Ranges
Rentang data/range data adalah perbedaan antara elemen maksimum dan minimum dalam kumpulan data. Kita bisa mendapatkannya dengan fungsi np.ptp():



In [91]:
print(y)
np.ptp(y)

[-5.  -1.1  0.1  2.   8.  12.8 21.  25.8 41. ]


46.0

In [92]:
print(z)
np.ptp(z)

0    -5.0
1    -1.1
2     0.1
3     2.0
4     8.0
5    12.8
6    21.0
7    25.8
8    41.0
dtype: float64


46.0

In [93]:
print((y_with_nan))
np.ptp(y_with_nan)

[-5.  -1.1  nan  0.1  2.   8.  12.8 21.  25.8 41. ]


nan

In [94]:
print(z_with_nan)
np.ptp(z_with_nan)

0    -5.0
1    -1.1
2     NaN
3     0.1
4     2.0
5     8.0
6    12.8
7    21.0
8    25.8
9    41.0
dtype: float64


nan

In [95]:
np.amax(y) - np.amin(y)

46.0

In [96]:
np.nanmax(y_with_nan) - np.nanmin(y_with_nan)

46.0

In [97]:
y.max() - y.min()

46.0

In [98]:
z.max() - z.min()

46.0

In [99]:
z_with_nan.max() - z_with_nan.min()

46.0

### Interquartile range 
adalah perbedaan antara kuartil pertama dan ketiga. Setelah kita menghitung kuartil, kita dapat mengambil selisihnya:

In [100]:
quartiles = np.quantile(y, [0.25, 0.75])
quartiles[1] - quartiles[0]

20.9

In [101]:
quartiles = z.quantile([0.25, 0.75])
quartiles[0.75] - quartiles[0.25]

20.9

# Measures of Correlation Between Pairs of Data

In [102]:
x = list(range(-10, 11))
y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]

In [103]:
x_, y_ = np.array(x), np.array(y)

In [104]:
x__, y__ = pd.Series(x_), pd.Series(y_)

Sample covariance adalah ukuran yang mengukur kekuatan dan arah hubungan antara sepasang variabel

- Jika korelasinya positif, maka kovariansinya juga positif. Hubungan yang lebih kuat sesuai dengan nilai kovarian yang lebih tinggi.
- Jika korelasinya negatif, maka kovariansinya juga negatif. Hubungan yang lebih kuat sesuai dengan nilai kovarians yang lebih rendah (atau lebih tinggi secara absolut).
- Jika korelasinya lemah, maka kovariansinya mendekati nol.


In [106]:
n = len(x)
mean_x, mean_y = sum(x) / n, sum(y) / n
cov_xy = (sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n))
        / (n - 1))
cov_xy

19.95

In [107]:
cov_matrix = np.cov(x_, y_)
cov_matrix

array([[38.5       , 19.95      ],
       [19.95      , 13.91428571]])

In [108]:
x_.var(ddof=1)


38.5

In [109]:
y_.var(ddof=1)

13.914285714285711

In [111]:
cov_xy = cov_matrix[0, 1]
cov_xy

19.95

In [113]:
cov_xy = x__.cov(y__)
cov_xy

19.95

In [115]:
cov_xy = cov_matrix[1, 0]
cov_xy

19.95

In [114]:
cov_xy = y__.cov(x__)
cov_xy

19.95

## Correlation coefficient
atau Pearson product-moment correlation coefficient, dilambangkan dengan simbol 𝑟. Coefficient  adalah ukuran lain dari korelasi antar data. Kita dapat menganggapnya sebagai standardized covariance. Berikut beberapa infonya:

- The value 𝑟 > 0 indicates positive correlation.
- The value 𝑟 < 0 indicates negative correlation.
- The value r = 1 is the maximum possible value of 𝑟. It corresponds to a perfect positive linear relationship between variables.
- The value r = −1 is the minimum possible value of 𝑟. It corresponds to a perfect negative linear relationship between variables.
- The value r ≈ 0, or when 𝑟 is around zero, means that the correlation between variables is weak.

In [116]:
var_x = sum((item - mean_x)**2 for item in x) / (n - 1)
var_y = sum((item - mean_y)**2 for item in y) / (n - 1)
std_x, std_y = var_x ** 0.5, var_y ** 0.5
r = cov_xy / (std_x * std_y)
r

0.861950005631606

In [117]:
r, p = scipy.stats.pearsonr(x_, y_)
r

0.8619500056316061

In [118]:
p

5.122760847201132e-07

In [119]:
corr_matrix = np.corrcoef(x_, y_)
corr_matrix

array([[1.        , 0.86195001],
       [0.86195001, 1.        ]])

In [120]:
r = corr_matrix[0, 1]
r

0.8619500056316061

In [121]:
r = corr_matrix[1, 0]
r

0.861950005631606

In [122]:
scipy.stats.linregress(x_, y_)

LinregressResult(slope=0.5181818181818181, intercept=5.714285714285714, rvalue=0.861950005631606, pvalue=5.122760847201164e-07, stderr=0.06992387660074979, intercept_stderr=0.4234100995002589)

In [123]:
result = scipy.stats.linregress(x_, y_)
r = result.rvalue
r

0.861950005631606

In [124]:
r = x__.corr(y__)
r

0.8619500056316061

In [125]:
r = y__.corr(x__)
r

0.861950005631606

In [126]:
a = np.array(
    [[1, 1, 1],
    [2, 3, 1],
    [4, 9, 2],
    [8, 27, 4],
    [16, 1, 1]])
a

array([[ 1,  1,  1],
       [ 2,  3,  1],
       [ 4,  9,  2],
       [ 8, 27,  4],
       [16,  1,  1]])