# DESCRIPTIVE STATISTICS
* deskriptif : mengetahui informasi datanya, mean, median dll
* inferensial : dari data tersebut ambil kesimpulan apa, prediksinya bagaimana
* populasi : keseluruhan dari elemen
* sampel : sebagian dari populasi
* outliers : titik/data yang angkanya itu menjauhi angka normal
* standar deviasi besar sebaran data jauh dari angka mean, sebaliknya
* standar deviasi akar dari variance
* interpretasi pake standar deviasi
* variansi dalam kuadrat hanya untuk kalkulasi
* range = nilai tertinggi - nilai terendah

# CALCULATING DESCRIPTIVE STATISTICS

In [1]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

In [2]:
math.nan

nan

In [3]:
np.nan

nan

In [4]:
x = [8, 1, 2.5, 4, 28]
x_with_nan = [8, 1, 2.5, math.nan, 4, 28]
print(x)
print(x_with_nan)

[8, 1, 2.5, 4, 28]
[8, 1, 2.5, nan, 4, 28]


buat array(y) dan series (z) dari x

In [5]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)
print(y)
print(y_with_nan)
print(z_with_nan)

[ 8.   1.   2.5  4.  28. ]
[ 8.   1.   2.5  nan  4.  28. ]
0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64


In [6]:
z_with_nan.isna()

0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

# MEAN/AVERAGE
![image.png](attachment:ee42a9f5-bcd9-4f03-8b92-3431768b4118.png)

## ARITMATHIC MEAN
1. x itu list
2. y itu array
3. z itu series

In [7]:
sum(x)

43.5

In [8]:
len(x)

5

In [9]:
mean_x = sum(x)/len(x)
mean_x

8.7

In [10]:
mean_stats = statistics.mean(x)
print(mean_stats)

8.7


In [11]:
mean_np = np.mean(x)
print(mean_np)

8.7


In [12]:
y

array([ 8. ,  1. ,  2.5,  4. , 28. ])

In [13]:
# array
y.mean()

8.7

In [14]:
# series
z.mean()

8.7

In [15]:
y_with_nan.mean()

nan

In [16]:
np.mean(y_with_nan)

nan

In [17]:
np.nan*3

nan

In [18]:
# mengabaikan nan : get mean without considering NaN values
np.nanmean(y_with_nan)

8.7

In [19]:
z_with_nan

0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64

In [20]:
# pada series langsung merata2kan tanpa mempertimbangkan NaN values
z_with_nan.mean()

8.7

## WEIGHTED MEAN
![image.png](attachment:2eeeb830-6048-4c97-a27b-e7b4b004617a.png)
* rata-rata dengan menghitung pembobotan
* ketika ada pembobotannya

In [21]:
x = [8, 1, 2.5, 4, 28] #metrics
w = [0.1, 0.2, 0.3, 0.25, 0.15] #pembobotan/weightages

# convert jadi array
y,z,w = np.array(x), pd.Series(x), np.array(w)

wmean = np.average(y, weights=w)
print(wmean)

wmean = np.average(z, weights=w)
print(wmean)

6.95
6.95


In [22]:
np.mean(x)

8.7

In [23]:
# manual calculation
wmean_manual =sum(w[i]*x[i] for i in range(len(x)))/sum(w)
print(wmean_manual)

6.95


sum(w)

## HARMONIC MEAN
![image.png](attachment:8dfcfcd3-86e2-418a-b359-ad80c8f89ddb.png)
sering dpake untuk rata2 rate/laju, contoh :
* 10km/jam
* 100kg/hari
* 1000000orang/tahun

In [24]:
x

[8, 1, 2.5, 4, 28]

In [25]:
hmean = len(x)/sum(1/item for item in x)
hmean

2.7613412228796843

In [26]:
arit_mean = np.mean(x)
arit_mean

8.7

In [27]:
for item in x:
    print(1/item)

0.125
1.0
0.4
0.25
0.03571428571428571


In [28]:
sum(1/item for item in x)

1.8107142857142857

In [29]:
list(1/item for item in x)

[0.125, 1.0, 0.4, 0.25, 0.03571428571428571]

contoh : dalam jarak 10km dengan kecepatan 60km/jam, lalu 10km berikutnta 20km/jam, berapa kecepatan rata2?

In [30]:
speed = [60,20]
hmean_speed = len(speed)/sum(1/item for item in speed)
hmean_speed

30.0

hmean dengan library

In [31]:
hmean_speed_stats =statistics.harmonic_mean(speed)
hmean_speed_stats

30.0

In [32]:
hmean_speed_stats2 = scipy.stats.hmean(speed)
hmean_speed_stats2

30.0

In [33]:
hmean_stats = statistics.harmonic_mean(x)
hmean_stats

2.7613412228796843

# GEOMETRIC MEAN
![image.png](attachment:8abbdb7e-fde1-4e53-bcb4-17c22b98c7be.png)
* bagus buat yang sifatnya compunding
* lebih mendekati realnya
* contoh : punya perusahaan berdiri 2017, ditahun 2017 omsetnya 1000, tahun 2018 naik 30%, tahun 2019 naik 50%, tahun 2020 naik 40%

In [34]:
x = 1000
x*=1.3 #1300
x*=1.5 #1950
x*=1.4 #2730
x

2730.0

In [35]:
# arithmetic mean
(30+50+40)/3

40.0

In [36]:
x = 1000
x*=1.4
x*=1.4
x*=1.4
x

2743.9999999999995

In [37]:
# Geometric Mean
gmean = ((30*50*40)**(1/3))
gmean

39.148676411688626

In [38]:
x = 1000
x*=1.3915
x*=1.3915
x*=1.3915
x

2694.322835875

In [39]:
# geometric mean menggunakan library scipy
scipy.stats.gmean([30,50,40])

39.14867641168864

# MEDIAN
![image.png](attachment:9ce9bbdc-0e2d-4ba9-be64-5cf71780e20f.png)

In [40]:
# Data Genap
a=[1,2,4,8,9]
n=len(a)
n

5

In [41]:
median = a[2]
median

4

In [42]:
# Data Genap
a = [1,2,4,8]
n=len(a)
n

4

In [43]:
(n+1)/2

2.5

In [44]:
median = (2+4)/2
median

3.0

In [45]:
# menggunakan numpy
x = [8,1,2.5,4,28]
c = [2,3,4,5]
np.median(x)

4.0

In [46]:
# pembuktian manual
n = len(x)
if n%2 ==1: #pengecekan apakah jumlah data ganjil atau genap
    median_ = sorted(x)[round(0.5*(n-1))] #buat berurutan dari kecil ke besar > ambil median
else: #genap
    x_ord, index = sorted(x), round(0.5*n)
    median_=0.5*(x_ord[index-1]+x_ord[index])
median_

4

In [47]:
# pembuktian manual
def median_function(x):
    n = len(x)
    if n%2 ==1: #ganjil
        median_ = sorted(x)[round(0.5*(n-1))] #ambil median
    else: #genap
        x_ord, index = sorted(x), round(0.5*n)
        median_=0.5*(x_ord[index-1]+x_ord[index])
    return median_
median_function(x)

4

In [48]:
median_function(c)

3.5

In [49]:
sorted(x)

[1, 2.5, 4, 8, 28]

In [50]:
n=len(x)
round(0.5*(n-1))

2

In [51]:
n

5

In [52]:
n=len([1,2,3,4,5,6,7])
round(0.5*(n-1))

3

# MODE/MODUS
![image.png](attachment:f7201290-441a-4874-a0ee-9e18bbef556c.png)

nilai yang paling sering muncul/frekuensi yang paling tinggi

In [53]:
u = [2,3,2,8,12]
mode_ = max((u.count(i),i)for i in set(u))[1]
mode_

2

In [54]:
u.count(2)

2

In [55]:
[(u.count(i),i)for i in set(u)]

[(1, 8), (2, 2), (1, 3), (1, 12)]

In [56]:
max([(u.count(i),i)for i in set(u)])

(2, 2)

In [57]:
set(u)

{2, 3, 8, 12}

In [58]:
statistics.mode(u)

2

In [59]:
scipy.stats.mode(u)

  scipy.stats.mode(u)


ModeResult(mode=array([2]), count=array([2]))

In [60]:
scipy.stats.mode(u)[0][0]

  scipy.stats.mode(u)[0][0]


2

In [61]:
mode_lib = scipy.stats.mode(u)

  mode_lib = scipy.stats.mode(u)


In [62]:
print(mode_lib)
print(mode_lib.mode)
print(mode_lib.count)

ModeResult(mode=array([2]), count=array([2]))
[2]
[2]


In [63]:
u

[2, 3, 2, 8, 12]

In [64]:
u_series = pd.Series(u)
u_series.mode()

0    2
dtype: int64

In [65]:
u_array = np.array()
statistics.mode(u_array)

TypeError: array() missing required argument 'object' (pos 0)

In [66]:
v = pd.Series([1,2,3,3,3,np.nan,np.nan,np.nan,np.nan])
v.mode()

0    3.0
dtype: float64

In [67]:
statistics.mode(v)

3.0

In [68]:
h = [1,2,3,3,3,np.nan,np.nan,np.nan,np.nan]
statistics.mode(h)

nan

* pake series mengabaikan nan
* pake list mempertombangkan nan

# MEASURE OF VARIAVILITY
## VARIANCE
![image.png](attachment:eeb84f3a-074a-49ab-8989-5177807986ed.png)

In [69]:
x

[8, 1, 2.5, 4, 28]

In [70]:
n = len(x)
mean_var = np.mean(x)
var_ = sum((item-mean_var)**2 for item in x)/(n-1)
var_

123.19999999999999

variance with statistics library

In [71]:
var_stat = statistics.variance(x)
var_stat

123.2

In [72]:
var_np = np.var(x,ddof=1)
var_np

123.19999999999999

In [73]:
z = pd.Series(x)
z.var()

123.19999999999999

## STANDARD DEVIASI

In [74]:
#menggunakan library
x = [1,2,3,6,7,8,30,40,50]
std_ = statistics.stdev(x)
std_

18.58090417606205

In [75]:
stdev = np.std(x,ddof=1)
stdev

18.58090417606205

In [76]:
# pembuktian manual
n = len(x)
mean_var = np.mean(x)
var_ = sum((item-mean_var)**2 for item in x)/(n-1)
sd_ = var_**(1/2)
sd_

18.58090417606205

In [77]:
var_np

123.19999999999999

In [78]:
stdev=var_np**(1/2)
stdev

11.099549540409285

In [79]:
stdev = np.sqrt(var_np)
stdev

11.099549540409285

# SKEWNESS
![image.png](attachment:1c496c66-3a11-46d5-925f-b2cddd45bd5c.png)
![image.png](attachment:601ac338-4097-4f1a-aae6-36fe9815e76f.png)
* negatif : meannya kecil dari median, negatifnya sangat2 besar


umur : [0,4,8]
mean = 4
n = 3
Stdev = ((16 + 0 + 16)/3-1)^1/2 = (32/2)^1/2 = 4 tahun
Variance = 4^2 = 16 tahun
![image.png](attachment:d522941a-c92f-4535-8dba-c6fb3886a889.png)

In [80]:
x= [8, 1, 2.5, 4, 28]
n=len(x)
mean = np.mean(x)
stdev = statistics.stdev(x)
skew_ = sum((item-mean)**3 for item in x)*n/((n-1)*(n-2)*(stdev**3))
skew_

1.947043227390592

In [81]:
scipy.stats.skew(x)

1.3061163034727836

In [82]:
# kalau true gaada koneksi statsitical bias
scipy.stats.skew(x,bias=False)

1.9470432273905927

In [83]:
z = pd.Series(x)
z.skew()

1.9470432273905924

In [84]:
scipy.stats.skew(x_with_nan)

nan

In [85]:
z_with_nan = pd.Series(x_with_nan)
z_with_nan.skew()

1.9470432273905924

In [86]:
scipy.stats.skew(z_with_nan)

nan

## PERCENTILES

In [87]:
x=[-5,-1.1,0.1,2,8,12.8,21,25.8,41]
np.percentile(x,5) #percentile 5

-3.44

In [88]:
np.percentile(x,0)

-5.0

In [89]:
np.percentile(x,100)

41.0

In [90]:
np.percentile(x,90)

28.840000000000003

## QUANTILE

In [91]:
np.quantile(x,0)

-5.0

In [92]:
np.quantile(x,1)

41.0

In [93]:
statistics.quantiles(x,n=2) #membagi data berdasarkan bagian


[8.0]

In [94]:
statistics.quantiles(x,n=4)

[-0.5, 8.0, 23.4]

In [95]:
statistics.quantiles(x,n=4,method='inclusive') #mempertimbangkan variable yg disini sebagai batas

[0.1, 8.0, 21.0]

In [96]:
np.quantile(x,0.01)

-4.688

In [97]:
np.quantile(x,0.5)

8.0

In [98]:
x

[-5, -1.1, 0.1, 2, 8, 12.8, 21, 25.8, 41]

In [99]:
x_with_nan = [-5, -1.1, np.nan, 0.1, 2, 8, 12.8, 21, 25.8, 41]
x_with_nan

[-5, -1.1, nan, 0.1, 2, 8, 12.8, 21, 25.8, 41]

In [100]:
np.quantile(x_with_nan,0)

nan

In [101]:
#menghandle data nan
np.nanquantile(x_with_nan,0.5)

8.0

In [102]:
np.quantile(x,[0.25,0.5,0.75])

array([ 0.1,  8. , 21. ])

In [103]:
statistics.quantiles(x,n=4,method='inclusive')

[0.1, 8.0, 21.0]

In [104]:
np.percentile(x,[25,50,75])

array([ 0.1,  8. , 21. ])

# KURTOSIS
![image.png](attachment:9f6b6b8f-928c-4e4c-869e-0958b6691c69.png)
* semakin gepeng semakin stdev tinggi, memiliki outliers tinggi
* ngomgin keberadaan outliers/pencilan

# RANGES
nilai maksimal - nilai minimal

In [105]:
x #ranges = 41 - (-5) = 46

[-5, -1.1, 0.1, 2, 8, 12.8, 21, 25.8, 41]

In [106]:
#np.ptp untuk menghitung range
np.ptp(x)

46.0

In [107]:
np.amax(x)

41.0

In [108]:
np.amin(x)

-5.0

In [109]:
np.amax(x) - np.amin(x)

46.0

# INTERQUARTILES RANGES (Q3-Q1)
biasanya digunakan untuk mendeteksi outlier
![image.png](attachment:cb795c29-496e-4e38-91b8-17a7d9890c8a.png)

In [110]:
x

[-5, -1.1, 0.1, 2, 8, 12.8, 21, 25.8, 41]

In [111]:
quart = np.quantile(x,[0.25,0.75])
IQR = quart[1] - quart [0]
IQR

20.9

# SUMMARY OF DESC. STATS

In [112]:
x

[-5, -1.1, 0.1, 2, 8, 12.8, 21, 25.8, 41]

In [113]:
result = scipy.stats.describe(x,ddof=1,bias=False)
result

DescribeResult(nobs=9, minmax=(-5.0, 41.0), mean=11.622222222222222, variance=228.75194444444446, skewness=0.9249043136685094, kurtosis=0.14770623629658886)

In [114]:
(result.variance)**(1/2)

15.12454774346805

In [115]:
result.minmax[1]-result.minmax[0]

46.0

In [116]:
result.skewness

0.9249043136685094

In [117]:
z = pd.Series(x)
z

0    -5.0
1    -1.1
2     0.1
3     2.0
4     8.0
5    12.8
6    21.0
7    25.8
8    41.0
dtype: float64

In [118]:
z.describe()

count     9.000000
mean     11.622222
std      15.124548
min      -5.000000
25%       0.100000
50%       8.000000
75%      21.000000
max      41.000000
dtype: float64

# KORELASI
![image.png](attachment:d0b6ea99-720f-46f4-9bce-552482fe9ade.png)
![image.png](attachment:382484de-b581-4234-a186-ef9044a5344f.png)

In [119]:
x = list(range(-10,11))
y = [0,2,2,2,2,3,3,6,7,4,7,6,6,9,4,5,5,10,11,12,14]
x_ar, y_ar = np.array(x),np.array(y)
x_s,y_s = pd.Series(x),pd.Series(y)

In [120]:
len

<function len(obj, /)>

## COVARIANCE
![image.png](attachment:a95df9ad-39d9-4e7c-bfb8-755c49ebbcc6.png)
![image.png](attachment:e8b523ef-1f94-4342-9b6d-7cd80ea387b6.png)

In [121]:
n = len(x)
mean_x = np.mean(x)
mean_y = np.mean(y)

cov_xy = (sum((x[item]-mean_x)*(y[item]-mean_y)for item in range(n)))/(n-1)
cov_xy

19.95

![image.png](attachment:68ca4e45-2aac-48ca-bc60-a4b8d8c56250.png)

In [122]:
cov_matrix = np.cov(x,y)
cov_matrix

array([[38.5       , 19.95      ],
       [19.95      , 13.91428571]])

In [123]:
np.var(x,ddof=1)

38.5

In [124]:
np.var(y,ddof=1)

13.914285714285711

In [125]:
cov_xy = cov_matrix[0,1]
cov_xy

19.95

In [126]:
cov_xy = cov_matrix[1,0]
cov_xy

19.95

## CORRELATION COEFFICIENT
* korelasi : hubungan antar 2 variabel
* perhitungan korelasi menggunakan pearson, untuk menghitung kedua variabel
![image.png](attachment:ffa3f2c4-3088-4c9c-b675-038eca138a13.png)

r = 1/r=-1 atau mendekati merupakan tanda keduaa variabel mengalami multicolinearity

In [127]:
var_x = np.var(x,ddof=1)
var_y = np.var(y,ddof=1)
cov_xy
std_x = var_x**(1/2)
std_y =  var_y**(1/2)
r = cov_xy / (std_x*std_y)
r

0.861950005631606

kesimpulan : korelasi positif sangat kuat karena sangat mendekati 1

In [137]:
xy_cor = scipy.stats.pearsonr(x,y).statistic
xy_cor

0.8619500056316061

In [128]:
r, p = scipy.stats.pearsonr(x,y)
r

0.8619500056316061

In [129]:
p

5.122760847201135e-07

In [138]:
corr_matrix = np.corrcoef(x,y)
corr_matrix

array([[1.        , 0.86195001],
       [0.86195001, 1.        ]])

korelasinya yang 0.86195001

In [140]:
scipy.stats.linregress(x,y)

LinregressResult(slope=0.5181818181818181, intercept=5.714285714285714, rvalue=0.861950005631606, pvalue=5.122760847201164e-07, stderr=0.06992387660074979, intercept_stderr=0.4234100995002589)

In [141]:
scipy.stats.linregress(x,y).rvalue

0.861950005631606

### Statistika pada 2d array

In [142]:
a = np.array([[1,1,1],
             [2,3,1],
             [4,9,2],
             [8,27,4],
             [16,1,1]])
np.mean(a)

5.4

In [143]:
np.median(a)

2.0

In [144]:
# rata-rata dari masing2 anggota dari list
np.mean(a, axis=0) #axis=0 kebawah/vertikal

array([6.2, 8.2, 1.8])

In [145]:
np.mean(a, axis=1) #horizontal

array([ 1.,  2.,  5., 13.,  6.])

In [146]:
np.var(a, ddof=1)

53.40000000000001

In [147]:
np.var(a,ddof=1,axis=0)

array([ 37.2, 121.2,   1.7])

### statistika pada dataframe

In [148]:
row_names = ['first','second','third','fourth','fifth']
col_names=['A','B','C']
df = pd.DataFrame(a,index=row_names,columns=col_names)
df

Unnamed: 0,A,B,C
first,1,1,1
second,2,3,1
third,4,9,2
fourth,8,27,4
fifth,16,1,1


In [149]:
np.mean(df.A)

6.2

In [150]:
df['A'].mean()

6.2

In [152]:
#variance per kolom
np.var(df, ddof=1)

A     37.2
B    121.2
C      1.7
dtype: float64

In [154]:
df.var() #di pandas/dataframe by default ddof=1

A     37.2
B    121.2
C      1.7
dtype: float64

In [155]:
df.mean(axis=0) #rata-rata kebawah

A    6.2
B    8.2
C    1.8
dtype: float64

In [156]:
df.mean(axis=1) #rata-rata kesamping

first      1.0
second     2.0
third      5.0
fourth    13.0
fifth      6.0
dtype: float64

pr dikumpulkan jumat depan (optional), kalau dari kode.id bahasa sendiri, kaya definisi dll
statistical hypotesis test cheat sheet
* masing-masing segmen optional 1 sendiri

### nadia syachrani