## Skewness

![image.png](attachment:image.png)

In [1]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd 

In [2]:
x = [8,1,2.5,4,28]
x_with_nan = [8,1,2.5,math.nan,4,28]

In [3]:
print(x)
print(x_with_nan)

[8, 1, 2.5, 4, 28]
[8, 1, 2.5, nan, 4, 28]


In [4]:
# Buat array (y) dan seris(z) dari x

y, y_with_nan = np.array(x), np.array(x_with_nan) # array
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)

print(y)
print(y_with_nan)
print(z)
print(z_with_nan)

[ 8.   1.   2.5  4.  28. ]
[ 8.   1.   2.5  nan  4.  28. ]
0     8.0
1     1.0
2     2.5
3     4.0
4    28.0
dtype: float64
0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64


In [5]:
z_with_nan.isnull()

0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

# Central Tendency

![image.png](attachment:image.png)

## Arithmetic Mean

- x = list
- y = array
- z = series

In [6]:
sum(x)

43.5

In [7]:
len(x)

5

In [8]:
mean_x = sum(x)/len(x)
mean_x

8.7

In [9]:
mean_stats = statistics.mean(x)
print(mean_stats)

8.7


In [10]:
mean_np = np.mean(x)
print(mean_np)

8.7


In [11]:
y.mean() #array

8.7

In [12]:
z.mean() #series

8.7

In [13]:
y_with_nan.mean()

nan

In [14]:
np.nan*100

nan

In [15]:
np.nanmean(y_with_nan) # Untuk mendapatkan mean tanpa menghiraukan nilai NaN.

8.7

In [16]:
z_with_nan.mean() # Series dapat lansung mendapatkan mean tanpa mempertimbangkan NaN

8.7

## Weighted Mean

![image.png](attachment:image.png)

 Weighted Mean = Rata-rata dengan menghitung pembobotan 

In [17]:
x = [8,1,2.5,4,28] # Metrics
w = [0.1,0.2,0.3,0.25,0.15] # Weightages

# Convert menjadi array
y,z,w = np.array(x), pd.Series(x), np.array(w)

wmean = np.average(y, weights=w)
print(wmean)

6.95


In [18]:
# Manual calculation
wmean_manual = sum(w[i]*x[i] for i in range(len(x)))/sum(w)
print(wmean)

6.95


![image.png](attachment:image.png)

In [19]:
sum(w)

1.0

## Harmonic Mean

![image.png](attachment:image.png)

Sering dipakai untuk rata-rata rate (10km/jam, 100kg/hari, 10000000 orang/tahun)

![image.png](attachment:image.png)

In [20]:
hmean = len(x)/sum(1/item for item in x)
x

[8, 1, 2.5, 4, 28]

In [21]:
for item in x:
    print(1/item)

0.125
1.0
0.4
0.25
0.03571428571428571


In [22]:
hmean

2.7613412228796843

In [23]:
sum(1/item for item in x)

1.8107142857142857

In [24]:
speed = [60,20]
hmean_speed = len(speed)/sum(1/item for item in speed)
hmean_speed

30.0

In [25]:
# hmean dengan library
hmean_stats = statistics.harmonic_mean(x)
hmean_stats

2.7613412228796843

In [26]:
hmean_stats_speed = statistics.harmonic_mean(speed)
hmean_stats_speed

30.0

In [27]:
hmean_stats_speed2 = scipy.stats.hmean(speed)
hmean_stats_speed2

30.0

## Geometric Mean

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Perusahaan berdiri pada tahun 2017 
tahun 2017 omsetnya 1000
- di tahun 2018 naik 30%
- di tahun 2019 naik 50%
- di tahun 2020 naik 40%

In [28]:
x = 1000
x*=1.3 # 1300
x*=1.5 # 1950
x*=1.4 # 2730
x

2730.0

In [29]:
# Arithmetic Mean
(30+50+40)/3

40.0

In [30]:
x = 1000
x*=1.4
x*=1.4
x*=1.4
x

2743.9999999999995

In [31]:
# Geometric Mean
gmean = ((30*50*40)**(1/3))
gmean

39.148676411688626

In [32]:
x = 1000
x*=1.3915 
x*=1.3915 
x*=1.3915
x

2694.322835875

In [33]:
scipy.stats.gmean([30,50,40]) # Geometric mean Scipy

39.14867641168864

 ## Median

In [34]:
# Ganjil
a = [1,2,4,8,9]
n = len(a)
median = a[2]
median

4

In [35]:
# Genap
a = [1,2,4,8]
n = len(a)
(n+1)/2
median = (2+4)/2
median

3.0

In [36]:
x 

2694.322835875

In [37]:
# Menggunakan numpy
x = [8, 1, 2.5, 4, 28]
c = [2,3,4,5]
np.median(x)

4.0

In [38]:
# Pembuktian manual
def median_function(x):
    n = len(x)
    if n%2 ==1: # Pengecekan apakah jumlah data ganjil atau genap
        median_ = sorted(x)[round(0.5*(n-1))] # Ambil median
    else: # Genap
        x_ord, index = sorted(x), round(0.5*n)
        median_=0.5 * (x_ord[index-1]+x_ord[index])
    return median_

In [39]:
median_function(x)

4

## Mode/Modus

Nilai yang paling sering muncul/frekuensi paling tinggi

In [40]:
u = [2,3,2,8,12]

mode_ = max((u.count(i),i)for i in set(u))[1]

In [41]:
[(u.count(i),i) for i in set(u)]

[(1, 8), (2, 2), (1, 3), (1, 12)]

In [42]:
set(u)

{2, 3, 8, 12}

In [43]:
statistics.mode(u)

2

In [46]:
mode_lib = scipy.stats.mode(u)

In [47]:
print(mode_lib)
print(mode_lib.mode)
print(mode_lib.count)

ModeResult(mode=array([2]), count=array([2]))
[2]
[2]


In [48]:
u

[2, 3, 2, 8, 12]

In [49]:
u_series = pd.Series(u)
u_series.mode()

0    2
dtype: int64

In [50]:
u_array = np.array(u)
statistics.mode(u_array)

2

In [51]:
v = pd.Series([1,2,3,3,3,3,np.nan,np.nan,np.nan,np.nan])
v.mode()

0    3.0
dtype: float64

In [52]:
h = [1,2,3,3,3,np.nan,np.nan,np.nan,np.nan]
statistics.mode(h)

nan

Series akan mengabaikan NaN dan list akan mempertimbangkan NaN

## Measures of Varibility

![image.png](attachment:image.png)

In [53]:
x

[8, 1, 2.5, 4, 28]

In [54]:
n = len(x)
mean_var = np.mean(x)
var_ = sum((item-mean_var)**2 for item in x) / (n-1)
var_

123.19999999999999

In [55]:
# Variance w/ statistics library
var_stat = statistics.variance(x)
var_stat

123.2

In [56]:
# Variance w/ numpy
var_np = np.var(x, ddof=1)
var_np

123.19999999999999

In [57]:
# Variance using Series Function
z = pd.Series(x)
z.var()

123.19999999999999

## Standard Deviasi

Quiz Hitung variance dan standar deviasi

In [58]:
x = [1,2,3,6,7,8,30,40,50]

In [59]:
# std
n = len(x)
mean_std = np.mean(x)
std_ = (sum((item-mean_var)**2 for item in x) / (n-1))**(1/2)
std_

20.2682325327099

In [60]:
#variance
n = len(x)
mean_var = np.mean(x)
var_ = (sum((item-mean_var)**2 for item in x) / (n-1))
var_

345.25000000000006

In [61]:
import numpy as np

x = [1, 2, 3, 6, 7, 8, 30, 40, 50]

variance = np.var(x, ddof=1)

# Menghitung standar deviasi
std_deviation = np.std(x,ddof=1)


print("Variansi dari data x adalah: ", variance)
print("Standar deviasi dari data x adalah: ", std_deviation)

Variansi dari data x adalah:  345.25000000000006
Standar deviasi dari data x adalah:  18.58090417606205


## Skewness

![image-2.png](attachment:image-2.png)

In [62]:
x = [8,1,2.5,4,28]

In [63]:
x

[8, 1, 2.5, 4, 28]

In [64]:
n = len(x)
mean = np.mean(x)
stdev = statistics.stdev(x)
skew_ = (sum((item - mean)**3 for item in x))*n / ((n-1)*(n-2)*(stdev**3))
skew_

1.947043227390592

In [65]:
scipy.stats.skew(x, bias=False)

1.9470432273905927

In [66]:
z = pd.Series(x)
z.skew()

1.9470432273905924

In [67]:
scipy.stats.skew(x_with_nan)

nan

In [68]:
z_with_nan = pd.Series(x_with_nan)
z_with_nan.skew()

1.9470432273905924

In [69]:
scipy.stats.skew(z_with_nan)

nan

## Percentiles

In [70]:
x = [-5,-1.1,0.1,2,8,12.8,21,25.8,41]
np.percentile(x,5)

-3.44

In [71]:
np.percentile(x,0)

-5.0

In [72]:
np.quantile(x,0)

-5.0

In [73]:
np.quantile(x,0.5)

8.0

In [74]:
statistics.quantiles(x, n=2)

[8.0]

In [75]:
statistics.quantiles(x,n=4, method='inclusive')

[0.1, 8.0, 21.0]

In [76]:
y

array([ 8. ,  1. ,  2.5,  4. , 28. ])

In [77]:
a = np.array([[1,1,1],
             [2,3,1],
             [4,9,2],
             [8,27,4],
             [16,1,1]])

In [78]:
np.mean(a)

5.4

In [79]:
np.median(a)

2.0

In [80]:
np.mean(a, axis=1)

array([ 1.,  2.,  5., 13.,  6.])

In [81]:
np.mean(a, axis=0)

array([6.2, 8.2, 1.8])

In [82]:
np.var(a,ddof=1)

53.40000000000001

In [83]:
np.var(a,ddof=1, axis=1)

array([  0.,   1.,  13., 151.,  75.])

In [84]:
np.var(a,ddof=1, axis=0)

array([ 37.2, 121.2,   1.7])

In [85]:
row_names = ['first','second','third','fourth','fifth']
col_names=['A','B','C']
df = pd.DataFrame(a,index=row_names,columns=col_names)
df

Unnamed: 0,A,B,C
first,1,1,1
second,2,3,1
third,4,9,2
fourth,8,27,4
fifth,16,1,1


In [86]:
np.mean(df.A)

6.2

In [87]:
df['A'].mean()

6.2

In [88]:
np.var(df, ddof=1)

A     37.2
B    121.2
C      1.7
dtype: float64

In [89]:
df.var()

A     37.2
B    121.2
C      1.7
dtype: float64

## Range (Max-Min)

In [90]:
x

[-5, -1.1, 0.1, 2, 8, 12.8, 21, 25.8, 41]

In [91]:
# np.ptp digunakan untuk menghitung range
np.ptp(x)

46.0

In [93]:
np.amax(x)

41.0

In [94]:
np.amin(x)

-5.0

In [95]:
np.amax(x)-np.amin(x)

46.0

## Interquartile Range (Q3-Q1)

Digunakan untuk mendeteksi pencilan

In [96]:
x

[-5, -1.1, 0.1, 2, 8, 12.8, 21, 25.8, 41]

In [97]:
quartiles = np.quantile(x,[0.25,0.75])

In [99]:
IQR = quartiles[1]-quartiles[0]
IQR

20.9

## Summary of Desc. Stats

In [100]:
result = scipy.stats.describe(x, ddof=1, bias=False)
result

DescribeResult(nobs=9, minmax=(-5.0, 41.0), mean=11.622222222222222, variance=228.75194444444446, skewness=0.9249043136685094, kurtosis=0.14770623629658886)

In [102]:
result.variance

228.75194444444446

In [104]:
(result.variance)**(1/2) # Untuk cari stdev

15.12454774346805

In [105]:
result.minmax[1]-result.minmax[0] # Range

46.0

In [106]:
result.skewness

0.9249043136685094

In [108]:
z = pd.Series(x)
z

0    -5.0
1    -1.1
2     0.1
3     2.0
4     8.0
5    12.8
6    21.0
7    25.8
8    41.0
dtype: float64

In [109]:
z.describe()

count     9.000000
mean     11.622222
std      15.124548
min      -5.000000
25%       0.100000
50%       8.000000
75%      21.000000
max      41.000000
dtype: float64

## Correlation (hubungan antar variable)

In [113]:
x = list(range(-10,11))
y = [0,2,2,2,2,3,3,6,7,4,7,6,6,9,4,5,5,10,11,12,14]

x_ar, y_ar = np.array(x), np.array(y)
x_s, y_s = pd.Series(x), pd.Series(y)

In [114]:
len(x)==len(y)

True

## Covariance

In [117]:
n = len(x)
mean_x = np.mean(x)
mean_y = np.mean(y)

cov_xy = (sum((x[item]-mean_x)*(y[item]-mean_y)for item in range(n)))/(n-1)
cov_xy

19.95

In [118]:
cov_matrix = np.cov(x,y)
cov_matrix

array([[38.5       , 19.95      ],
       [19.95      , 13.91428571]])

In [119]:
np.var(x, ddof=1)

38.5

In [120]:
cov_xy = cov_matrix[0,1] # [0,1] baris 0 kolom 1
cov_xy

19.95

## Correlation Coefficient
Ukuran hubungan antar variable

In [125]:
var_x = np.var(x, ddof=1)
var_y = np.var(y, ddof=1) 
std_x = var_x**(1/2)
std_y = var_y**(1/2)
r = cov_xy / (std_x*std_y)
r 

0.861950005631606

Korelasi positif artinya antar kedua variabel memiliki hubungan yang kuat, apabila mendekati 1 maka hubungan antar variabel tersebut sangat kuat.

Korelasi negatif artinya antar kedua hubungan memiliki hubungan yang kuat, apabila semakin mendekati -1 maka hubungan antar variabel tersebut tidak kuat.

Apabila korelasinya 0 maka tidak ada hubungan antar kedua variabel.

In [126]:
scipy.stats.pearsonr(x,y)

(0.8619500056316061, 5.122760847201135e-07)

In [127]:
r, p = scipy.stats.pearsonr(x,y)
r

0.8619500056316061

In [128]:
p

5.122760847201135e-07