# **Basic Statistic**

Statistics can be a powerful tool when performing the art of Data Science (DS). From a high-level view, statistics is the use of mathematics to perform technical analysis of data. A basic visualisation such as a bar chart might give you some high-level information, but with statistics we get to operate on the data in a much more information-driven and targeted way. The math involved helps us form concrete conclusions about our data rather than just guesstimating.

Using statistics, we can gain deeper and more fine grained insights into how exactly our data is structured and based on that structure how we can optimally apply other data science techniques to get even more information. 

## ``Topic:``

#### 1. **Descriptive Analysis** 
#### 2. **Probability Distribution**
#### 3. **Correlation**

<hr>

In [1]:
import statistics as st
import pandas as pd
import numpy as np

<hr>

# **Descriptive Statistics**

## 1. Central Tendency
Python Central tendency characterizes one central value for the entire distribution. Measures under this include mean, median, and mode.

### **a. mean()**
This function returns the arithmetic average of the data it operates on. 
``Data terdistribusi normal dan tidak ada outliers.``
``Seluruh data dijumlahkan, kemudian dibagi dengan jumlah data.``

In [63]:
data = [1,2,3,5,7,9]
st.mean(data)

4.5

### **b. mode()**
This function returns the most common value in a set of data. This gives us a great idea of where the center lies. ``Lebih sering dipakai dalam konteks data categorical``.

In [3]:
data1 = [1,2,3,5,7,9,7,2,7,6]
st.mode(data1)

7

In [72]:
jns_kelamin = [1,1,1,2,2,2,2] # code jenis kelamin 1=laki-laki, 2 = perempuan
st.mode(jns_kelamin)

2

### **c. median()**
For data of odd length, this returns the middle item; for that of even length, it returns the average of the two middle items. ``Data kurang/tidak berdistribusi normal atau ada data outliers (data yang terlalu menyimpang/ekstrim). Cara menghitung: data diurutkan, kemudian diambil titik tengahnya (50% dari urutan)``

In [68]:
st.median([1,2,3,4,5,6,7,8,9,10, 1000])

6

### **d. harmonic_mean()**

This function returns the harmonic mean of the data. For three values a, b, and c, the harmonic mean is-
3/(1/a + 1/b +1/c)
It is a measure of the center; one such example would be speed.

In [73]:
st.harmonic_mean([1,2,3])

1.6363636363636365

For the same set of data, the arithmetic mean would give us a value of 5.233333333333333.

### **e. median_low()**

When the data is of an even length, this provides us the low median of the data. Otherwise, it returns the middle value. ``konteksnya = jumlah datanya genap, dan ingin mencari median yang rendahnya``

In [74]:
# contoh pakai median biasa
st.median([1,2,3,4])

2.5

In [75]:
# contoh pakai median_low
st.median_low([1,2,3,4])

2

### **f. median_high()**

Like median_low, this returns the high median when the data is of an even length. Otherwise, it returns the middle value. ``konteksnya = jumlah datanya genap, dan ingin mencari median yang tertinggi``. 

In [76]:
st.median_high([1,2,3,4])

3

<hr>

## 2. Dispersion

``Dispersion/spread gives us an idea of how the data strays from the typical value.``

Python Dispersion is the term for a practice that characterizes how apart the members of the distribution are from the center and from each other. Variance/Standard Deviation is one such measure of variability.

### **a. variance()**

This returns the variance of the sample. This is the second moment about the mean and a larger value denotes a rather spread-out set of data. You can use this when your data is a sample out of a population.

In [10]:
st.variance(data1)

7.433333333333334

### **b. pvariance()**

This returns the population variance of data. Use this to calculate variance from an entire population.

In [11]:
st.pvariance(data1)

6.69

### **c. stdev()**

This returns the standard deviation for the sample. This is equal to the square root of the sample variance.

In [12]:
st.stdev(data1)

2.7264140062238043

### **d. pstdev()**

This returns the population standard deviation. This is the square root of population variance.

In [13]:
st.pstdev(data1)

2.5865034312755126

<hr>

## **Pandas with Descriptive Statistics in Python**

In [14]:
df = pd.DataFrame(data1)

In [15]:
# mean
df.mean()

0    4.9
dtype: float64

In [16]:
# mode
df.mode()

Unnamed: 0,0
0,7


In [17]:
# standard deviation
df.std()

0    2.726414
dtype: float64

In [18]:
# show all descriptive analysis
df.describe()

Unnamed: 0,0
count,10.0
mean,4.9
std,2.726414
min,1.0
25%,2.25
50%,5.5
75%,7.0
max,9.0


<hr>

## **Take Class Exercise**: Analyze Salaries Dataset

### ``Show your code and answer this questions!``
### Open Salaries.csv dan Create DataFrame

#### 1. Berapa nilai tengah ``BasePay`` di tahun ``2013``?
#### 2. Berapa rata-rata ``OtherPay`` di tahun ``2014``?
#### 3. Berapa rata-rata ``Benefits`` pada karyawan yang ``TotalPay``-nya di atas rata-rata karyawan di tahun 2014?
#### 4. Berapa nilai standart deviasi ``TotalPay`` pada karyawan ``Transit Operator``?
#### 5. Siapa 5 orang yang berprofesi ``Special Nurse`` dan ``BasePay``-nya di atas rata-rata?
#### 6. Sebutkan 5 profesi (``JobTitle``) dengan gaji tertinggi (``TotalPay``)?
#### 7. Berapa nilai variance ``BasePay`` karyawan ``Public Svc Aide-Public Works`` di tahun 2014?
#### 8. Siapa 5 orang yang berprofesi ``Police Officer 3`` dengan ``TotalPay`` terendah?
#### 9. Berapa standart deviasi ``BasePay`` karyawan ``Registered Nurse`` di tahun 2012 yang ``Benefits``-nya di atas rata-rata?
#### 10. Profesi apa yang paling banyak muncul (mode) dengan nilai ``BasePay`` dan ``Benefits``-nya di atas rata-rata? 

In [2]:
df = pd.read_csv('Salaries.csv')
df.head(2)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,


In [79]:
# nomor 1
df[df['Year'] == 2013]['BasePay'].median()

67669.0

In [3]:
# nomor 2
df[df['Year'] == 2014]['OtherPay'].mean()

3505.4212505574646

In [83]:
# nomor 3
df[(df['Year']==2014) & (df['TotalPay'] > df['TotalPay'].mean())]['Benefits'].mean()

36934.831982886295

In [15]:
# nomor 4
(df['TotalPay'][df["JobTitle"]=='Transit Operator']).std()

30035.249899362447

In [19]:
# nomor 5
df[(df['JobTitle']=='Special Nurse') & (df['BasePay'] > df['BasePay'].mean())][['EmployeeName', 'BasePay']].head(5)

Unnamed: 0,EmployeeName,BasePay
39881,Christian Kitchin,133932.19
41793,Laurie Towns,135691.93
42130,Leah Custis,127700.35
42394,Jennifer Chiu,124073.96
42535,Genevieve Hamer,128163.62


In [24]:
# nomor 6: Sebutkan 5 profesi (JobTitle) dengan rata-rata gaji tertinggi (TotalPay)?
df.groupby('JobTitle')['JobTitle','TotalPay'].mean().sort_values('TotalPay', ascending=False).head()

  df.groupby('JobTitle')['JobTitle','TotalPay'].mean().sort_values('TotalPay', ascending=False).head()


Unnamed: 0_level_0,TotalPay
JobTitle,Unnamed: 1_level_1
GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,399211.275
Chief Investment Officer,339653.7
Chief of Police,329183.646667
"Chief, Fire Department",325971.683333
DEPUTY DIRECTOR OF INVESTMENTS,307899.46


In [26]:
# nomor 7: Berapa nilai variance BasePay karyawan Public Svc Aide-Public Works di tahun 2014?
df[(df['Year'] == 2014) & (df['JobTitle'].str.lower() == 'public svc aide-public works')]['BasePay'].var()

45111291.10909389

In [36]:
# nomor 8: Siapa 5 orang yang berprofesi Police Officer 3 dengan TotalPay terendah?
df.sort_values(by='TotalPay', ascending=True)[(df['JobTitle'] == 'Police Officer 3') & df['TotalPay'] > 0][['EmployeeName','JobTitle','TotalPay']].head(5)

  df.sort_values(by='TotalPay', ascending=True)[(df['JobTitle'] == 'Police Officer 3') & df['TotalPay'] > 0][['EmployeeName','JobTitle','TotalPay']].head(5)


Unnamed: 0,EmployeeName,JobTitle,TotalPay
72907,William Pyne,Police Officer 3,30.78
72895,Anthony Nelson,Police Officer 3,38.11
72867,Richard Sheehan,Police Officer 3,60.09
72864,Robert Wood,Police Officer 3,63.02
72855,Richard Vanwinkle,Police Officer 3,67.42


In [38]:
df[(df['JobTitle'] == 'Police Officer 3') & (df['TotalPay'] > 0)].sort_values(by = 'TotalPay')[['EmployeeName', 'TotalPay']].head(5)

Unnamed: 0,EmployeeName,TotalPay
72907,William Pyne,30.78
72895,Anthony Nelson,38.11
72867,Richard Sheehan,60.09
72864,Robert Wood,63.02
72855,Richard Vanwinkle,67.42


In [47]:
# nomor 9: Berapa standart deviasi BasePay karyawan Registered Nurse di tahun 2012 yang Benefits-nya di atas rata-rata?
df[(df['JobTitle'].str.lower() == 'registered nurse') & (df['Benefits'] > df['Benefits'].mean()) & (df['Year']==2012)]['BasePay'].std()

17852.79774406112

In [51]:
# nomor 10: Profesi apa yang paling banyak muncul (mode) dengan nilai BasePay dan Benefits-nya di atas rata-rata?
df[(df['BasePay'] > df['BasePay'].mean()) & (df['Benefits'] > df['Benefits'].mean())]['JobTitle'].mode()

0    Registered Nurse
dtype: object

<hr>

## **Take Home Exercise**: Duplicate Statistic Module
### ``Create your own function without importing packages/modules! ``

#### 1. Mean
#### 2. Median
#### 3. Modus
#### 4. Standart Deviation
#### 5. Variance

In [12]:
def describe(yourList):    
    # mean
    total = 0
    for i in range(len(yourList)):
        total += yourList[i]
    mean = total / len(yourList)
    
    # median
    yourList.sort()
    if len(yourList) % 2 == 0:
        titik = len(yourList) / 2
        median = (yourList[int(titik)-1] + yourList[int(titik)]) / 2
    if len(yourList) % 2 == 1:
        median = yourList[int(len(yourList)/2)]
    
    # Q1
    Q1List = yourList[yourList.index(min(yourList)) : yourList.index(round(median))] 
    if len(Q1List) % 2 == 0:
        titik_Q1 = len(Q1List) / 2
        Q1 = (yourList[int(titik_Q1)-1] + yourList[int(titik_Q1)]) / 2
    else:
        Q1 = Q1List[int(len(Q1List) / 2)]
    
    # Q3
    Q3List = yourList[yourList.index(round(median)):] 
    if len(Q3List) % 2 == 0:
        titik_Q3 = len(Q3List) / 2
        Q3 = (yourList[int(titik_Q3)-1] + yourList[int(titik_Q3)]) / 2
    else:
        Q3 = Q3List[int(len(Q3List) / 2)]
    
    # modus
    hitung = []
    mode = []
    for i in range(len(yourList)):
        hitung.append(yourList.count(yourList[i]))
    terbesar = max(hitung)
    mode = yourList[hitung.index(terbesar)]
    
    # standart deviasi
    newList = []
    for j in range(len(yourList)):
        newList.append((yourList[j]-mean)**2)
    total2 = 0
    for k in range(len(newList)):
        total2 += newList[k]
    std = (total2 / (len(yourList)-1)) ** 0.5
    
    # varians
    variance = std ** 2
            
    print("Jumlah data = ", len(yourList))
    print("Nilai minimum = ", min(yourList))
    print("Nilai maksimum = ", max(yourList))    
    print("Nilai rata-rata (mean) = ", mean)
    print("Nilai tengah (median/Q2) = {}".format(median))
    print("Nilai Q1: {} dan Q3: {}.".format(Q1, Q3))
    print("Nilai yang sering muncul (mode) = {}".format(mode))
    print("Nilai standart deviasi = ", std)
    print("Nilai varians = ", variance)

In [54]:
contoh_list = [1,5,6,7,8,3,4,5,6,7,1,2,3,4,6,1,2,3,9,8,9,7,8,6,7,6,6,7,3,4,9,3]
describe(contoh_list)

Jumlah data =  32
Nilai minimum =  1
Nilai maksimum =  9
Nilai rata-rata (mean) =  5.1875
Nilai tengah (median/Q2) = 6.0
Nilai Q1: 3 dan Q3: 7.
Nilai yang sering muncul (mode) = 6
Nilai standart deviasi =  2.4683480174975068
Nilai varians =  6.092741935483872


<hr>

### **Student Code**

In [57]:
# MODE: nilai yang paling banyak muncul
def modus(x):
    x.sort()
    b=x[0]
    d=[]
    d.append([x[0],x.count(x[0])])
    for i in range(1,len(x)):
        if x[i]!=b:
            b=x[i]
            d.append([x[i],x.count(x[i])])
    for i in range (len(d)):
        for j in range (i,len(d)):
            if (d[i][1]<d[j][1]):
                y=d[i]
                d[i]=d[j]
                d[j]=y
    z=[d[0][0]]
    for i in range (1,len(d)):
        if d [i][1]>=d[0][1]:
            z.append(d [i][0])
    return z
    
print (modus(contoh_list))

[6]


In [61]:
def unimodus(x):
    a=0
    b=0
    for i in x:
        if x.count(i)>a:
            a= x.count(i)
            b=i
    return b
unimodus(contoh_list)

6

In [60]:
# STANDAR DEVIASI= setiap data dikurangi rata-rata, kemudian dipangkat, dibagi jumlah data, kemudian diakar
def rata2(YourList):
    jumlah = 0
    for i in YourList:
        jumlah += i

    rata2 = jumlah / len(YourList)
    return rata2

def stdev(YourList):
    
    jumlah = 0
    bagi = len(YourList)
    
    for i in YourList:
        y = (i - rata2(YourList))**2
        jumlah += y
    variance = jumlah/bagi
    return variance**(0.5)

stdev(contoh_list)

2.429473965697101

In [59]:
def variansi(data):
    x = sum(data)/len(data)
    y = []
    for i in data:
        z = (i - x)**2
        y.append(z)
        a = sum(y)/(len(data)-1)
    return a

variansi(contoh_list)

6.092741935483871

<hr>

### **Reference**:
- George Seif, "The 5 Basic Statistics Concepts Data Scientists Need to Know", https://towardsdatascience.com/the-5-basic-statistics-concepts-data-scientists-need-to-know-2c96740377ae
- Diogo Menezes Borges, "Introduction to Statistics for Data Science", https://www.kdnuggets.com/2018/12/introduction-statistics-data-science.html
-  DataFlair Team, "Python Descriptive Statistics – Measuring Central Tendency & Variability", https://data-flair.training/blogs/python-descriptive-statistics/
- Gaël Varoquaux, "3.1. Statistics in Python", https://scipy-lectures.org/packages/statistics/index.html
- Mirko Stojiljković , "Python Statistics Fundamentals: How to Describe Your Data", https://realpython.com/python-statistics/
- Karlijn Willems, "40+ Python Statistics For Data Science Resources", https://www.datacamp.com/community/tutorials/python-statistics-data-science