# **DataFrame** - Pandas Data Structure

Kita bekerja banyak dengan **data tabular** (mempunyai baris dan kolom) ketika menggunakan Ms Excel. Analogi dengan hal tersebut, **DataFrame** dari *Pandas* adalah struktur data 2-dimensi yang berbentuk tabular.

DataFrame merupakan kumpulan *Series*.

<img src="https://drive.google.com/uc?export=view&id=1bWZ-ysctcAL-KvSdxp66F0f1J7-MQu2a" alt="Drawing" width= 750px;/>

<small>[Source](https://drive.google.com/file/d/1bWZ-ysctcAL-KvSdxp66F0f1J7-MQu2a/view?usp=sharing)</small>

* Data di atas merupakan data tabular yang memiliki beberapa kolom (*satisfaction_level, last_evaluation, number_project, average_montly_hours*)
* Artinya, dalam *Pandas*, data di atas adalah sebuah *DataFrame*

Kali ini, kita akan mempelajari beberapa hal sebagai berikut:
1. Membuat *DataFrame*
2. Atribut dan *method* dari *DataFrame*
3. Mengoperasikan *DataFrame*

Sebelum melanjutkan, silahkan *import library Pandas* terlebih dahulu.

In [6]:
# import library pandas dan dipanggil dengan command 'pd'
import pandas as pd

---
## Membuat *DataFrame*

Membuat *Pandas DataFrame* dapat dilakukan dengan beberapa metode berikut:
1. Dari *Python List*
2. Dari *Python Dictionary*
3. Dari *Pandas Series*
4. Dari import dataset .csv

---
1. *DataFrame* dapat dibuat dari **Python List**

In [4]:
data = {"PassengerID": [1, 2, 3, 4, 5],
        "Survived": [0, 1, 1, 1, 0],
        "Name": ["Braund", "Cumings", "Heikkinen", "Futrelle", "Allen"]}

df = pd.DataFrame(data=data)
df

NameError: name 'pd' is not defined

Gambar di atas adalah bentuk *Pandas DataFrame* yang berisi *index*, *value*, dan *column*
* Angka `0, 1, 2, 3, 4` dinamakan **index** dari `df`
* **PassengerID**, **Survived**, dan **Name** adalah **column** dari `df`

Kita dapat memberikan label index pada data seperti pada *Series*

In [None]:
# Index yang ingin kita gunakan
nama_index = ['a', 'b', 'c', 'd', 'e']

df = pd.DataFrame(data=data, index=nama_index)
df

Unnamed: 0,PassengerID,Survived,Name
a,1,0,Braund
b,2,1,Cumings
c,3,1,Heikkinen
d,4,1,Futrelle
e,5,0,Allen


---
2. *DataFrame* dapat dibuat dari **Python Dictionary**

In [3]:
data_dict1 = {"PassengerID": 1, "Survived": 0, "Name":"Braund"}
data_dict2 = {"PassengerID": 2, "Survived": 1, "Name":"Cumings"}
data_dict3 = {"PassengerID": 3, "Survived": 1, "Name":"Heikkinen"}
data_dict4 = {"PassengerID": 4, "Survived": 1, "Name":"Futrelle"}
data_dict5 = {"PassengerID": 5, "Survived": 0, "Name":"Allen"}

df = pd.DataFrame(data=[data_dict1, data_dict2, data_dict3,
                        data_dict4, data_dict5])
df

Unnamed: 0,PassengerID,Survived,Name
0,1,0,Braund
1,2,1,Cumings
2,3,1,Heikkinen
3,4,1,Futrelle
4,5,0,Allen


---
3. *DataFrame* dapat dibuat dari **Pandas Series**

In [None]:
data = {"PassengerID": pd.Series(data=[1, 2, 3, 4, 5]),
        "Survived": pd.Series(data=[0, 1, 1, 1, 0]),
        "Name": pd.Series(data=["Braund", "Cumings", "Heikkinen", 
                                "Futrelle", "Allen"])}

df = pd.DataFrame(data=data)
df

Unnamed: 0,PassengerID,Survived,Name
0,1,0,Braund
1,2,1,Cumings
2,3,1,Heikkinen
3,4,1,Futrelle
4,5,0,Allen


---
4. *DataFrame* dapat dibuat dari import file .csv. Seperti pada pertemuan sebelumnya.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [7]:
dataset_path = "titanic.csv"

df_titanic = pd.read_csv(dataset_path)
display(df_titanic)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


---
**Latihan**

1. Buat *DataFrame* berikut

| Nama | Pekerjaan |
| --- | --- |
| Bobi | Peternak |
| Broskuy | Youtuber |
| James Bond | Agen Mission Impossible |

In [None]:
## input jawaban anda disini
# 1. buat dengan versi dictionary
# 2. buat dengan versi series

---
## Atribut & *Method* *DataFrame*

Mengacu pada [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), *DataFrame* memiliki cukup banyak atribut & *methods*. Berikut adalah beberapa contoh atribut & *methods* yang sering digunakan

---
Beberapa atribut yang sering digunakan dalam *DataFrame* antara lain:

| Atribut | Fungsi |
| --- | --- |
| `index` | *return* *index* dari *DataFrame* |
| `columns` | *return* nama kolom dari *DataFrame* |
| `axes` | *return* *list* indeks baris dan nama kolom dari *DataFrame* |
| `dtype` | *return* tipe data dari *DataFrame* |
| `size` | *return* total elemen dari *DataFrame* |
| `shape` | *return* ukuran baris dan kolom dari *DataFrame* |
| `ndim` | *return* dimensi/jumlah kolom dari *DataFrame* |
| `values` | *return* semua *value* dari *DataFrame* |
| `empty` | *return* apa *DataFrame* kosong/tidak |
| `T` | *return* *DataFrame* yang di-*transpose* |

Coba kita terapkan beberapa kedalam *DataFrame* Titanic

In [8]:
print("Berikut adalah beberapa atribut dari df_titanic\n")

# Mengecek index dari df_titanic
print(f"index     = {df_titanic.index}")

# Mengecek kolom dari df_titanic
print(f"kolom    = {list(df_titanic.columns)}")

# Mengecek tipe dari df_titanic
print(f"type   = \n{df_titanic.dtypes}")

# Mengecek shape dari df_titanic
print(f"shape   = {df_titanic.shape}")

Berikut adalah beberapa atribut dari df_titanic

index     = RangeIndex(start=0, stop=891, step=1)
kolom    = ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
type   = 
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
shape   = (891, 12)


---
Beberapa *method* yang sering digunakan dalam *DataFrame* antara lain:

| *Method* | Fungsi |
| --- | --- |
| `.describe()` | *return* statistik dari *DataFrame* |
| `.sum()` | *return* jumlah nilai semua data dari *DataFrame* |
| `.mean()` | *return* rata-rata dari *DataFrame* |
| `.median()` | *return* median dari *DataFrame* |
| `.mode()` | *return* modus dari *DataFrame* |
| `.head(n)` | *return* n-data awal dari *DataFrame* |
| `.tail(n)` | *return* n-data akhir dari *DataFrame* |

In [None]:
# Statistik dari DataFrame
df_titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


**Quick question**, data bertipe apa yang dapat dicari dengan *method* `.describe()` ?

In [None]:
# 5 data awal dari df_titanic
df_titanic.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
# 5 data akhir dari df_titanic
df_titanic.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


---
## Operasi Dasar *DataFrame*

Misal kita ingin mengolah `df_titanic`

In [None]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Untuk **mengakses** data berdasarkan kolom gunakan `["Nama Kolom"]`

In [None]:
# Panggil 1 kolom: Name
df_titanic["Name"]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [None]:
# Panggil multi-kolom: Name & Age
list_kolom = ["Name", "Age"]
df_titanic[list_kolom]

Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,"Allen, Mr. William Henry",35.0
...,...,...
886,"Montvila, Rev. Juozas",27.0
887,"Graham, Miss. Margaret Edith",19.0
888,"Johnston, Miss. Catherine Helen ""Carrie""",
889,"Behr, Mr. Karl Howell",26.0


Untuk **mengakses** baris dapat menggunakan `loc`

In [None]:
# Akses data 1 indeks: 0
indeks_data = 0
df_titanic.loc[indeks_data]

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                                 22
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object

<img src="https://drive.google.com/uc?export=view&id=1SBrFTp2CZPy-a9A9sRTLYWNhUbiPlM_R" alt="Drawing" width= 1000px;/>

<small>[Source](https://drive.google.com/file/d/1SBrFTp2CZPy-a9A9sRTLYWNhUbiPlM_R/view?usp=sharing)</small>

In [None]:
# Akses data multi-indeks: 5 data pertama
n_data = 5
df_titanic[:n_data]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Lanjut, kita ingin **meng-update** umur dari Mr. Owen Harris dapat dilakukan dengan cara *re-assign* umur secara langsung.

In [None]:
# Persiapan
indeks_data = 0         # indeks data Mr. Owen Harris
kolom_data = "Age"
umur_baru = 25

# Ayo ganti
df_titanic[kolom_data][indeks_data] = umur_baru
df_titanic.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,25.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Untuk **menambahkan** kolom baru, dapat menggunakan cara yang mirip dengan *dictionary*. Data yang ditambahkan dapat berbentuk *Series*

In [None]:
import numpy as np

# Cari panjang baris
nRow = len(df_titanic['Name'])

# Tambahkan new_column sebagai kolom baru
df_titanic['new_column'] = pd.Series(np.random.randn(nRow), index=df_titanic.index)
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,new_column
0,1,0,3,"Braund, Mr. Owen Harris",male,25.0,1,0,A/5 21171,7.25,,S,-0.435133
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,-0.920152
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0.914544
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,-0.92364
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0.386136


Untuk **menambahkan** baris baru, dapat menggunakan cara yang sama juga dengan *dictionary*. Pastikan *key* yang ada telah sesuai. Penambahan dilakukan menggunakan *method* `.append()`

In [None]:
# misal, kita buat df_titanic_addition yang terdiri dari 3 data pertama
df_titanic_addition = df_titanic[:3]

# Selanjutnya, tambahkan df_titanic_addition kedalam df_titanic menggunakan
# method .append()
df_titanic = df_titanic.append(df_titanic_addition, ignore_index=True)
df_titanic.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,new_column
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C,1.054154
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q,0.029503
891,1,0,3,"Braund, Mr. Owen Harris",male,25.0,1,0,A/5 21171,7.25,,S,-0.435133
892,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,-0.920152
893,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0.914544


untuk **menghapus** kolom, dapat dilakukan dengan *method* `.drop(columns=[])`

In [None]:
# Kita hapus new_column
df_titanic = df_titanic.drop(columns=["new_column"])
df_titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q
891,1,0,3,"Braund, Mr. Owen Harris",male,25.0,1,0,A/5 21171,7.25,,S
892,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
893,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


Untuk **menghapus** baris, dapat dilakukan dengan *method* yang sama, tapi berdasarkan **index**

In [None]:
# Kita ingin hapus indeks 891, 892, dan 893
df_titanic = df_titanic.drop([891, 892, 893])
df_titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


**Latihan**

1. Ambil 5 data pertama dari kolom *Name*, *Sex*, *Age*, *Survived*

In [None]:
## input jawaban anda disini