# **Introduction to Pandas**

Pandas berisi struktur data dan alat manipulasi data yang dirancang untuk pembersihan data dan analisis yang cepat dan mudah dalam Python. Pandas sering digunakan bersama dengan pustaka komputasi numerik seperti NumPy dan SciPy, pustaka analitik seperti statsmodels dan scikit-learn, dan pustaka visualisasi data seperti matplotlib. Pandas mengadopsi secara signifikan bagian dari gaya idiomatis komputasi berbasis larik NumPy, terutama berbasis larik dan preferensi untuk pemrosean data tanpa perulangan.

Sejak menjadi open source pada tahun 2010, pandas telah berkembang menjadi cukup besar yang dapat diterapkan dalam berbagai kasus penggunaan di dunia nyta. Pengembang komunitas telah berkembang menjadi lebih dari 800 kontributor yang berbeda yang telah membantu membangun proyek ini karena mereka telah menggunakannya untuk memecahkan masalah data sehari hari.

___

## **1. Pandas Data Structure**

Mari kita mulai penjelajahan kita tentang `pandas` dengan tinjauan umum tentang struktur data. Anda harus familiar dengan dua hal yang paling penting struktur data Series dan DataFrame. Meskipun keduanya bukan solusi universal untuk setiap masalah, mereka menyediakan dasar yang solid dan mudiah digunakan untuk sebagian besar aplikasi.

Sebagian besar, objek `pandas` menggunakan array NumPy untuk representasi data internalnya. Namun, untuk beberapa tipe data, `pandas` dibangun di atas NumPy untuk mebuat array sendiri (https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html). Untuk alasan ini, tergantung pada tipe datanya, `values` dapat berupa objek `pandas.array` atau `numpy.array`. Oleh karena itu, kita perlu memastikan bahwa kita mendapatakn tipe tertentu.

### `Series`

`Series` adalah objek seperti array satu dimensi yang berisi urutan nilai dan label data yang berkaitan, yang disebut indeks. Series dibentuk hanya dari satu array data.

In [None]:
# pip install pandas

In [4]:
import pandas as pd
import numpy as np

In [15]:
np.array([35000, 71000, 16000, 5000])

array([35000, 71000, 16000,  5000])

In [6]:
# membuat series dari suatu list
series1 = pd.Series([35000, 71000, 16000, 5000])
series1

0    35000
1    71000
2    16000
3     5000
dtype: int64

Representasi string dari Series yang ditampilkan secara interaktif menjukkan indeks di kiri dan nilai di sebelah kanan. Anda bisa mendapatkan representasi terhadap array dan indeks dari Series melalui atributnya. Berikut ini beberapa atribut yang umum digunakan:

| Attribute | Returns |
| --- | --- |
| `name` | The name of the `Series` object |
| `dtype` | The data type of the `Series` object |
| `shape` | Dimensions of the `Series` object in a tuple of the form (`number of rows`, ) |
| `index` | The `Index` object that is part of the `Series` object |
| `values` | The data in the `Series` object |

Sekarang mari kita lihat beberapa contoh menggunakan atribut atribut ini.

Objek `Series` itu sendiri memiliki atribut nama, yang terintegrasi dengan area-area utama lain dari fungsi pandas.

In [10]:
# memberi nama pada series yang belum memiliki nama
series1.name = 'population'
series1

0    35000
1    71000
2    16000
3     5000
Name: population, dtype: int64

In [11]:
# memanggil nama dari series
series1.name

'population'

In [18]:
# # contoh membuat dataframe
# pd.DataFrame(data=series1)

Objek `Series` memiliki satu tipe data, kita bisa cek tipenya melalui dtype.

In [14]:
series1.dtype

dtype('int64')

Sama seperti NumPy, kita dapat menggunakan `shape` untuk mendapatkan dimensi sebagai `(baris, kolom)`. Objek `Series` adalah sebuah kolom tunggal, sehingga hanya memiliki nilai untuk dimensi baris.

In [16]:
# mengecek bentuk data
series1.shape

(4,)

Objek `Series` secara default indeks-nya terdiri dari bilangan bulat 0 sampai N - 1 (dimana N adalah panjang data)

In [17]:
# menampilkan indeks-nya
series1.index

RangeIndex(start=0, stop=4, step=1)

Object `Series` ini menyimpan nilai-nilainya sebagai array NumPy.

In [20]:
# menampilkan nilainya
series1.values

array([35000, 71000, 16000,  5000], dtype=int64)

### `Index`

Penambahan `Index` membuat `Series` menjadi lebih kuat daripada array NumPy. Kita bisa mendapatkan index dari attribut `index` dari object `Series`: 

In [23]:
labels = pd.Index(['California', 'Ohio', 'Oregon', 'Texas'])
labels

Index(['California', 'Ohio', 'Oregon', 'Texas'], dtype='object')

Objek indeks tidak dapat diubah (immutable) dan dengan demikian tidak dapat dimodifikasi oleh pengguna. Hal ini membuatnya lebih aman untuk berbagai objek indeks si antara struktur data.

Berikut ini beberapa atribut yang umum digunakan dengan objek `Index`:

| Attribute | Returns |
| --- | --- |
| `name` | The name of the `Index` object |
| `dtype` | The data type of the `Index` object |
| `shape` | Dimensions of the `Index` object |
| `values` | The data in the `Index` object |
| `is_unique` | Check if the `Index` object has all unique values |

Objek `Index` juga memiliki atribut nama: 

In [25]:
# memberi nama pada index
labels.name = 'state'
labels

Index(['California', 'Ohio', 'Oregon', 'Texas'], dtype='object', name='state')

In [26]:
# memanggil nama index
labels.name

'state'

Kita dapat memeriksa jenis data yang mendasarinya:

In [27]:
# mengecek tipe data ('O' menunjukkan object)
labels.dtype

dtype('O')

Dan sama juga untuk dimensi:

In [28]:
# Menampilkan bentuk
labels.shape

(4,)

Objek `Index` ini juga dibangun di atas array Numpy:

In [29]:
# menampilkan nilai dari index
labels.values

array(['California', 'Ohio', 'Oregon', 'Texas'], dtype=object)

Objek `Index` dapat berisi label duplikat. Kita dapat cek hal ini untuk memastikannya:

In [30]:
labels.is_unique

True

Sering kali diinginkan untuk membuat `Series` dengan `Index` yang mengidentifikasi setiap titik data dengan sebuah label. perhatikan, jumlah `Index` dan `Series` harus sesuai.

In [34]:
series1.index = labels
series1

state
California    35000
Ohio          71000
Oregon        16000
Texas          5000
Name: population, dtype: int64

Dibandingkan dengan array Numpy, kita dapat menggunakan label dalam indeks saat memilih satu nilai atau satu ser nilai. Disini ['Ohio', 'Oregon'] diinterpretasikan sebagai daftar indeks dan berisi string, bukan bilangan bulat.

In [37]:
series1[['Ohio', 'Oregon']]

state
Ohio      71000
Oregon    16000
Name: population, dtype: int64

Atau merubah nilai pada `Series` berdasarkan `Index`:

In [40]:
series1['California'] = 40000
series1


state
California    40000
Ohio          71000
Oregon        16000
Texas          5000
Name: population, dtype: int64

In [43]:
pd.DataFrame(series1)

Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
California,40000
Ohio,71000
Oregon,16000
Texas,5000


### `DataFrame`

Memiliki object `Series` untuk setiap kolom merupakan peningkatan dari representasi NumPy, namun kita masih memiliki masalah yang sama ketika ingin mengurutkan berdasarkan nilai atau mengambil seluruh baris. `DataFrame` memberi kita representasi tabel yang dibentuk dari banyak objek `Series` yang membentuk kolom dan objek `Index` yang memberi label pada baris.

In [54]:
# untuk membuat dataframe bisa menggunakan pd.DataFrame()
# key: menjadi nama kolom
# values: menjadi isi dari kolom

population = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2021, 2022, 2023, 2021, 2022, 2023],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}

frame = pd.DataFrame(
    data=population,
    columns = ['year', 'state', 'pop', 'debt'],
    index=['one', 'two', 'three', 'four', 'five', 'six']
)

frame

Unnamed: 0,year,state,pop,debt
one,2021,Ohio,1.5,
two,2022,Ohio,1.7,
three,2023,Ohio,3.6,
four,2021,Nevada,2.4,
five,2022,Nevada,2.9,
six,2023,Nevada,3.2,


Berikut adalah beberapa atribut yang umum digunakan:

| Attribute | Returns |
| --- | --- |
| `dtype` | The data type of each column |
| `shape` | Dimensions of the `DataFrame` object in a tuple of the form `(number of rows, number of columns)` |
| `index` | The `Index` object along the rows of the `DataFrame` object |
| `columns` | The name of the columns (as an `Index` object) |
| `values` | The data in the `DataFrame` object |
| `empty` | Check if the `DataFrame` object is empty |

Kita dapat memeriksa tipe data yang mendasarinya dengan `dtypes` (perhatikan bahwa ini bukan dtype seperti pada objek `Series` dan `Index` karena setiap kolom akan memiliki tipe datanya sendiri):

In [56]:
frame.dtypes

year       int64
state     object
pop      float64
debt      object
dtype: object

Kita dapat memperoleh data yang mendasarinya dengan atribut `values`. Perhatikan bahwa ini terlihat sangat mirip dengan representasi NumPy.

In [57]:
frame.values

array([[2021, 'Ohio', 1.5, nan],
       [2022, 'Ohio', 1.7, nan],
       [2023, 'Ohio', 3.6, nan],
       [2021, 'Nevada', 2.4, nan],
       [2022, 'Nevada', 2.9, nan],
       [2023, 'Nevada', 3.2, nan]], dtype=object)

Kita dapat mengisolasi kolom dengan atribut `columns`. Perhatikan bahwa kolom sebenarnya adalah objek `Index` hanya pada sumbu yang berbeda (kolom adalah indeks horizontal sedangkan baris adalah indeks vertikal).

In [59]:
frame.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Onject `Index` di sepanjang baris dataframe dapat diakses melalui atribut index (sama seperti objek `Series`):

In [60]:
frame.index

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

Seperti halnya objek `Series` dan `Index`, kita dapat memperoleh dimensi dataframe dengan atribut `shape`. Hasilnya adalah dalam bentuk `(nrows, ncols)`. DataFrame kita memiliki 6 baris dan 4 kolom.

In [61]:
frame.shape

(6, 4)

## **2. Creating DataFrames**

Kita akan membuat objek `DataFrame` dari struktur data lainnya di Python.

Kita import pustaka pandas dan numpy terlebih dahulu.

In [62]:
import numpy as np
import pandas as pd

### Creating a `DataFrame` object from a `Series` object

Menggunakan metode `to_frame()`:

In [64]:
# cara 1
pd.DataFrame(series1)

Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
California,40000
Ohio,71000
Oregon,16000
Texas,5000


In [65]:
# cara 2
series1.to_frame()

Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
California,40000
Ohio,71000
Oregon,16000
Texas,5000


In [69]:
pd.Series(np.linspace(10,100,10)).to_frame(name='salary')

Unnamed: 0,salary
0,10.0
1,20.0
2,30.0
3,40.0
4,50.0
5,60.0
6,70.0
7,80.0
8,90.0
9,100.0


### Creating a `DataFrame` object from Python Data Structures

Pertama, *from a dictionary of list-like structures*. Nilai-nilai dictionary dapat berupa list, array NumPy, dll, selama mereka memiliki panjang (generator tidak memiliki panjang sehingga kita tidak dapat menggunakannya disini):

In [77]:
import datetime as dt

In [84]:
np.random.seed(2024)

pd.DataFrame(
    {
        'random': np.random.rand(5),
        'text': ['hot', 'warm', 'cool', 'cold', None],
        'truth': [np.random.choice([True, False]) for _ in range(5)]
    },
    index = pd.date_range(
        end=dt.date(2024,2,27),
        freq='1D',
        periods=5,
        name='date'
    )
)

Unnamed: 0_level_0,random,text,truth
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-02-23,0.588015,hot,False
2024-02-24,0.699109,warm,True
2024-02-25,0.188152,cool,False
2024-02-26,0.043809,cold,True
2024-02-27,0.205019,,True


Kedua, dari *list of dictionaries*:

In [85]:
pd.DataFrame(
    [
        {'mag': 5.2, 'place': 'California'},
        {'mag': 1.2, 'place': 'Alaska'},
        {'mag': 0.2, 'place': 'California'},
    ]
)

Unnamed: 0,mag,place
0,5.2,California
1,1.2,Alaska
2,0.2,California


Ketiga, dari *list of tuples*:

In [87]:
pd.DataFrame(
    [(n, n**2, n**3) for n in range(5)],
    columns=['n', 'n_squared', 'n_cubed']
)

Unnamed: 0,n,n_squared,n_cubed
0,0,0,0
1,1,1,1
2,2,4,8
3,3,9,27
4,4,16,64


Keempat, dari *NumPy array*:

In [None]:
pd.DataFrame(
    np.array([
        [0, 0, 0],
        [1, 1, 1],
        [2, 4, 8],
        [3, 9, 27],
        [4, 16, 64]
    ]), columns=['n', 'n_squared', 'n_cubed']
)

Unnamed: 0,0,1,2
0,0,0,0
1,1,1,1
2,2,4,8
3,3,9,27
4,4,16,64


## **3. Reading and Writing Data in Text Format**

Menagkses data terbagi menjadi beberapa kategori, membaca file teks dan format on-disk yang lebih efisien, memuat data dari basis data, dan berinteraksi dengan jaringan dengan sumber-sumber kerja internet seperti API web.

Pandas memiliki sejumlah fungsi untuk membaca data tabular sebagai objek DataFrame. Tabel di bawah meringkas beberapa di antaranya, meskipun read_csv dan read_table kemungkinan besar adalah yang paling sering digunakan.

| **Function** | **Description** |
| --- | --- |
| `read_csv` | Load delimited data from a file, URL, or file-like object; use comma as default delimiter |
| `read_table` | Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter |
| `read_excel` | Read tabular data from an Excel XLS or XLSX file |
| `read_html` | Read all table found in the given HTML document |
| `read_json` | Read data from a JSON (JavaScript Object Notation) string representation | 
| `read_sql` | Read the result of SQL query (using SQLALchemy) as a pandas DataFrame | 

### Creating a `DataFrame` object from the contents of a CSV File

Mencari informasi pada berkas sebelum membacanya:

Berkas kita berukuran kecil, memiliiki header pada baris pertama, dan dipisahkan dengan koma, sehingga kita tidak perlu memberikan argumen tambahan untuk dibaca di berkas dengan `pd.read_csv()`, namun pastikan untuk memeriksa dokumentasi untuk mengetahui argumen yang mungkin:

In [90]:
df = pd.read_csv('earthquakes.csv')
df

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.020030,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.021370,28.0,21.0,",ci37389194,",3.42,ml,...,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.026180,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,,,73096941,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.077990,,192.0,",nc73096941,",2.16,md,...,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9327,,,73086771,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.018060,,185.0,",nc73086771,",0.62,md,...,",nc,",reviewed,1537230228060,"M 0.6 - 9km ENE of Mammoth Lakes, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1537285598315,https://earthquake.usgs.gov/earthquakes/eventp...
9328,,,38063967,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.030410,,50.0,",ci38063967,",1.00,ml,...,",ci,",reviewed,1537230135130,"M 1.0 - 3km W of Julian, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1537276800970,https://earthquake.usgs.gov/earthquakes/eventp...
9329,,,2018261000,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.452600,,276.0,",pr2018261000,",2.40,md,...,",pr,",reviewed,1537229908180,"M 2.4 - 35km NNE of Hatillo, Puerto Rico",0,earthquake,",geoserve,origin,phase-data,",-240.0,1537243777410,https://earthquake.usgs.gov/earthquakes/eventp...
9330,,,38063959,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.018650,,61.0,",ci38063959,",1.10,ml,...,",ci,",reviewed,1537229545350,"M 1.1 - 9km NE of Aguanga, CA",0,earthquake,",focal-mechanism,geoserve,nearby-cities,origin...",-480.0,1537230211640,https://earthquake.usgs.gov/earthquakes/eventp...


Kita juga dapat memasukkan URL. Mari kita baca berkas yang sama dari Github.

In [91]:
df =  pd.read_csv(
    'https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/blob/master/ch_02/data/earthquakes.csv?raw=True'
)

In [93]:
df.head()

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02137,28.0,21.0,",ci37389194,",3.42,ml,...,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02618,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,,,73096941,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.07799,,192.0,",nc73096941,",2.16,md,...,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...


### Creating a `DataFrame` object from the contents of a JSON File

JSON (kependekan dari JavaScript Object Notation) telah menjadi salah satu format standar untuk mengirim data melalui permintaan HTTP antara browser web dan aplikasi lain. Ini adalah format data yang jauh lebih bebas daripada bentuk teks tabel seperti CSV.

In [94]:
df = pd.read_json('population_data.json')
df.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1960,96388069.0
1,Arab World,ARB,1961,98882541.4
2,Arab World,ARB,1962,101474075.8
3,Arab World,ARB,1963,104169209.2
4,Arab World,ARB,1964,106978104.6


### Creating a `DataFrame` object from the contents of an Excel File

pandas juga mendukung pembacaan data tabular yang disimpan dalam file Excel 2003 (dan yang lebih tinggi) menggunakan kelas `ExcelFile` arau fungsi `pd.read_excel()`

In [2]:
import pandas as pd
import numpy as np

In [4]:
xlsx = pd.ExcelFile('superhero_info.xlsx')
df = pd.read_excel(xlsx)
df.head()

Unnamed: 0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Alignment,Weight
0,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191,Dark Horse Comics,good,65
1,Alien,Male,-,Xenomorph XX121,No Hair,244,Dark Horse Comics,bad,169
2,Angel,Male,-,Vampire,-,-99,Dark Horse Comics,good,-99
3,Buffy,Female,green,Human,Blond,157,Dark Horse Comics,good,52
4,Captain Midnight,Male,-,Human,-,-99,Dark Horse Comics,good,-99


### Creating a `DataFrame` object by Querying a Database

Menggunakan database SQLite. Jika tidak tersedia, and perlu menginstal SQLAlchemy.

In [5]:
import sqlite3

In [6]:
with sqlite3.connect('quakes.db') as connection:
    tsunamis = pd.read_sql('SELECT * FROM tsunamis LIMIT 10', connection)

tsunamis

Unnamed: 0,alert,type,title,place,magType,mag,time
0,,earthquake,"M 5.0 - 165km NNW of Flying Fish Cove, Christm...","165km NNW of Flying Fish Cove, Christmas Island",mww,5.0,1539459504090
1,green,earthquake,"M 6.7 - 262km NW of Ozernovskiy, Russia","262km NW of Ozernovskiy, Russia",mww,6.7,1539429023560
2,green,earthquake,"M 5.6 - 128km SE of Kimbe, Papua New Guinea","128km SE of Kimbe, Papua New Guinea",mww,5.6,1539312723620
3,green,earthquake,"M 6.5 - 148km S of Severo-Kuril'sk, Russia","148km S of Severo-Kuril'sk, Russia",mww,6.5,1539213362130
4,green,earthquake,"M 6.2 - 94km SW of Kokopo, Papua New Guinea","94km SW of Kokopo, Papua New Guinea",mww,6.2,1539208835130
5,green,earthquake,"M 5.9 - 117km ESE of Kimbe, Papua New Guinea","117km ESE of Kimbe, Papua New Guinea",mww,5.9,1539205996680
6,green,earthquake,"M 5.9 - 113km ESE of Kimbe, Papua New Guinea","113km ESE of Kimbe, Papua New Guinea",mww,5.9,1539205141060
7,green,earthquake,"M 7.0 - 117km E of Kimbe, Papua New Guinea","117km E of Kimbe, Papua New Guinea",mww,7.0,1539204500290
8,green,earthquake,"M 6.1 - 132km E of Kimbe, Papua New Guinea","132km E of Kimbe, Papua New Guinea",mb,6.1,1539204326420
9,green,earthquake,"M 5.0 - 61km SSW of Chignik Lake, Alaska","61km SSW of Chignik Lake, Alaska",ml,5.0,1539152878406


### Creating a `DataFrame` object by Connecting to mySQL

Anda perlu menginstal mysql-connector-python terlebih dahulu.

In [3]:
# pip install mysql-connector-python

Selanjutnya import module `mysql.connector` dan hubungkan ke database menggunakan akun masing-masing.

In [5]:
# import module
import mysql.connector

# connect to Database
mydb = mysql.connector.connect(
    host = 'localhost',
    user = 'root',
    passwd = 'password_masing_masing',      # masukkan password mysql
    database = 'sakila'                     # ini nama database-nya
)

Buat fungsi bernama `sql_table` dengan parameter `query` untuk menarik data dari database dan mengubahnya ke dalam DataFrame.

In [7]:
# access database
curs = mydb.cursor()                                                # create access ke database

def sql_table(query):
    curs.execute(query)                                             # menjalankan query
    result = curs.fetchall()                                        # save result in result
    tabel1 = pd.DataFrame(result, columns = curs.column_names)      # convert to dataframe table
    return tabel1                                                   # open the dataframe

In [None]:
sql_table(
    '''
    SELECT * FROM film
    LIMIT 10
    '''
)

### Writing a `DataFrame` object to a CSV File

Perhatikan bahwa indeks dari `df` hanyalah nomor baris, jadi kita tidak ingin menyimpannya. Oleh karena itu, kita mengoper `index = False` ke mtode `to_csv()`

In [9]:
tsunamis.to_csv('tsunamis.csv', index=False)

## **4. Indexing, Selection, and Filtering**

Kita akan bekerja dengan file `earthquakes.csv` untuk menerapkan ini, jadi kita perlu menangani impor dan membacanya.

In [2]:
df = pd.read_csv('earthquakes.csv')
df.head()

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02137,28.0,21.0,",ci37389194,",3.42,ml,...,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02618,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,,,73096941,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.07799,,192.0,",nc73096941,",2.16,md,...,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...


In [3]:
# bentuk data 9332 baris dan 26 kolom
df.shape

(9332, 26)

In [4]:
df.columns

Index(['alert', 'cdi', 'code', 'detail', 'dmin', 'felt', 'gap', 'ids', 'mag',
       'magType', 'mmi', 'net', 'nst', 'place', 'rms', 'sig', 'sources',
       'status', 'time', 'title', 'tsunami', 'type', 'types', 'tz', 'updated',
       'url'],
      dtype='object')

### Selecting `column`

Memilih kolom dengan menggunakan notasi atribut:

In [5]:
# hati-hati jika nama kolom sama dengan nama fungsi atau atribut
# misal ada kolom bernama shape
# df.shape --> menampilkan bentuk ukuran bukan mengambil kolom shape
df.mag

0       1.35
1       1.29
2       3.42
3       0.44
4       2.16
        ... 
9327    0.62
9328    1.00
9329    2.40
9330    1.10
9331    0.66
Name: mag, Length: 9332, dtype: float64

Atau dengan sintaks mirip dictionary:

In [7]:
# mengambil kolom mag dan ditampilkan dalam Series
df['mag']

0       1.35
1       1.29
2       3.42
3       0.44
4       2.16
        ... 
9327    0.62
9328    1.00
9329    2.40
9330    1.10
9331    0.66
Name: mag, Length: 9332, dtype: float64

In [9]:
# mengambil kolom mag dan ditampilkan dalam DataFrame
df[['mag']]

Unnamed: 0,mag
0,1.35
1,1.29
2,3.42
3,0.44
4,2.16
...,...
9327,0.62
9328,1.00
9329,2.40
9330,1.10


Memilih beberapa kolom:

In [11]:
# nama_dataframe[[kolom_1, kolom_2, kolom_3,....]]
df[['mag', 'alert', 'tsunami']]

Unnamed: 0,mag,alert,tsunami
0,1.35,,0
1,1.29,,0
2,3.42,,0
3,0.44,,0
4,2.16,,0
...,...,...,...
9327,0.62,,0
9328,1.00,,0
9329,2.40,,0
9330,1.10,,0


### Selecting `rows`

Menggunakan nomor baris (termasuk indeks pertama, tidak termasuk indeks terakhir):

In [13]:
# memilih baris dengan indexing [start:stop:step] -- stop(eksklusif)
df[0:5]

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02137,28.0,21.0,",ci37389194,",3.42,ml,...,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02618,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,,,73096941,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.07799,,192.0,",nc73096941,",2.16,md,...,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...


In [14]:
# memilih baris pada indeks ke-1 hingga ke-10 dengan step 2 (1, 3, 5, 7, dan 9)
df[1:11:2]

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02618,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
5,,,2018286011,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.4373,,158.0,",pr2018286011,",2.61,md,...,",pr,",reviewed,1539473686440,"M 2.6 - 55km ESE of Punta Cana, Dominican Repu...",0,earthquake,",geoserve,origin,phase-data,",-300.0,1539500579236,https://earthquake.usgs.gov/earthquakes/eventp...
7,,,73096936,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.01622,,83.0,",nc73096936,",1.13,md,...,",nc,",automatic,1539473060280,"M 1.1 - 10km NW of Parkfield, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539476642808,https://earthquake.usgs.gov/earthquakes/eventp...
9,,,1000hbtn,https://earthquake.usgs.gov/fdsnws/event/1/que...,3.191,,37.0,",us1000hbtn,",4.7,mb,...,",us,",reviewed,1539472814760,"M 4.7 - 219km SSE of Saparua, Indonesia",0,earthquake,",geoserve,origin,phase-data,",540.0,1539473712040,https://earthquake.usgs.gov/earthquakes/eventp...


Memilih baris dan kolom dengan proses berantai:

In [16]:
# berantai dataframe[baris][kolom]
df[1:11:2][['mag', 'magType']]

Unnamed: 0,mag,magType
1,1.29,ml
3,0.44,ml
5,2.61,md
7,1.13,md
9,4.7,mb


In [18]:
# berantai dataframe[kolom][baris]
df[['mag', 'magType']][1:11:2]

Unnamed: 0,mag,magType
1,1.29,ml
3,0.44,ml
5,2.61,md
7,1.13,md
9,4.7,mb


### Indexing with `loc`

Pemilihan format `loc [penunjuk_baris, penunjuk_kolom]` di mana : dapat digunakan untuk memilih semua.

In [22]:
# pada loc, stop-nya inklusif
# mengambil baris pada index ke-10 hingga ke-15 (inklusif) dan kolom 'magType', 'alert', dan 'status'
df.loc[10:15, ['magType', 'alert', 'status']]

Unnamed: 0,magType,alert,status
10,ml,,automatic
11,md,,reviewed
12,ml,,automatic
13,mb,,reviewed
14,md,,automatic
15,ml,,automatic


Kita dapat menggunakan `loc` untuk memilih baris dan kolom tertentu tanpa proses berantai. Jika kita menggunakan nomor baris dengan `loc`, nomor-nomor tersebut sekarang **termasuk** indeks akhir:

In [23]:
df.loc[5:10, ['title', 'mag']]

Unnamed: 0,title,mag
5,"M 2.6 - 55km ESE of Punta Cana, Dominican Repu...",2.61
6,"M 1.7 - 105km W of Talkeetna, Alaska",1.7
7,"M 1.1 - 10km NW of Parkfield, CA",1.13
8,"M 0.9 - 6km NW of The Geysers, CA",0.92
9,"M 4.7 - 219km SSE of Saparua, Indonesia",4.7
10,"M 0.5 - 10km NE of Aguanga, CA",0.5


In [27]:
# mengambil baris indeks ke-0 hingga indeks ke-5
# dari kolom 'detail' hingga 'magType'
df.loc[:5, 'detail':'magType']

Unnamed: 0,detail,dmin,felt,gap,ids,mag,magType
0,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml
1,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml
2,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02137,28.0,21.0,",ci37389194,",3.42,ml
3,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02618,,39.0,",ci37389186,",0.44,ml
4,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.07799,,192.0,",nc73096941,",2.16,md
5,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.4373,,158.0,",pr2018286011,",2.61,md


In [29]:
# mengambil semua baris
# dari kolom 'magType' hingga terakhir
df.loc[:, 'magType':]

Unnamed: 0,magType,mmi,net,nst,place,rms,sig,sources,status,time,title,tsunami,type,types,tz,updated,url
0,ml,,ci,26.0,"9km NE of Aguanga, CA",0.19,28,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,ml,,ci,20.0,"9km NE of Aguanga, CA",0.29,26,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,ml,,ci,111.0,"8km NE of Aguanga, CA",0.22,192,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,ml,,ci,26.0,"9km NE of Aguanga, CA",0.17,3,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,md,,nc,18.0,"10km NW of Avenal, CA",0.05,72,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9327,md,,nc,13.0,"9km ENE of Mammoth Lakes, CA",0.03,6,",nc,",reviewed,1537230228060,"M 0.6 - 9km ENE of Mammoth Lakes, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1537285598315,https://earthquake.usgs.gov/earthquakes/eventp...
9328,ml,,ci,28.0,"3km W of Julian, CA",0.21,15,",ci,",reviewed,1537230135130,"M 1.0 - 3km W of Julian, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1537276800970,https://earthquake.usgs.gov/earthquakes/eventp...
9329,md,,pr,9.0,"35km NNE of Hatillo, Puerto Rico",0.41,89,",pr,",reviewed,1537229908180,"M 2.4 - 35km NNE of Hatillo, Puerto Rico",0,earthquake,",geoserve,origin,phase-data,",-240.0,1537243777410,https://earthquake.usgs.gov/earthquakes/eventp...
9330,ml,,ci,27.0,"9km NE of Aguanga, CA",0.10,19,",ci,",reviewed,1537229545350,"M 1.1 - 9km NE of Aguanga, CA",0,earthquake,",focal-mechanism,geoserve,nearby-cities,origin...",-480.0,1537230211640,https://earthquake.usgs.gov/earthquakes/eventp...


### Indexing with `iloc`

Eksklusif dari titik akhir seperti halnya pemotongan Python.

In [32]:
# iloc menggunakan urutan index baik pada baris maupun kolom
# pada iloc stop-nya eksklusif
# mengambil baris pada indeks ke-5 hingga ke-9
# dan kolom indeks ke-3 (detail) dan indeks ke-1 (cdi)

df.iloc[5:10, [3, 1]]

Unnamed: 0,detail,cdi
5,https://earthquake.usgs.gov/fdsnws/event/1/que...,
6,https://earthquake.usgs.gov/fdsnws/event/1/que...,
7,https://earthquake.usgs.gov/fdsnws/event/1/que...,
8,https://earthquake.usgs.gov/fdsnws/event/1/que...,
9,https://earthquake.usgs.gov/fdsnws/event/1/que...,


Kita dapat menggunakan sintaks slicing dengan `iloc` untuk baris dan kolom.

In [34]:
df.columns

Index(['alert', 'cdi', 'code', 'detail', 'dmin', 'felt', 'gap', 'ids', 'mag',
       'magType', 'mmi', 'net', 'nst', 'place', 'rms', 'sig', 'sources',
       'status', 'time', 'title', 'tsunami', 'type', 'types', 'tz', 'updated',
       'url'],
      dtype='object')

In [35]:
# mengambil baris pada indeks ke-5 hingga ke-9
# dan kolom indeks ke-4 (dmin) hingga indeks ke-8 (mag) dengan step 2 (indeks ke-4, 6, 8)
# stop-nya eksklusif
df.iloc[5:10, 4:9:2]

Unnamed: 0,dmin,gap,mag
5,0.4373,158.0,2.61
6,,,1.7
7,0.01622,83.0,1.13
8,0.009138,52.0,0.92
9,3.191,37.0,4.7


### Filtering `DataFrame`

Kita dapat memfilter dataframe kita dengan menggunakan **Boolean mask** yang dapat dibuat sebagai berikut:

In [41]:
# kondisi (hasilnya berupa boolean)
df['mag'] > 5

0       False
1       False
2       False
3       False
4       False
        ...  
9327    False
9328    False
9329    False
9330    False
9331    False
Name: mag, Length: 9332, dtype: bool

Menggunakan mask di atas untuk seleksi adalah dengan menempatkannya di dalam tanda kurung:

In [46]:
# data_frame[kondisi]
# menampilkan seluruh kolom serta beberapa baris yang memenuhi kondisi dimana nilai 'mag' lebih besar dari 5
df[df['mag'] > 5]

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
118,green,,1000hbkz,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.623,,25.0,",pt18286001,at00pgjb1a,us1000hbkz,",6.7,mww,...,",pt,at,us,",reviewed,1539429023560,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake,",geoserve,ground-failure,impact-link,losspager...",600.0,1539455437040,https://earthquake.usgs.gov/earthquakes/eventp...
180,green,,1000hbhw,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.077,,23.0,",us1000hbhw,",5.2,mww,...,",us,",reviewed,1539405255580,"M 5.2 - 25km E of Bitung, Indonesia",0,earthquake,",geoserve,losspager,origin,phase-data,shakemap,",480.0,1539412565560,https://earthquake.usgs.gov/earthquakes/eventp...
226,green,,1000hbff,https://earthquake.usgs.gov/fdsnws/event/1/que...,7.385,,27.0,",us1000hbff,",5.7,mww,...,",us,",reviewed,1539389626220,"M 5.7 - 42km WNW of Sola, Vanuatu",0,earthquake,",geoserve,losspager,origin,phase-data,shakemap,",660.0,1539396937285,https://earthquake.usgs.gov/earthquakes/eventp...
227,,3.1,1000hbfe,https://earthquake.usgs.gov/fdsnws/event/1/que...,1.822,9.0,90.0,",us1000hbfe,",5.2,mb,...,",us,",reviewed,1539389603790,"M 5.2 - 15km WSW of Pisco, Peru",0,earthquake,",dyfi,geoserve,origin,phase-data,",-300.0,1539403377538,https://earthquake.usgs.gov/earthquakes/eventp...
258,,,1000hbde,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.644,,46.0,",us1000hbde,",5.1,mb,...,",us,",reviewed,1539380306940,"M 5.1 - 236km NNW of Kuril'sk, Russia",0,earthquake,",geoserve,origin,phase-data,",600.0,1539381450040,https://earthquake.usgs.gov/earthquakes/eventp...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9175,,,2000hgb0,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.746,,38.0,",us2000hgb0,",5.2,mb,...,",us,",reviewed,1537262729590,"M 5.2 - 126km N of Dili, East Timor",1,earthquake,",geoserve,origin,phase-data,",480.0,1537264531040,https://earthquake.usgs.gov/earthquakes/eventp...
9176,,,2000hgax,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.839,,83.0,",us2000hgax,",5.2,mb,...,",us,",reviewed,1537262656830,"M 5.2 - 90km S of Raoul Island, New Zealand",0,earthquake,",geoserve,origin,phase-data,",-720.0,1537263853040,https://earthquake.usgs.gov/earthquakes/eventp...
9211,green,,2000hg93,https://earthquake.usgs.gov/fdsnws/event/1/que...,8.749,,22.0,",us2000hg93,",6.0,mww,...,",us,",reviewed,1537255661330,M 6.0 - Southwest Indian Ridge,0,earthquake,",geoserve,losspager,moment-tensor,origin,phase...",180.0,1538206958458,https://earthquake.usgs.gov/earthquakes/eventp...
9213,,,2000hg99,https://earthquake.usgs.gov/fdsnws/event/1/que...,5.359,,92.0,",us2000hg99,",5.1,mb,...,",us,",reviewed,1537255481060,M 5.1 - South of Tonga,0,earthquake,",geoserve,origin,phase-data,",-720.0,1538204240040,https://earthquake.usgs.gov/earthquakes/eventp...


In [47]:
# data_frame[kondisi]
# menampilkan hanya kolom 'title' dan 'mag' serta beberapa baris yang memenuhi kondisi dimana nilai 'mag' lebih besar dari 5
df[df['mag'] > 5][['title', 'mag']]

Unnamed: 0,title,mag
118,"M 6.7 - 262km NW of Ozernovskiy, Russia",6.7
180,"M 5.2 - 25km E of Bitung, Indonesia",5.2
226,"M 5.7 - 42km WNW of Sola, Vanuatu",5.7
227,"M 5.2 - 15km WSW of Pisco, Peru",5.2
258,"M 5.1 - 236km NNW of Kuril'sk, Russia",5.1
...,...,...
9175,"M 5.2 - 126km N of Dili, East Timor",5.2
9176,"M 5.2 - 90km S of Raoul Island, New Zealand",5.2
9211,M 6.0 - Southwest Indian Ridge,6.0
9213,M 5.1 - South of Tonga,5.1


Kita bisa menggunakan mask dengan `loc`

In [48]:
# dataframe.loc[kondisi_untuk_seleksi_baris, penunjuk_kolom]
df.loc[df['mag'] > 5, ['title', 'mag']]

Unnamed: 0,title,mag
118,"M 6.7 - 262km NW of Ozernovskiy, Russia",6.7
180,"M 5.2 - 25km E of Bitung, Indonesia",5.2
226,"M 5.7 - 42km WNW of Sola, Vanuatu",5.7
227,"M 5.2 - 15km WSW of Pisco, Peru",5.2
258,"M 5.1 - 236km NNW of Kuril'sk, Russia",5.1
...,...,...
9175,"M 5.2 - 126km N of Dili, East Timor",5.2
9176,"M 5.2 - 90km S of Raoul Island, New Zealand",5.2
9211,M 6.0 - Southwest Indian Ridge,6.0
9213,M 5.1 - South of Tonga,5.1


Boolean mask dapat dibuat menggunakan beberapa kriteria ketika dikombinasikan dengan operator bitwise & untuk AND dan | untuk OR. Kita juga harus mengelilingi setiap kriteria dengan tanda kurung. Kita tidak dapat menggunakan `and/or` di sini karena kita perlu mengevaluasi baris demi baris.

In [50]:
# syntax 
# AND: dataframe[(kondisi_1) & (kondisi_2)]
# OR: dataframe[(kondisi_1) | (kondisi_2)]

# menampilkan data yang terdapat indikasi tsunami dan alert-nya red
# dan hanya menampilkan kolom 'alert', 'mag', 'magType', 'title', 'tsunami', 'type'

df[(df['tsunami']==1) & (df['alert']=='red')][['alert', 'mag', 'magType', 'title', 'tsunami', 'type']]

Unnamed: 0,alert,mag,magType,title,tsunami,type
5263,red,7.5,mww,"M 7.5 - 78km N of Palu, Indonesia",1,earthquake


In [52]:
df.loc[(df['tsunami']==1) & (df['alert']=='red'), 
       ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']]

Unnamed: 0,alert,mag,magType,title,tsunami,type
5263,red,7.5,mww,"M 7.5 - 78km N of Palu, Indonesia",1,earthquake


Contoh dengan kondisi OR:

In [53]:
# menampilkan data yang terdapat indikasi tsunami atau alert-nya red
# dan hanya menampilkan kolom 'alert', 'mag', 'magType', 'title', 'tsunami', 'type'

df.loc[(df['tsunami']==1) | (df['alert']=='red'), 
       ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']]

Unnamed: 0,alert,mag,magType,title,tsunami,type
36,,5.0,mww,"M 5.0 - 165km NNW of Flying Fish Cove, Christm...",1,earthquake
118,green,6.7,mww,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake
501,green,5.6,mww,"M 5.6 - 128km SE of Kimbe, Papua New Guinea",1,earthquake
799,green,6.5,mww,"M 6.5 - 148km S of Severo-Kuril'sk, Russia",1,earthquake
816,green,6.2,mww,"M 6.2 - 94km SW of Kokopo, Papua New Guinea",1,earthquake
...,...,...,...,...,...,...
8561,,5.4,mb,"M 5.4 - 228km S of Taron, Papua New Guinea",1,earthquake
8624,,5.1,mb,"M 5.1 - 278km SE of Pondaguitan, Philippines",1,earthquake
9133,green,5.1,ml,"M 5.1 - 64km SSW of Kaktovik, Alaska",1,earthquake
9175,,5.2,mb,"M 5.2 - 126km N of Dili, East Timor",1,earthquake


Boolean mask dapat dibuat dari kriteria apa pun yang menghasilkan Boolean. Sebagai contoh, kita dapat memilih semua gempa bumi dengan string `Alaska` pada kolom `place` dengan nilai non-null pada kolom `alert`. Untuk mendapatkan nilai non-null, kita dapat menggunakan metode `isnull()` dengan operasi negasi bitwise (~) atau metode `notnull()`

In [60]:
df[(df['place'].str.contains('Alaska')) & (df['alert'].notnull())][['alert', 'mag', 'magType', 'title', 'tsunami', 'type']]

Unnamed: 0,alert,mag,magType,title,tsunami,type
1015,green,5.0,ml,"M 5.0 - 61km SSW of Chignik Lake, Alaska",1,earthquake
1273,green,4.0,ml,"M 4.0 - 71km SW of Kaktovik, Alaska",1,earthquake
1795,green,4.0,ml,"M 4.0 - 60km WNW of Valdez, Alaska",1,earthquake
2752,green,4.0,ml,"M 4.0 - 67km SSW of Kaktovik, Alaska",1,earthquake
3260,green,3.9,ml,"M 3.9 - 44km N of North Nenana, Alaska",0,earthquake
4101,green,4.2,ml,"M 4.2 - 131km NNW of Arctic Village, Alaska",0,earthquake
6897,green,3.8,ml,"M 3.8 - 80km SSW of Kaktovik, Alaska",0,earthquake
8524,green,3.8,ml,"M 3.8 - 69km SSW of Kaktovik, Alaska",0,earthquake
9133,green,5.1,ml,"M 5.1 - 64km SSW of Kaktovik, Alaska",1,earthquake


In [64]:
df.loc[(df['place'].str.contains('Alaska')) & (df['alert'].notnull()),
       ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']]

Unnamed: 0,alert,mag,magType,title,tsunami,type
1015,green,5.0,ml,"M 5.0 - 61km SSW of Chignik Lake, Alaska",1,earthquake
1273,green,4.0,ml,"M 4.0 - 71km SW of Kaktovik, Alaska",1,earthquake
1795,green,4.0,ml,"M 4.0 - 60km WNW of Valdez, Alaska",1,earthquake
2752,green,4.0,ml,"M 4.0 - 67km SSW of Kaktovik, Alaska",1,earthquake
3260,green,3.9,ml,"M 3.9 - 44km N of North Nenana, Alaska",0,earthquake
4101,green,4.2,ml,"M 4.2 - 131km NNW of Arctic Village, Alaska",0,earthquake
6897,green,3.8,ml,"M 3.8 - 80km SSW of Kaktovik, Alaska",0,earthquake
8524,green,3.8,ml,"M 3.8 - 69km SSW of Kaktovik, Alaska",0,earthquake
9133,green,5.1,ml,"M 5.1 - 64km SSW of Kaktovik, Alaska",1,earthquake


Kita dapat menggunakan metode `between()` untuk mengubah 2 operator individual (kurang dari atau sama dengan nilai maksimum dan lebih besar dari atau sama dengan nilai minimum) menjadi satu operator. Perhatikan bahwa ini sudah termasuk titik akhir secara default:

In [66]:
df.loc[
    df.mag.between(6.5, 7.5),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

Unnamed: 0,alert,mag,magType,title,tsunami,type
118,green,6.7,mww,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake
799,green,6.5,mww,"M 6.5 - 148km S of Severo-Kuril'sk, Russia",1,earthquake
837,green,7.0,mww,"M 7.0 - 117km E of Kimbe, Papua New Guinea",1,earthquake
4363,green,6.7,mww,"M 6.7 - 263km NNE of Ndoi Island, Fiji",1,earthquake
5263,red,7.5,mww,"M 7.5 - 78km N of Palu, Indonesia",1,earthquake


In [68]:
df.loc[
    df['mag'].between(6.5, 7.5),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

Unnamed: 0,alert,mag,magType,title,tsunami,type
118,green,6.7,mww,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake
799,green,6.5,mww,"M 6.5 - 148km S of Severo-Kuril'sk, Russia",1,earthquake
837,green,7.0,mww,"M 7.0 - 117km E of Kimbe, Papua New Guinea",1,earthquake
4363,green,6.7,mww,"M 6.7 - 263km NNE of Ndoi Island, Fiji",1,earthquake
5263,red,7.5,mww,"M 7.5 - 78km N of Palu, Indonesia",1,earthquake


Kita dapat menggunakan metode `isin()` untuk memeriksa keanggotaan dalam sebuah list nilai:

In [71]:
df['magType'].value_counts()

ml       6803
md       1796
mb        601
mww        68
mb_lg      30
mwr        14
mh         12
mw          4
mwb         2
ms_20       1
Name: magType, dtype: int64

In [73]:
df.loc[
    df['magType'].isin(['mwb', 'mw', 'ms_20']),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

Unnamed: 0,alert,mag,magType,title,tsunami,type
995,,3.35,mw,"M 3.4 - 9km WNW of Cobb, CA",0,earthquake
1465,green,3.83,mw,"M 3.8 - 109km WNW of Trinidad, CA",0,earthquake
2414,green,3.83,mw,"M 3.8 - 5km SW of Tres Pinos, CA",1,earthquake
4988,green,4.41,mw,"M 4.4 - 1km SE of Delta, B.C., MX",1,earthquake
5196,green,5.7,ms_20,"M 5.7 - 107km N of Palu, Indonesia",1,earthquake
6307,green,5.8,mwb,"M 5.8 - 297km NNE of Ndoi Island, Fiji",0,earthquake
8257,green,5.7,mwb,"M 5.7 - 175km SSE of Lambasa, Fiji",0,earthquake


Kita dapat mengambil indeks nilai minimum dan maksimum dari kolom tertentu dan menggunakannya untuk memilih seluruh baris di mana nilai tersebut muncul:

In [80]:
[df['mag'].idxmax(), df['mag'].idxmin()]

[5263, 2409]

In [78]:
df.loc[
    [df['mag'].idxmax(), df['mag'].idxmin()],
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']    
]

Unnamed: 0,alert,mag,magType,title,tsunami,type
5263,red,7.5,mww,"M 7.5 - 78km N of Palu, Indonesia",1,earthquake
2409,,-1.26,ml,"M -1.3 - 41km ENE of Adak, Alaska",0,earthquake
