<a href="https://colab.research.google.com/github/harishmuh/Python-for-Data-Science-Analysis/blob/main/Introduction_to_Pandas_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Pandas**

Pandas contains data structures and data manipulation tools designed for fast and easy data cleaning and analysis in Python. Pandas is often used in conjunction with numerical computing libraries like NumPy and SciPy, analytics libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. Pandas adopts a significant part of NumPy's array-based computing idiomatic style, especially its array-based and preference for loop-free data processing.

Since becoming open source in 2010, pandas has grown to be large enough to be applied in a variety of use cases in the real world. The developer community has grown to over 800 different contributors who have helped build this project because they have used it to solve everyday data problems.

___

## **1. Pandas Data Structure**

Let's start our exploration of `pandas` with an overview of data structures. You should be familiar with the two most important data structures Series and DataFrame. While they are not a universal solution to every problem, they provide a solid and usable foundation for most applications.

For the most part, `pandas` objects use NumPy arrays for their internal data representation. However, for some data types, `pandas` builds on NumPy to create its own arrays (https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html). For this reason, depending on the data type, `values` can be either `pandas.array` or `numpy.array` objects. Therefore, we need to make sure that we get a specific type.

### `Series`

A `Series` is a one-dimensional array-like object that contains a sequence of values and their corresponding data labels, called indices. Series are formed from only one array of data.

In [None]:
# Installing Pandas library
# !pip install pandas

In [None]:
# Importing pandas and numpy library
import pandas as pd
import numpy as np

In [None]:
# Creating array
np.array([35000, 71000, 16000, 5000])

array([35000, 71000, 16000,  5000])

In [None]:
# create a series from a list
series1 = pd.Series([35000, 71000, 16000, 5000])
series1

Unnamed: 0,0
0,35000
1,71000
2,16000
3,5000


The string representation of a Series displayed interactively shows the index on the left and the value on the right. You can get the array and index representation of a Series through its attributes. Here are some commonly used attributes:

| Attribute | Returns |
| --- | --- |
| `name` | The name of the `Series` object |
| `dtype` | The data type of the `Series` object |
| `shape` | Dimensions of the `Series` object in a tuple of the form (`number of rows`, ) |
| `index` | The `Index` object that is part of the `Series` object |
| `values` | The data in the `Series` object | | `index` | The `Index` object that is part of the `Series` object.

Now let's see some examples of using these attributes.

Objek `Series` itu sendiri memiliki atribut nama, yang terintegrasi dengan area-area utama lain dari fungsi pandas.

In [None]:
# give a name to a series that does not yet have a name
series1.name = 'population'
series1

Unnamed: 0,population
0,35000
1,71000
2,16000
3,5000


In [None]:
# calling the name of the series
series1.name

'population'

In [None]:
# # example of creating a dataframe
# pd.DataFrame(data=series1)

The `Series` object has one data type, we can check its type through dtype.

In [None]:
series1.dtype

dtype('int64')

Just like NumPy, we can use `shape` to get dimensions as `(row, column)`. The `Series` object is a single column, so it only has values for the row dimension.

In [None]:
# check data shape
series1.shape

(4,)

The `Series` object defaults to an index consisting of integers 0 to N - 1 (where N is the length of the data)

In [None]:
# display the index
series1.index

RangeIndex(start=0, stop=4, step=1)

This `Series` object stores its values as a NumPy array.

In [None]:
# display the values
series1.values

array([35000, 71000, 16000,  5000])

### `Index`

Penambahan `Index` membuat `Series` menjadi lebih kuat daripada array NumPy. Kita bisa mendapatkan index dari attribut `index` dari object `Series`:

In [None]:
labels = pd.Index(['California', 'Ohio', 'Oregon', 'Texas'])
labels

Index(['California', 'Ohio', 'Oregon', 'Texas'], dtype='object')

Index objects are immutable and thus cannot be modified by the user. This makes it safer for various index objects si between data structures.

Here are some commonly used attributes with `Index` objects:

| Attribute | Returns |
| --- | --- |
| `name` | The name of the `Index` object |
| `dtype` | The data type of the `Index` object |
| `shape` | Dimensions of the `Index` object |
| `values` | The data in the `Index` object |
| `is_unique` | Check if the `Index` object has all unique values | | `is_unique` | Check if the `Index` object has all unique values | | `values` | The data in the `Index` object | | `is_unique` | Check if the `Index` object has all unique values | | `is_unique` |

The `Index` object also has a name attribute:

In [None]:
# name the index
labels.name = 'state'
labels

Index(['California', 'Ohio', 'Oregon', 'Texas'], dtype='object', name='state')

In [None]:
# call index name
labels.name

'state'

We can check the underlying data type:

In [None]:
# check data type ('O' indicates object)
labels.dtype

dtype('O')

And the same for dimensions:

In [None]:
# Displaying shapes
labels.shape

(4,)

This `Index` object is also built on top of the Numpy array:

In [None]:
# display values from index
labels.values

array(['California', 'Ohio', 'Oregon', 'Texas'], dtype=object)

The `Index` object may contain duplicate labels. We can check this to be sure:

In [None]:
labels.is_unique

True

It is often desirable to create a `Series` with an `Index` that identifies each data point with a label. note, the number of `Index` and `Series` must match.

In [None]:
series1.index = labels
series1

Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
California,35000
Ohio,71000
Oregon,16000
Texas,5000


Compared to Numpy arrays, we can use labels in indexes when selecting a single value or a set of values. Here [‘Ohio’, ‘Oregon’] is interpreted as an index list and contains strings instead of integers.

In [None]:
series1[['Ohio', 'Oregon']]

Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
Ohio,71000
Oregon,16000


Or change the value in `Series` based on `Index`:

In [None]:
series1['California'] = 40000
series1


Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
California,40000
Ohio,71000
Oregon,16000
Texas,5000


In [None]:
pd.DataFrame(series1)

Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
California,40000
Ohio,71000
Oregon,16000
Texas,5000


### `DataFrame`

Having a `Series` object for each column is an improvement over the NumPy representation, but we still have the same problem when we want to sort by value or retrieve the entire row. `DataFrame` gives us a table representation formed from many `Series` objects that form the columns and `Index` objects that label the rows.

In [None]:
# to create a dataframe can use pd.DataFrame()
# keys: becomes the column name
# values: becomes the content of the column

population = {
 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2021, 2022, 2023, 2021, 2022, 2023],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}

frame = pd.DataFrame(
 data=population,
 columns=['year', 'state', 'pop', 'debt'],
 index=['one', 'two', 'three', 'four', 'five', 'six']
)

frame

Unnamed: 0,year,state,pop,debt
one,2021,Ohio,1.5,
two,2022,Ohio,1.7,
three,2023,Ohio,3.6,
four,2021,Nevada,2.4,
five,2022,Nevada,2.9,
six,2023,Nevada,3.2,


Here are some commonly used attributes:

| Attributes | Returns
| --- | --- |
| `dtype` | The data type of each column |
| `shape` | The dimensions of the `DataFrame` object in tuples of the form `(number of rows, number of columns)` |
| `index` | The `Index` object along the row of the `DataFrame` object |
| `column` | The name of the column (as the `Index` object) |
| `value` | The data in the `DataFrame` object |
| `empty` | Checking if the `DataFrame` object is empty

We can check the underlying data type with `dtypes` (note that this is not dtype as in the `Series` and `Index` objects since each column will have its own data type):

In [None]:
frame.dtypes

Unnamed: 0,0
year,int64
state,object
pop,float64
debt,object


We can get the underlying data with the `values` attribute. Note that this looks very similar to the NumPy representation.

In [None]:
frame.values

array([[2021, 'Ohio', 1.5, nan],
       [2022, 'Ohio', 1.7, nan],
       [2023, 'Ohio', 3.6, nan],
       [2021, 'Nevada', 2.4, nan],
       [2022, 'Nevada', 2.9, nan],
       [2023, 'Nevada', 3.2, nan]], dtype=object)

We can isolate columns with the `columns` attribute. Note that columns are actually `Index` objects just on different axes (columns are horizontal indices while rows are vertical indices).

In [None]:
frame.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [None]:
frame.index

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

Just like the `Series` and `Index` objects, we can get the dimensions of a dataframe with the `shape` attribute. The result is in the shape `(nrows, ncols)`. Our DataFrame has 6 rows and 4 columns.

In [None]:
frame.shape

(6, 4)

## **2. Creating DataFrames**

* We will create a `DataFrame` object from other data structures in Python.

* We import the pandas and numpy libraries first.

In [None]:
import numpy as np
import pandas as pd

### Creating a `DataFrame` object from a `Series` object

Using the `to_frame()` method:

In [None]:
# method 1
pd.DataFrame(series1)

Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
California,40000
Ohio,71000
Oregon,16000
Texas,5000


In [None]:
# method 2
series1.to_frame()

Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
California,40000
Ohio,71000
Oregon,16000
Texas,5000


In [None]:
pd.Series(np.linspace(10,100,10)).to_frame(name='salary')

Unnamed: 0,salary
0,10.0
1,20.0
2,30.0
3,40.0
4,50.0
5,60.0
6,70.0
7,80.0
8,90.0
9,100.0


### Creating a `DataFrame` object from Python Data Structures

First, from a dictionary of list-like structures. The dictionary values ​​can be lists, NumPy arrays, etc., as long as they have a length (generators don't have a length so we can't use them here):

In [None]:
import datetime as dt

In [None]:
np.random.seed(2024)

pd.DataFrame(
    {
        'random': np.random.rand(5),
        'text': ['hot', 'warm', 'cool', 'cold', None],
        'truth': [np.random.choice([True, False]) for _ in range(5)]
    },
    index = pd.date_range(
        end=dt.date(2024,2,27),
        freq='1D',
        periods=5,
        name='date'
    )
)

Unnamed: 0_level_0,random,text,truth
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-02-23,0.588015,hot,False
2024-02-24,0.699109,warm,True
2024-02-25,0.188152,cool,False
2024-02-26,0.043809,cold,True
2024-02-27,0.205019,,True


Second, from the *list of dictionaries*:

In [None]:
pd.DataFrame(
    [
        {'mag': 5.2, 'place': 'California'},
        {'mag': 1.2, 'place': 'Alaska'},
        {'mag': 0.2, 'place': 'California'},
    ]
)

Unnamed: 0,mag,place
0,5.2,California
1,1.2,Alaska
2,0.2,California


Third, from *list of tuples*:

In [None]:
pd.DataFrame(
    [(n, n**2, n**3) for n in range(5)],
    columns=['n', 'n_squared', 'n_cubed']
)

Unnamed: 0,n,n_squared,n_cubed
0,0,0,0
1,1,1,1
2,2,4,8
3,3,9,27
4,4,16,64


Fourth, from *NumPy array*:

In [None]:
pd.DataFrame(
    np.array([
        [0, 0, 0],
        [1, 1, 1],
        [2, 4, 8],
        [3, 9, 27],
        [4, 16, 64]
    ]), columns=['n', 'n_squared', 'n_cubed']
)

Unnamed: 0,n,n_squared,n_cubed
0,0,0,0
1,1,1,1
2,2,4,8
3,3,9,27
4,4,16,64


## **3. Reading and Writing Data in Text Format**

Accessing data is divided into several categories, reading text files and more efficient on-disk formats, loading data from databases, and interacting with the network with internet work sources such as web APIs.

Pandas has a number of functions for reading tabular data as DataFrame objects. The table below summarizes some of them, although read_csv and read_table are probably the most commonly used.

| **Function** | **Description** |
| --- | --- |
| `read_csv` | Load delimited data from a file, URL, or file-like object; use comma as default delimiter |
| `read_table` | Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter |
| `read_excel` | Read tabular data from an Excel XLS or XLSX file |
| `read_html` | Read all tables found in the given HTML document |
| `read_json` | Read data from a JSON (JavaScript Object Notation) string representation |
| `read_sql` | Read the result of SQL query (using SQLALchemy) as a pandas DataFrame |

### Creating a `DataFrame` object from the contents of a CSV File

Looking up information in the file before reading it:

Our file is small, has headers on the first line, and is comma-separated, so we don't need to provide any additional arguments to read in the file with `pd.read_csv()`, but be sure to check the documentation for possible arguments. The **earthquakes.csv** below can be accessed [here.](https://github.com/harishmuh/Python-for-Data-Science-Analysis/blob/main/datasets/earthquakes.csv)

In [None]:
df = pd.read_csv('earthquakes.csv')
df

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.020030,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.021370,28.0,21.0,",ci37389194,",3.42,ml,...,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.026180,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,,,73096941,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.077990,,192.0,",nc73096941,",2.16,md,...,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9327,,,73086771,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.018060,,185.0,",nc73086771,",0.62,md,...,",nc,",reviewed,1537230228060,"M 0.6 - 9km ENE of Mammoth Lakes, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1537285598315,https://earthquake.usgs.gov/earthquakes/eventp...
9328,,,38063967,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.030410,,50.0,",ci38063967,",1.00,ml,...,",ci,",reviewed,1537230135130,"M 1.0 - 3km W of Julian, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1537276800970,https://earthquake.usgs.gov/earthquakes/eventp...
9329,,,2018261000,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.452600,,276.0,",pr2018261000,",2.40,md,...,",pr,",reviewed,1537229908180,"M 2.4 - 35km NNE of Hatillo, Puerto Rico",0,earthquake,",geoserve,origin,phase-data,",-240.0,1537243777410,https://earthquake.usgs.gov/earthquakes/eventp...
9330,,,38063959,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.018650,,61.0,",ci38063959,",1.10,ml,...,",ci,",reviewed,1537229545350,"M 1.1 - 9km NE of Aguanga, CA",0,earthquake,",focal-mechanism,geoserve,nearby-cities,origin...",-480.0,1537230211640,https://earthquake.usgs.gov/earthquakes/eventp...


We can also use URL. Lets use a file from the Github.

In [None]:
df =  pd.read_csv(
    'https://raw.githubusercontent.com/harishmuh/Python-for-Data-Science-Analysis/refs/heads/main/datasets/earthquakes.csv'
)

In [None]:
df.head()

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02137,28.0,21.0,",ci37389194,",3.42,ml,...,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02618,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,,,73096941,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.07799,,192.0,",nc73096941,",2.16,md,...,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...


### Creating a DataFrame object from the contents of a JSON File

JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data via HTTP requests between web browsers and other applications. It is a much more free-form data format than tabular text forms such as CSV. The json file below can be accessed [here](https://github.com/harishmuh/Python-for-Data-Science-Analysis/blob/main/datasets/population_data.json)

In [None]:
df = pd.read_json('population_data.json')
df.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1960,96388069.0
1,Arab World,ARB,1961,98882541.4
2,Arab World,ARB,1962,101474075.8
3,Arab World,ARB,1963,104169209.2
4,Arab World,ARB,1964,106978104.6


### Creating a `DataFrame` object from the contents of an Excel File

pandas also supports reading tabular data stored in Excel 2003 (and higher) files using the `ExcelFile` class or the `pd.read_excel()` function. The xlsx file can be accessed [here](https://github.com/harishmuh/Python-for-Data-Science-Analysis/blob/main/datasets/superhero_info.xlsx).

In [None]:
import pandas as pd
import numpy as np

In [None]:
xlsx = pd.ExcelFile('superhero_info.xlsx')
df = pd.read_excel(xlsx)
df.head()

Unnamed: 0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Alignment,Weight
0,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191,Dark Horse Comics,good,65
1,Alien,Male,-,Xenomorph XX121,No Hair,244,Dark Horse Comics,bad,169
2,Angel,Male,-,Vampire,-,-99,Dark Horse Comics,good,-99
3,Buffy,Female,green,Human,Blond,157,Dark Horse Comics,good,52
4,Captain Midnight,Male,-,Human,-,-99,Dark Horse Comics,good,-99


### Creating a `DataFrame` object by Querying a Database

Uses a SQLite database. If it is not available, you will need to install SQLAlchemy.

In [None]:
import sqlite3

In [None]:
with sqlite3.connect('quakes.db') as connection:
    tsunamis = pd.read_sql('SELECT * FROM tsunamis LIMIT 10', connection)

tsunamis

Unnamed: 0,alert,type,title,place,magType,mag,time
0,,earthquake,"M 5.0 - 165km NNW of Flying Fish Cove, Christm...","165km NNW of Flying Fish Cove, Christmas Island",mww,5.0,1539459504090
1,green,earthquake,"M 6.7 - 262km NW of Ozernovskiy, Russia","262km NW of Ozernovskiy, Russia",mww,6.7,1539429023560
2,green,earthquake,"M 5.6 - 128km SE of Kimbe, Papua New Guinea","128km SE of Kimbe, Papua New Guinea",mww,5.6,1539312723620
3,green,earthquake,"M 6.5 - 148km S of Severo-Kuril'sk, Russia","148km S of Severo-Kuril'sk, Russia",mww,6.5,1539213362130
4,green,earthquake,"M 6.2 - 94km SW of Kokopo, Papua New Guinea","94km SW of Kokopo, Papua New Guinea",mww,6.2,1539208835130
5,green,earthquake,"M 5.9 - 117km ESE of Kimbe, Papua New Guinea","117km ESE of Kimbe, Papua New Guinea",mww,5.9,1539205996680
6,green,earthquake,"M 5.9 - 113km ESE of Kimbe, Papua New Guinea","113km ESE of Kimbe, Papua New Guinea",mww,5.9,1539205141060
7,green,earthquake,"M 7.0 - 117km E of Kimbe, Papua New Guinea","117km E of Kimbe, Papua New Guinea",mww,7.0,1539204500290
8,green,earthquake,"M 6.1 - 132km E of Kimbe, Papua New Guinea","132km E of Kimbe, Papua New Guinea",mb,6.1,1539204326420
9,green,earthquake,"M 5.0 - 61km SSW of Chignik Lake, Alaska","61km SSW of Chignik Lake, Alaska",ml,5.0,1539152878406


### Writing a `DataFrame` object to a CSV File

Note that the index of `df` is just the row number, so we don't want to store it. Therefore, we pass `index = False` to the `to_csv()` method.

In [None]:
tsunamis.to_csv('tsunamis.csv', index=False)

## **4. Indexing, Selection, and Filtering**

We'll be working with the same `earthquakes.csv` file to implement this, so we need to handle importing and reading it.

In [None]:
df = pd.read_csv('earthquakes.csv')
df.head()

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02137,28.0,21.0,",ci37389194,",3.42,ml,...,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02618,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,,,73096941,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.07799,,192.0,",nc73096941,",2.16,md,...,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...


In [None]:
# data shape 9332 rows and 26 columns
df.shape

(9332, 26)

In [None]:
df.columns

Index(['alert', 'cdi', 'code', 'detail', 'dmin', 'felt', 'gap', 'ids', 'mag',
       'magType', 'mmi', 'net', 'nst', 'place', 'rms', 'sig', 'sources',
       'status', 'time', 'title', 'tsunami', 'type', 'types', 'tz', 'updated',
       'url'],
      dtype='object')

### Selecting `column`

Selecting columns using attribute notation:

In [None]:
# be careful if the column name is the same as the function or attribute name
# for example there is a column named shape
# df.shape --> displays the size shape instead of taking the shape column
df.mag

Unnamed: 0,mag
0,1.35
1,1.29
2,3.42
3,0.44
4,2.16
...,...
9327,0.62
9328,1.00
9329,2.40
9330,1.10


In [None]:
# get mag column and display in Series
df['mag']

Unnamed: 0,mag
0,1.35
1,1.29
2,3.42
3,0.44
4,2.16
...,...
9327,0.62
9328,1.00
9329,2.40
9330,1.10


In [None]:
# get mag column and display in DataFrame
df[['mag']]

Unnamed: 0,mag
0,1.35
1,1.29
2,3.42
3,0.44
4,2.16
...,...
9327,0.62
9328,1.00
9329,2.40
9330,1.10


Selecting multiple columns:

In [None]:
# dataframe_name[[column_1, column_2, column_3,....]]
df[['mag', 'alert', 'tsunami']]

Unnamed: 0,mag,alert,tsunami
0,1.35,,0
1,1.29,,0
2,3.42,,0
3,0.44,,0
4,2.16,,0
...,...,...,...
9327,0.62,,0
9328,1.00,,0
9329,2.40,,0
9330,1.10,,0


### Selecting `rows`

Using row numbers (including the first index, excluding the last index):

In [None]:
# select rows with indexing [start:stop:step] -- stop(exclusive)
df[0:5]

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02137,28.0,21.0,",ci37389194,",3.42,ml,...,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02618,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,,,73096941,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.07799,,192.0,",nc73096941,",2.16,md,...,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...


In [None]:
# select rows at index 1 to 10 with step 2 (1, 3, 5, 7, and 9)
df[1:11:2]

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02618,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
5,,,2018286011,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.4373,,158.0,",pr2018286011,",2.61,md,...,",pr,",reviewed,1539473686440,"M 2.6 - 55km ESE of Punta Cana, Dominican Repu...",0,earthquake,",geoserve,origin,phase-data,",-300.0,1539500579236,https://earthquake.usgs.gov/earthquakes/eventp...
7,,,73096936,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.01622,,83.0,",nc73096936,",1.13,md,...,",nc,",automatic,1539473060280,"M 1.1 - 10km NW of Parkfield, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539476642808,https://earthquake.usgs.gov/earthquakes/eventp...
9,,,1000hbtn,https://earthquake.usgs.gov/fdsnws/event/1/que...,3.191,,37.0,",us1000hbtn,",4.7,mb,...,",us,",reviewed,1539472814760,"M 4.7 - 219km SSE of Saparua, Indonesia",0,earthquake,",geoserve,origin,phase-data,",540.0,1539473712040,https://earthquake.usgs.gov/earthquakes/eventp...


In [None]:
# chain dataframe[row][column]
df[1:11:2][['mag', 'magType']]

Unnamed: 0,mag,magType
1,1.29,ml
3,0.44,ml
5,2.61,md
7,1.13,md
9,4.7,mb


### Indexing with `loc`

The selection format is `loc [row_pointer, column_pointer]` where : can be used to select all.

In [None]:
# in loc, stop is inclusive
# fetch rows at indexes 10 through 15 (inclusive) and columns 'magType', 'alert', and 'status'
df.loc[10:15, ['magType', 'alert', 'status']]

Unnamed: 0,magType,alert,status
10,ml,,automatic
11,md,,reviewed
12,ml,,automatic
13,mb,,reviewed
14,md,,automatic
15,ml,,automatic


We can use loc to select specific rows and columns without chaining. If we use row numbers with loc, they now include the end index:

In [None]:
df.loc[5:10, ['title', 'mag']]

Unnamed: 0,title,mag
5,"M 2.6 - 55km ESE of Punta Cana, Dominican Repu...",2.61
6,"M 1.7 - 105km W of Talkeetna, Alaska",1.7
7,"M 1.1 - 10km NW of Parkfield, CA",1.13
8,"M 0.9 - 6km NW of The Geysers, CA",0.92
9,"M 4.7 - 219km SSE of Saparua, Indonesia",4.7
10,"M 0.5 - 10km NE of Aguanga, CA",0.5


In [None]:
# fetch rows from index 0 to index 5
# from column 'detail' to 'magType'
df.loc[:5, 'detail':'magType']

Unnamed: 0,detail,dmin,felt,gap,ids,mag,magType
0,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml
1,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml
2,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02137,28.0,21.0,",ci37389194,",3.42,ml
3,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02618,,39.0,",ci37389186,",0.44,ml
4,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.07799,,192.0,",nc73096941,",2.16,md
5,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.4373,,158.0,",pr2018286011,",2.61,md


In [None]:
# fetch all rows
# from column 'magType' to the last
df.loc[:, 'magType':]

Unnamed: 0,magType,mmi,net,nst,place,rms,sig,sources,status,time,title,tsunami,type,types,tz,updated,url
0,ml,,ci,26.0,"9km NE of Aguanga, CA",0.19,28,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,ml,,ci,20.0,"9km NE of Aguanga, CA",0.29,26,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,ml,,ci,111.0,"8km NE of Aguanga, CA",0.22,192,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,ml,,ci,26.0,"9km NE of Aguanga, CA",0.17,3,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,md,,nc,18.0,"10km NW of Avenal, CA",0.05,72,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9327,md,,nc,13.0,"9km ENE of Mammoth Lakes, CA",0.03,6,",nc,",reviewed,1537230228060,"M 0.6 - 9km ENE of Mammoth Lakes, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1537285598315,https://earthquake.usgs.gov/earthquakes/eventp...
9328,ml,,ci,28.0,"3km W of Julian, CA",0.21,15,",ci,",reviewed,1537230135130,"M 1.0 - 3km W of Julian, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1537276800970,https://earthquake.usgs.gov/earthquakes/eventp...
9329,md,,pr,9.0,"35km NNE of Hatillo, Puerto Rico",0.41,89,",pr,",reviewed,1537229908180,"M 2.4 - 35km NNE of Hatillo, Puerto Rico",0,earthquake,",geoserve,origin,phase-data,",-240.0,1537243777410,https://earthquake.usgs.gov/earthquakes/eventp...
9330,ml,,ci,27.0,"9km NE of Aguanga, CA",0.10,19,",ci,",reviewed,1537229545350,"M 1.1 - 9km NE of Aguanga, CA",0,earthquake,",focal-mechanism,geoserve,nearby-cities,origin...",-480.0,1537230211640,https://earthquake.usgs.gov/earthquakes/eventp...


### Indexing with `iloc`

Exclusive from the endpoint just like Python slicing.

In [None]:
# iloc uses index order for both rows and columns
# in iloc the stop is exclusive
# fetches rows at index 5 to 9
# and columns at index 3 (detail) and index 1 (cdi)

df.iloc[5:10, [3, 1]]

Unnamed: 0,detail,cdi
5,https://earthquake.usgs.gov/fdsnws/event/1/que...,
6,https://earthquake.usgs.gov/fdsnws/event/1/que...,
7,https://earthquake.usgs.gov/fdsnws/event/1/que...,
8,https://earthquake.usgs.gov/fdsnws/event/1/que...,
9,https://earthquake.usgs.gov/fdsnws/event/1/que...,


We can use slicing syntax with `iloc` for both rows and columns.

In [None]:
df.columns

Index(['alert', 'cdi', 'code', 'detail', 'dmin', 'felt', 'gap', 'ids', 'mag',
       'magType', 'mmi', 'net', 'nst', 'place', 'rms', 'sig', 'sources',
       'status', 'time', 'title', 'tsunami', 'type', 'types', 'tz', 'updated',
       'url'],
      dtype='object')

In [None]:
# fetch rows at index 5 to 9
# and columns at index 4 (dmin) to index 8 (mag) with step 2 (index 4, 6, 8)
# stop is exclusive
df.iloc[5:10, 4:9:2]

Unnamed: 0,dmin,gap,mag
5,0.4373,158.0,2.61
6,,,1.7
7,0.01622,83.0,1.13
8,0.009138,52.0,0.92
9,3.191,37.0,4.7


### Filtering `DataFrame`

We can filter our dataframe by using **Boolean mask** which can be created as follows:

In [None]:
# condition (result is boolean)
df['mag'] > 5

Unnamed: 0,mag
0,False
1,False
2,False
3,False
4,False
...,...
9327,False
9328,False
9329,False
9330,False


Using the above mask for selection is done by placing it in brackets:

In [None]:
# data_frame[condition]
# display all columns and some rows that meet the condition where the value of 'mag' is greater than 5
df[df['mag'] > 5]

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
118,green,,1000hbkz,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.623,,25.0,",pt18286001,at00pgjb1a,us1000hbkz,",6.7,mww,...,",pt,at,us,",reviewed,1539429023560,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake,",geoserve,ground-failure,impact-link,losspager...",600.0,1539455437040,https://earthquake.usgs.gov/earthquakes/eventp...
180,green,,1000hbhw,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.077,,23.0,",us1000hbhw,",5.2,mww,...,",us,",reviewed,1539405255580,"M 5.2 - 25km E of Bitung, Indonesia",0,earthquake,",geoserve,losspager,origin,phase-data,shakemap,",480.0,1539412565560,https://earthquake.usgs.gov/earthquakes/eventp...
226,green,,1000hbff,https://earthquake.usgs.gov/fdsnws/event/1/que...,7.385,,27.0,",us1000hbff,",5.7,mww,...,",us,",reviewed,1539389626220,"M 5.7 - 42km WNW of Sola, Vanuatu",0,earthquake,",geoserve,losspager,origin,phase-data,shakemap,",660.0,1539396937285,https://earthquake.usgs.gov/earthquakes/eventp...
227,,3.1,1000hbfe,https://earthquake.usgs.gov/fdsnws/event/1/que...,1.822,9.0,90.0,",us1000hbfe,",5.2,mb,...,",us,",reviewed,1539389603790,"M 5.2 - 15km WSW of Pisco, Peru",0,earthquake,",dyfi,geoserve,origin,phase-data,",-300.0,1539403377538,https://earthquake.usgs.gov/earthquakes/eventp...
258,,,1000hbde,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.644,,46.0,",us1000hbde,",5.1,mb,...,",us,",reviewed,1539380306940,"M 5.1 - 236km NNW of Kuril'sk, Russia",0,earthquake,",geoserve,origin,phase-data,",600.0,1539381450040,https://earthquake.usgs.gov/earthquakes/eventp...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9175,,,2000hgb0,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.746,,38.0,",us2000hgb0,",5.2,mb,...,",us,",reviewed,1537262729590,"M 5.2 - 126km N of Dili, East Timor",1,earthquake,",geoserve,origin,phase-data,",480.0,1537264531040,https://earthquake.usgs.gov/earthquakes/eventp...
9176,,,2000hgax,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.839,,83.0,",us2000hgax,",5.2,mb,...,",us,",reviewed,1537262656830,"M 5.2 - 90km S of Raoul Island, New Zealand",0,earthquake,",geoserve,origin,phase-data,",-720.0,1537263853040,https://earthquake.usgs.gov/earthquakes/eventp...
9211,green,,2000hg93,https://earthquake.usgs.gov/fdsnws/event/1/que...,8.749,,22.0,",us2000hg93,",6.0,mww,...,",us,",reviewed,1537255661330,M 6.0 - Southwest Indian Ridge,0,earthquake,",geoserve,losspager,moment-tensor,origin,phase...",180.0,1538206958458,https://earthquake.usgs.gov/earthquakes/eventp...
9213,,,2000hg99,https://earthquake.usgs.gov/fdsnws/event/1/que...,5.359,,92.0,",us2000hg99,",5.1,mb,...,",us,",reviewed,1537255481060,M 5.1 - South of Tonga,0,earthquake,",geoserve,origin,phase-data,",-720.0,1538204240040,https://earthquake.usgs.gov/earthquakes/eventp...


In [None]:
# data_frame[condition]
# display all columns and some rows that meet the condition where the value of 'mag' is greater than 5
df[df['mag'] > 5][['title', 'mag']]

Unnamed: 0,title,mag
118,"M 6.7 - 262km NW of Ozernovskiy, Russia",6.7
180,"M 5.2 - 25km E of Bitung, Indonesia",5.2
226,"M 5.7 - 42km WNW of Sola, Vanuatu",5.7
227,"M 5.2 - 15km WSW of Pisco, Peru",5.2
258,"M 5.1 - 236km NNW of Kuril'sk, Russia",5.1
...,...,...
9175,"M 5.2 - 126km N of Dili, East Timor",5.2
9176,"M 5.2 - 90km S of Raoul Island, New Zealand",5.2
9211,M 6.0 - Southwest Indian Ridge,6.0
9213,M 5.1 - South of Tonga,5.1


We can use mask with `loc`

In [None]:
# dataframe.loc[condition_for_row_selection, column_pointer]
df.loc[df['mag'] > 5, ['title', 'mag']]

Unnamed: 0,title,mag
118,"M 6.7 - 262km NW of Ozernovskiy, Russia",6.7
180,"M 5.2 - 25km E of Bitung, Indonesia",5.2
226,"M 5.7 - 42km WNW of Sola, Vanuatu",5.7
227,"M 5.2 - 15km WSW of Pisco, Peru",5.2
258,"M 5.1 - 236km NNW of Kuril'sk, Russia",5.1
...,...,...
9175,"M 5.2 - 126km N of Dili, East Timor",5.2
9176,"M 5.2 - 90km S of Raoul Island, New Zealand",5.2
9211,M 6.0 - Southwest Indian Ridge,6.0
9213,M 5.1 - South of Tonga,5.1


Boolean masks can be created using multiple criteria when combined with the bitwise operators & for AND and | for OR. We also need to surround each criterion with parentheses. We cannot use `and/or` here because we need to evaluate row by row.

In [None]:
# syntax
# AND: dataframe[(condition_1) & (condition_2)]
# OR: dataframe[(condition_1) | (condition_2)]

# displays data that contains tsunami indications and red alerts
# and only displays the columns 'alert', 'mag', 'magType', 'title', 'tsunami', 'type'

df[(df['tsunami']==1) & (df['alert']=='red')][['alert', 'mag', 'magType', 'title', 'tsunami', 'type']]

Unnamed: 0,alert,mag,magType,title,tsunami,type
5263,red,7.5,mww,"M 7.5 - 78km N of Palu, Indonesia",1,earthquake


In [None]:
df.loc[(df['tsunami']==1) & (df['alert']=='red'),
       ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']]

Unnamed: 0,alert,mag,magType,title,tsunami,type
5263,red,7.5,mww,"M 7.5 - 78km N of Palu, Indonesia",1,earthquake


Example with OR

In [None]:
# displays data that contains tsunami indications or red alerts
# and only displays the columns 'alert', 'mag', 'magType', 'title', 'tsunami', 'type'

df.loc[(df['tsunami']==1) | (df['alert']=='red'),
['alert', 'mag', 'magType', 'title', 'tsunami', 'type']]

Unnamed: 0,alert,mag,magType,title,tsunami,type
36,,5.0,mww,"M 5.0 - 165km NNW of Flying Fish Cove, Christm...",1,earthquake
118,green,6.7,mww,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake
501,green,5.6,mww,"M 5.6 - 128km SE of Kimbe, Papua New Guinea",1,earthquake
799,green,6.5,mww,"M 6.5 - 148km S of Severo-Kuril'sk, Russia",1,earthquake
816,green,6.2,mww,"M 6.2 - 94km SW of Kokopo, Papua New Guinea",1,earthquake
...,...,...,...,...,...,...
8561,,5.4,mb,"M 5.4 - 228km S of Taron, Papua New Guinea",1,earthquake
8624,,5.1,mb,"M 5.1 - 278km SE of Pondaguitan, Philippines",1,earthquake
9133,green,5.1,ml,"M 5.1 - 64km SSW of Kaktovik, Alaska",1,earthquake
9175,,5.2,mb,"M 5.2 - 126km N of Dili, East Timor",1,earthquake


Boolean masks can be created from any criteria that results in a Boolean. For example, we can select all earthquakes with the string `Alaska` in the `place` column with non-null values ​​in the `alert` column. To get non-null values, we can use the `isnull()` method with bitwise negation (~) or the `notnull()` method

In [None]:
df[(df['place'].str.contains('Alaska')) & (df['alert'].notnull())][['alert', 'mag', 'magType', 'title', 'tsunami', 'type']]

Unnamed: 0,alert,mag,magType,title,tsunami,type
1015,green,5.0,ml,"M 5.0 - 61km SSW of Chignik Lake, Alaska",1,earthquake
1273,green,4.0,ml,"M 4.0 - 71km SW of Kaktovik, Alaska",1,earthquake
1795,green,4.0,ml,"M 4.0 - 60km WNW of Valdez, Alaska",1,earthquake
2752,green,4.0,ml,"M 4.0 - 67km SSW of Kaktovik, Alaska",1,earthquake
3260,green,3.9,ml,"M 3.9 - 44km N of North Nenana, Alaska",0,earthquake
4101,green,4.2,ml,"M 4.2 - 131km NNW of Arctic Village, Alaska",0,earthquake
6897,green,3.8,ml,"M 3.8 - 80km SSW of Kaktovik, Alaska",0,earthquake
8524,green,3.8,ml,"M 3.8 - 69km SSW of Kaktovik, Alaska",0,earthquake
9133,green,5.1,ml,"M 5.1 - 64km SSW of Kaktovik, Alaska",1,earthquake


In [None]:
df.loc[(df['place'].str.contains('Alaska')) & (df['alert'].notnull()),
       ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']]

Unnamed: 0,alert,mag,magType,title,tsunami,type
1015,green,5.0,ml,"M 5.0 - 61km SSW of Chignik Lake, Alaska",1,earthquake
1273,green,4.0,ml,"M 4.0 - 71km SW of Kaktovik, Alaska",1,earthquake
1795,green,4.0,ml,"M 4.0 - 60km WNW of Valdez, Alaska",1,earthquake
2752,green,4.0,ml,"M 4.0 - 67km SSW of Kaktovik, Alaska",1,earthquake
3260,green,3.9,ml,"M 3.9 - 44km N of North Nenana, Alaska",0,earthquake
4101,green,4.2,ml,"M 4.2 - 131km NNW of Arctic Village, Alaska",0,earthquake
6897,green,3.8,ml,"M 3.8 - 80km SSW of Kaktovik, Alaska",0,earthquake
8524,green,3.8,ml,"M 3.8 - 69km SSW of Kaktovik, Alaska",0,earthquake
9133,green,5.1,ml,"M 5.1 - 64km SSW of Kaktovik, Alaska",1,earthquake


We can use the `between()` method to turn 2 individual operators (less than or equal to the maximum value and greater than or equal to the minimum value) into a single operator. Note that this includes an endpoint by default:

In [None]:
df.loc[
    df.mag.between(6.5, 7.5),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

Unnamed: 0,alert,mag,magType,title,tsunami,type
118,green,6.7,mww,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake
799,green,6.5,mww,"M 6.5 - 148km S of Severo-Kuril'sk, Russia",1,earthquake
837,green,7.0,mww,"M 7.0 - 117km E of Kimbe, Papua New Guinea",1,earthquake
4363,green,6.7,mww,"M 6.7 - 263km NNE of Ndoi Island, Fiji",1,earthquake
5263,red,7.5,mww,"M 7.5 - 78km N of Palu, Indonesia",1,earthquake


In [None]:
df.loc[
    df['mag'].between(6.5, 7.5),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

Unnamed: 0,alert,mag,magType,title,tsunami,type
118,green,6.7,mww,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake
799,green,6.5,mww,"M 6.5 - 148km S of Severo-Kuril'sk, Russia",1,earthquake
837,green,7.0,mww,"M 7.0 - 117km E of Kimbe, Papua New Guinea",1,earthquake
4363,green,6.7,mww,"M 6.7 - 263km NNE of Ndoi Island, Fiji",1,earthquake
5263,red,7.5,mww,"M 7.5 - 78km N of Palu, Indonesia",1,earthquake


We can use the isin() method to check membership in a list of values:

In [None]:
df['magType'].value_counts()

Unnamed: 0_level_0,count
magType,Unnamed: 1_level_1
ml,6803
md,1796
mb,601
mww,68
mb_lg,30
mwr,14
mh,12
mw,4
mwb,2
ms_20,1


In [None]:
df.loc[
    df['magType'].isin(['mwb', 'mw', 'ms_20']),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

Unnamed: 0,alert,mag,magType,title,tsunami,type
995,,3.35,mw,"M 3.4 - 9km WNW of Cobb, CA",0,earthquake
1465,green,3.83,mw,"M 3.8 - 109km WNW of Trinidad, CA",0,earthquake
2414,green,3.83,mw,"M 3.8 - 5km SW of Tres Pinos, CA",1,earthquake
4988,green,4.41,mw,"M 4.4 - 1km SE of Delta, B.C., MX",1,earthquake
5196,green,5.7,ms_20,"M 5.7 - 107km N of Palu, Indonesia",1,earthquake
6307,green,5.8,mwb,"M 5.8 - 297km NNE of Ndoi Island, Fiji",0,earthquake
8257,green,5.7,mwb,"M 5.7 - 175km SSE of Lambasa, Fiji",0,earthquake


We can take the minimum and maximum value indices of a given column and use them to select all rows where those values ​​appear:

In [None]:
[df['mag'].idxmax(), df['mag'].idxmin()]

[5263, 2409]

In [None]:
df.loc[
    [df['mag'].idxmax(), df['mag'].idxmin()],
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

Unnamed: 0,alert,mag,magType,title,tsunami,type
5263,red,7.5,mww,"M 7.5 - 78km N of Palu, Indonesia",1,earthquake
2409,,-1.26,ml,"M -1.3 - 41km ENE of Adak, Alaska",0,earthquake
