In [1]:
# Set the timezone of Google Colab

# According to https://stackoverflow.com/questions/55918562/changing-the-system-time-in-google-colaboratory
# Change from the default UTC time zone in Colab to Bangkok's time zone
!rm /etc/localtime
!ln -s /usr/share/zoneinfo/Asia/Bangkok /etc/localtime
!date

Wed Aug 21 05:43:58 PM +07 2024


In [2]:
import sys
import pandas as pd
import numpy as np
import IPython
from IPython.display import display

print( f"Python {sys.version}" )
print( f"Pandas {pd.__version__}" )
print( f"NumPy {np.__version__}" )
print( f"IPython {IPython.__version__}" )

Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
Pandas 2.1.4
NumPy 1.26.4
IPython 7.34.0


In [3]:
# Load the pokemon dataset
df_pokemon = pd.read_csv('https://raw.githubusercontent.com/ShaileshDhama/Exploratory-Data-Analysis-On-Pokemon-Dataset/master/Complete%20Pokemon.csv')
df_pokemon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   abilities          801 non-null    object 
 1   against_bug        801 non-null    float64
 2   against_dark       801 non-null    float64
 3   against_dragon     801 non-null    float64
 4   against_electric   801 non-null    float64
 5   against_fairy      801 non-null    float64
 6   against_fight      801 non-null    float64
 7   against_fire       801 non-null    float64
 8   against_flying     801 non-null    float64
 9   against_ghost      801 non-null    float64
 10  against_grass      801 non-null    float64
 11  against_ground     801 non-null    float64
 12  against_ice        801 non-null    float64
 13  against_normal     801 non-null    float64
 14  against_poison     801 non-null    float64
 15  against_psychic    801 non-null    float64
 16  against_rock       801 non

# 1. Pandas: Conditionally select/access/change data

In `Pandas1.ipynb`, we learn how to select/access/change desired data by directly specifying its index with:
- **Position-based indexing:** `df.iloc[]` and `df.iat[]`
- **Label-based indexing:** `df.loc[]` and `df.at[]`

In this file, we will learn how to select/access the data that matchs some conditions or criteria by:
1. **Boolean-based indexing:**
  - Use boolean values to flag the data to keep (True) or abandon (False)
  - Specify a Series of bool to filter <u>rows and/or columns</u> in `df.iloc[]` and `df.loc[]`
2. **The `query()` method:**
  - <u>Filter rows</u> based on a string-based boolean expression
  - A simplified syntax that closely resembles SQL syntax
  - Can refer to a column with a shorter notation; for example, `df["column_name"] > 0` vs. `"column_name > 0"`
  - Provide performance improvements over boolean-based indexing for large DataFrames
3. **The `filter()` method:**
  - Compared to 1 and 2, the `filter()` method is less common. But it provides some unique features that don't exist in the others.
  - This method subsets the DataFrame's <u>rows or columns</u> according to the specified index labels. For DataFrame, the default is to subset columns (`axis=1`).
  - Note that this method does not filter a DataFrame on its contents <u>but on the labels</u>.

In [4]:
# Prepare the data to play around
df = df_pokemon

## 1.1 Row filtering with boolean

This section will use the boolean-based indexing and the `query()` method for row filtering.

### 1.1.1 Row filtering: Single condition

In [5]:
# Create a Series of bool whose True values indicate rows with the 'type1' column equals 'grass'
# This command returns a pandas.Series of bool
(df['type1'] == 'grass')    # This equals (df.loc[ : , 'type1' ] == 'grass') and (df.type1 == 'grass')

Unnamed: 0,type1
0,True
1,True
2,True
3,False
4,False
...,...
796,False
797,True
798,False
799,False


In [6]:
# Use the boolean-based indexing and the query() method to select rows that meet a condition
# The five commands in this cell produce the exact same results despite different coding styles

# Style 1
#df[ df['type1'] == 'grass' ][ ['name','japanese_name', 'type1', 'type2'] ]

# Style 2
#df.loc[ df['type1'] == 'grass' ][ ['name','japanese_name', 'type1', 'type2'] ]

# Style 3
df.loc[ df['type1'] == 'grass', ['name','japanese_name', 'type1', 'type2'] ]

# Style 4
# For a column name that isn't a valid python variable (i.e., a column name with space), surround it by backticks in the query string
#df.query( "type1 == 'grass'" )[ ['name','japanese_name', 'type1', 'type2'] ]

# Style 5
# Refer to variables in the environment by prefixing them with an ‘@’ character in the query string
#target_type = 'grass'
#df.query( "type1 == @target_type" )[ ['name','japanese_name', 'type1', 'type2'] ]

Unnamed: 0,name,japanese_name,type1,type2
0,Bulbasaur,Fushigidaneフシギダネ,grass,poison
1,Ivysaur,Fushigisouフシギソウ,grass,poison
2,Venusaur,Fushigibanaフシギバナ,grass,poison
42,Oddish,Nazonokusaナゾノクサ,grass,poison
43,Gloom,Kusaihanaクサイハナ,grass,poison
...,...,...,...,...
760,Bounsweet,Amakajiアマカジ,grass,
761,Steenee,Amamaikoアママイコ,grass,
762,Tsareena,Amajoアマージョ,grass,
786,Tapu Bulu,Kapu-bululカプ・ブルル,grass,fairy


In [7]:
# Query rows whose 'type1' attribute is specified in the list
targets = ['dragon', 'fairy']

# Style 1
df.loc[ df['type1'].isin(targets) , ['name','abilities', 'type1', 'type2'] ].reset_index(drop=True)

# Style 2
#df.query( "type1 in @targets" )[ ['name','abilities', 'type1', 'type2'] ].reset_index(drop=True)

Unnamed: 0,name,abilities,type1,type2
0,Clefairy,"['Cute Charm', 'Magic Guard', 'Friend Guard']",fairy,
1,Clefable,"['Cute Charm', 'Magic Guard', 'Unaware']",fairy,
2,Dratini,"['Shed Skin', 'Marvel Scale']",dragon,
3,Dragonair,"['Shed Skin', 'Marvel Scale']",dragon,
4,Dragonite,"['Inner Focus', 'Multiscale']",dragon,flying
5,Cleffa,"['Cute Charm', 'Magic Guard', 'Friend Guard']",fairy,
6,Togepi,"['Hustle', 'Serene Grace', 'Super Luck']",fairy,
7,Togetic,"['Hustle', 'Serene Grace', 'Super Luck']",fairy,flying
8,Snubbull,"['Intimidate', 'Run Away', 'Rattled']",fairy,
9,Granbull,"['Intimidate', 'Quick Feet', 'Rattled']",fairy,


In [8]:
# Query rows whose 'attack' attribute is at least 2 times greater than its 'defense' attribute

# Style 1
df.loc[ df['attack'] > 2 * df['defense'] , ['name','abilities', 'attack', 'defense'] ].reset_index(drop=True)

# Style 2
#df.query( "attack > (2 * defense)" )[ ['name','abilities', 'attack', 'defense'] ].reset_index(drop=True)

Unnamed: 0,name,abilities,attack,defense
0,Beedrill,"['Swarm', 'Sniper']",150,40
1,Jigglypuff,"['Cute Charm', 'Competitive', 'Friend Guard']",45,20
2,Mankey,"['Vital Spirit', 'Anger Point', 'Defiant']",80,35
3,Bellsprout,"['Chlorophyll', 'Gluttony']",75,35
4,Hitmonlee,"['Limber', 'Reckless', 'Unburden']",120,53
5,Flareon,"['Flash Fire', 'Guts']",130,60
6,Mewtwo,"['Pressure', 'Unnerve']",150,70
7,Pichu,"['Static', 'Lightningrod']",40,15
8,Murkrow,"['Insomnia', 'Super Luck', 'Prankster']",85,42
9,Magby,"['Flame Body', 'Vital Spirit']",75,37


### 1.1.2 Row filtering: Compound condition

Combine many conditions by bitwise operators: `&` (and), `|` (or), `~` (not)

In [9]:
# Use boolean indexing to select rows that meet the final condition
# The five commands in this cell produce the exact same results despite different coding styles

# Style 1
#df[ (df['type1'] == 'grass') & (df['type2'] != 'poison') ][ ['name','japanese_name', 'type1', 'type2'] ]

# Style 2
#df.loc[ (df['type1'] == 'grass') & (df['type2'] != 'poison') ][ ['name','japanese_name', 'type1', 'type2'] ]

# Style 3
df.loc[ (df['type1'] == 'grass') & (df['type2'] != 'poison') , ['name','japanese_name', 'type1', 'type2'] ]

# Style 4
#df.query( "(type1 == 'grass') & (type2 != 'poison')" )[ ['name','japanese_name', 'type1', 'type2'] ]

# Style 5
#df.query( "(type1 == 'grass') and (type2 != 'poison')" )[ ['name','japanese_name', 'type1', 'type2'] ]

Unnamed: 0,name,japanese_name,type1,type2
101,Exeggcute,Tamatamaタマタマ,grass,psychic
102,Exeggutor,Nassyナッシー,grass,psychic
113,Tangela,Monjaraモンジャラ,grass,
151,Chikorita,Chicoritaチコリータ,grass,
152,Bayleef,Bayleafベイリーフ,grass,
...,...,...,...,...
760,Bounsweet,Amakajiアマカジ,grass,
761,Steenee,Amamaikoアママイコ,grass,
762,Tsareena,Amajoアマージョ,grass,
786,Tapu Bulu,Kapu-bululカプ・ブルル,grass,fairy


In [10]:
# Store the filtered data in a variable
df2 = df.loc[ (df['type1'] == 'grass') & (df['type2'] != 'poison') , : ]

# Find the number of rows that meet our condition
df2.shape

(64, 41)

### 1.1.3 String column

In [11]:
# Preview the dataset (just a reminder)
df.head()

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
0,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,1,65,65,45,grass,poison,6.9,1,0
1,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,2,80,80,60,grass,poison,13.0,1,0
2,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,3,122,120,80,grass,poison,100.0,1,0
3,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,4,60,50,65,fire,,8.5,1,0
4,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,5,80,65,80,fire,,19.0,1,0


In [12]:
# Select rows based on string equality

# Style 1
df.loc [ df['abilities'] == "['Blaze', 'Solar Power']", : ]

# Style 2
# When there are quote characters in the search string, using escape characters doesn't help
# Avoid confusion in quote characters by putting the search string in an external variable
#a = "['Blaze', 'Solar Power']"
#df.query( "abilities == @a" )

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
3,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,4,60,50,65,fire,,8.5,1,0
4,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,5,80,65,80,fire,,19.0,1,0
5,"['Blaze', 'Solar Power']",0.25,1.0,1.0,2.0,0.5,0.5,0.5,1.0,1.0,...,88.1,6,159,115,100,fire,flying,90.5,1,0


Use `pandas.Series.str` and `import re` to add more flexibility in dealing with string

In [13]:
# re is a powerful built-in Python package for Regular Expression
# Read its syntax in https://docs.python.org/3/library/re.html
import re
print(f're {re.__version__}')

re 2.2.1


In [14]:
# Query rows whose 'abilitiy' attribute includes the 'blaze' substring (case-insensitive)

# Style 1
df.loc[ df['abilities'].str.contains('blaze', case=False) , ['pokedex_number', 'name','abilities', 'type1', 'type2'] ].reset_index(drop=True)

# Style 2
#df.query( "abilities.str.contains('blaze', case=False)", engine='python' )[ ['pokedex_number', 'name','abilities', 'type1', 'type2'] ].reset_index(drop=True)

Unnamed: 0,pokedex_number,name,abilities,type1,type2
0,4,Charmander,"['Blaze', 'Solar Power']",fire,
1,5,Charmeleon,"['Blaze', 'Solar Power']",fire,
2,6,Charizard,"['Blaze', 'Solar Power']",fire,flying
3,155,Cyndaquil,"['Blaze', 'Flash Fire']",fire,
4,156,Quilava,"['Blaze', 'Flash Fire']",fire,
5,157,Typhlosion,"['Blaze', 'Flash Fire']",fire,
6,255,Torchic,"['Blaze', 'Speed Boost']",fire,
7,256,Combusken,"['Blaze', 'Speed Boost']",fire,fighting
8,257,Blaziken,"['Blaze', 'Speed Boost']",fire,fighting
9,390,Chimchar,"['Blaze', 'Iron Fist']",fire,


In [15]:
# Query rows whose 'name' attribute starts with 'char' or ends with 'bug' (case-insensitive)

# Style 1
df.loc[ df['name'].str.contains('^char|bug$', case=False, regex=True) , ['pokedex_number', 'name','abilities', 'type1', 'type2'] ].reset_index(drop=True)

# Style 2
#df.query( "name.str.contains('^char|bug$', case=False, regex=True)", engine='python' )[ ['pokedex_number', 'name','abilities', 'type1', 'type2'] ].reset_index(drop=True)

Unnamed: 0,pokedex_number,name,abilities,type1,type2
0,4,Charmander,"['Blaze', 'Solar Power']",fire,
1,5,Charmeleon,"['Blaze', 'Solar Power']",fire,
2,6,Charizard,"['Blaze', 'Solar Power']",fire,flying
3,664,Scatterbug,"['Shield Dust', 'Compoundeyes', 'Friend Guard']",bug,
4,737,Charjabug,['Battery'],bug,electric


In [16]:
# Query rows whose 'type' attributes include 'dragon' and whose 'name' attribute doesn't start with 'drag'

# Style 1
df.loc[ ((df['type1']=='dragon') | (df['type2']=='dragon')) & ~df['name'].str.contains('^drag', case=False, regex=True) , ['pokedex_number', 'name','abilities', 'type1', 'type2'] ].reset_index(drop=True)

# Style 2
#df.query( "((type1=='dragon') | (type2=='dragon')) & ~(name.str.contains('^drag', case=False, regex=True))", engine='python' )[ ['pokedex_number', 'name','abilities', 'type1', 'type2'] ].reset_index(drop=True)

Unnamed: 0,pokedex_number,name,abilities,type1,type2
0,147,Dratini,"['Shed Skin', 'Marvel Scale']",dragon,
1,230,Kingdra,"['Swift Swim', 'Sniper', 'Damp']",water,dragon
2,329,Vibrava,['Levitate'],ground,dragon
3,330,Flygon,['Levitate'],ground,dragon
4,334,Altaria,"['Natural Cure', 'Cloud Nine']",dragon,flying
5,371,Bagon,"['Rock Head', 'Sheer Force']",dragon,
6,372,Shelgon,"['Rock Head', 'Overcoat']",dragon,
7,373,Salamence,"['Intimidate', 'Moxie']",dragon,flying
8,380,Latias,['Levitate'],dragon,psychic
9,381,Latios,['Levitate'],dragon,psychic


### 1.1.4 Datetime column

In [17]:
# Create a dummy DataFrame to play around
df_dt = pd.DataFrame( { 'name':['A','B', 'C', 'D'] ,
                        'date1':['2000-12-31', '2005-4-15', '2020-7-9', '2022-2-28'],
                        'date2':['2000-12-31 23:55', '2005-4-15 1:15', '2020-7-9 0:45', '2022-2-28 12:30'] } )

print( df_dt.info(), end='\n\n' )
display(df_dt)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   date1   4 non-null      object
 2   date2   4 non-null      object
dtypes: object(3)
memory usage: 224.0+ bytes
None



Unnamed: 0,name,date1,date2
0,A,2000-12-31,2000-12-31 23:55
1,B,2005-4-15,2005-4-15 1:15
2,C,2020-7-9,2020-7-9 0:45
3,D,2022-2-28,2022-2-28 12:30


In [18]:
# Convert pandas.Series to datetime
df_dt.date1 = pd.to_datetime( df_dt.date1 )
df_dt.date2 = pd.to_datetime( df_dt.date2 )

print( df_dt.info(), end='\n\n' )
display(df_dt)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   name    4 non-null      object        
 1   date1   4 non-null      datetime64[ns]
 2   date2   4 non-null      datetime64[ns]
dtypes: datetime64[ns](2), object(1)
memory usage: 224.0+ bytes
None



Unnamed: 0,name,date1,date2
0,A,2000-12-31,2000-12-31 23:55:00
1,B,2005-04-15,2005-04-15 01:15:00
2,C,2020-07-09,2020-07-09 00:45:00
3,D,2022-02-28,2022-02-28 12:30:00


In [19]:
# Access an attribute available in the datetime column
print( df_dt['date2'].dt.date , end='\n\n' )
print( df_dt['date2'].dt.month_name() , end='\n\n' )
print( df_dt['date2'].dt.day_name() )

0    2000-12-31
1    2005-04-15
2    2020-07-09
3    2022-02-28
Name: date2, dtype: object

0    December
1       April
2        July
3    February
Name: date2, dtype: object

0      Sunday
1      Friday
2    Thursday
3      Monday
Name: date2, dtype: object


In [20]:
# Extract datetime info and put them in new columns
df_dt['date2_y'] = df_dt['date2'].dt.year
df_dt['date2_m'] = df_dt['date2'].dt.month
df_dt['date2_d'] = df_dt['date2'].dt.day
df_dt['date2_hr'] = df_dt['date2'].dt.hour
df_dt['date2_min'] = df_dt['date2'].dt.minute
df_dt['date2_sec'] = df_dt['date2'].dt.second

df_dt

Unnamed: 0,name,date1,date2,date2_y,date2_m,date2_d,date2_hr,date2_min,date2_sec
0,A,2000-12-31,2000-12-31 23:55:00,2000,12,31,23,55,0
1,B,2005-04-15,2005-04-15 01:15:00,2005,4,15,1,15,0
2,C,2020-07-09,2020-07-09 00:45:00,2020,7,9,0,45,0
3,D,2022-02-28,2022-02-28 12:30:00,2022,2,28,12,30,0


In [21]:
# Get the current date and time
today = pd.to_datetime('today')

print( f"today (Bangkok) = {today.day_name()} {today}\n> type = {type(today)}\n> year = {today.year}" )

today (Bangkok) = Wednesday 2024-08-21 17:57:38.582359
> type = <class 'pandas._libs.tslibs.timestamps.Timestamp'>
> year = 2024


In [22]:
# Calculate current ages and put the results in new columns
df_dt['age_y'] = today.year - df_dt['date2'].dt.year
df_dt['age_s'] = (today - df_dt['date2']).dt.total_seconds()

print( df_dt.info(), end='\n\n' )
display(df_dt)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   name       4 non-null      object        
 1   date1      4 non-null      datetime64[ns]
 2   date2      4 non-null      datetime64[ns]
 3   date2_y    4 non-null      int32         
 4   date2_m    4 non-null      int32         
 5   date2_d    4 non-null      int32         
 6   date2_hr   4 non-null      int32         
 7   date2_min  4 non-null      int32         
 8   date2_sec  4 non-null      int32         
 9   age_y      4 non-null      int32         
 10  age_s      4 non-null      float64       
dtypes: datetime64[ns](2), float64(1), int32(7), object(1)
memory usage: 368.0+ bytes
None



Unnamed: 0,name,date1,date2,date2_y,date2_m,date2_d,date2_hr,date2_min,date2_sec,age_y,age_s
0,A,2000-12-31,2000-12-31 23:55:00,2000,12,31,23,55,0,24,745956200.0
1,B,2005-04-15,2005-04-15 01:15:00,2005,4,15,1,15,0,19,610735400.0
2,C,2020-07-09,2020-07-09 00:45:00,2020,7,9,0,45,0,4,130007600.0
3,D,2022-02-28,2022-02-28 12:30:00,2022,2,28,12,30,0,2,78211660.0


In [23]:
# Select rows whose 'date2' attribute is between 2010 and 2022

# Style 1
#df_dt.loc[ ( 2010 <= df_dt['date2'].dt.year <= 2022), : ].sort_values('date2', ascending=False)  # ValueError
df_dt.loc[ (df_dt['date2'].dt.year >= 2010) & (df_dt['date2'].dt.year <= 2022), : ].sort_values('date2', ascending=False)

# Style 2
#df_dt.query("2010 <= date2.dt.year <= 2022").sort_values('date2', ascending=False)

# Style 3
#df_dt.loc[ (df_dt['date2'].dt.date >= pd.Timestamp('2010-01-01').date()) & (df_dt['date2'].dt.date <= pd.Timestamp('2022-12-31').date()), : ].sort_values('date2', ascending=False)

# Style 4
#df_dt.query("@pd.to_datetime('2010-01-01').date() <= date2.dt.date <= @pd.to_datetime('2022-12-31').date()").sort_values('date2', ascending=False)

Unnamed: 0,name,date1,date2,date2_y,date2_m,date2_d,date2_hr,date2_min,date2_sec,age_y,age_s
3,D,2022-02-28,2022-02-28 12:30:00,2022,2,28,12,30,0,2,78211660.0
2,C,2020-07-09,2020-07-09 00:45:00,2020,7,9,0,45,0,4,130007600.0


## 1.2 Conditional change

In [24]:
# Prepare the subset of data to play around
df = df_pokemon.query( "((type1=='dragon') | (type2=='dragon')) & ~name.str.contains('^drag', case=False, regex=True)", engine='python' )
df = df[ ['pokedex_number', 'name','abilities', 'type1', 'type2'] ].reset_index(drop=True)

df

Unnamed: 0,pokedex_number,name,abilities,type1,type2
0,147,Dratini,"['Shed Skin', 'Marvel Scale']",dragon,
1,230,Kingdra,"['Swift Swim', 'Sniper', 'Damp']",water,dragon
2,329,Vibrava,['Levitate'],ground,dragon
3,330,Flygon,['Levitate'],ground,dragon
4,334,Altaria,"['Natural Cure', 'Cloud Nine']",dragon,flying
5,371,Bagon,"['Rock Head', 'Sheer Force']",dragon,
6,372,Shelgon,"['Rock Head', 'Overcoat']",dragon,
7,373,Salamence,"['Intimidate', 'Moxie']",dragon,flying
8,380,Latias,['Levitate'],dragon,psychic
9,381,Latios,['Levitate'],dragon,psychic


In [27]:
# For any rows with 'type1' attribute of 'dark', change it to 'black'

# Style 1
df.loc[ df['type1']=='dark', 'type1' ] = 'black'

# Style 2
#df[ df['type1']=='dark', 'type1' ] = 'black'  # TypeError: unhashable type: 'Series'

df

Unnamed: 0,pokedex_number,name,abilities,type1,type2
0,147,Dratini,"['Shed Skin', 'Marvel Scale']",dragon,
1,230,Kingdra,"['Swift Swim', 'Sniper', 'Damp']",water,dragon
2,329,Vibrava,['Levitate'],ground,dragon
3,330,Flygon,['Levitate'],ground,dragon
4,334,Altaria,"['Natural Cure', 'Cloud Nine']",dragon,flying
5,371,Bagon,"['Rock Head', 'Sheer Force']",dragon,
6,372,Shelgon,"['Rock Head', 'Overcoat']",dragon,
7,373,Salamence,"['Intimidate', 'Moxie']",dragon,flying
8,380,Latias,['Levitate'],dragon,psychic
9,381,Latios,['Levitate'],dragon,psychic


In [28]:
# Add a dummy column containing an empty string
df[ 'type3' ] = ''    # This equals (df.loc[:,'type3'] = '') and (df = df.assign(type3=''))
df

Unnamed: 0,pokedex_number,name,abilities,type1,type2,type3
0,147,Dratini,"['Shed Skin', 'Marvel Scale']",dragon,,
1,230,Kingdra,"['Swift Swim', 'Sniper', 'Damp']",water,dragon,
2,329,Vibrava,['Levitate'],ground,dragon,
3,330,Flygon,['Levitate'],ground,dragon,
4,334,Altaria,"['Natural Cure', 'Cloud Nine']",dragon,flying,
5,371,Bagon,"['Rock Head', 'Sheer Force']",dragon,,
6,372,Shelgon,"['Rock Head', 'Overcoat']",dragon,,
7,373,Salamence,"['Intimidate', 'Moxie']",dragon,flying,
8,380,Latias,['Levitate'],dragon,psychic,
9,381,Latios,['Levitate'],dragon,psychic,


In [29]:
# For any rows with 'type1' attribute of 'black', change their 'type1' and 'type3' attributes
df.loc[ df['type1']=='black', ['type1', 'type3'] ] = ['dark' , 'very dark' ]

df

Unnamed: 0,pokedex_number,name,abilities,type1,type2,type3
0,147,Dratini,"['Shed Skin', 'Marvel Scale']",dragon,,
1,230,Kingdra,"['Swift Swim', 'Sniper', 'Damp']",water,dragon,
2,329,Vibrava,['Levitate'],ground,dragon,
3,330,Flygon,['Levitate'],ground,dragon,
4,334,Altaria,"['Natural Cure', 'Cloud Nine']",dragon,flying,
5,371,Bagon,"['Rock Head', 'Sheer Force']",dragon,,
6,372,Shelgon,"['Rock Head', 'Overcoat']",dragon,,
7,373,Salamence,"['Intimidate', 'Moxie']",dragon,flying,
8,380,Latias,['Levitate'],dragon,psychic,
9,381,Latios,['Levitate'],dragon,psychic,


## 1.3 Column filtering with boolean

This section will use boolean-based indexing to filter columns. Most examples are similar to row filtering, except that we cannot use the `query()` method in this task.

In [30]:
# Prepare a data to play around
df = df_pokemon

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   abilities          801 non-null    object 
 1   against_bug        801 non-null    float64
 2   against_dark       801 non-null    float64
 3   against_dragon     801 non-null    float64
 4   against_electric   801 non-null    float64
 5   against_fairy      801 non-null    float64
 6   against_fight      801 non-null    float64
 7   against_fire       801 non-null    float64
 8   against_flying     801 non-null    float64
 9   against_ghost      801 non-null    float64
 10  against_grass      801 non-null    float64
 11  against_ground     801 non-null    float64
 12  against_ice        801 non-null    float64
 13  against_normal     801 non-null    float64
 14  against_poison     801 non-null    float64
 15  against_psychic    801 non-null    float64
 16  against_rock       801 non

In [31]:
# Choose only columns whose indices are divisible by 4

# Step 1: Create a 1D array/Series of bool
lst = [ (df.columns.get_loc(col_name) % 4 == 0) for col_name in df.columns ]
print('----- Step 1 -----')
print( pd.Series( lst ) )

# Step 2: Use the 1D array/Series to filter columns
print('\n----- Step 2 -----')
df.loc[ : , lst ]     # This equals (df.iloc[ :, lst ])

----- Step 1 -----
0      True
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12     True
13    False
14    False
15    False
16     True
17    False
18    False
19    False
20     True
21    False
22    False
23    False
24     True
25    False
26    False
27    False
28     True
29    False
30    False
31    False
32     True
33    False
34    False
35    False
36     True
37    False
38    False
39    False
40     True
dtype: bool

----- Step 2 -----


Unnamed: 0,abilities,against_electric,against_flying,against_ice,against_rock,base_egg_steps,classfication,hp,pokedex_number,type1,is_legendary
0,"['Overgrow', 'Chlorophyll']",0.5,2.0,2.0,1.0,5120,Seed Pokémon,45,1,grass,0
1,"['Overgrow', 'Chlorophyll']",0.5,2.0,2.0,1.0,5120,Seed Pokémon,60,2,grass,0
2,"['Overgrow', 'Chlorophyll']",0.5,2.0,2.0,1.0,5120,Seed Pokémon,80,3,grass,0
3,"['Blaze', 'Solar Power']",1.0,1.0,0.5,2.0,5120,Lizard Pokémon,39,4,fire,0
4,"['Blaze', 'Solar Power']",1.0,1.0,0.5,2.0,5120,Flame Pokémon,58,5,fire,0
...,...,...,...,...,...,...,...,...,...,...,...
796,['Beast Boost'],2.0,0.5,1.0,1.0,30720,Launch Pokémon,97,797,steel,1
797,['Beast Boost'],0.5,1.0,1.0,0.5,30720,Drawn Sword Pokémon,59,798,grass,1
798,['Beast Boost'],0.5,1.0,2.0,1.0,30720,Junkivore Pokémon,223,799,dark,1
799,['Prism Armor'],1.0,1.0,1.0,1.0,30720,Prism Pokémon,97,800,psychic,1


In [32]:
# Choose only columns whose labels doesn't include the 'against_' substring

# Step 1: Create a 1D array/Series of bool
lst = [ ('against_' not in col_name) for col_name in df.columns ]
print('----- Step 1 -----')
print( pd.Series( lst ) )

# Step 2: Use the 1D array/Series to filter columns
print('\n----- Step 2 -----')
df.loc[ : , lst ]     # This equals (df.iloc[ :, lst ])

----- Step 1 -----
0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19     True
20     True
21     True
22     True
23     True
24     True
25     True
26     True
27     True
28     True
29     True
30     True
31     True
32     True
33     True
34     True
35     True
36     True
37     True
38     True
39     True
40     True
dtype: bool

----- Step 2 -----


Unnamed: 0,abilities,attack,base_egg_steps,base_happiness,base_total,capture_rate,classfication,defense,experience_growth,height_m,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
0,"['Overgrow', 'Chlorophyll']",49,5120,70,318,45,Seed Pokémon,49,1059860,0.7,...,88.1,1,65,65,45,grass,poison,6.9,1,0
1,"['Overgrow', 'Chlorophyll']",62,5120,70,405,45,Seed Pokémon,63,1059860,1.0,...,88.1,2,80,80,60,grass,poison,13.0,1,0
2,"['Overgrow', 'Chlorophyll']",100,5120,70,625,45,Seed Pokémon,123,1059860,2.0,...,88.1,3,122,120,80,grass,poison,100.0,1,0
3,"['Blaze', 'Solar Power']",52,5120,70,309,45,Lizard Pokémon,43,1059860,0.6,...,88.1,4,60,50,65,fire,,8.5,1,0
4,"['Blaze', 'Solar Power']",64,5120,70,405,45,Flame Pokémon,58,1059860,1.1,...,88.1,5,80,65,80,fire,,19.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
796,['Beast Boost'],101,30720,0,570,25,Launch Pokémon,103,1250000,9.2,...,,797,107,101,61,steel,flying,999.9,7,1
797,['Beast Boost'],181,30720,0,570,255,Drawn Sword Pokémon,131,1250000,0.3,...,,798,59,31,109,grass,steel,0.1,7,1
798,['Beast Boost'],101,30720,0,570,15,Junkivore Pokémon,53,1250000,5.5,...,,799,97,53,43,dark,dragon,888.0,7,1
799,['Prism Armor'],107,30720,0,600,3,Prism Pokémon,101,1250000,2.4,...,,800,127,89,79,psychic,,230.0,7,1


## 1.4 The `filter()` method

- When we want to <u>filter rows or columns based on their labels</u> but don't want to compute a 1D boolean array by ourselves, we may consider using the `filter()` method.

- There are three input arguments to this method (i.e., `items`, `like`, and `regex`). These three arguments are enforced to be mutually exclusive (they cannot exist together).

In [33]:
# The 'items' argument to specify *labels* to keep

# Column filtering
print('----- Column filtering -----')
display( df.filter(items=['pokedex_number', 'name', 'japanese_name', 'is_legendary']) )   # default: axis=1

# Row filtering
print('\n----- Row filtering -----')
display( df.filter(items=[0, 10, 20, 30, 40], axis=0) )  # default: axis=1

----- Column filtering -----


Unnamed: 0,pokedex_number,name,japanese_name,is_legendary
0,1,Bulbasaur,Fushigidaneフシギダネ,0
1,2,Ivysaur,Fushigisouフシギソウ,0
2,3,Venusaur,Fushigibanaフシギバナ,0
3,4,Charmander,Hitokageヒトカゲ,0
4,5,Charmeleon,Lizardoリザード,0
...,...,...,...,...
796,797,Celesteela,Tekkaguyaテッカグヤ,1
797,798,Kartana,Kamiturugiカミツルギ,1
798,799,Guzzlord,Akuzikingアクジキング,1
799,800,Necrozma,Necrozmaネクロズマ,1



----- Row filtering -----


Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
0,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,1,65,65,45,grass,poison,6.9,1,0
10,['Shed Skin'],1.0,1.0,1.0,1.0,1.0,0.5,2.0,2.0,1.0,...,50.0,11,25,25,30,bug,,9.9,1,0
20,"['Keen Eye', 'Sniper']",0.5,1.0,1.0,2.0,1.0,1.0,1.0,1.0,0.0,...,50.0,21,31,31,70,normal,flying,2.0,1,0
30,"['Poison Point', 'Rivalry', 'Sheer Force']",0.5,1.0,1.0,0.0,0.5,0.5,1.0,1.0,1.0,...,0.0,31,75,85,76,poison,ground,60.0,1,0
40,"['Inner Focus', 'Infiltrator']",0.25,1.0,1.0,2.0,0.5,0.25,1.0,1.0,1.0,...,50.0,41,30,40,55,poison,flying,7.5,1,0


In [34]:
# The 'like' argument to specify the substring in *labels*

# Column filtering
# Choose columns whose labels contain the 'against' substring making ('against' in col_label == True)
print('----- Column filtering -----')
display( df.filter(like="against") )   # default: axis=1

# Row filtering
# Choose rows whose labels contain the '97' substring making ('97' in row_label == True)
print('\n----- Row filtering -----')
display( df.filter(like='97', axis=0) )  # default: axis=1

----- Column filtering -----


Unnamed: 0,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,against_grass,against_ground,against_ice,against_normal,against_poison,against_psychic,against_rock,against_steel,against_water
0,1.00,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,0.25,1.0,2.0,1.0,1.0,2.0,1.0,1.0,0.5
1,1.00,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,0.25,1.0,2.0,1.0,1.0,2.0,1.0,1.0,0.5
2,1.00,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,0.25,1.0,2.0,1.0,1.0,2.0,1.0,1.0,0.5
3,0.50,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,0.50,2.0,0.5,1.0,1.0,1.0,2.0,0.5,2.0
4,0.50,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,0.50,2.0,0.5,1.0,1.0,1.0,2.0,0.5,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
796,0.25,1.0,0.5,2.0,0.5,1.0,2.0,0.5,1.0,0.25,0.0,1.0,0.5,0.0,0.5,1.0,0.5,1.0
797,1.00,1.0,0.5,0.5,0.5,2.0,4.0,1.0,1.0,0.25,1.0,1.0,0.5,0.0,0.5,0.5,0.5,0.5
798,2.00,0.5,2.0,0.5,4.0,2.0,0.5,1.0,0.5,0.50,1.0,2.0,1.0,1.0,0.0,1.0,1.0,0.5
799,2.00,2.0,1.0,1.0,1.0,0.5,1.0,1.0,2.0,1.00,1.0,1.0,1.0,1.0,0.5,1.0,1.0,1.0



----- Row filtering -----


Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
97,"['Hyper Cutter', 'Shell Armor', 'Sheer Force']",1.0,1.0,1.0,2.0,1.0,1.0,0.5,1.0,1.0,...,50.0,98,25,25,50,water,,6.5,1,0
197,"['Insomnia', 'Super Luck', 'Prankster']",1.0,0.5,1.0,2.0,2.0,1.0,1.0,1.0,0.5,...,50.0,198,85,42,91,dark,flying,2.1,2,0
297,"['Thick Fat', 'Huge Power', 'Sap Sipper']",0.5,0.5,0.0,1.0,1.0,1.0,1.0,1.0,0.0,...,24.6,298,20,40,20,normal,fairy,2.0,3,0
397,"['Intimidate', 'Reckless']",0.5,1.0,1.0,2.0,1.0,1.0,1.0,1.0,0.0,...,50.0,398,50,60,100,normal,flying,24.9,4,0
497,"['Blaze', 'Thick Fat']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,498,45,45,45,fire,,9.9,5,0
597,"['Iron Barbs', 'Anticipation']",1.0,1.0,0.5,0.5,0.5,2.0,4.0,1.0,1.0,...,50.0,598,54,116,20,grass,steel,110.0,5,0
697,"['Refrigerate', 'Snow Warning']",1.0,1.0,1.0,1.0,1.0,4.0,1.0,0.5,1.0,...,88.1,698,67,63,46,rock,ice,25.2,6,0
797,['Beast Boost'],1.0,1.0,0.5,0.5,0.5,2.0,4.0,1.0,1.0,...,,798,59,31,109,grass,steel,0.1,7,1


In [35]:
# The 'regex' argument to specify the search pattern in *labels*

# Column filtering
# Choose columns whose labels include the 'fairy' substring OR end with 'name'
print('----- Column filtering -----')
display( df.filter(regex="fairy|name$") )   # default: axis=1

# Row filtering
# Choose rows whose labels start with '5' to '7' AND end with '97'
print('\n----- Row filtering -----')
display( df.filter(regex='[5-7]97', axis=0) )  # default: axis=1

----- Column filtering -----


Unnamed: 0,against_fairy,japanese_name,name
0,0.5,Fushigidaneフシギダネ,Bulbasaur
1,0.5,Fushigisouフシギソウ,Ivysaur
2,0.5,Fushigibanaフシギバナ,Venusaur
3,0.5,Hitokageヒトカゲ,Charmander
4,0.5,Lizardoリザード,Charmeleon
...,...,...,...
796,0.5,Tekkaguyaテッカグヤ,Celesteela
797,0.5,Kamiturugiカミツルギ,Kartana
798,4.0,Akuzikingアクジキング,Guzzlord
799,1.0,Necrozmaネクロズマ,Necrozma



----- Row filtering -----


Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
597,"['Iron Barbs', 'Anticipation']",1.0,1.0,0.5,0.5,0.5,2.0,4.0,1.0,1.0,...,50.0,598,54,116,20,grass,steel,110.0,5,0
697,"['Refrigerate', 'Snow Warning']",1.0,1.0,1.0,1.0,1.0,4.0,1.0,0.5,1.0,...,88.1,698,67,63,46,rock,ice,25.2,6,0
797,['Beast Boost'],1.0,1.0,0.5,0.5,0.5,2.0,4.0,1.0,1.0,...,,798,59,31,109,grass,steel,0.1,7,1


## 1.5 Column filtering with dtype

In [36]:
# Select columns from their dtypes

print('----- include -----\n')
df2 = df.select_dtypes(include=['int'])   # Choose only columns whose dtype is 'int'
print( df2.info() )

print('\n----- exclude -----\n')
df3 = df.select_dtypes(exclude=['int'])   # Choose all columns except those whose dtype is 'int'
print( df3.info() )

----- include -----

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   attack             801 non-null    int64
 1   base_egg_steps     801 non-null    int64
 2   base_happiness     801 non-null    int64
 3   base_total         801 non-null    int64
 4   defense            801 non-null    int64
 5   experience_growth  801 non-null    int64
 6   hp                 801 non-null    int64
 7   pokedex_number     801 non-null    int64
 8   sp_attack          801 non-null    int64
 9   sp_defense         801 non-null    int64
 10  speed              801 non-null    int64
 11  generation         801 non-null    int64
 12  is_legendary       801 non-null    int64
dtypes: int64(13)
memory usage: 81.5 KB
None

----- exclude -----

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 28 columns):
 #   Column 

# 2. Pandas: Handle missing data

- In pandas (NumPy backend), missing values refer to `np.nan` (numeric), `None` (object), `pd.NA` (object), and `pd.NaT` (datetime).
- `np.inf` is considered missing data iff `pd.options.mode.use_inf_as_na` is set.

In [37]:
# Create a subset of data for playing around
df = df_pokemon.loc[ : , ['pokedex_number', 'name', 'hp', 'weight_kg', 'height_m', 'type1', 'type2'] ]

print( df.info(), end='\n\n' )
display(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   pokedex_number  801 non-null    int64  
 1   name            801 non-null    object 
 2   hp              801 non-null    int64  
 3   weight_kg       781 non-null    float64
 4   height_m        781 non-null    float64
 5   type1           801 non-null    object 
 6   type2           417 non-null    object 
dtypes: float64(2), int64(2), object(3)
memory usage: 43.9+ KB
None



Unnamed: 0,pokedex_number,name,hp,weight_kg,height_m,type1,type2
0,1,Bulbasaur,45,6.9,0.7,grass,poison
1,2,Ivysaur,60,13.0,1.0,grass,poison
2,3,Venusaur,80,100.0,2.0,grass,poison
3,4,Charmander,39,8.5,0.6,fire,
4,5,Charmeleon,58,19.0,1.1,fire,
...,...,...,...,...,...,...,...
796,797,Celesteela,97,999.9,9.2,steel,flying
797,798,Kartana,59,0.1,0.3,grass,steel
798,799,Guzzlord,223,888.0,5.5,dark,dragon
799,800,Necrozma,97,230.0,2.4,psychic,


## 2.1 Detect/Count NA

For pandas.Series:

In [38]:
# Count non-NA in a pandas.Series
df['type2'].count()

417

In [39]:
# Boolean mask for NA values in a pandas.Series
df['type2'].isna()   # This equals df['type2'].isnull()

Unnamed: 0,type2
0,False
1,False
2,False
3,True
4,True
...,...
796,False
797,False
798,False
799,True


For pandas.DataFrame:

In [40]:
# Count non-NA values for each column in a pandas.DataFrame
df.count()   # default: axis=0

Unnamed: 0,0
pokedex_number,801
name,801
hp,801
weight_kg,781
height_m,781
type1,801
type2,417


In [41]:
# Count non-NA values for each row in a pandas.DataFrame
df.count(axis=1)

Unnamed: 0,0
0,7
1,7
2,7
3,6
4,6
...,...
796,7
797,7
798,7
799,6


In [42]:
# Boolean mask for NA values in a pandas.DataFrame
df.isna()    # This equals df.isnull()

Unnamed: 0,pokedex_number,name,hp,weight_kg,height_m,type1,type2
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,True
4,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...
796,False,False,False,False,False,False,False
797,False,False,False,False,False,False,False
798,False,False,False,False,False,False,False
799,False,False,False,False,False,False,True


## 2.2 Filter NA

In [43]:
# Show pokemons whose weight or height is NA

# Style 1
df.loc[ df['weight_kg'].isna() | df['height_m'].isna() , : ].reset_index(drop=True)

# Style 2
#df.query("weight_kg.isna() | height_m.isna()", engine='python').reset_index(drop=True)

Unnamed: 0,pokedex_number,name,hp,weight_kg,height_m,type1,type2
0,19,Rattata,30,,,normal,dark
1,20,Raticate,75,,,normal,dark
2,26,Raichu,60,,,electric,electric
3,27,Sandshrew,50,,,ground,ice
4,28,Sandslash,75,,,ground,ice
5,37,Vulpix,38,,,fire,ice
6,38,Ninetales,73,,,fire,ice
7,50,Diglett,10,,,ground,ground
8,51,Dugtrio,35,,,ground,ground
9,52,Meowth,40,,,normal,dark


In [44]:
# Show pokemons whose 'type2' attribute isn't NA

# Style 1
df.loc[ df['type2'].notna() , : ].reset_index(drop=True)

# Style 2
#df.query("type2.notna()", engine='python').reset_index(drop=True)

Unnamed: 0,pokedex_number,name,hp,weight_kg,height_m,type1,type2
0,1,Bulbasaur,45,6.9,0.7,grass,poison
1,2,Ivysaur,60,13.0,1.0,grass,poison
2,3,Venusaur,80,100.0,2.0,grass,poison
3,6,Charizard,78,90.5,1.7,fire,flying
4,12,Butterfree,60,32.0,1.1,bug,flying
...,...,...,...,...,...,...,...
412,795,Pheromosa,71,25.0,1.8,bug,fighting
413,797,Celesteela,97,999.9,9.2,steel,flying
414,798,Kartana,59,0.1,0.3,grass,steel
415,799,Guzzlord,223,888.0,5.5,dark,dragon


In [45]:
# Reminder: When isna() is applied to DataFrame, the result itself is the DataFrame of bool
# Then, how to filter rows with this bool DataFrame?
df.isna()

Unnamed: 0,pokedex_number,name,hp,weight_kg,height_m,type1,type2
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,True
4,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...
796,False,False,False,False,False,False,False
797,False,False,False,False,False,False,False
798,False,False,False,False,False,False,False
799,False,False,False,False,False,False,True


In [46]:
# Show all pokemons (rows) whose column values include at least one NA

# Style 1: Not recommended
'''
filter = pd.Series( [False] * len(df) )   # len(df) = df.shape[0]
for col in df.columns:
  filter |= df[col].isna()
df.loc[ filter , : ].reset_index(drop=True)
'''

# Style 2
df.loc[ df.isna().any(axis=1) , : ].reset_index(drop=True)

# Style 3
#df.query("@df.isna().any(axis=1)", engine='python').reset_index(drop=True)

Unnamed: 0,pokedex_number,name,hp,weight_kg,height_m,type1,type2
0,4,Charmander,39,8.5,0.6,fire,
1,5,Charmeleon,58,19.0,1.1,fire,
2,7,Squirtle,44,9.0,0.5,water,
3,8,Wartortle,59,22.5,1.0,water,
4,9,Blastoise,79,85.5,1.6,water,
...,...,...,...,...,...,...,...
398,782,Jangmo-o,45,29.7,0.6,dragon,
399,789,Cosmog,43,0.1,0.2,psychic,
400,790,Cosmoem,43,999.9,0.1,psychic,
401,796,Xurkitree,83,100.0,3.8,electric,


## 2.3 Drop missing data

Drop NA from a 1D pandas.Series is straightforward.

In [47]:
# Preview the dataframe (just a reminder)
df

Unnamed: 0,pokedex_number,name,hp,weight_kg,height_m,type1,type2
0,1,Bulbasaur,45,6.9,0.7,grass,poison
1,2,Ivysaur,60,13.0,1.0,grass,poison
2,3,Venusaur,80,100.0,2.0,grass,poison
3,4,Charmander,39,8.5,0.6,fire,
4,5,Charmeleon,58,19.0,1.1,fire,
...,...,...,...,...,...,...,...
796,797,Celesteela,97,999.9,9.2,steel,flying
797,798,Kartana,59,0.1,0.3,grass,steel
798,799,Guzzlord,223,888.0,5.5,dark,dragon
799,800,Necrozma,97,230.0,2.4,psychic,


In [48]:
# Drop missing data in a 1D pandas.Series
df2 = df['weight_kg'].dropna()

print(df2.shape)
df2

(781,)


Unnamed: 0,weight_kg
0,6.9
1,13.0
2,100.0
3,8.5
4,19.0
...,...
796,999.9
797,0.1
798,888.0
799,230.0


In a 2D pandas.DataFrame, we cannot drop single values but full rows or full columns.

In [49]:
# Drop all columns containing NA
df.dropna(axis=1)    # This equals df.dropna(axis=1, how='any')

Unnamed: 0,pokedex_number,name,hp,type1
0,1,Bulbasaur,45,grass
1,2,Ivysaur,60,grass
2,3,Venusaur,80,grass
3,4,Charmander,39,fire
4,5,Charmeleon,58,fire
...,...,...,...,...
796,797,Celesteela,97,steel
797,798,Kartana,59,grass
798,799,Guzzlord,223,dark
799,800,Necrozma,97,psychic


In [50]:
# Drop all rows in which any NA is present.

# Style 1
df.dropna()   # This equals df.dropna(axis=0, how='any')

# Style 2
#df.loc[ df.notna().all(axis=1) , : ]

# Style 3
#df.query("@df.notna().all(axis=1)", engine='python')

Unnamed: 0,pokedex_number,name,hp,weight_kg,height_m,type1,type2
0,1,Bulbasaur,45,6.9,0.7,grass,poison
1,2,Ivysaur,60,13.0,1.0,grass,poison
2,3,Venusaur,80,100.0,2.0,grass,poison
5,6,Charizard,78,90.5,1.7,fire,flying
11,12,Butterfree,60,32.0,1.1,bug,flying
...,...,...,...,...,...,...,...
794,795,Pheromosa,71,25.0,1.8,bug,fighting
796,797,Celesteela,97,999.9,9.2,steel,flying
797,798,Kartana,59,0.1,0.3,grass,steel
798,799,Guzzlord,223,888.0,5.5,dark,dragon


## 2.4 Fill missing data

In [51]:
df2 = df.fillna('empty value')
df2

Unnamed: 0,pokedex_number,name,hp,weight_kg,height_m,type1,type2
0,1,Bulbasaur,45,6.9,0.7,grass,poison
1,2,Ivysaur,60,13.0,1.0,grass,poison
2,3,Venusaur,80,100.0,2.0,grass,poison
3,4,Charmander,39,8.5,0.6,fire,empty value
4,5,Charmeleon,58,19.0,1.1,fire,empty value
...,...,...,...,...,...,...,...
796,797,Celesteela,97,999.9,9.2,steel,flying
797,798,Kartana,59,0.1,0.3,grass,steel
798,799,Guzzlord,223,888.0,5.5,dark,dragon
799,800,Necrozma,97,230.0,2.4,psychic,empty value


In [52]:
print( df2.count(), end='\n\n' )
print( df2.info() )

pokedex_number    801
name              801
hp                801
weight_kg         801
height_m          801
type1             801
type2             801
dtype: int64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   pokedex_number  801 non-null    int64 
 1   name            801 non-null    object
 2   hp              801 non-null    int64 
 3   weight_kg       801 non-null    object
 4   height_m        801 non-null    object
 5   type1           801 non-null    object
 6   type2           801 non-null    object
dtypes: int64(2), object(5)
memory usage: 43.9+ KB
None


# 3. Pandas: Handle duplicated data

## 3.1 Check for duplication

- In pandas, we can check for duplicated values (in Series) or duplicated rows (in DataFrame) by `pandas.Series.duplicated()` and `pandas.DataFrame.duplicated()` respectively.

In [53]:
# Create a subset of data for playing around
df = df_pokemon.loc[ : , ['pokedex_number', 'name', 'type1', 'type2'] ]

print( df.info(), end='\n\n' )
display(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   pokedex_number  801 non-null    int64 
 1   name            801 non-null    object
 2   type1           801 non-null    object
 3   type2           417 non-null    object
dtypes: int64(1), object(3)
memory usage: 25.2+ KB
None



Unnamed: 0,pokedex_number,name,type1,type2
0,1,Bulbasaur,grass,poison
1,2,Ivysaur,grass,poison
2,3,Venusaur,grass,poison
3,4,Charmander,fire,
4,5,Charmeleon,fire,
...,...,...,...,...
796,797,Celesteela,steel,flying
797,798,Kartana,grass,steel
798,799,Guzzlord,dark,dragon
799,800,Necrozma,psychic,


In [54]:
# By default, pandas uses *all columns* to consider duplication
result = df.duplicated()   # Return a Series of bool
display( result )

# Count the number of occurences
print()
display( result.value_counts() )

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
796,False
797,False
798,False
799,False





Unnamed: 0,count
False,801


In [55]:
# To consider duplication based on some columns, use the 'subset' argument
result1 = df.duplicated(subset=['type1','type2'])  # default: keep='first' means mark duplicates as True except for the first occurrence
result2 = df.duplicated(subset=['type1','type2'], keep=False) # keep=False: mark all duplicates as True

# Display result
print("===== result1 (default: keep='first') =====")
display( result1 )
print()
display( result1.value_counts() )

print('\n===== result2 (keep=False) =====')
display( result2 )
print()
display( result2.value_counts() )

===== result1 (default: keep='first') =====


Unnamed: 0,0
0,False
1,True
2,True
3,False
4,True
...,...
796,True
797,True
798,True
799,True





Unnamed: 0,count
True,635
False,166



===== result2 (keep=False) =====


Unnamed: 0,0
0,True
1,True
2,True
3,True
4,True
...,...
796,True
797,True
798,True
799,True





Unnamed: 0,count
True,755
False,46


In [56]:
# Show pokemons whose type1 and type2 are unique
df.loc[ ~df.duplicated(subset=['type1','type2'] , keep=False) , : ]

Unnamed: 0,pokedex_number,name,type1,type2
25,26,Raichu,electric,electric
104,105,Marowak,ground,fire
207,208,Steelix,steel,ground
218,219,Magcargo,fire,rock
247,248,Tyranitar,rock,dark
250,251,Celebi,psychic,grass
289,290,Nincada,bug,ground
291,292,Shedinja,bug,ghost
301,302,Sableye,dark,ghost
388,389,Torterra,grass,ground


In [57]:
# Check the correctness of the previous cell
# Is it true that the only pokemon with type1='electric' and type2='electric' is Raichu?
df.query("type1=='electric' and type2=='electric'")

Unnamed: 0,pokedex_number,name,type1,type2
25,26,Raichu,electric,electric


## 3.2 Drop duplication

In [58]:
"""
- Return DataFrame with duplicate rows removed
- The 'keep' argument determines which duplicates (if any) to keep
  - first (default) : Drop duplicates except for the first occurrence.
  - last : Drop duplicates except for the last occurrence.
  - False : Drop all duplicates.
"""
df.drop_duplicates(subset=['type1','type2'] , keep=False).reset_index()

Unnamed: 0,index,pokedex_number,name,type1,type2
0,25,26,Raichu,electric,electric
1,104,105,Marowak,ground,fire
2,207,208,Steelix,steel,ground
3,218,219,Magcargo,fire,rock
4,247,248,Tyranitar,rock,dark
5,250,251,Celebi,psychic,grass
6,289,290,Nincada,bug,ground
7,291,292,Shedinja,bug,ghost
8,301,302,Sableye,dark,ghost
9,388,389,Torterra,grass,ground


# PRACTICE

1. (1 point) Display pokedex number, English name, Japanese name, and hp values regarding all 1st-generation pokemons. The results must be ordered starting from the maximum hp to the minimum hp (for the same hp value, order by the pokedex number from small to large).

2. (1 point) How many pokemons that both type1 and type2 are not NA, and have type1 of either water or grass?

3. (1.5 points) Display pokedex number, English name, hp, abilities, and capture rate values regarding all pokemons whose abilities attribute includes 'super luck' and whose capture rate is greater than 90.