## Chapter14: Data Indexing and Selection

---
* Author:  [Yuttapong Mahasittiwat](mailto:khala1391@gmail.com)
* Technologist | Data Modeler | Data Analyst
* [YouTube](https://www.youtube.com/khala1391)
* [LinkedIn](https://www.linkedin.com/in/yuttapong-m/)
---

Source: [**Python Data Science Handbook** by **VanderPlas**](https://jakevdp.github.io/PythonDataScienceHandbook/)

In [2]:
import numpy as np
import pandas as pd
print("numpy version :",np.__version__)
print("pandas version :",pd.__version__)

numpy version : 1.26.4
pandas version : 2.2.1


## Data Selection in Series

### series as dictionary

In [6]:
df = pd.Series([.25,.5,.75,1.],
              index=['a','b','c','d'])
df

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [10]:
df['b']

0.5

In [14]:
'a' in df

True

In [19]:
df.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [25]:
list(df.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [36]:
df.items()

<zip at 0x226ce9abb40>

In [44]:
df['e']=1.25
df

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

### series as 1D array

In [50]:
df['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [56]:
df[0:2]  # exclude

a    0.25
b    0.50
dtype: float64

In [70]:
df[(df>0.3) & (df<0.8)]

b    0.50
c    0.75
dtype: float64

In [76]:
df['a']

0.25

In [78]:
df[['a','e']]

a    0.25
e    1.25
dtype: float64

### indexers: loc and iloc

In [82]:
df = pd.Series(['a','b','c'], index=[1,3,5])
df

1    a
3    b
5    c
dtype: object

In [86]:
df[1]

'a'

In [88]:
df[1:3]

3    b
5    c
dtype: object

In [92]:
df[0:3]

1    a
3    b
5    c
dtype: object

In [94]:
df.loc[3]

'b'

In [98]:
df.loc[5]

'c'

## Data Selection in DataFrames

### dataframe as dictionary

In [100]:
area = pd.Series({'California': 423967,
                  'Texas': 695662,
                  'Florida': 170312,
                  'New York': 141297,
                  'Pennsylvania': 119280})

pop = pd.Series({'California': 39538223,
                 'Texas': 29145505,
                 'Florida': 21538187,
                 'New York': 20201249,
                 'Pennsylvania': 13002700})
df = pd.DataFrame({'area':area, 'pop':pop})
df

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187
New York,141297,20201249
Pennsylvania,119280,13002700


In [104]:
df.index

Index(['California', 'Texas', 'Florida', 'New York', 'Pennsylvania'], dtype='object')

In [106]:
df['area']

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

In [108]:
df.area

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

In [120]:
df['density'] = df['pop']/df['area']
df

Unnamed: 0,area,pop,density
California,423967,39538223,93.257784
Texas,695662,29145505,41.896072
Florida,170312,21538187,126.463121
New York,141297,20201249,142.97012
Pennsylvania,119280,13002700,109.009893


### dataframe as 2d array

In [128]:
df.values   # return dataframe as 2D array

array([[4.23967000e+05, 3.95382230e+07, 9.32577842e+01],
       [6.95662000e+05, 2.91455050e+07, 4.18960717e+01],
       [1.70312000e+05, 2.15381870e+07, 1.26463121e+02],
       [1.41297000e+05, 2.02012490e+07, 1.42970120e+02],
       [1.19280000e+05, 1.30027000e+07, 1.09009893e+02]])

In [130]:
df.T

Unnamed: 0,California,Texas,Florida,New York,Pennsylvania
area,423967.0,695662.0,170312.0,141297.0,119280.0
pop,39538220.0,29145500.0,21538190.0,20201250.0,13002700.0
density,93.25778,41.89607,126.4631,142.9701,109.0099


In [132]:
df.values[0]

array([4.23967000e+05, 3.95382230e+07, 9.32577842e+01])

In [134]:
df['area']

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

In [136]:
df.iloc[:3,:2]

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187


In [138]:
df.loc[:'Florida',:'pop'] # explicit indexing

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187


In [144]:
df.loc[df.density>120,['pop','density']]

Unnamed: 0,pop,density
Florida,21538187,126.463121
New York,20201249,142.97012


In [155]:
pd.options.display.float_format = '{:,.0f}'.format

In [164]:
# pd.options.display.*?

In [160]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, California to Pennsylvania
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   area     5 non-null      int64  
 1   pop      5 non-null      int64  
 2   density  5 non-null      float64
dtypes: float64(1), int64(2)
memory usage: 332.0+ bytes


In [157]:
df.iloc[0,2]=90
df

Unnamed: 0,area,pop,density
California,423967,39538223,90
Texas,695662,29145505,42
Florida,170312,21538187,126
New York,141297,20201249,143
Pennsylvania,119280,13002700,109


### additional indexing conventions

In [166]:
df['Florida':'New York']

Unnamed: 0,area,pop,density
Florida,170312,21538187,126
New York,141297,20201249,143


In [172]:
df[1:3]

Unnamed: 0,area,pop,density
Texas,695662,29145505,42
Florida,170312,21538187,126


In [176]:
df[df.density>120]

Unnamed: 0,area,pop,density
Florida,170312,21538187,126
New York,141297,20201249,143


###

###