<h3>Data Indexing</h3>

In [2]:
# import dependences
import pandas as pd
import numpy as np

In [4]:
# Data Selections in Series
# A Series object acts in many ways like a one-dimensional NumPy array, and in
# many ways like a standard Python dictionary

# Series as a Dictionary
data = pd.Series(
    [0.24, .5, .54, 2.0],
    index = ['a', 'b', 'c', 'd']
)
data

a    0.24
b    0.50
c    0.54
d    2.00
dtype: float64

In [5]:
data['d']

np.float64(2.0)

In [7]:
# Using dictionary-like expressions
'a' in data

True

In [8]:
data.a

np.float64(0.24)

In [9]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [12]:
list(data.items())

[('a', 0.24), ('b', 0.5), ('c', 0.54), ('d', 2.0)]

In [13]:
# Modifying series objects with a dictionary-like syntax
data.d = 90
data

a     0.24
b     0.50
c     0.54
d    90.00
dtype: float64

In [16]:
data.e = 100
data['a'] = .34
data

a     0.34
b     0.50
c     0.54
d    90.00
dtype: float64

In [17]:
# Series as one-Dimensional Array: selection mechanisms such as slices, masking, fancy indexing,
# Slicing
# Final index is included in the slice unlike Python indexing
data['a':'c']

a    0.34
b    0.50
c    0.54
dtype: float64

In [20]:
# data[0:2] # deprecated
data.iloc[0:2]

a    0.34
b    0.50
dtype: float64

In [23]:
# Masking
data[(data > .2) & (data < 1)]

a    0.34
b    0.50
c    0.54
dtype: float64

In [24]:
# Fancy indexing
data[['a', 'c']]

a    0.34
c    0.54
dtype: float64

In [25]:
# Using implicit indexing
data[1:3]

b    0.50
c    0.54
dtype: float64

In [26]:
# Using explicit indexing
data[2]

  data[2]


np.float64(0.54)

In [34]:
# Indexers: loc and iloc
data = pd.Series(
    ['Kampala',
     'Mukono',
     'Jinja'],
    index=[1, 2, 3]
)
data

1    Kampala
2     Mukono
3      Jinja
dtype: object

In [35]:
# loc attribute allows indexing and slicing. Always references the explicit index
# data.loc[1]
# loc gets rows (and/or columns) with particular labels (explicit index)
# iloc gets rows (and/or columns) at integer locations (implicit index), Python-Style index
data.loc[1]

'Kampala'

In [37]:
data.iloc[1]

'Mukono'

In [38]:
# Conclusions
# Explicit is better than implicit.
# loc and iloc are explicit.
# Learn from the Zen of Python
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [53]:
# Data Selection in DataFrames
# A dataframe acts in many ways like a two-dimensional or structured array.
# And in other ways like a dictionary of series structures sharing the same index
hivPrevalence = pd.Series(
    {
    'Central 1': 8.6,
    'Central 2': 7.6,
    'Kampala': 6.9,
    'East-Central': 4.7,
    'Mid-Central': 5.1,
}
)

population = pd.Series(
    {
    'Central 1': 1_904_035,
    'Central 2': 2_485_890,
    'Kampala': 5_000_234,
    'East-Central': 3_000_000,
    'Mid-Central': 2_904_342,
}
)

data = pd.DataFrame({'HIV Prevalence': hivPrevalence,
                     'pop': population
                    })
data

Unnamed: 0,HIV Prevalence,pop
Central 1,8.6,1904035
Central 2,7.6,2485890
Kampala,6.9,5000234
East-Central,4.7,3000000
Mid-Central,5.1,2904342


In [54]:
# Individual Series that make up the DataFrame can be accessed via dictionary-style indexing
data['HIV Prevalence']

Central 1       8.6
Central 2       7.6
Kampala         6.9
East-Central    4.7
Mid-Central     5.1
Name: HIV Prevalence, dtype: float64

In [56]:
# Using attribute-styel access
# This does not work in all cases. Think ...

<bound method DataFrame.pop of               HIV Prevalence      pop
Central 1                8.6  1904035
Central 2                7.6  2485890
Kampala                  6.9  5000234
East-Central             4.7  3000000
Mid-Central              5.1  2904342>