# Pandas : Data Selection

Here we’ll look at means of accessing and modifying values in Pandas Series and DataFrame objects

In [69]:
import pandas as pd
import numpy as np

## 1. Data Selection in Series

We saw Pandas Series object acts in many ways like a one dimensional NumPy array, and in many ways like a standard Python dictionary. Data selection is similar to them as well.

#### Series as Dictionary

In [70]:
s = pd.Series([1,12,30,15], index=['f','g','e','a'])
s

f     1
g    12
e    30
a    15
dtype: int64

In [71]:
s['e']

30

In [72]:
'a' in s          #checks indices only, like keys in dictionary

True

In [73]:
1 in s         

False

In [74]:
s.keys()

Index(['f', 'g', 'e', 'a'], dtype='object')

In [75]:
list(s.items())

[('f', 1), ('g', 12), ('e', 30), ('a', 15)]

In [76]:
s['a'] = 48             #like Python dictionary, Series is also mutable
s

f     1
g    12
e    30
a    48
dtype: int64

#### Series as 1D array

In [77]:
# slicing
s['f':'a':2]         # [start:end:step]

f     1
e    30
dtype: int64

In [78]:
s[0:5:2]          #using explicit numbers

f     1
e    30
dtype: int64

In [79]:
# masking
s[(s>0)&(s<45)]

f     1
g    12
e    30
dtype: int64

In [80]:
#iloc : for implicit indexing 
#loc : for explicit indexing
ind = pd.Series(['a','b','c'], index=[1,5,4])
ind.loc[1]

'a'

In [81]:
ind.iloc[1]

'b'

## 2. Data Selection in DataFrames

DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index.

#### DataFrame as Dictionary

In [82]:
area = pd.Series({'California': 423967, 
                  'Texas': 695662, 
                  'New York': 141297, 
                  'Florida': 170312,
                  'Illinois': 149995})

pop = pd.Series({'California': 38332521, 
                 'Texas': 26448193,
                 'New York': 19651127, 
                 'Florida': 19552860,
                 'Illinois': 12882135})

data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [83]:
data.area    #attribute - style access

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [84]:
data['area'] #dictionary - style access

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [85]:
# attribute-style column access actually accesses the exact same object as the dictionary-style access
data['area'] is data.area

True

In [86]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


#### DataFrame as two-dimensional array

In [87]:
data.values             #implicitly dataframe is considered as 2D array

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [88]:
data.shape       #shape of dataframe

(5, 3)

In [89]:
data.ndim        #number of dimensions

2

In [90]:
data.T          #transpose the dataframe

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [91]:
data.iloc[:3, :2]       #implicit location

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [92]:
data.loc[:'Illinois', :'pop']        #explicit indexing

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [93]:
data.ix[:3,:'pop']      #hybrid of iloc and loc

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
