# Data indexing and selection
------------------------
[pandas-docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

The **axis labeling** information in pandas objects serves many purposes:
* Identifies data (i.e. provides metadata) using known indicators ( important for analysis, visualization, and interactive console display)
* Enables automatic and explicit data alignment
* Allows **intuitive** getting and setting of subsets of the data set


Since the type of the data to be accessed isn’t known in advance, directly using standard Python and NumPy **indexing operators** [] and **attribute operator** . has some **optimization** limits. For production code, the **optimized pandas data access** methods are recommended.

## 1. Data selection in Series
_________________________

#### 1.1. ``Series`` as a dictionary
____________________________

In [2]:
import pandas as pd

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

In [None]:
data['b']

* Using dictionary-like Python expressions and methods to examine the keys/indices and values:

In [None]:
'a' in data

In [None]:
data.keys()

In [None]:
type(data.items())

In [None]:
help(zip)

In [None]:
list(zip('abcdefg', range(3), range(2)))

In [None]:
list(data.items())

* ``Series`` objects can  be modified with a dictionary-like syntax

In [None]:
data['e'] = 1.25
data

#### 1.2. ``Series`` as a 1D array
_________________________

* Slicing by **explicit** index:  final index is **included** in the slice

In [None]:
data['a':'c'] # [a,c]

* Slicing by **implicit integer** index: final index is **excluded** from the slice

In [None]:
data[0:2] # [0,2)

* Masking -- using of the filter

In [None]:
data[(data > 0.3) & (data < 0.8)]

In [None]:
f038 = (data > 0.3) & (data < 0.8)
data[f038]

In [None]:
type(f038)

In [None]:
f038

* ``list`` of the concrete indexes

In [None]:
data[['a', 'e']]

In [None]:
lae = ['a', 'e']
data[lae]

#### 1.3. Indexers: ``loc``, ``iloc``, and ``ix`` 
__________________________

* Slicing and indexing conventions can be a source of confusion

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

* **Explicit** index when indexing

In [None]:
data[1]

* **Implicit** index when slicing (by default with integer indexing)

In [None]:
data[1:3] #[1,3)

* Because of this potential confusion in the **case of integer indexes**, Pandas provides some special *indexer* attributes that **explicitly** expose certain indexing schemes
* One guiding principle of Python code :  **explicit is better than implicit**

* ``loc`` -- an attribute of the ``Series``:
    * is not a functional method
    * allows indexing and slicing 
    * always references the **explicit** index as a **lable**

In [None]:
data.loc[1]

In [None]:
data.loc[1:3] #[1,3]

* ``iloc`` -- an attribute of the ``Series``:
    * is not a functional method
    * allows indexing and slicing
    * always references the **implicit** Python-style index as an **integer** :

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3] #[1,3)

* ``ix`` -- an attribute, which is a hybrid of the two: 
    * for ``Series`` objects is equivalent to standard ``[]``-based indexing
    * suitable in the context of ``DataFrame`` objects

## 2. Data Selection in ``DataFrame``
_________________________________

#### 2.1. ``DataFrame`` as a dictionary
____________________________

In [3]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})

In [4]:
#data = pd.DataFrame({'area':area, 'pop':pop})
data = pd.DataFrame({'pop':pop, 'area':area})
data

Unnamed: 0,pop,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


* The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name:

In [None]:
data['pop']

In [None]:
type(data['pop'])

In [None]:
type(data[['pop',]])

In [None]:
data[['pop',]]

* Equivalently, attribute-style access with column names (that are strings) can be used:

In [None]:
data.area

* Attribute-style column access actually accesses the exact same object as the dictionary-style access:

In [None]:
data.area is data['area']

* Attribute-style access is not possible if the column names are not strings or if the column names conflict with methods of the ``DataFrame``:

In [None]:
data.pop is data['pop']

In [None]:
type(data.pop)

In [None]:
help(data.pop) #  data.pop points to DataFrame.pop()

In [None]:
data.pop('pop')

*  **Dictionary**-style syntax can also be used to modify the object, in this case **adding a new column**

In [None]:
data

In [None]:
data['pop'] = pop

In [None]:
data

In [5]:
data['density'] = data['pop'] / data['area']

In [6]:
data

Unnamed: 0,pop,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [None]:
data.vaccinated=0.6*data['pop']

In [None]:
data['vaccinated']=0.6*data['pop']

You can use attribute access to modify an **existing** element of a Series or column of a DataFrame, **but** using attribute access to create a new column resuts   creating a new attribute rather than a new column.

In [None]:
data.__dict__

#### 2.2. ``DataFrame`` as a 2D array
_________________________

* Using the ``values`` attribute for examination of underlying two-dimensional array:

In [None]:
data.values

* Getting transposed  ``DataFrame`` :

In [None]:
data.T

In [None]:
data

In [None]:
data.values[0]

In [None]:
data.values

In [None]:
data['area']

#### 2.3. Convention for array-style indexing with ``loc``, ``iloc``, and ``ix`` indexers
_______________________________________

* ``iloc`` -- indexing the underlying array as if it is a simple NumPy array (using the **implicit** Python-style index), but the ``DataFrame`` index and column labels are maintained in the result:

In [None]:
data.iloc[:1, :2]

* ``loc`` -- indexing the underlying data in an array-like style but using the **explicit** index and column names:

In [None]:
data.loc[:'Illinois', :'pop']
#data.loc[:'Illinois', :'area']

* ``ix`` --  hybrid indexing of these two approaches (`ix` is deprecated):

In [None]:
data.ix[:3, :'pop']

* Any of the familiar NumPy-style data access patterns can be used within these indexers

In [None]:
data.loc[data.density > 100, ['pop', 'density']]

In [None]:
fltr = data.density > 100
data.loc[fltr, ['pop', 'density']]

* Any of these indexing conventions may also be used to set or modify values

In [None]:
data.iloc[0, 2] = 90
data

#### 2.4. Additional indexing conventions
________________________________

* **indexing** refers to columns
* **slicing** refers to rows:

In [None]:
data['Florida':'Illinois']

* Such slices can also refer to rows by number rather than by index:

In [None]:
data[1:3]

* Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [None]:
data[data.density > 100]