# Pandas

## Index and select data

We have to ways to filter data in pandas:

1. Using common square brackets **[ ]**
2. Using **loc** and **iloc** from pandas

### Column access [ ]

In [32]:
import pandas as pd
import numpy as np

df = pd.read_csv("./brics.csv", index_col="country_ab")
df

Unnamed: 0_level_0,country,capital,area,population
country_ab,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


To get data from a specific column with square brackets we do the following:

In [14]:
df["country"]

country_ab
BR          Brazil
RU          Russia
IN           India
CH           China
SA    South Africa
Name: country, dtype: object

The type of the value that we get is a **Series** data type from pandas module.

But perhaps we want to transform this to a Dataframe, we can do it with the following syntax

In [15]:
df[["country"]]

Unnamed: 0_level_0,country
country_ab,Unnamed: 1_level_1
BR,Brazil
RU,Russia
IN,India
CH,China
SA,South Africa


We can get multiple columns in the same code. If we notice, we are using list data type to get data.

To get data from multiple data from different columns, we can just extend the items from the list.

In [16]:
df[["country", "capital"]]

Unnamed: 0_level_0,country,capital
country_ab,Unnamed: 1_level_1,Unnamed: 2_level_1
BR,Brazil,Brasilia
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing
SA,South Africa,Pretoria


### Row access [ ]

To get rows from a dataframe we make use of the slicing feature from python.

We specify the inteval of indexes (both numbers are exclusive) as following

In [18]:
df[1:4]

Unnamed: 0_level_0,country,capital,area,population
country_ab,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


### loc and iloc

In NumPy the way of consulting data from arrays using the square brackets, follow this syntax

`numpy_array[row, column]`

So then, how can we apply this to pandas dataframes?

The most common way of data access in pandas is by the use of loc and iloc.

* loc (label-based)
* iloc (integer position based)

In [23]:
df.loc["RU"]

country       Russia
capital       Moscow
area            17.1
population     143.5
Name: RU, dtype: object

In [26]:
df.loc[["RU", "IN", "CH"], ["country", "capital", "area"]]

Unnamed: 0_level_0,country,capital,area
country_ab,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
RU,Russia,Moscow,17.1
IN,India,New Delhi,3.286
CH,China,Beijing,9.597


In [27]:
df.loc[:, ["country", "capital"]]

Unnamed: 0_level_0,country,capital
country_ab,Unnamed: 1_level_1,Unnamed: 2_level_1
BR,Brazil,Brasilia
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing
SA,South Africa,Pretoria


## Filtering data

We will select countries with area over 8 million km^2

In [30]:
df.loc[:, "area"] > 8

country_ab
BR     True
RU     True
IN    False
CH     True
SA    False
Name: area, dtype: bool

In [31]:
df[df.loc[:, "area"] > 8]

Unnamed: 0_level_0,country,capital,area,population
country_ab,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
CH,China,Beijing,9.597,1357.0


### Multiple conditions

To test multiple conditions when filtering data in dataframes, we use numpy logical operator as following.

The numpy function compare the first and second parameters with a specific logical operator.

In [39]:
df[np.logical_and(df.loc[:, "area"] > 8, df.loc[:, "area"] < 10)]

Unnamed: 0_level_0,country,capital,area,population
country_ab,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,Brazil,Brasilia,8.516,200.4
CH,China,Beijing,9.597,1357.0


## Looping over a dataframe

To iterate over a pandas dataframe we make use of `iterrows()` function as following:

In [46]:
for label, row in df.iterrows():
    print("---")
    print(label)
    print("---")
    print(row)

---
BR
---
country         Brazil
capital       Brasilia
area             8.516
population       200.4
Name: BR, dtype: object
---
RU
---
country       Russia
capital       Moscow
area            17.1
population     143.5
Name: RU, dtype: object
---
IN
---
country           India
capital       New Delhi
area              3.286
population       1252.0
Name: IN, dtype: object
---
CH
---
country         China
capital       Beijing
area            9.597
population     1357.0
Name: CH, dtype: object
---
SA
---
country       South Africa
capital           Pretoria
area                 1.221
population           52.98
Name: SA, dtype: object


In [47]:
for label, row in df.iterrows():
    print(f'{label}: {row["capital"]}')

BR: Brasilia
RU: Moscow
IN: New Delhi
CH: Beijing
SA: Pretoria


### Adding column

In [57]:
for lab, row in df.iterrows():
    # Creating series
    df.loc[lab, "name_length"] = len(row["country"])

df

Unnamed: 0_level_0,country,capital,area,population,name_length
country_ab,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BR,Brazil,Brasilia,8.516,200.4,6.0
RU,Russia,Moscow,17.1,143.5,6.0
IN,India,New Delhi,3.286,1252.0,5.0
CH,China,Beijing,9.597,1357.0,5.0
SA,South Africa,Pretoria,1.221,52.98,12.0


In [58]:
df["name_length"]

country_ab
BR     6.0
RU     6.0
IN     5.0
CH     5.0
SA    12.0
Name: name_length, dtype: float64