# 03. Indexing and Slicing the DataFrames

There are various ways to select the rows and columns of a dataframe or Series.

- Select the rows from a dataframe
- Select columns from a dataframe
- Select subsets of dataframes

#### Import the Pandas and NumPy Libraries

In [1]:
import numpy as np
import pandas as pd

#### Load the data

In [2]:
df = pd.read_csv('./data/Company.csv')
df

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
0,1318,1/3/1954,61,Vancouver,Executive,CEO,M
1,1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
2,1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F
3,1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
4,1322,1/9/1958,57,Vancouver,Executive,VP Finance,M
...,...,...,...,...,...,...,...
6279,8036,8/9/1992,23,New Westminister,Customer Service,Cashier,F
6280,8181,9/26/1993,22,Prince George,Customer Service,Cashier,M
6281,8223,2/11/1994,21,Trail,Customer Service,Cashier,M
6282,8226,2/16/1994,21,Victoria,Customer Service,Cashier,F


In [3]:
df.head()

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
0,1318,1/3/1954,61,Vancouver,Executive,CEO,M
1,1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
2,1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F
3,1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
4,1322,1/9/1958,57,Vancouver,Executive,VP Finance,M


______________

## 1. The Column index and row index

- one or more columns from a dataframe selected using the following commands:
    - `df['column_name']` or `df.column_name`: It returns a series
    - `df[['col_x', 'col_y']]`: It returns a dataframe
- One or more rows can be selected using the indexs:
    - `df[2:5]`: Returns dataframe of index 2, 3, 4 - The DataFrame indexing operator selects rows and can do so by integer location or by index label.
    

### 1. Column index 

In [4]:
df['gender'].head(10)

0    M
1    F
2    F
3    M
4    M
5    M
6    F
7    F
8    F
9    F
Name: gender, dtype: object

In [5]:
df['gender'].unique()

array(['M', 'F'], dtype=object)

In [6]:
df[['EmployeeID', 'gender']]

Unnamed: 0,EmployeeID,gender
0,1318,M
1,1319,F
2,1320,F
3,1321,M
4,1322,M
...,...,...
6279,8036,F
6280,8181,M
6281,8223,M
6282,8226,F


### 2. row index

In [7]:
df.head()

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
0,1318,1/3/1954,61,Vancouver,Executive,CEO,M
1,1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
2,1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F
3,1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
4,1322,1/9/1958,57,Vancouver,Executive,VP Finance,M


In [8]:
df[2:5]

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
2,1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F
3,1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
4,1322,1/9/1958,57,Vancouver,Executive,VP Finance,M


_____________

## 2. `loc` and `iloc` index

### 1. loc indexer
- Access a group of rows and columns by label(s) or a boolean array from a DataFrame.
- Used in slicing of data from DataFrame.
- `loc[]` is a primarily label based data selecting method, means we have to pass the name of the row and column.
- Syntax: `dataframe.loc[[list_of_row_labels]`, `[list_of_column_labels]]`

In [9]:
df_2 = df.set_index('EmployeeID')
df_2

Unnamed: 0_level_0,birthdate_key,age,city_name,department,job_title,gender
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1318,1/3/1954,61,Vancouver,Executive,CEO,M
1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F
1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
1322,1/9/1958,57,Vancouver,Executive,VP Finance,M
...,...,...,...,...,...,...
8036,8/9/1992,23,New Westminister,Customer Service,Cashier,F
8181,9/26/1993,22,Prince George,Customer Service,Cashier,M
8223,2/11/1994,21,Trail,Customer Service,Cashier,M
8226,2/16/1994,21,Victoria,Customer Service,Cashier,F


In [10]:
s = df_2.loc[1332]
print(s)
type(s)

birthdate_key                           2/5/1955
age                                           60
city_name                              Vancouver
department                             Executive
job_title        Exec Assistant, Human Resources
gender                                         F
Name: 1332, dtype: object


pandas.core.series.Series

In [11]:
df_3 = df_2.loc[[1332, 8223]]
df_3
type(df_3)

pandas.core.frame.DataFrame

In [12]:
df_2.loc[[1332, 8223], ['age', 'department']]

Unnamed: 0_level_0,age,department
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1
1332,60,Executive
8223,21,Customer Service


In [13]:
df_2.loc[(df_2.age > 55)]

Unnamed: 0_level_0,birthdate_key,age,city_name,department,job_title,gender
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1318,1/3/1954,61,Vancouver,Executive,CEO,M
1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F
1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
1322,1/9/1958,57,Vancouver,Executive,VP Finance,M
...,...,...,...,...,...,...
4434,12/15/1946,69,Vancouver,Meats,Meat Cutter,M
4442,12/22/1946,69,Vancouver,Meats,Meat Cutter,M
4445,12/26/1946,69,New Westminster,Produce,Produce Clerk,M
4448,12/27/1946,69,Kamloops,Bakery,Baker,M


### 2. `iloc` indexer

- `iloc()` is a index based selecting method, means we have to pass integer index to select specific row/column.
- `iloc()` does not accept boolean data inlike `loc()`

In [14]:
df_2.head()

Unnamed: 0_level_0,birthdate_key,age,city_name,department,job_title,gender
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1318,1/3/1954,61,Vancouver,Executive,CEO,M
1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F
1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
1322,1/9/1958,57,Vancouver,Executive,VP Finance,M


In [15]:
df_2.iloc[[3,4,5], 0:4]

Unnamed: 0_level_0,birthdate_key,age,city_name,department
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1321,1/2/1959,56,Vancouver,Executive
1322,1/9/1958,57,Vancouver,Executive
1323,1/9/1962,53,Vancouver,Executive


In [16]:
df_2.iloc[[3,4,5], [0,1,2,3]]
#df_2.loc[[1321, 1322, 1323],['birthdate_key','age','city_name','department']]

Unnamed: 0_level_0,birthdate_key,age,city_name,department
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1321,1/2/1959,56,Vancouver,Executive
1322,1/9/1958,57,Vancouver,Executive
1323,1/9/1962,53,Vancouver,Executive


In [17]:
df_2.iloc[(df_2.age > 55).values].head()

Unnamed: 0_level_0,birthdate_key,age,city_name,department,job_title,gender
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1318,1/3/1954,61,Vancouver,Executive,CEO,M
1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F
1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
1322,1/9/1958,57,Vancouver,Executive,VP Finance,M


## 3. Query - [pd.DataFrame.query](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html)

- Query the columns of a DataFrame with a boolean expression.
- DataFrame.query(expr, inplace=False, **kwargs)
- query expressions (string) incoming querymethod to query the data. The result of the expression must return **Boolean** value.

In [18]:
df_2.head()

Unnamed: 0_level_0,birthdate_key,age,city_name,department,job_title,gender
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1318,1/3/1954,61,Vancouver,Executive,CEO,M
1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F
1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
1322,1/9/1958,57,Vancouver,Executive,VP Finance,M


In [19]:
df_2.query('(gender == "M")')

Unnamed: 0_level_0,birthdate_key,age,city_name,department,job_title,gender
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1318,1/3/1954,61,Vancouver,Executive,CEO,M
1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
1322,1/9/1958,57,Vancouver,Executive,VP Finance,M
1323,1/9/1962,53,Vancouver,Executive,"Exec Assistant, VP Stores",M
1334,2/6/1961,54,Vancouver,Executive,"Exec Assistant, Finance",M
...,...,...,...,...,...,...
6918,2/6/1984,31,Vancouver,Customer Service,Cashier,M
7416,1/14/1988,27,Chilliwack,Customer Service,Cashier,M
7741,4/7/1990,25,Nanaimo,Customer Service,Cashier,M
8181,9/26/1993,22,Prince George,Customer Service,Cashier,M


In [20]:
sorted_df = df_2.query('(age >= age.mean())').sort_values(by='age', ascending=False)
sorted_df

Unnamed: 0_level_0,birthdate_key,age,city_name,department,job_title,gender
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2489,7/25/1941,74,Surrey,Meats,Meat Cutter,F
2388,4/20/1941,74,New Westminster,Meats,Meat Cutter,F
2365,3/30/1941,74,Trail,Meats,Meat Cutter,F
2360,3/26/1941,74,Nanaimo,Meats,Meat Cutter,F
2517,9/12/1941,74,Squamish,Meats,Meat Cutter,F
...,...,...,...,...,...,...
5166,12/10/1969,46,Prince George,Bakery,Baker,M
5167,12/13/1969,46,West Vancouver,Bakery,Baker,M
5168,12/16/1969,46,Fort St John,Produce,Produce Clerk,F
5169,12/19/1969,46,Richmond,Produce,Produce Clerk,F


____
## Assignment:
- Load the `company.csv` DataSet
- Sort the dataset based on age.
- Print only the top 10 dataSets
- Print the last 10 dataSets
- Select the top 10 records where gender=M, department=Bakery and age is less than 50.

## Recommended Readings
- [Geeksforgeeks - Difference between iloc() and loc()](https://www.geeksforgeeks.org/difference-between-loc-and-iloc-in-pandas-dataframe/)
- [Use loc and iloc for selecting data in Pandas](https://towardsdatascience.com/how-to-use-loc-and-iloc-for-selecting-data-in-pandas-bd09cb4c3d79)