# 3.SELECTING/SLICING DATA IN DATAFRAME

## Section

1)[Selecting By Column Name](#SELECTING-BY-COLUMN-NAME/WITH-RANGE)<br>
2)[Selecting By Column Number](#SELECTING-BY-COLUMN-NUMBER/WITH-RANGE)<br>
3)[Selecting By Row Name](#SELECTING-BY-ROW-NAME/WITH-RANGE)<br>
4)[Selecting By Row Number](#SELECTING-BY-ROW-NUMBER/WITH-RANGE)<br>
5)[Selecting Data](#SELECTING-DATA)<br>
6)[Selecting Columns And Rows](#SELECTING-COLUMNS-AND-ROWS)<br>
7)[Swapping Columns](#SWAPPING-COLUMNS)<br>

## DIFFERENCE BETWEEN SERIES AND DATAFRAME
<sup>1</sup>Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.

## DIFFERENCE BETWEEN loc and iloc
<sup>2</sup>.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.

<sup>2</sup>.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing.

In [3]:
import pandas as pd
import numpy as np

date = pd.date_range('04-01-2020',periods=5)
df_slicing = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'), index=date)
df_slicing

Unnamed: 0,A,B,C,D,E
2020-04-01,-0.389731,1.69228,1.849792,0.282607,-0.565654
2020-04-02,0.571418,0.872103,-1.82633,-0.755075,-1.262654
2020-04-03,0.144592,0.60561,-1.848609,-0.413979,0.649762
2020-04-04,-1.517552,-0.905539,-0.350625,-0.556692,-0.47066
2020-04-05,-0.339092,0.165935,0.47053,-1.279291,1.7166


## SELECTING BY COLUMN NAME/WITH RANGE

[Top](#Section)

In [4]:
df_slicing.A
df_slicing['A']
df_slicing[['A','C']]
df_slicing.loc[:,['C','D']]
#Doesnt look like you can select column range, but you can select multiple columns. Use iloc to select column range

Unnamed: 0,C,D
2020-04-01,1.849792,0.282607
2020-04-02,-1.82633,-0.755075
2020-04-03,-1.848609,-0.413979
2020-04-04,-0.350625,-0.556692
2020-04-05,0.47053,-1.279291


## SELECTING BY COLUMN NUMBER/WITH RANGE

[Top](#Section)

In [5]:
#df_slicing[df_slicing.columns[0:2,4:5]] #<- wont work
df_slicing[df_slicing.columns[np.concatenate([range(0,2),range(3,4)])]]
df_slicing[df_slicing.columns[0:2]]
df_slicing.iloc[:,[0,2,4]]
#df_slicing.iloc[:,[0:2]] <-wont work
df_slicing.iloc[:,np.r_[0:2,4:5]]
df_slicing.iloc[:,0:3]

Unnamed: 0,A,B,C
2020-04-01,-0.389731,1.69228,1.849792
2020-04-02,0.571418,0.872103,-1.82633
2020-04-03,0.144592,0.60561,-1.848609
2020-04-04,-1.517552,-0.905539,-0.350625
2020-04-05,-0.339092,0.165935,0.47053


## SELECTING BY ROW NAME/WITH RANGE

[Top](#Section)

In [6]:
df_slicing[0:2] #<- not by name but by []
df_slicing.loc['2020-04-03']
df_slicing.loc[date[2]] #<-selects the entire 3rd row
#df_slicing['2020-04-03'] <- wont work
df_slicing.loc[[False,True,True,True,False]] #<- Boolean
df_slicing.loc[['2020-04-01','2020-04-03']]
df_slicing.loc['2020-04-01':'2020-04-03']

Unnamed: 0,A,B,C,D,E
2020-04-01,-0.389731,1.69228,1.849792,0.282607,-0.565654
2020-04-02,0.571418,0.872103,-1.82633,-0.755075,-1.262654
2020-04-03,0.144592,0.60561,-1.848609,-0.413979,0.649762


## SELECTING BY ROW NUMBER/WITH RANGE

[Top](#Section)

In [7]:
df_slicing[0:2]
df_slicing[::2]
df_slicing.iloc[2:5,:]
df_slicing.iloc[2]
#df_slicing[1:][1:3] <-SUB SLICE

A    0.144592
B    0.605610
C   -1.848609
D   -0.413979
E    0.649762
Name: 2020-04-03 00:00:00, dtype: float64

In [8]:
df_slicing['2020-04-01':'2020-04-04']

Unnamed: 0,A,B,C,D,E
2020-04-01,-0.389731,1.69228,1.849792,0.282607,-0.565654
2020-04-02,0.571418,0.872103,-1.82633,-0.755075,-1.262654
2020-04-03,0.144592,0.60561,-1.848609,-0.413979,0.649762
2020-04-04,-1.517552,-0.905539,-0.350625,-0.556692,-0.47066


## SELECTING DATA

[Top](#Section)

In [9]:
df_slicing['A'][date[2]]
df_slicing.loc['2020-04-01','A']
df_slicing.iloc[1,2]

-1.826329502411529

## SELECTING COLUMNS AND ROWS

[Top](#Section)

In [10]:
df_slicing.loc['2020-04-01':'2020-04-04',['C','D']] #Range from 2020-04-01 to 2020-04-04 and column C and D
df_slicing.iloc[0:2,2:5]
df_slicing.iloc[[0,3],[0,3,4]]
#df_slicing.loc['2020-04-01',['C','D']]

Unnamed: 0,A,D,E
2020-04-01,-0.389731,0.282607,-0.565654
2020-04-04,-1.517552,-0.556692,-0.47066


## SWAPPING COLUMNS

[Top](#Section)

In [11]:
df_slicing.loc[:, ['B', 'A']] = df_slicing[['A', 'B']].to_numpy()
df_slicing.loc[:, ['B', 'A']] = df_slicing[['A', 'B']] #<- why this wont work column alignment before column assignment
#https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
df_slicing

Unnamed: 0,A,B,C,D,E
2020-04-01,1.69228,-0.389731,1.849792,0.282607,-0.565654
2020-04-02,0.872103,0.571418,-1.82633,-0.755075,-1.262654
2020-04-03,0.60561,0.144592,-1.848609,-0.413979,0.649762
2020-04-04,-0.905539,-1.517552,-0.350625,-0.556692,-0.47066
2020-04-05,0.165935,-0.339092,0.47053,-1.279291,1.7166


### Reference
<sup>1</sup>*DF vs Series:* https://www.geeksforgeeks.org/creating-a-dataframe-from-pandas-series/#:~:text=Series%20is%20a%20type%20of,values%2C%20double%20values%20and%20more.&text=Series%20can%20only%20contain%20single,used%20to%20analyse%20the%20data
<br>
<sup>2</sup>*loc & iloc:* https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html