<a href="https://colab.research.google.com/github/manjulamishra/DS-Code-Pandas_Useful_Functions/blob/master/data_manipulation_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Indexing/Selction/Slicing of Data

https://i.stack.imgur.com/FzimB.png
https://stackoverflow.com/questions/22149584/what-does-axis-in-pandas-mean
enter image description here
In Pandas:

axis=0 means along "indexes". It's a row-wise operation.
Suppose, to perform concat() operation on dataframe1 & dataframe2, we will take dataframe1 & take out 1st row from dataframe1 and place into the new DF, then we take out another row from dataframe1 and put into new DF, we repeat this process until we reach to the bottom of dataframe1. Then, we do the same process for dataframe2.

Basically, stacking dataframe2 on top of dataframe1 or vice a versa.

E.g making a pile of books on a table or floor

axis=1 means along "columns". It's a column-wise operation.
Suppose, to perform concat() operation on dataframe1 & dataframe2, we will take out the 1st complete column(a.k.a 1st series) of dataframe1 and place into new DF, then we take out the second column of dataframe1 and keep adjacent to it (sideways), we have to repeat this operation until all columns are finished. Then, we repeat the same process on dataframe2. Basically, stacking dataframe2 sideways.



In [None]:
import pandas as pd
import numpy as np

### Let's create a DataFrame

In [None]:
dates = pd.date_range('1/1/2000', periods=8)

In [None]:
# dates are treated as index
#  that means you can access a particular entry using that index
df = pd.DataFrame(np.random.randn(8,4), 
                 index=dates, columns=['A', 'B', 'C', 'D'])

In [None]:
# if we select a column it's like a list now
#  now, we can get the desired enteries by using the indexing 
#  we already have dates as index, we will use the location to see 
#  it's the same concpet as list/list of lists 
#  [[1,2,3]] = it has 1 list and that list has 3 elements inside 
s = df['A']
print(s.shape)
s[dates[5]]

(8,)


0.33497695138320416

### Inter-change the column names and values (swap)

It's useful for in-place column transformation to a subset of the columns

In [None]:
df[['A', 'B']] = df[['B', 'A']]
df

Unnamed: 0,A,B,C,D
2000-01-01,-1.03663,1.904858,1.364472,-0.736082
2000-01-02,-0.73614,1.532126,-0.209754,-1.858622
2000-01-03,1.113099,-0.005548,-1.550566,0.522875
2000-01-04,-2.152696,0.919469,-0.797773,0.239495
2000-01-05,2.062501,-2.42864,1.204805,-0.885867
2000-01-06,1.322407,0.334977,-0.22023,-1.599317
2000-01-07,1.020576,-0.350545,-0.145013,0.98113
2000-01-08,0.344439,1.27987,0.365789,-1.607499


### Inter-change Multiple columns (swap)
The correct way to swap column values is by using raw values:

In [None]:
#  getting the subset of data/only few columns
df[['A', 'B']]

Unnamed: 0,A,B
2000-01-01,-1.03663,1.904858
2000-01-02,-0.73614,1.532126
2000-01-03,1.113099,-0.005548
2000-01-04,-2.152696,0.919469
2000-01-05,2.062501,-2.42864
2000-01-06,1.322407,0.334977
2000-01-07,1.020576,-0.350545
2000-01-08,0.344439,1.27987


In [None]:
#  the column values move around
#  the vlaues A become values of B and vice versa
df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()
df[["A", "B"]]

Unnamed: 0,A,B
2000-01-01,1.904858,-1.03663
2000-01-02,1.532126,-0.73614
2000-01-03,-0.005548,1.113099
2000-01-04,0.919469,-2.152696
2000-01-05,-2.42864,2.062501
2000-01-06,0.334977,1.322407
2000-01-07,-0.350545,1.020576
2000-01-08,1.27987,0.344439


## Attribute Access

Accessing an index on a Series or column on a DataFrame directly as an attribute

In [None]:
# let's create a series
sa = pd.Series([1,2,3], index = list('abc'))

dfa = df.copy()

In [None]:
sa

a    1
b    2
c    3
dtype: int64

In [None]:
#  accessing the elements 
# using the index you can access the item in the series 
sa.b

2

In [None]:
#  let's select a column from our earlier df

dfa.A

2000-01-01    1.904858
2000-01-02    1.532126
2000-01-03   -0.005548
2000-01-04    0.919469
2000-01-05   -2.428640
2000-01-06    0.334977
2000-01-07   -0.350545
2000-01-08    1.279870
Freq: D, Name: A, dtype: float64

#### Assigning index to an exist column/new column as index

In [None]:
# creating anew column of the same length as index
dfa['F'] = list(range(len(df.index))) # use dfa['A'] to make a new col 'A'

In [None]:
# reassign an existing column as index
dfa.A = list(range(len(df.index))) # dfa.A notation is fine if the col already exists
dfa

Unnamed: 0,A,B,C,D,F
2000-01-01,0,-1.03663,1.364472,-0.736082,0
2000-01-02,1,-0.73614,-0.209754,-1.858622,1
2000-01-03,2,1.113099,-1.550566,0.522875,2
2000-01-04,3,-2.152696,-0.797773,0.239495,3
2000-01-05,4,2.062501,1.204805,-0.885867,4
2000-01-06,5,1.322407,-0.22023,-1.599317,5
2000-01-07,6,1.020576,-0.145013,0.98113,6
2000-01-08,7,0.344439,0.365789,-1.607499,7


In [None]:
dfa = dfa.drop(columns='F')

In [None]:
dfa.keys()

Index(['A', 'B', 'C', 'D'], dtype='object')

### Assigning a dict to a row of a DataFrame

In [None]:
#  create a two col df 
X = pd.DataFrame({'x': [1,2,3], 'y':[3,4,5]})

In [None]:
X

Unnamed: 0,x,y
0,1,3
1,2,4
2,3,5


### Changing values of a column using row indexing

In [None]:
# change the value of the 2nd row
# just X.loc[1] also works!
X.iloc[1] = {'x':99, 'y':88}
X.iloc[1]['y'] = 500
X

Unnamed: 0,x,y
0,1,3
1,99,500
2,3,5


In [None]:
#  if we change the value of only 1 column, the rest of the row will get NaNs
X.iloc[2] = {'y':55} # the value of X at the 2nd location of the 1st element of the 2nd row is NaN
X

Unnamed: 0,x,y
0,1.0,3.0
1,99.0,500.0
2,,55.0


In [None]:
#  addign anotehr column
X['z'] = [1,2,3]
X.iloc[1] = 100
X

Unnamed: 0,x,y,z
0,1.0,3.0,1
1,100.0,100.0,100
2,,55.0,3


# Slicing Ranges

In [None]:
#  s was df['A']
# s is like a series now
s

2000-01-01    1.904858
2000-01-02    1.532126
2000-01-03   -0.005548
2000-01-04    0.919469
2000-01-05   -2.428640
2000-01-06    0.334977
2000-01-07   -0.350545
2000-01-08    1.279870
Freq: D, Name: A, dtype: float64

In [None]:
s[::] # same as s[:], double colons s[::2] are used for skipping

2000-01-01    1.904858
2000-01-02    1.532126
2000-01-03   -0.005548
2000-01-04    0.919469
2000-01-05   -2.428640
2000-01-06    0.334977
2000-01-07   -0.350545
2000-01-08    1.279870
Freq: D, Name: A, dtype: float64

In [None]:
s[:5] # first 5 same sa s[0:5]

2000-01-01    1.904858
2000-01-02    1.532126
2000-01-03   -0.005548
2000-01-04    0.919469
2000-01-05   -2.428640
Freq: D, Name: A, dtype: float64

In [None]:
s[::2] # every other 

2000-01-01    1.904858
2000-01-03   -0.005548
2000-01-05   -2.428640
2000-01-07   -0.350545
Freq: 2D, Name: A, dtype: float64

In [None]:
s[::-1] # in reverse order

2000-01-08    1.279870
2000-01-07   -0.350545
2000-01-06    0.334977
2000-01-05   -2.428640
2000-01-04    0.919469
2000-01-03   -0.005548
2000-01-02    1.532126
2000-01-01    1.904858
Freq: -1D, Name: A, dtype: float64

In [None]:
s2 = s.copy()

In [None]:
# turns first five rows into 0
s2[:5] = 0
s2

2000-01-01    0.000000
2000-01-02    0.000000
2000-01-03    0.000000
2000-01-04    0.000000
2000-01-05    0.000000
2000-01-06    0.334977
2000-01-07   -0.350545
2000-01-08    1.279870
Freq: D, Name: A, dtype: float64

In [None]:
df[:3] # outputs first 3 rows

Unnamed: 0,A,B,C,D
2000-01-01,1.904858,-1.03663,1.364472,-0.736082
2000-01-02,1.532126,-0.73614,-0.209754,-1.858622
2000-01-03,-0.005548,1.113099,-1.550566,0.522875


In [None]:
df[::-1] 

Unnamed: 0,A,B,C,D
2000-01-08,1.27987,0.344439,0.365789,-1.607499
2000-01-07,-0.350545,1.020576,-0.145013,0.98113
2000-01-06,0.334977,1.322407,-0.22023,-1.599317
2000-01-05,-2.42864,2.062501,1.204805,-0.885867
2000-01-04,0.919469,-2.152696,-0.797773,0.239495
2000-01-03,-0.005548,1.113099,-1.550566,0.522875
2000-01-02,1.532126,-0.73614,-0.209754,-1.858622
2000-01-01,1.904858,-1.03663,1.364472,-0.736082


## Selection by Label

In [None]:

dfl = pd.DataFrame(np.random.randn(5,4), 
                  columns = list('ABCD'),
                  index=pd.date_range('20130101', periods=5))

dfl

Unnamed: 0,A,B,C,D
2013-01-01,-0.15012,1.66607,0.020874,-1.582633
2013-01-02,0.318356,1.138704,0.454163,-1.706591
2013-01-03,0.304486,-0.735436,0.147157,-0.238521
2013-01-04,-0.237939,-1.856986,0.191768,-0.893753
2013-01-05,-1.177202,0.564525,0.21641,0.65399


In [None]:
dfl.loc['20130101': '20130103']

Unnamed: 0,A,B,C,D
2013-01-01,-0.15012,1.66607,0.020874,-1.582633
2013-01-02,0.318356,1.138704,0.454163,-1.706591
2013-01-03,0.304486,-0.735436,0.147157,-0.238521


In [None]:
# to access the items of a row
# dfl.loc['20130101'][0]

dfl.loc[:'20130103']

Unnamed: 0,A,B,C,D
2013-01-01,-0.15012,1.66607,0.020874,-1.582633
2013-01-02,0.318356,1.138704,0.454163,-1.706591
2013-01-03,0.304486,-0.735436,0.147157,-0.238521


### More index (row) based indexing

In [None]:
s1 = pd.Series(np.random.randn(6), index=list('abcdef'))
s1

a    0.017725
b   -0.969186
c   -0.490467
d   -0.593311
e    1.683164
f    0.195339
dtype: float64

In [None]:
# 'c' row onwards
s1.loc['c':]

c   -0.490467
d   -0.593311
e    1.683164
f    0.195339
dtype: float64

In [None]:
s1.loc['c'] = 0

In [None]:
s1

a    0.017725
b   -0.969186
c    0.000000
d   -0.593311
e    1.683164
f    0.195339
dtype: float64

In [None]:
s1.loc['c':] = 0
s1

a    0.017725
b   -0.969186
c    0.000000
d    0.000000
e    0.000000
f    0.000000
dtype: float64

In [None]:
df1 = pd.DataFrame(np.random.randn(6,4),
                  index = list('abcded'),
                  columns=list('ABDC'))

In [None]:
df1

Unnamed: 0,A,B,D,C
a,-1.432712,-0.45702,-0.427773,0.637294
b,2.180567,-0.054165,0.59299,-0.644671
c,-1.037832,0.020412,-0.306324,-0.326059
d,1.084503,0.45506,0.199775,-1.979673
e,-0.221576,1.572959,-1.804321,-0.587746
d,-0.494115,0.587223,-1.334331,-0.716053


In [None]:
df1.loc['d']

Unnamed: 0,A,B,D,C
d,1.084503,0.45506,0.199775,-1.979673
d,-0.494115,0.587223,-1.334331,-0.716053


### Selecting particular rows and columns

In [None]:
# a,c,d rows and B and D columns
df1.loc[['a', 'c', 'd'], 'B':'D']

Unnamed: 0,B,D
a,-0.45702,-0.427773
c,0.020412,-0.306324
d,0.45506,0.199775
d,0.587223,-1.334331


In [None]:
df1.loc['c':, 'A':'C']

Unnamed: 0,A,B,D,C
c,-1.037832,0.020412,-0.306324,-0.326059
d,1.084503,0.45506,0.199775,-1.979673
e,-0.221576,1.572959,-1.804321,-0.587746
d,-0.494115,0.587223,-1.334331,-0.716053


#### if the index aren't unique, you might have problems slicing the df using .loc

KeyError: "Cannot get left slice bound for non-unique label: 'd'"

In [None]:
df1.loc['d':, 'A':'C']

KeyError: ignored

## iloc, loc