# <span style="color:#130654; font-family: Helvetica; font-size: 200%; font-weight:700"> Pandas | <span style="font-size: 50%; font-weight:300">Index: Set, Reset & Use</span>

To use pandas in python import it first by using the following command:

In [1]:
# import pandas
import pandas as pd

# import other libraries here
import numpy as np

<br>

## <span style="color:#130654">**Index**</span>

- Index are <u>unique identifiers of a row</u> in a dataframe.
- Like a dict, a DataFrame's <u>index is backed by a hash table</u>. Looking up rows based on index values is like looking up dict values based on a key.
- In contrast, the values in a column are like values in a list.
- Index uses <u>fast hash lookup</u> so looking up rows based on index values is <u>faster than looking up rows based on column values</u>.

*Example:*

In [2]:
# lets create a dataframe with random numbers
df_1 = pd.DataFrame(({'foo':np.random.random(), 'index':range(10000)}))

# Show first 5 records
df_1.head()

Unnamed: 0,foo,index
0,0.889971,0
1,0.889971,1
2,0.889971,2
3,0.889971,3
4,0.889971,4


**Look up by column**

To look up any row where the `df_1['index']` <u>column equals 999</u>. Pandas will have to <u>loop through every value in the column</u> to find the ones equal to 999.

In [3]:
df_1[df_1['index'] == 999]

Unnamed: 0,foo,index
999,0.889971,999


Time taken and number of iterations through loops:

In [4]:
%timeit df_1[df_1['index'] == 999]

1.07 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


**Look up by index**

In [5]:
# lets set the index column as dataframe index
df_1_index = df_1.set_index(['index'])

# Show first 5 records
df_1_index.head()

Unnamed: 0_level_0,foo
index,Unnamed: 1_level_1
0,0.889971
1,0.889971
2,0.889971
3,0.889971
4,0.889971


To ookup any row where the <u>index equals 999</u>, Pandas uses the <u>hash value to find the rows</u>:

In [6]:
df_1_index.loc[999]

foo    0.889971
Name: 999, dtype: float64

Time taken and number of iterations through loops:

In [7]:
%timeit df_1_index.loc[999]

144 µs ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


As it is evident that using `index` is way more faster at looking up values in comparison to doing look up with `column`.

But setting up index is still slower than doing look up with index.

In [8]:
%timeit df_1.set_index(['index'])

593 µs ± 142 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


<br>

#### <span style="color:#130654">When we use Index?</span>

- Sometimes the index plays a role in reshaping the DataFrame.
- Many functions, such as `set_index`, `stack`, `unstack`, `pivot`, `pivot_table`, `melt`, `lreshape`, and `crosstab`, all use or manipulate the index.
- Behind the scenes, `join`, `merge` and `groupby` take advantage of fast index lookups when possible.
- Time series have `resample`, `asfreq` and `interpolate` methods whose underlying implementations take advantage of fast index lookups too.

<br>

#### <span style="color:#130654">set_index()</span>

`set_index()` method can be used to assign an existing custom as an index in dataframe.

*Syntax:*
```python
DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
```

|Param |Details |
|:----:|--------|
|**keys** | Column name or list of column name. |
|**drop** | Boolean value which drops the column used for index if True. |
|**append** | Appends the column to existing index column if True. |
|**inplace** | Makes the changes in the dataframe if True. |
|**verify_integrity** | Checks the new index column for duplicates if True. |

<br>

Creating a dataset using dictionary:

In [9]:
data = {
    'country':['United States', 'China', 'Japan', 'Russia', 'India'],
    'pop_million':[331.0, 1439.32, 126.48, 145.93, 1380.0],
    'pop_rank':[3, 1, 11, 9, 2],
    'gdp_trillion':[19.485, 12.238, 4.872 , 1.578 , 2.651],
    'gdp_rank':[1, 2, 3, 11, 5],
    
}

Lets create a dataframe with automatic index assigned by pandas.

In [10]:
df_2 = pd.DataFrame(data)
df_2

Unnamed: 0,country,pop_million,pop_rank,gdp_trillion,gdp_rank
0,United States,331.0,3,19.485,1
1,China,1439.32,1,12.238,2
2,Japan,126.48,11,4.872,3
3,Russia,145.93,9,1.578,11
4,India,1380.0,2,2.651,5


Check the index of dataframe using `dataframe.index`:

In [11]:
df_2.index

RangeIndex(start=0, stop=5, step=1)

Pandas has automatically assigned range of integer as index.

<br>

Use `set_index()` method to assign column `gdp_rank` as custom index.

In [12]:
df_2.set_index(keys='gdp_rank', inplace=True) # No need to type "keys=", just pass column name directly
df_2

Unnamed: 0_level_0,country,pop_million,pop_rank,gdp_trillion
gdp_rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,United States,331.0,3,19.485
2,China,1439.32,1,12.238
3,Japan,126.48,11,4.872
11,Russia,145.93,9,1.578
5,India,1380.0,2,2.651


Now assign another `country` column as index.

In [13]:
df_2.set_index(keys='country', inplace=True)
df_2

Unnamed: 0_level_0,pop_million,pop_rank,gdp_trillion
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United States,331.0,3,19.485
China,1439.32,1,12.238
Japan,126.48,11,4.872
Russia,145.93,9,1.578
India,1380.0,2,2.651


Note: Since `drop` param is `True` by default, so set_index() method will drop the previous assigned column as index. Thus extra precaution need to be taken. 

As from the above example it is clear that column `gdp_rank` is dropped from the dataframe when column `country` is assigned as new index. In order to avoid this set `drop=Flase` in the param.

<br>

#### <span style="color:#130654">reset_index()</span>

`reset_index()` can be used to reset the custom index and replace it with rangeindex(). Custom index is converted back into column. 

In [14]:
df_2.reset_index(inplace=True)
df_2

Unnamed: 0,country,pop_million,pop_rank,gdp_trillion
0,United States,331.0,3,19.485
1,China,1439.32,1,12.238
2,Japan,126.48,11,4.872
3,Russia,145.93,9,1.578
4,India,1380.0,2,2.651


<br>

#### <span style="color:#130654">sort_index()</span>

`sort_index()` function is used to sort Series by index labels.

Returns a new Series sorted by label if inplace argument is False, otherwise updates the original series and returns None.

*Syntax:*
```python
Series.sort_index(self, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True)
```

|        Name        | Description                                                  | Type                                     | Required |
| :----------------: | :----------------------------------------------------------- | :--------------------------------------- | :------- |
|      **axis**      | Axis to direct sorting. This can only be 0 for Series.       | int                                      | Required |
|     **level**      | If not None, sort on values in specified index level(s).     | int                                      | Optional |
|   **ascending**    | Sort ascending vs. descending.                               | bool                                     | Required |
|    **inplace**     | If True, perform operation in-place.                         | bool Default Value: False                | Required |
|      **kind**      | Choice of sorting algorithm.                                 | {‘quicksort’, ‘mergesort’, ‘heapsort’} ’ | Required |
|  **na_position**   | 1. ‘first’ puts NaNs at the beginning.<br />2. ‘last’ puts NaNs at the end. <br />Not implemented for MultiIndex. | {‘first’, ‘last’} ’                      | Required |
| **sort_remaining** | If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level. | bool                                     | Required |

<img src="./img/sort_index.png" width=200/>

In [15]:
# set "country" column as index in dataframe
df_2.set_index('country', inplace=True)

# sort row index in ascending order
df_2.sort_index(ascending=True, inplace=True)
df_2

Unnamed: 0_level_0,pop_million,pop_rank,gdp_trillion
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
China,1439.32,1,12.238
India,1380.0,2,2.651
Japan,126.48,11,4.872
Russia,145.93,9,1.578
United States,331.0,3,19.485


By default `sort_index()` sorts row / index, to sort column use `axis=1` param.

In [16]:
# sort column index in ascending order
df_2.sort_index(axis=1, ascending=True, inplace=True)
df_2

Unnamed: 0_level_0,gdp_trillion,pop_million,pop_rank
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
China,12.238,1439.32,1
India,2.651,1380.0,2
Japan,4.872,126.48,11
Russia,1.578,145.93,9
United States,19.485,331.0,3


DataFrame index is alphabetically sorted in ascending order.

<br>

#### <span style="color:#130654">Retreive values using index</span>

Index can be used to get a values or filter values in dataframe based on specific condition.
- Use `.loc[row_name, column_name]` method to get value by label.
- Use `.iloc[row_index, column_index]` method to get value by position.
- These methods are `not inplace`.

In [17]:
# lets see our dataframe
df_2

Unnamed: 0_level_0,gdp_trillion,pop_million,pop_rank
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
China,12.238,1439.32,1
India,2.651,1380.0,2
Japan,4.872,126.48,11
Russia,1.578,145.93,9
United States,19.485,331.0,3


<br>

Get the row / index values for `country = India`

In [18]:
# Using label
df_2.loc['India']

gdp_trillion       2.651
pop_million     1380.000
pop_rank           2.000
Name: India, dtype: float64

In [19]:
# Using index position
df_2.iloc[1]

gdp_trillion       2.651
pop_million     1380.000
pop_rank           2.000
Name: India, dtype: float64

<br>

Get the `pop_million` column value for row / index `country = India`

In [20]:
# Using label
df_2.loc['India', 'pop_million' ]

1380.0

In [21]:
# Using index position
df_2.iloc[1, 1]

1380.0

<br>

Additional use of `iloc` for slicing rows and columns to get desired dataframe

In [22]:
# Slicing rows, use ":" to select all columns after ","

df_2.iloc[0:3,:]

Unnamed: 0_level_0,gdp_trillion,pop_million,pop_rank
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
China,12.238,1439.32,1
India,2.651,1380.0,2
Japan,4.872,126.48,11


In [23]:
# Slicing rows, use ":" to select all rows before ","
df_2.iloc[:,0:2]

Unnamed: 0_level_0,gdp_trillion,pop_million
country,Unnamed: 1_level_1,Unnamed: 2_level_1
China,12.238,1439.32
India,2.651,1380.0
Japan,4.872,126.48
Russia,1.578,145.93
United States,19.485,331.0


-------------

## <span style="color:#130654">Reindexing</span>

Reindexing changes the row labels and column labels of a DataFrame.

Multiple operations can be accomplished through indexing like −
- Reorder the existing data to match a new set of labels.
- Insert missing value (NA) markers in label locations where no data for the label existed.

`reindex()` function is used to conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index.

*Syntax:*
```python
DataFrame.reindex(self, labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)
```

|          Name          | Description                                                  | Type                                                 | Required |
| :--------------------: | :----------------------------------------------------------- | :--------------------------------------------------- | :------- |
|       **label**        | New labels / index to conform the axis specified by ‘axis’ to. | array-like                                           | Optional |
| **index**, **columns** | New labels / index to conform to, should be specified using keywords. Preferably an Index object to avoid duplicating data | array-like                                           | Optional |
|        **axis**        | Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). | int or str,                                          | Optional |
|       **method**       | Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.<br />1. None (default): don’t fill gaps<br />2. pad / ffill: propagate last valid observation forward to next valid<br />3. backfill / bfill: use next valid observation to fill gap<br />4. nearest: use nearest valid observations to fill gap | {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’} | Required |
|        **copy**        | Return a new object, even if the passed indexes are the same. | bool                                                 | Required |
|       **level**        | Broadcast across a level, matching Index values on the passed MultiIndex level. | int or name                                          | Required |
|     **fill_value**     | Value to use for missing values. Defaults to NaN, but can be any “compatible” value. | scalar                                               | Required |
|       **limit**        | Maximum number of consecutive elements to forward or backward fill. | int                                                  | Required |
|     **tolerance**      | Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] - target) <= tolerance. Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type. |                                                      | Optional |

In [24]:
N=6

df_xx = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'X': np.linspace(0,stop=N-1,num=N),
   'Y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})

df_xx

Unnamed: 0,A,X,Y,C,D
0,2016-01-01,0.0,0.930182,Medium,73.676793
1,2016-01-02,1.0,0.447519,Medium,91.989409
2,2016-01-03,2.0,0.285539,Medium,88.898705
3,2016-01-04,3.0,0.22709,Low,106.118143
4,2016-01-05,4.0,0.415857,High,105.812838
5,2016-01-06,5.0,0.036463,Low,106.182851


In [25]:
#reindex the DataFrame
df_reindexed = df_xx.reindex(index=[0,2,5], columns=['A', 'C', 'B'], fill_value='missing')

In [26]:
df_reindexed

Unnamed: 0,A,C,B
0,2016-01-01,Medium,missing
2,2016-01-03,Medium,missing
5,2016-01-06,Low,missing
