---
---

<center><h1>📍 📍 Overview of Subsetting in Pandas 📍 📍</h1></center>



---

#### `TABLE OF CONTENTS`

- What is an index?
- How to subset first N rows based on their position index?
- Can we change the index?
- Will the index be always numeric?
- How to subset the data based on a label of the index?
- Can we reset the index?
- How to subset the data based on a value of a column?

---


#### `READ THE DATA`

- In this notebook, we are going to use the big mart sales data that we have used previously. It is stored in the folder name `datasets`.


In [1]:
# import the pandas library
import pandas as pd

In [22]:
# read the big mart sales data
data = pd.read_csv('datasets/big_mart_sales.csv')

#### `WHAT IS INDEX?`

![](index.png)

-  You can see the index object by `DataFrame.index`.

In [3]:
# index of the dataframe
data.index

RangeIndex(start=0, stop=8523, step=1)

***So, In this case, we have the numeric index value start from the 0 and ends at 8523.***


In [4]:
# no. of rows and columns in the data
data.shape

(8523, 12)

In [5]:
# column index
data.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier',
       'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type', 'Item_Outlet_Sales'],
      dtype='object')

#### `HOW TO SUBSET FIRST 'N' ROWS BASED ON THEIR POSITION INDEX?`


In [6]:
# view the top rows of the data
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [7]:
data.tail()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
8518,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1,2778.3834
8519,FDS36,8.38,Regular,0.046982,Baking Goods,108.157,OUT045,2002,,Tier 2,Supermarket Type1,549.285
8520,NCJ29,10.6,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136
8521,FDN46,7.21,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976
8522,DRG01,14.8,Low Fat,0.044878,Soft Drinks,75.467,OUT046,1997,Small,Tier 1,Supermarket Type1,765.67


---

#### `CAN WE CHANGE THE INDEX?`

---

***The answer is `Yes`. We can change the index. Let's see how?***

---

#### Create a random list of 8523 numbers and set it as the index
---

In [8]:
# create random list
import random
random_list = [random.randint(1, 8523) for i in range(8523)] 


In [9]:
random_list

[584,
 116,
 401,
 6349,
 388,
 1464,
 5257,
 5859,
 1860,
 894,
 8357,
 5247,
 4298,
 1903,
 7781,
 7064,
 5003,
 662,
 7313,
 1644,
 423,
 6957,
 5169,
 649,
 4303,
 6629,
 3181,
 1088,
 6675,
 4485,
 1177,
 4578,
 6888,
 7477,
 3617,
 6243,
 6311,
 6114,
 2651,
 517,
 6525,
 6439,
 5125,
 5740,
 1209,
 1452,
 7242,
 944,
 6058,
 3634,
 191,
 7045,
 6886,
 6500,
 8321,
 6081,
 3617,
 1680,
 2757,
 8083,
 3306,
 5538,
 3961,
 5474,
 160,
 23,
 5348,
 5802,
 3398,
 3965,
 5905,
 1666,
 5116,
 1363,
 6845,
 3212,
 5986,
 702,
 6914,
 5347,
 3790,
 131,
 6379,
 8128,
 8383,
 4319,
 7429,
 3180,
 553,
 7258,
 876,
 7418,
 5052,
 1031,
 1989,
 2281,
 2768,
 4208,
 3118,
 5131,
 7869,
 1672,
 6649,
 2136,
 269,
 7239,
 5147,
 97,
 765,
 1606,
 1289,
 2124,
 6170,
 8011,
 6378,
 6875,
 5525,
 681,
 7272,
 5418,
 6250,
 2694,
 7152,
 8049,
 2373,
 3555,
 2142,
 2154,
 4684,
 1273,
 1115,
 483,
 7622,
 6430,
 1923,
 5200,
 3783,
 2185,
 2225,
 2354,
 6748,
 5437,
 4633,
 7385,
 4069,
 8253,
 1

In [10]:
# set the index
data.index = random_list

In [11]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
584,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
116,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
401,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
6349,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
388,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


***Set another column of the dataframe as the index. We will use the set_index function. You can read more about here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html'***

---

In [27]:
# 1. change the index of the dataframe
# 2. drop=True is used to drop the column that's set as index
# 3. inplace=True is used to make changes in the original dataframe

data = pd.read_csv('datasets/big_mart_sales.csv')

data.set_index('Outlet_Establishment_Year', drop=True, inplace=False)

Unnamed: 0_level_0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Outlet_Establishment_Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1999,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,Medium,Tier 1,Supermarket Type1,3735.1380
2009,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,Medium,Tier 3,Supermarket Type2,443.4228
1999,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,Medium,Tier 1,Supermarket Type1,2097.2700
1998,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,,Tier 3,Grocery Store,732.3800
1987,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,High,Tier 3,Supermarket Type1,994.7052
...,...,...,...,...,...,...,...,...,...,...,...
1987,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,High,Tier 3,Supermarket Type1,2778.3834
2002,FDS36,8.380,Regular,0.046982,Baking Goods,108.1570,OUT045,,Tier 2,Supermarket Type1,549.2850
2004,NCJ29,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,Small,Tier 2,Supermarket Type1,1193.1136
2009,FDN46,7.210,Regular,0.145221,Snack Foods,103.1332,OUT018,Medium,Tier 3,Supermarket Type2,1845.5976


In [28]:
# example of a label based index
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [11]:
data.index

Int64Index([1999, 2009, 1999, 1998, 1987, 2009, 1987, 1985, 2002, 2007,
            ...
            2004, 2002, 2009, 2009, 1997, 1987, 2002, 2004, 2009, 1997],
           dtype='int64', name='Outlet_Establishment_Year', length=8523)

In [26]:
data.head()

Unnamed: 0_level_0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Outlet_Establishment_Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1999,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,Medium,Tier 1,Supermarket Type1,3735.138
2009,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,Medium,Tier 3,Supermarket Type2,443.4228
1999,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,Medium,Tier 1,Supermarket Type1,2097.27
1998,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,,Tier 3,Grocery Store,732.38
1987,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,High,Tier 3,Supermarket Type1,994.7052


In [15]:
# 1. change the index of the dataframe
# 2. drop=True is used to drop the column that's set as index
# 3. inplace=True is used to make changes in the original dataframe
# data = pd.read_csv('datasets/big_mart_sales.csv')

data.set_index('Item_Fat_Content', drop=True, inplace=True)

In [16]:
# example of a label based index
data.head()

Unnamed: 0_level_0,Item_Weight,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Item_Fat_Content,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Low Fat,9.3,0.016047,Dairy,249.8092,OUT049,Medium,Tier 1,Supermarket Type1,3735.138
Regular,5.92,0.019278,Soft Drinks,48.2692,OUT018,Medium,Tier 3,Supermarket Type2,443.4228
Low Fat,17.5,0.01676,Meat,141.618,OUT049,Medium,Tier 1,Supermarket Type1,2097.27
Regular,19.2,0.0,Fruits and Vegetables,182.095,OUT010,,Tier 3,Grocery Store,732.38
Low Fat,8.93,0.0,Household,53.8614,OUT013,High,Tier 3,Supermarket Type1,994.7052


In [17]:
data.index

Index(['Low Fat', 'Regular', 'Low Fat', 'Regular', 'Low Fat', 'Regular',
       'Regular', 'Low Fat', 'Regular', 'Regular',
       ...
       'Regular', 'Regular', 'Low Fat', 'Low Fat', 'reg', 'Low Fat', 'Regular',
       'Low Fat', 'Regular', 'Low Fat'],
      dtype='object', name='Item_Fat_Content', length=8523)

---

### `WILL THE INDEX BE ALWAYS NUMERIC?`

---


***No, We can also have categorical variables as the index of a dataframe.*** 

---

In [24]:
# reset the index
data.set_index('Item_Identifier', drop= True, inplace=True)

In [25]:
# view the top rows of the data
data.head()

Unnamed: 0_level_0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Item_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,Medium,Tier 1,Supermarket Type1,3735.138
DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,Medium,Tier 3,Supermarket Type2,443.4228
FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,Medium,Tier 1,Supermarket Type1,2097.27
FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,,Tier 3,Grocery Store,732.38
NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,High,Tier 3,Supermarket Type1,994.7052


In [26]:
# index of the data
data.index

Index(['FDA15', 'DRC01', 'FDN15', 'FDX07', 'NCD19', 'FDP36', 'FDO10', 'FDP10',
       'FDH17', 'FDU28',
       ...
       'FDH31', 'FDA01', 'FDH24', 'NCJ19', 'FDF53', 'FDF22', 'FDS36', 'NCJ29',
       'FDN46', 'DRG01'],
      dtype='object', name='Item_Identifier', length=8523)

---

#### `HOW TO SUBSET THE DATA BASED ON THE LABEL OF THE INDEX?`

- We can subset the data based on the label of the index using the loc function.
---

In [27]:
data.loc['FDA15']

Unnamed: 0_level_0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Item_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,Medium,Tier 1,Supermarket Type1,3735.138
FDA15,9.3,Low Fat,0.016055,Dairy,250.2092,OUT045,,Tier 2,Supermarket Type1,5976.2208
FDA15,9.3,Low Fat,0.016019,Dairy,248.5092,OUT035,Small,Tier 2,Supermarket Type1,6474.2392
FDA15,9.3,Low Fat,0.016088,Dairy,249.6092,OUT018,Medium,Tier 3,Supermarket Type2,5976.2208
FDA15,9.3,Low Fat,0.026818,Dairy,248.9092,OUT010,,Tier 3,Grocery Store,498.0184
FDA15,9.3,Low Fat,0.016009,Dairy,250.6092,OUT013,High,Tier 3,Supermarket Type1,6474.2392
FDA15,,Low Fat,0.015945,Dairy,249.5092,OUT027,Medium,Tier 3,Supermarket Type3,6474.2392
FDA15,9.3,LF,0.016113,Dairy,248.8092,OUT017,,Tier 2,Supermarket Type1,5976.2208


***What if we want to change the index back to positional?***

---

#### `CAN WE RESET THE INDEX?`

***Yes, we can reset the index. Let's see how? We will use the reset_index function. You can read more about here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html***

---

In [28]:
# reset the index
data.reset_index(inplace=True)

In [29]:
# view the top rows of the data
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,High,Tier 3,Supermarket Type1,994.7052


#### `HOW TO SUBSET THE DATA BASED ON A VALUE OF A COLUMN?`

---

In [29]:
data[data['Item_Type'] == 'Dairy']

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
11,FDA03,18.500,Regular,0.045464,Dairy,144.1102,OUT046,1997,Small,Tier 1,Supermarket Type1,2187.1530
19,FDU02,13.350,Low Fat,0.102492,Dairy,230.5352,OUT035,2004,Small,Tier 2,Supermarket Type1,2748.4224
28,FDE51,5.925,Regular,0.161467,Dairy,45.5086,OUT010,1998,,Tier 3,Grocery Store,178.4344
30,FDV38,19.250,Low Fat,0.170349,Dairy,55.7956,OUT010,1998,,Tier 3,Grocery Store,163.7868
...,...,...,...,...,...,...,...,...,...,...,...,...
8424,FDC39,7.405,Low Fat,0.159165,Dairy,207.1296,OUT035,2004,Small,Tier 2,Supermarket Type1,3739.1328
8447,FDS26,20.350,Low Fat,0.089975,Dairy,261.6594,OUT017,2007,,Tier 2,Supermarket Type1,7588.1226
8448,FDV50,14.300,Low Fat,0.123071,Dairy,121.1730,OUT018,2009,Medium,Tier 3,Supermarket Type2,2093.9410
8457,FDY50,5.800,Low Fat,0.130931,Dairy,89.9172,OUT035,2004,Small,Tier 2,Supermarket Type1,1516.6924


In [30]:
data.loc[30]

Item_Identifier                      FDV38
Item_Weight                          19.25
Item_Fat_Content                   Low Fat
Item_Visibility                   0.170349
Item_Type                            Dairy
Item_MRP                           55.7956
Outlet_Identifier                   OUT010
Outlet_Establishment_Year             1998
Outlet_Size                            NaN
Outlet_Location_Type                Tier 3
Outlet_Type                  Grocery Store
Item_Outlet_Sales                  163.787
Name: 30, dtype: object

----

#### `WE WILL SEE HOW TO SUBSET THE DATA BASED ON POSITION, LABEL AND VALUES IN DETAIL`

---