# Indexing and Selecting Data

In this section, you will:

* Select rows from a dataframe
* Select columns from a dataframe
* Select subsets of dataframes

### Selecting Rows

Selecting rows in dataframes is similar to the indexing you have seen in numpy arrays. The syntax ```df[start_index:end_index]``` will subset rows according to the start and end indices.

In [None]:
import numpy as np
import pandas as pd

market_df = pd.read_csv("./global_sales_data/market_fact.csv", delimiter = ',')
market_df.head(15)

Notice that, by default, pandas assigns integer labels to the rows, starting at 0.

In [4]:
# Selecting the rows from indices 2 to 6
market_df[2:7]

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.3,0.37
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.3,0.38
5,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.02,0.03,23,-47.64,6.15,0.37
6,Ord_31,Prod_12,SHP_41,Cust_26,14.76,0.01,5,1.32,0.5,0.36


In [10]:
market_df.describe([0.25,0.5,0.75,0.8,0.9,0.95,0.97,0.99, 0.991,0.992,0.995,0.996,0.997,0.998,0.999,0.9995])

Unnamed: 0,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
count,8399.0,8399.0,8399.0,8399.0,8399.0,8336.0
mean,1775.878179,0.049671,25.571735,181.184424,12.838557,0.512513
std,3585.050525,0.031823,14.481071,1196.653371,17.264052,0.135589
min,2.24,0.0,1.0,-14140.7,0.49,0.35
25%,143.195,0.02,13.0,-83.315,3.3,0.38
50%,449.42,0.05,26.0,-1.5,6.07,0.52
75%,1709.32,0.08,38.0,162.75,13.99,0.59
80%,2302.901,0.08,41.0,274.932,19.99,0.6
90%,4918.979,0.09,46.0,769.038,35.0,0.72
95%,7844.335,0.1,48.0,1542.309,55.351,0.78


In [11]:
# Selecting alternate rows starting from index = 5
market_df[5::2].head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
5,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.02,0.03,23,-47.64,6.15,0.37
7,Ord_4725,Prod_4,SHP_6593,Cust_1641,3410.1575,0.1,48,1137.91,0.99,0.55
9,Ord_4725,Prod_6,SHP_6593,Cust_1641,57.22,0.07,8,-27.72,6.6,0.37
11,Ord_1925,Prod_6,SHP_2637,Cust_708,465.9,0.05,38,79.34,4.86,0.38
13,Ord_2207,Prod_11,SHP_3093,Cust_839,3364.248,0.1,15,-693.23,61.76,0.78


### Selecting Columns

There are two simple ways to select a single column from a dataframe - ```df['column_name']``` and ```df.column_name```.

In [None]:
# Using df['column']
sales = market_df['Sales']
sales.head()


In [None]:
# Using df.column
sales = market_df.Sales >100
sales.head()

In [None]:
# Notice that in both these cases, the resultant is a Series object
print(type(market_df['Sales']))
print(type(market_df.Sales))


#### Selecting Multiple Columns 

You can select multiple columns by passing the list of column names inside the ```[]```: ```df[['column_1', 'column_2', 'column_n']]```.

For instance, to select only the columns Cust_id, Sales and Profit:

In [None]:
# Select Cust_id, Sales and Profit:
market_df[['Cust_id', 'Sales', 'Profit']].head()

Notice that in this case, the output is itself a dataframe.

In [None]:
type(market_df[['Cust_id', 'Sales', 'Profit']])

In [None]:
# Similarly, if you select one column using double square brackets, 
# you'll get a df, not Series

type(market_df[['Sales']])

### Selecting Subsets of Dataframes

Until now, you have seen selecting rows and columns using the following ways:
* Selecting rows: ```df[start:stop]```
* Selecting columns: ```df['column']``` or ```df.column``` or ```df[['col_x', 'col_y']]```
    * ```df['column']``` or ```df.column``` return a series
    * ```df[['col_x', 'col_y']]``` returns a dataframe

But pandas does not prefer this way of indexing dataframes, since it has some ambiguity. For instance, let's try and select the third row of the dataframe.



In [None]:
# Trying to select the third row: Throws an error
market_df[2]

Pandas throws an error because it is confused whether the ```[2]``` is an *index* or a *label*. Recall from the previous section that you can change the row indices. 

In [None]:
# Changing the row indices to Ord_id
market_df.set_index('Ord_id').head()

Now imagine you had a column with entries ```[2, 4, 7, 8 ...]```, and you set that as the index. What should ```df[2]``` return?
The second row, or the row with the index value = 2?

Taking an example from this dataset, say you decide to assign the ```Order_Quantity``` column as the index.

In [None]:
market_df.set_index('Order_Quantity', inplace = True)

Now, what should ```df[13]``` return - the 14th row, or the row with index label 13 (i.e. the second row)?

Because of this and similar other ambiguities, pandas provides **explicit ways** to subset dataframes - position based indexing and label based indexing, which we'll study next.

In [None]:
market_df.head(20)

In [None]:
market_df[13]