**loc and iloc**
loc uses row and **column labels**.
iloc uses row and **column indexes**.

The main difference is that loc works with labels whereas iloc works with indices.

In [4]:
import pandas as pd
pd.options.display.width = 1000

sales = pd.read_csv('sales.csv')
print("access using loc")
print(sales.loc[:4, ["product_code","product_group"]])

print("access using iloc")
print(sales.iloc[:4, [0, 1]])

access using loc
   product_code product_group
0          4187           PG2
1          4195           PG2
2          4204           PG2
3          4219           PG2
4          4718           PG2
access using iloc
   product_code product_group
0          4187           PG2
1          4195           PG2
2          4204           PG2
3          4219           PG2


In [5]:
import numpy as np
df = pd.DataFrame(
  np.random.randint(10, size=(4,4)),
  index = ["a","b","c","d"],
  columns = ["col_a","col_b","col_c","col_d"]
  )

print(df)

print("\nSelect two rows and two columns using loc:")
print(df.loc[["b","d"], ["col_a","col_c"]])
print("\nSelect two rows and two columns using iloc:")
print(df.iloc[[1,3], [0,2]])

   col_a  col_b  col_c  col_d
a      4      9      0      9
b      5      4      4      0
c      6      8      3      9
d      2      9      3      5

Select two rows and two columns using loc:
   col_a  col_c
b      5      4
d      2      3

Select two rows and two columns using iloc:
   col_a  col_c
b      5      4
d      2      3


In [11]:
# select column by name
# need to put the column name in a list, even if you are selecting only one column
# if not, pandas will return a series instead of a dataframe (incase pass 1 column string)
print(sales[['product_code', 'product_group']].head(5))
# return dataframe
print(sales[["product_code"]].head(5))
# return series


   product_code product_group
0          4187           PG2
1          4195           PG2
2          4204           PG2
3          4219           PG2
4          4718           PG2
   product_code
0          4187
1          4195
2          4204
3          4219
4          4718


In [18]:
# filter using OR (|) and AND (&)
sales_filtered = sales[(sales["product_group"] == "PG1") | (sales["product_group"] == "PG2")]
sales_filtered.head(5)

# filter using isin
sales_filtered = sales[sales["product_group"].isin(["PG1", "PG2", "PG10"])]
sales_filtered.head(5)

# filter using not
sales_filtered = sales[~sales["product_group"].isin(["PG1", "PG2", "PG10"])]
sales_filtered.head(5)

# filter using query
sales_filtered = sales.query("product_group == 'PG1' or product_group == 'PG2'")
sales_filtered.head(5)


Unnamed: 0,product_code,product_group,stock_qty,cost,price,last_week_sales,last_month_sales
0,4187,PG2,498,420.76,569.91,13,58
1,4195,PG2,473,545.64,712.41,16,58
2,4204,PG2,968,640.42,854.91,22,88
3,4219,PG2,241,869.69,1034.55,14,45
4,4718,PG2,1401,12.54,26.59,50,285


The challenge is to find the number of products that are priced higher than the average product price in the sales data frame. As a hint, keep in mind that you need to use the operations specified with the bullet points above.

In [19]:
sales.head(5)
average_price = sales["price"].mean()
sales["product_code"].value_counts

Unnamed: 0,product_code,product_group,stock_qty,cost,price,last_week_sales,last_month_sales
0,4187,PG2,498,420.76,569.91,13,58
1,4195,PG2,473,545.64,712.41,16,58
2,4204,PG2,968,640.42,854.91,22,88
3,4219,PG2,241,869.69,1034.55,14,45
4,4718,PG2,1401,12.54,26.59,50,285
