##### <b> filtering Dataframes </b></br> Can filter rows in a DataFrame by passing logical test into the .loc[] access (like series/Np Array) </br>

In [28]:
import pandas as pd
import numpy as np

### **Operators and Methods to Create Boolean Filters for Logical Tests**

| Description                 | Python Operator | Pandas Method |
|-----------------------------|-----------------|---------------|
| Equal                       | `==`            | `.eq()`       |
| Not Equal                   | `!=`            | `.ne()`       |
| Less Than or Equal          | `<=`            | `.le()`       |
| Less Than                   | `<`             | `.lt()`       |
| Greater Than or Equal       | `>=`            | `.ge()`       |
| Greater Than                | `>`             | `.gt()`       |
| Membership Test             | `in`            | `isin()`      |
| Inverse Membership Test     | `not in`        | `~.isin()`    |
##### .isin() method syntax: pd[`column_name_to_be_searched`].isin([`list of EXACT search strings`]) otherwise use .str.contains(`string characters`)

In [29]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [30]:
# create logical condition mask to filter DataFrame and .loc returns all rows that pass the condition
mask = retail_df['date'] == '2016-10-28'
retail_df.loc[mask]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
536382,2482326,2016-10-28,1,AUTOMOTIVE,8.000,0
536383,2482327,2016-10-28,1,BABY CARE,0.000,0
536384,2482328,2016-10-28,1,BEAUTY,9.000,1
536385,2482329,2016-10-28,1,BEVERAGES,2576.000,38
536386,2482330,2016-10-28,1,BOOKS,0.000,0
...,...,...,...,...,...,...
538159,2484103,2016-10-28,9,POULTRY,391.292,24
538160,2484104,2016-10-28,9,PREPARED FOODS,78.769,1
538161,2484105,2016-10-28,9,PRODUCE,993.760,5
538162,2484106,2016-10-28,9,SCHOOL AND OFFICE SUPPLIES,0.000,0


In [31]:
# create logical condition mask1 to filter DataFrame and select specific columns then use .loc[mask] to return all rows that pass the condition
mask1 = retail_df['date'] == '2016-10-28', ['date', 'sales']
retail_df.loc[mask1]

Unnamed: 0,date,sales
536382,2016-10-28,8.000
536383,2016-10-28,0.000
536384,2016-10-28,9.000
536385,2016-10-28,2576.000
536386,2016-10-28,0.000
...,...,...
538159,2016-10-28,391.292
538160,2016-10-28,78.769
538161,2016-10-28,993.760
538162,2016-10-28,0.000


##### <b> Apply multiple Filters </b></br> Join logical test with an &(and) |(or).

In [32]:
# create a complex boolean mask for string character search using .str.contains('characters')
mask2 = (retail_df['family'].str.contains('AUTO')) | (retail_df['family'].str.contains('DAIR'))
retail_df.loc[mask2]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
8,1945952,2016-01-01,1,DAIRY,0.0,0
33,1945977,2016-01-01,10,AUTOMOTIVE,0.0,0
41,1945985,2016-01-01,10,DAIRY,0.0,0
66,1946010,2016-01-01,11,AUTOMOTIVE,0.0,0
...,...,...,...,...,...,...
1054853,3000797,2017-08-15,7,DAIRY,1279.0,25
1054878,3000822,2017-08-15,8,AUTOMOTIVE,4.0,0
1054886,3000830,2017-08-15,8,DAIRY,1330.0,24
1054911,3000855,2017-08-15,9,AUTOMOTIVE,15.0,0


In [33]:
# create a complex boolean mask for using membership test .isin([])
mask3 = (retail_df['family'].isin(['AUTOMOTIVE', 'CLEANING']) & 
        retail_df['sales'] > 0)
retail_df.loc[mask3]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
561,1946505,2016-01-01,25,AUTOMOTIVE,4.0,0
568,1946512,2016-01-01,25,CLEANING,734.0,0
1782,1947726,2016-01-02,1,AUTOMOTIVE,7.0,0
1789,1947733,2016-01-02,1,CLEANING,526.0,3
1815,1947759,2016-01-02,10,AUTOMOTIVE,1.0,0
...,...,...,...,...,...,...
1054852,3000796,2017-08-15,7,CLEANING,1139.0,9
1054878,3000822,2017-08-15,8,AUTOMOTIVE,4.0,0
1054885,3000829,2017-08-15,8,CLEANING,1198.0,13
1054911,3000855,2017-08-15,9,AUTOMOTIVE,15.0,0


In [34]:
# create boolean mask for year 2016 with sales greater than 500
mask2016 = ((retail_df['date'].str[:4] == '2016') 
            & (retail_df['sales'] > 500))
retail_df.loc[mask2016]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
564,1946508,2016-01-01,25,BEVERAGES,5104.000,1
566,1946510,2016-01-01,25,BREAD/BAKERY,680.952,0
568,1946512,2016-01-01,25,CLEANING,734.000,0
569,1946513,2016-01-01,25,DAIRY,1033.000,11
572,1946516,2016-01-01,25,FROZEN FOODS,596.125,0
...,...,...,...,...,...,...
650409,2596353,2016-12-31,9,GROCERY I,7657.226,147
650415,2596359,2016-12-31,9,HOME CARE,515.000,7
650422,2596366,2016-12-31,9,PERSONAL CARE,516.000,13
650425,2596369,2016-12-31,9,POULTRY,687.853,1


### **Query Method** </br> .query() Method uses SQL-like syntax to filter DataFrames </br> Use of this method can based if teams within the company allows this or not </br> - create complex filters using `and` & `or` keywords </br> - use the `in` keyword from base Python</br> pd.query("`column_to_filter` in [`'list_of_search_text'`] with `and`/`or` condition if needed") </br> Similar to SQL the `@` can be used to call variables in the query statement `@variable_name` </br> -- cannot use slice with .query() method but date values can be parsed to their own columns

In [35]:
# .query method wraps entire statement in "" and search text in ''
retail_df.query(
    "family in ['CLEANING', 'DAIRY'] and sales > 0"
    )

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
568,1946512,2016-01-01,25,CLEANING,734.0,0
569,1946513,2016-01-01,25,DAIRY,1033.0,11
1789,1947733,2016-01-02,1,CLEANING,526.0,3
1790,1947734,2016-01-02,1,DAIRY,627.0,15
1822,1947766,2016-01-02,10,CLEANING,1216.0,4
...,...,...,...,...,...,...
1054853,3000797,2017-08-15,7,DAIRY,1279.0,25
1054885,3000829,2017-08-15,8,CLEANING,1198.0,13
1054886,3000830,2017-08-15,8,DAIRY,1330.0,24
1054918,3000862,2017-08-15,9,CLEANING,1439.0,25


In [36]:
# creation of average sales variable
avg_sales = retail_df['sales'].mean()
avg_sales

457.72248700136413

In [37]:
# .query method wraps entire statement in "" and search text in '' and uses @variable_name when calling created variables
retail_df.query(
    "family in ['CLEANING', 'DAIRY'] and sales > @avg_sales"
    )

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
568,1946512,2016-01-01,25,CLEANING,734.0,0
569,1946513,2016-01-01,25,DAIRY,1033.0,11
1789,1947733,2016-01-02,1,CLEANING,526.0,3
1790,1947734,2016-01-02,1,DAIRY,627.0,15
1822,1947766,2016-01-02,10,CLEANING,1216.0,4
...,...,...,...,...,...,...
1054853,3000797,2017-08-15,7,DAIRY,1279.0,25
1054885,3000829,2017-08-15,8,CLEANING,1198.0,13
1054886,3000830,2017-08-15,8,DAIRY,1330.0,24
1054918,3000862,2017-08-15,9,CLEANING,1439.0,25


### **Sorting DataFrames by Indices** </br> Can sort a DataFrame by it's indices using the `.sort_index()` method </br> - This sorts rows (`axis=0`) by default, but can specify (`axis=1`) to sort columns

In [39]:
# create sample DataFrame by filtering rows for 3 product familes
condition = retail_df['family'].isin(['BEVERAGES', 'DAIRY', 'DELI'])
retail_df[condition]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
3,1945947,2016-01-01,1,BEVERAGES,0.000,0
8,1945952,2016-01-01,1,DAIRY,0.000,0
9,1945953,2016-01-01,1,DELI,0.000,0
36,1945980,2016-01-01,10,BEVERAGES,0.000,0
41,1945985,2016-01-01,10,DAIRY,0.000,0
...,...,...,...,...,...,...
1054886,3000830,2017-08-15,8,DAIRY,1330.000,24
1054887,3000831,2017-08-15,8,DELI,276.639,8
1054914,3000858,2017-08-15,9,BEVERAGES,3530.000,26
1054919,3000863,2017-08-15,9,DAIRY,835.000,19


In [43]:
# grab 5 sample rows into sample_df
sample_df = retail_df[condition].sample(5, random_state=2021)

In [44]:
# sort sample_df by index ascending (default)
sample_df.sort_index()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
13506,1959450,2016-01-08,38,DELI,131.545,43
74292,2020236,2016-02-11,43,DELI,212.0,2
445008,2390952,2016-09-06,45,BEVERAGES,8339.0,19
495966,2441910,2016-10-05,25,DELI,0.0,0
882588,2828532,2017-05-11,23,BEVERAGES,1194.0,22


In [45]:
# sort sample_df by index descending 
sample_df.sort_index(ascending=False)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
882588,2828532,2017-05-11,23,BEVERAGES,1194.0,22
495966,2441910,2016-10-05,25,DELI,0.0,0
445008,2390952,2016-09-06,45,BEVERAGES,8339.0,19
74292,2020236,2016-02-11,43,DELI,212.0,2
13506,1959450,2016-01-08,38,DELI,131.545,43


In [53]:
# sort sample_df by column index (or column labels) ascending (default)
sample_df.sort_index(axis=1, inplace=True)
sample_df

Unnamed: 0,date,family,id,onpromotion,sales,store_nbr
74292,2016-02-11,DELI,2020236,2,212.0,43
13506,2016-01-08,DELI,1959450,43,131.545,38
882588,2017-05-11,BEVERAGES,2828532,22,1194.0,23
445008,2016-09-06,BEVERAGES,2390952,19,8339.0,45
495966,2016-10-05,DELI,2441910,0,0.0,25


### **Sorting DataFrames by its Values** </br> Can sort a DataFrame by it's values using the `.sort_values()` method </br> - Can sort a single column or multiple columns </br> - can specify ascending/descending for specific columns during .sort_values() method using .sort_values([`list of columns`], ascending[`List of True or False for each column`])

In [54]:
# sort sample_df by 1 column 'store_nbr'
sample_df.sort_values('store_nbr')

Unnamed: 0,date,family,id,onpromotion,sales,store_nbr
882588,2017-05-11,BEVERAGES,2828532,22,1194.0,23
495966,2016-10-05,DELI,2441910,0,0.0,25
13506,2016-01-08,DELI,1959450,43,131.545,38
74292,2016-02-11,DELI,2020236,2,212.0,43
445008,2016-09-06,BEVERAGES,2390952,19,8339.0,45


In [57]:
# sort sample_df by 2 columns 'family', 'sales' with 'family sorted ascending and sales sorted descending
sample_df.sort_values(['family', 'sales'], ascending=[True,False])

Unnamed: 0,date,family,id,onpromotion,sales,store_nbr
445008,2016-09-06,BEVERAGES,2390952,19,8339.0,45
882588,2017-05-11,BEVERAGES,2828532,22,1194.0,23
74292,2016-02-11,DELI,2020236,2,212.0,43
13506,2016-01-08,DELI,1959450,43,131.545,38
495966,2016-10-05,DELI,2441910,0,0.0,25
