In this lab, we will experiment with some useful functionalities from pandas such as append, join, lambdas, and combine to manipulate dataframes. Pandas is a very well-known python library which helps to organize and analyze data.   

In [2]:
import pandas as pd

We have dictionary containing order information for some products. The information in this dictionary are the order ids, name of the products, order quantity, price for each unit of the products.

In [3]:
order_info = {
    "order_id": [1001, 1002, 1003, 1004, 1005, 1006, 1007],
    "product_name": ["Keyboard", "Jacket", "Shoe", "Tennis Racket","Basketball", "Microphone", "Couch"],
    "quantity": [2, 1, 3, 4, 6, 2, 1],
    "unit_price": [22.5, 79.99, 49.5, 30, 15, 35.5, 349.99]
}

In [4]:
# Now, we create a dataframe named "order_df" form this order information
order_df = pd.DataFrame(order_info)

In [5]:
order_df

Unnamed: 0,order_id,product_name,quantity,unit_price
0,1001,Keyboard,2,22.5
1,1002,Jacket,1,79.99
2,1003,Shoe,3,49.5
3,1004,Tennis Racket,4,30.0
4,1005,Basketball,6,15.0
5,1006,Microphone,2,35.5
6,1007,Couch,1,349.99


We have another dictionary for containing the names, categories and manufacturer of the products.

In [6]:
sales_info = {
    "product_name": ["Shoe", "Keyboard","Tennis Racket", "Guitar", "Biscuits", "Basketball"],
    "category": ["Clothing", "Electronics", "Sports", "Music", "Food", "Sports"],
    "manufacturer": ["Manufacturer_A", "Manufacturer_B", "Manufacturer_C", "Manufacturer_D", "Manufacturer_E", "Manufacturer_C"]
}

Let's convert this dictionary into a dataframe for further processing of the data.

In [7]:
sales_df = pd.DataFrame(sales_info)

In [8]:
sales_df

Unnamed: 0,product_name,category,manufacturer
0,Shoe,Clothing,Manufacturer_A
1,Keyboard,Electronics,Manufacturer_B
2,Tennis Racket,Sports,Manufacturer_C
3,Guitar,Music,Manufacturer_D
4,Biscuits,Food,Manufacturer_E
5,Basketball,Sports,Manufacturer_C


In [9]:
# Excercise:
# Combine the rows given in this cell with order_df

new_order_info = {
    "order_id": [1008, 1009],
    "product_name": ["Watch", "Pen"],
    "quantity": [2, 12],
    "unit_price": [68, 0.4]
}

# new_order_df = pd.DataFrame(new_order_info)
# order_df = order_df.append(new_order_df)
# order_df

Both of the dataframes, order_df and sales_df have a column in common which is 'product_name'. We can concatenate these two dataframes based on this column to better visualize our data. "join" method in pandas can help us do that based on our preferences.

The "join" method provided by pandas can be used to combine two dataframes based on a common column. There can be several types of join operation in pandas -

Inner Join: returns only the rows that have matching values for the selected column in both DataFrames

Outer Join: returns all rows from both DataFrames, with NaN values where there's no matching value for the selected column

Left Join: returns all rows from the left DataFrame and the matched rows from the right DataFrame, with NaN values where there are no matches

Right Join: returns all rows from the right DataFrame and the matched rows from the left DataFrame, with NaN values where there are no matches

For example, if we want to get all the information for the orders which have entries in both order_df and sales_df, we can use 'inner' join.

In [10]:
# Performing inner join between the dataframes based on 'product_name' column
# returns the product information for products which have rows in both dataframes
order_df.join(sales_df.set_index('product_name'), on='product_name', how="inner")

Unnamed: 0,order_id,product_name,quantity,unit_price,category,manufacturer
0,1001,Keyboard,2,22.5,Electronics,Manufacturer_B
2,1003,Shoe,3,49.5,Clothing,Manufacturer_A
3,1004,Tennis Racket,4,30.0,Sports,Manufacturer_C
4,1005,Basketball,6,15.0,Sports,Manufacturer_C


We will now perform outer join between two dataframes, order_df and sales_df. This will match the 'product_name' column of both dataframes and keep all the rows from both. The missing cells will be populated with NaN.

In [11]:
outer_results = order_df.join(sales_df.set_index('product_name'), on='product_name', how="outer")
outer_results

Unnamed: 0,order_id,product_name,quantity,unit_price,category,manufacturer
0.0,1001.0,Keyboard,2.0,22.5,Electronics,Manufacturer_B
1.0,1002.0,Jacket,1.0,79.99,,
2.0,1003.0,Shoe,3.0,49.5,Clothing,Manufacturer_A
3.0,1004.0,Tennis Racket,4.0,30.0,Sports,Manufacturer_C
4.0,1005.0,Basketball,6.0,15.0,Sports,Manufacturer_C
5.0,1006.0,Microphone,2.0,35.5,,
6.0,1007.0,Couch,1.0,349.99,,
,,Guitar,,,Music,Manufacturer_D
,,Biscuits,,,Food,Manufacturer_E


In [28]:
# Exercise: 
# Join order_df and sales_df using the common column, 'product_name'. 
# While joining, discard the rows coming from sales_df which doesn't match with any row in order_df by 'product_name' column value

left_results = order_df.join(sales_df.set_index('product_name'), on='product_name', how="left")

In [26]:
left_results

Unnamed: 0,order_id,product_name,quantity,unit_price,category,manufacturer
0,1001,Keyboard,2,22.5,Electronics,Manufacturer_B
1,1002,Jacket,1,79.99,,
2,1003,Shoe,3,49.5,Clothing,Manufacturer_A
3,1004,Tennis Racket,4,30.0,Sports,Manufacturer_C
4,1005,Basketball,6,15.0,Sports,Manufacturer_C
5,1006,Microphone,2,35.5,,
6,1007,Couch,1,349.99,,


In [29]:
# Exercise: 
# Join order_df and sales_df using the common column, 'product_name'. 
# While joining, discard the rows coming from order_df which doesn't match with any row in sales_df by 'product_name' column value

right_df = order_df.join(sales_df.set_index('product_name'), on='product_name', how="right")

In [15]:
right_df

Unnamed: 0,order_id,product_name,quantity,unit_price,category,manufacturer
2.0,1003.0,Shoe,3.0,49.5,Clothing,Manufacturer_A
0.0,1001.0,Keyboard,2.0,22.5,Electronics,Manufacturer_B
3.0,1004.0,Tennis Racket,4.0,30.0,Sports,Manufacturer_C
,,Guitar,,,Music,Manufacturer_D
,,Biscuits,,,Food,Manufacturer_E
4.0,1005.0,Basketball,6.0,15.0,Sports,Manufacturer_C


# Lambda

Lamdas in python is an anonymous function which can take any numbers of arguments and return values based on a given expression. The format of writing lamdas can be: "lambda inputs: expression".

Let's look at the order_df dataframe again. We want to get the rows for which the quantity of the product is less than or equal 2. First, we can use lambda to get boolean status of rows based on the quantity values as a pandas series. If the quantity is less than or equal two, the boolean value is True, otherwise it's False.

In [16]:
# Get a boolean series using lambda considering the quantity of the products
low_purchase_series = order_df['quantity'].apply(lambda x: x <= 2)
low_purchase_series

0     True
1     True
2    False
3    False
4    False
5     True
6     True
Name: quantity, dtype: bool

In [17]:
# Check the type of the result we got
type(low_purchase_series)

pandas.core.series.Series

In [18]:
# Display the rows in order_df which have maximum two purchases in the quantity column
order_df[low_purchase_series]

Unnamed: 0,order_id,product_name,quantity,unit_price
0,1001,Keyboard,2,22.5
1,1002,Jacket,1,79.99
5,1006,Microphone,2,35.5
6,1007,Couch,1,349.99


In [19]:
# Altogether, in a single line -
order_df[order_df['quantity'].apply(lambda x: x <= 2)]

Unnamed: 0,order_id,product_name,quantity,unit_price
0,1001,Keyboard,2,22.5
1,1002,Jacket,1,79.99
5,1006,Microphone,2,35.5
6,1007,Couch,1,349.99


In [20]:
# Exercise:
# Display all information(including category and manufacturer if present) about the products which have unit price greater than 40
# Hint: Try to use results from left join

# left_results[left_results['unit_price'].apply(lambda u: u > 40)]

In [21]:
# Total price of the product is less than a threshold
# results.apply(lambda x : x['quantity'] * x['unit_price'] < 300)

Now, let's get all the product information from the 'right join'ed dataframe where the category is 'Electronics'.

In [22]:
# Product information for Electronics category
electronics_df = right_df[(right_df['category'] == 'Electronics')] 
electronics_df

Unnamed: 0,order_id,product_name,quantity,unit_price,category,manufacturer
0.0,1001.0,Keyboard,2.0,22.5,Electronics,Manufacturer_B


In [23]:
# Excercise:
# Get the same output as the previous cell using 'lambda' function

# results[results['category'].apply(lambda x: True if x == 'Electronics' else False)]

In [24]:
# Let's get all the information (from both dataframes) for products with unit price less than 40
unit_price_20 = left_results[left_results['unit_price'] < 40]
unit_price_20

Unnamed: 0,order_id,product_name,quantity,unit_price,category,manufacturer
0,1001,Keyboard,2,22.5,Electronics,Manufacturer_B
3,1004,Tennis Racket,4,30.0,Sports,Manufacturer_C
4,1005,Basketball,6,15.0,Sports,Manufacturer_C
5,1006,Microphone,2,35.5,,


In [25]:
# Exercise: 
# Get the same outputs as the previous cell using 'lambda'

left_results[left_results['unit_price'].apply(lambda x: x < 40)]

Unnamed: 0,order_id,product_name,quantity,unit_price,category,manufacturer
0,1001,Keyboard,2,22.5,Electronics,Manufacturer_B
3,1004,Tennis Racket,4,30.0,Sports,Manufacturer_C
4,1005,Basketball,6,15.0,Sports,Manufacturer_C
5,1006,Microphone,2,35.5,,
