In [175]:
import pandas as pd 

# Join

The "join" method provided by pandas can be used to combine two dataframes based on the values of a common selected column. There can be several types of join operation in pandas -

Inner Join: returns only the rows that have matching values for the selected column in both DataFrames

Outer Join: returns all rows from both DataFrames, with NaN values where there's no matching value for the selected column

Left Join: returns all rows from the left DataFrame and the matched rows from the right DataFrame, with NaN values where there are no matches

Right Join: returns all rows from the right DataFrame and the matched rows from the left DataFrame, with NaN values where there are no matches

The product_info dictionary contains the list of product_ids and the list of product names. Let's create a DataFrame for these products where we have only two columns, one is for product id and another one is for the names of the products.

In [176]:
product_info = {
                  "product_id": [1,2,3,4], 
                  "name":["headphone","watch","laptop","water bottle"]
               }

In [177]:
# Create dataframe from product_info
product_df = pd.DataFrame(product_info)
product_df

Unnamed: 0,product_id,name
0,1,headphone
1,2,watch
2,3,laptop
3,4,water bottle


In [178]:
price_info = {
                "product_id": [1,2,3,7,8], 
                "price":[50, 85, 1500, 70, 100]
             }

Let's create another DataFrame containing prices for different products. This DataFrame also have two columns, "product_id" and "price".

In [179]:
price_df = pd.DataFrame(price_info)
price_df

Unnamed: 0,product_id,price
0,1,50
1,2,85
2,3,1500
3,7,70
4,8,100


Join method in pandas helps to combine two dataframe based on the matching values of the selected column. We select the "product_id" column as the common column for matching. 

We are combining two dataframes named product_df and price_df here, and two additional parameters are provided for the join method. The name of the common column is assigned to 'on' parameter, and the type of join (e.g. inner, outer, left, right) is assigned to 'how' parameter 

When we use 'inner join', all the rows with common values for 'product_id' column in both dataframes are only selected. All the rows for which the column values don't match in both dataframes are discarded. 

In [180]:
# Inner Join
product_df.join(price_df.set_index('product_id'), on='product_id', how='inner')

Unnamed: 0,product_id,name,price
0,1,headphone,50
1,2,watch,85
2,3,laptop,1500


For 'outer' join, all the rows from both dataframes are persisted even if the values in the selected column don't match up. The missing entries in the combined dataframe are populated with NaN values.

Here, for product id 4, there's no price entry in the price_df dataframe. And, for product id 7 and 8, there's no value for product name in the product_df table. In the combined dataframe, these missing entries are set as NaN.

In [181]:
# Outer Join
product_df.join(price_df.set_index('product_id'), on='product_id', how='outer')

Unnamed: 0,product_id,name,price
0.0,1,headphone,50.0
1.0,2,watch,85.0
2.0,3,laptop,1500.0
3.0,4,water bottle,
,7,,70.0
,8,,100.0


When combining the dataframes using 'left' join, all the rows in the left dataframe (product_df) are chosen and missing values from right dataframe (price_df) are set to be NaN. All the rows from the right table for which the selected column values don't match with the left table are omitted. 

In [182]:
product_df.join(price_df.set_index('product_id'), on='product_id', how='left')

Unnamed: 0,product_id,name,price
0,1,headphone,50.0
1,2,watch,85.0
2,3,laptop,1500.0
3,4,water bottle,


In [183]:
# Exercise:
# Perform right join on product_df and price_df 
# Write a comment on what you see as the result

# product_df.join(price_df.set_index('product_id'), on='product_id', how='right')

In [184]:
# import pandas as pd

# # create the customers dataset
# customers = pd.DataFrame({
#     'customer_id': [1, 2, 3, 4, 5],
#     'name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Erling'],
#     'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'dave@example.com','erling@example.com']
# })

# # create the orders dataset
# orders = pd.DataFrame({
#     'order_id': [1, 2, 3, 4, 5, 6],
#     'customer_id': [1, 2, 3, 1, 3, 4],
#     'order_date': ['2022-01-01', '2022-01-02', '2022-01-03', '2021-02-04', '2022-01-05', '2022-01-06'],
#     'total': [100, 200, 300, 150, 250, 200]
# })

# # join the two datasets on the customer_id column
# customer_orders = customers.join(orders.set_index('customer_id'), on='customer_id', how='left')

# print(customer_orders)

#Append

While working with pandas dataframes, it's often required to combine rows to a dataframe from a different dataframe. **append()** method in pandas can be used to achieve this. 

In [185]:
# We create a dataframe containing two columns, "name" and "age"
person_data = {
          "name": ["Alice", "Bob", "Charlie", "David"],
          "age": [27, 33, 21, 19]
        }

person_df = pd.DataFrame(person_data)

In [186]:
# Display the dataframe
person_df

Unnamed: 0,name,age
0,Alice,27
1,Bob,33
2,Charlie,21
3,David,19


Now, we create another dataframe with same column names. We want to append the rows from the newly created dataframe to the dataframe we created named "person_df".

In [187]:
# Set the values for the new dataframe
person_data_2 = {
          "name": ["Erling", "Frank"],
          "age": [26, 42]
        }

In [188]:
# Create and display the new dataframe named "person_df_2" 
person_df_2 = pd.DataFrame(person_data_2)
person_df_2

Unnamed: 0,name,age
0,Erling,26
1,Frank,42


We will now use the **append()** method to combine the rows from "person_df_2" with "person_df". We need to save the result to a new variable because the append() function doesn't modify the original dataframe. 

In [189]:
new_person_df = person_df.append(person_df_2, ignore_index=True) # ignore_index is set True to ignore the indices in the original DFs and reset new indices

  new_person_df = person_df.append(person_df_2, ignore_index=True) # ignore_index is set True to ignore the indices in the original DFs and reset new indices


In [190]:
# Display the combined dataframe
new_person_df

Unnamed: 0,name,age
0,Alice,27
1,Bob,33
2,Charlie,21
3,David,19
4,Erling,26
5,Frank,42


Let's try to add some rows to "new_person_df" which have more column values. We can see that the values in the "gender" column are populated with "NaN" where the values for that column are missing.

In [191]:
extra_df = pd.DataFrame({"name":["Kevin","Liam","Martha"],
                         "age": [28, 31, 49],
                         "gender": ["M","M","F"]})

new_person_df.append(extra_df, ignore_index=True) # Don't forget to save this result to another variable if you intend to use this new dataframe later

  new_person_df.append(extra_df, ignore_index=True) # Don't forget to save this result to another variable if you intend to use this new dataframe later


Unnamed: 0,name,age,gender
0,Alice,27,
1,Bob,33,
2,Charlie,21,
3,David,19,
4,Erling,26,
5,Frank,42,
6,Kevin,28,M
7,Liam,31,M
8,Martha,49,F


In [192]:
# Exercise:
# Add another row to the "new_person_df" using append function

# temp_person_df = pd.DataFrame({"name":["Jack"],
#                                "age":[41]})

# another_person_df = new_person_df.append(temp_person_df, ignore_index=True)
# another_person_df


# Lamdas

Lamdas in python is an anonymous function which can take any numbers of arguments and return values based on a given expression. The format of writing lamdas can be: "lambda inputs: expression".

Here's a dictionary name salary_data with keyss - 'Name', 'Age' and 'Salary'.

In [193]:
salary_data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
        'Age': [25, 45, 35, 30],
        'Salary': [50000, 60000, 70000, 80000]}

Let's convert this dictionary into a dataframe named 'salary_df' using pandas.

In [194]:
salary_df = pd.DataFrame(salary_data)
print(salary_df)

      Name  Age  Salary
0    Alice   25   50000
1      Bob   45   60000
2  Charlie   35   70000
3     Dave   30   80000


Now, let's say we want to select the rows from this dataframe where the salary is greater than 65000. The easiest way to do it is shown in the next cell. 

In [195]:
salary_df[salary_df['Salary'] > 65000]

Unnamed: 0,Name,Age,Salary
2,Charlie,35,70000
3,Dave,30,80000


But lambdas in python can be also used to achieve the same thing. Lambda can take the column values as input and execute against an expression. 

First, calculate a boolean series in pandas representing if any the salary in a row is greater than 65000 or not. In the next cell, we can see first two values are False, that means the salaries for those rows are not greater than 65000.

In [196]:
salary_series = salary_df['Salary'].apply(lambda x: x > 65000)
salary_series

0    False
1    False
2     True
3     True
Name: Salary, dtype: bool

In [197]:
# Check the type to verify
type(salary_series)

pandas.core.series.Series

Now, we can use this created series to filter out the rows from salary_df which have salary greater than 65000.

In [198]:
salary_df[salary_series]

Unnamed: 0,Name,Age,Salary
2,Charlie,35,70000
3,Dave,30,80000


In a single line, it can be written as -

In [199]:
salary_df[salary_df['Salary'].apply(lambda x: x > 65000)]

Unnamed: 0,Name,Age,Salary
2,Charlie,35,70000
3,Dave,30,80000


In [200]:
# Exercise:
# Get the rows from salary_df using lambda where the age is greater than 30
salary_df[salary_df['Age'].apply(lambda x: x > 30)]

Unnamed: 0,Name,Age,Salary
1,Bob,45,60000
2,Charlie,35,70000


If we could achieve the same output just by using "salary_df[salary_df['Salary'] > 65000]", then what's the necessity of using lambda which is more complicated as it seems?   

Because lambda can allow user to create anonymous functions, and permit more complex manipulation of data.

Let's consider another case where we want to create another column in salary_df. The new column named 'Age_Group' will contain the age category based on the values in the 'Age' column. For simplicity, let's assume, if age is less than 30, then the person is young, otherwise the person is old.

In [201]:
salary_df['Age_Group'] = salary_df['Age'].apply(lambda x: 'Young' if x < 30 else 'Old')

In [202]:
salary_df

Unnamed: 0,Name,Age,Salary,Age_Group
0,Alice,25,50000,Young
1,Bob,45,60000,Old
2,Charlie,35,70000,Old
3,Dave,30,80000,Old


In [203]:
# Excercise:
# Add a new column to salary_df named 'Salary_Group'. The values in 'Salary_Group' columns will be 'mid income' if the 'salary' 
# is greater than 55000, otherwise the value will be "low income".

salary_df['Salary_Group'] = salary_df['Salary'].apply(lambda x: 'mid income' if x > 55000 else 'low income')
salary_df

Unnamed: 0,Name,Age,Salary,Age_Group,Salary_Group
0,Alice,25,50000,Young,low income
1,Bob,45,60000,Old,mid income
2,Charlie,35,70000,Old,mid income
3,Dave,30,80000,Old,mid income


# Combine

'Combine' method in pandas can be used to element-wise combining between the columns of two dataframes. The resulting number of rows and columns in the combined dataset will be the union of two dataframes.

For further reference, follow this link: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.combine.html

Let's create two dataframes named df1 and df2, both of which integers have 'X' and 'Y' column values.

In [204]:
df1 = pd.DataFrame({'X': [0, 1, 2, 20], 'Y': [4, 5, 6, 21]})
df1

Unnamed: 0,X,Y
0,0,4
1,1,5
2,2,6
3,20,21


In [205]:
df2 = pd.DataFrame({'X': [7, 8, 9], 'Y': [10, 11, 13]})
df2

Unnamed: 0,X,Y
0,7,10
1,8,11
2,9,13


In [206]:
# Let's combine these two dataframes by conducting a pairwise sum.
# We can define a lambda function which takes two inputs (each can be entire column), and 
# return the summation of pairwise entries
sum_func = lambda x, y: x + y

In [207]:
# Now, we can combine df1 and f2 using combine method
# by providing the lambda for summation
# Note that, as df2 has only 3 rows, the 4th row in the resulting dataframe is populated with NaN values,
# although df1 has 4 rows. 
df1.combine(df2, sum_func)

Unnamed: 0,X,Y
0,7.0,14.0
1,9.0,16.0
2,11.0,19.0
3,,


In [208]:
# If we want to fill the missing value in the original dataframe (e.g. df2 here) with a specific value,
# 'fill_value' parameter can be used to assign a value.
df1.combine(df2, sum_func, fill_value=0)

Unnamed: 0,X,Y
0,7.0,14.0
1,9.0,16.0
2,11.0,19.0
3,20.0,21.0


Let's try to define a lambda function which will take the column from a dataframe over the other dataframe, if the sum of column values is greater than the other.

In [209]:
take_larger = lambda x, y: x if x.sum() > y.sum() else y

In [210]:
# Note that, the sum of 'X' column is greater for df2, so there's only three integers in this column 
# For 'Y' column, the sum is larger for df1. So, four values are present in the result from df1. 
df1.combine(df2, take_larger)

Unnamed: 0,X,Y
0,7.0,4.0
1,8.0,5.0
2,9.0,6.0
3,,21.0


In [211]:
# Excercise:
# Combine df1 and df2 in such a way that each entry in the resulting dataframe is squared sum of two given dataframes

sqr_sum = lambda x, y: x**2 + y**2
df1.combine(df2, sqr_sum, fill_value=0)

Unnamed: 0,X,Y
0,49.0,116.0
1,65.0,146.0
2,85.0,205.0
3,400.0,441.0
