<a href="https://colab.research.google.com/github/mathiasfls/Foundations-of-Cultural-and-Social-Data-Analysis/blob/main/4_Hands_on.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Week 4 - Combining Data**



Python Pandas is a powerful library for data analysis that provides various data manipulation tools and data structures for cleaning, processing, and analyzing data in Python. It plays a crucial role in data science as it enables data scientists and analysts to work with data in an efficient and flexible manner.

One of the key features of Pandas is its ability to combine and merge datasets, which is essential in data science. Pandas provides a range of functions for combining and merging datasets, including join, merge, concatenate, and append.

Joining two or more datasets is a common operation in data science, and Pandas provides several types of join operations, including left join, right join, inner join, and outer join. Joining two or more datasets involves combining rows from each dataset based on a common key or index. The type of join to use depends on the data and the specific problem being solved.

A left join returns all the rows from the left dataset and any matching rows from the right dataset. A right join returns all the rows from the right dataset and any matching rows from the left dataset. An inner join returns only the rows that have matching keys in both datasets, and an outer join returns all the rows from both datasets, with missing values filled in where there is no match.

Pandas also provides other tools for combining datasets, such as concatenation, which combines two or more datasets along a particular axis, and merging, which is similar to join but allows more flexibility in how the datasets are combined.

Overall, Pandas is an essential tool for combining and manipulating data in data science, providing data scientists and analysts with a powerful set of functions for cleaning, transforming, and analyzing data.


# Combining data with Join

The join() method in Pandas is used to combine two or more DataFrames based on a common column. The syntax of the join() method is as follows:



```
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
```

where:

    other: The DataFrame or Series to be joined.
    on: The column or index level name(s) on which to join.
    how: The type of join to be performed. Default is a left join.
    lsuffix: The suffix to add to the column names of the left DataFrame.
    rsuffix: The suffix to add to the column names of the right DataFrame.
    sort: Whether to sort the resulting DataFrame by the columns.

The different types of joins in Pandas are:

    Inner Join
    Left Join
    Right Join
    Full Join (or Outer Join)


Here's an example of creating a DataFrame with books and sales:

In [1]:
import pandas as pd

# Create a DataFrame of books
books_df = pd.DataFrame({
    'Book ID': [1, 2, 3, 4, 5],
    'Title': ['The Great Gatsby', 'To Kill a Mockingbird', '1984', 'Pride and Prejudice', 'The Catcher in the Rye'],
    'Author': ['F. Scott Fitzgerald', 'Harper Lee', 'George Orwell', 'Jane Austen', 'J.D. Salinger'],
})

# Create a DataFrame of sales
sales_df = pd.DataFrame({
    'Book ID': [2, 3, 3, 4, 5, 5],
    'Units Sold': [10, 5, 7, 20, 15, 12],
    'Sales Date': ['2022-01-01', '2022-01-01', '2022-02-01', '2022-01-15', '2022-01-02', '2022-02-15'],
})

In [2]:
books_df

Unnamed: 0,Book ID,Title,Author
0,1,The Great Gatsby,F. Scott Fitzgerald
1,2,To Kill a Mockingbird,Harper Lee
2,3,1984,George Orwell
3,4,Pride and Prejudice,Jane Austen
4,5,The Catcher in the Rye,J.D. Salinger


In [3]:
sales_df

Unnamed: 0,Book ID,Units Sold,Sales Date
0,2,10,2022-01-01
1,3,5,2022-01-01
2,3,7,2022-02-01
3,4,20,2022-01-15
4,5,15,2022-01-02
5,5,12,2022-02-15


Now, let's explore the different types of joins using these DataFrames.

##Inner Join

An inner join returns only the rows that have matching values in both DataFrames. It is performed using the inner option in the how parameter of the join() method.

In [4]:
# Inner join on Book ID
inner_join_df = books_df.join(sales_df.set_index('Book ID'), on='Book ID', how='inner')
inner_join_df.head(10)

Unnamed: 0,Book ID,Title,Author,Units Sold,Sales Date
1,2,To Kill a Mockingbird,Harper Lee,10,2022-01-01
2,3,1984,George Orwell,5,2022-01-01
2,3,1984,George Orwell,7,2022-02-01
3,4,Pride and Prejudice,Jane Austen,20,2022-01-15
4,5,The Catcher in the Rye,J.D. Salinger,15,2022-01-02
4,5,The Catcher in the Rye,J.D. Salinger,12,2022-02-15


The resulting DataFrame only contains the rows where the Book ID is present in both DataFrames.

##Left Join

A left join returns all the rows from the left DataFrame and the matching rows from the right DataFrame. If there are no matching rows in the right DataFrame, then the values for the right DataFrame columns will be filled with NaN. It is performed using the left option in the how parameter of the join() method.

In [5]:
# Left join on Book ID
left_join_df = books_df.join(sales_df.set_index('Book ID'), on='Book ID', how='left')
left_join_df.head(10)

Unnamed: 0,Book ID,Title,Author,Units Sold,Sales Date
0,1,The Great Gatsby,F. Scott Fitzgerald,,
1,2,To Kill a Mockingbird,Harper Lee,10.0,2022-01-01
2,3,1984,George Orwell,5.0,2022-01-01
2,3,1984,George Orwell,7.0,2022-02-01
3,4,Pride and Prejudice,Jane Austen,20.0,2022-01-15
4,5,The Catcher in the Rye,J.D. Salinger,15.0,2022-01-02
4,5,The Catcher in the Rye,J.D. Salinger,12.0,2022-02-15


The resulting DataFrame contains all the rows from the left DataFrame and the matching rows from the right DataFrame. The values for the right DataFrame columns are filled with NaN where there are no matching rows.

##Right Join

A right join returns all the rows from the right DataFrame and the matching rows from the left DataFrame. If there are no matching rows in the left DataFrame, then the values for the left DataFrame columns will be filled with NaN. It is performed using the right option in the how parameter of the join() method.

In [6]:
# Right join on Book ID
right_join_df = books_df.join(sales_df.set_index('Book ID'), on='Book ID', how='right')
right_join_df.head(10)

Unnamed: 0,Book ID,Title,Author,Units Sold,Sales Date
1,2,To Kill a Mockingbird,Harper Lee,10,2022-01-01
2,3,1984,George Orwell,5,2022-01-01
2,3,1984,George Orwell,7,2022-02-01
3,4,Pride and Prejudice,Jane Austen,20,2022-01-15
4,5,The Catcher in the Rye,J.D. Salinger,15,2022-01-02
4,5,The Catcher in the Rye,J.D. Salinger,12,2022-02-15


The resulting DataFrame contains all the rows from the right DataFrame and the matching rows from the left DataFrame. The values for the left DataFrame columns are filled with NaN where there are no matching rows.

##Full Join (or Outer Join)

A full join (or outer join) returns all the rows from both DataFrames. If there are no matching rows in either DataFrame, then the values for the corresponding columns will be filled with NaN. It is performed using the outer option in the how parameter of the join() method.

In [7]:
# Full join on Book ID
full_join_df = books_df.join(sales_df.set_index('Book ID'), on='Book ID', how='outer')
full_join_df.head(10)


Unnamed: 0,Book ID,Title,Author,Units Sold,Sales Date
0,1,The Great Gatsby,F. Scott Fitzgerald,,
1,2,To Kill a Mockingbird,Harper Lee,10.0,2022-01-01
2,3,1984,George Orwell,5.0,2022-01-01
2,3,1984,George Orwell,7.0,2022-02-01
3,4,Pride and Prejudice,Jane Austen,20.0,2022-01-15
4,5,The Catcher in the Rye,J.D. Salinger,15.0,2022-01-02
4,5,The Catcher in the Rye,J.D. Salinger,12.0,2022-02-15


The resulting DataFrame contains all the rows from both DataFrames. The values for the corresponding columns are filled with NaN where there are no matching rows.

# Combining data with Merge

## Inner Join

Let's say we have two datasets: a sales dataset and a customer dataset. The sales dataset contains information about sales transactions, including the customer ID and the amount of the sale. The customer dataset contains information about customers, including their name and contact information, as well as their customer ID.

We want to join these two datasets together to create a new dataset that contains information about sales transactions, as well as the name and contact information of the customers who made the purchases.

Here's the code to join the two datasets using Pandas:

### Example 1

In [8]:
#import libraries
import csv
import pandas as pd
import numpy as np

In [9]:
# create the sales dataset
sales_df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'amount': [100.0, 150.0, 200.0, 75.0, 300.0]
})
sales_df.head(5)

Unnamed: 0,customer_id,amount
0,1,100.0
1,2,150.0
2,3,200.0
3,4,75.0
4,5,300.0


In [10]:
# create the customer dataset
customers_df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['John Doe', 'Jane Smith', 'Bob Johnson', 'Sara Lee', 'David Lee'],
    'email': ['john.doe@example.com', 'jane.smith@example.com', 'bob.johnson@example.com', 'sara.lee@example.com', 'david.lee@example.com'],
    'phone': ['555-1234', '555-5678', '555-9876', '555-4321', '555-8765']
})
customers_df.head(5)

Unnamed: 0,customer_id,name,email,phone
0,1,John Doe,john.doe@example.com,555-1234
1,2,Jane Smith,jane.smith@example.com,555-5678
2,3,Bob Johnson,bob.johnson@example.com,555-9876
3,4,Sara Lee,sara.lee@example.com,555-4321
4,5,David Lee,david.lee@example.com,555-8765


In [11]:
merged_df = pd.merge(sales_df, customers_df, on='customer_id', how='inner')

# display the resulting merged dataset
print(merged_df)

   customer_id  amount         name                    email     phone
0            1   100.0     John Doe     john.doe@example.com  555-1234
1            2   150.0   Jane Smith   jane.smith@example.com  555-5678
2            3   200.0  Bob Johnson  bob.johnson@example.com  555-9876
3            4    75.0     Sara Lee     sara.lee@example.com  555-4321
4            5   300.0    David Lee    david.lee@example.com  555-8765


In this example, we use the merge function to join the two datasets on the customer_id column, using an inner join. The resulting merged dataset contains all of the sales transactions, as well as the name and contact information of the customers who made the purchases.

This is just one example of how to use Pandas to join two datasets. Pandas provides many other functions and options for combining and manipulating data, making it a powerful tool for data scientists and analysts.


### Example 2

Suppose we have two datasets: sales and customers. The sales dataset contains information about sales transactions, including the customer ID, the product ID, the date of the sale, and the quantity sold. The customers dataset contains information about customers, including their name, age, and gender, as well as their customer ID.

In [12]:
# generate sales data
sales = pd.DataFrame({
    'customer_id': [1, 1, 2, 3, 3, 3, 4, 5],
    'product_id': [1, 2, 2, 3, 2, 1, 1, 2],
    'sale_date': pd.date_range(start='2022-01-01', periods=8),
    'quantity_sold': [3, 4, 1, 2, 2, 5, 1, 3],
    'revenue': np.random.randint(100, 1000, 8)
})

# generate customer data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['John', 'Jane', 'Bob', 'Sara', 'David'],
    'age': [30, 25, 40, 35, 28],
    'gender': ['M', 'F', 'M', 'F', 'M']
})


In [14]:
sales

Unnamed: 0,customer_id,product_id,sale_date,quantity_sold,revenue
0,1,1,2022-01-01,3,832
1,1,2,2022-01-02,4,633
2,2,2,2022-01-03,1,466
3,3,3,2022-01-04,2,217
4,3,2,2022-01-05,2,711
5,3,1,2022-01-06,5,559
6,4,1,2022-01-07,1,593
7,5,2,2022-01-08,3,706


In [15]:
customers

Unnamed: 0,customer_id,name,age,gender
0,1,John,30,M
1,2,Jane,25,F
2,3,Bob,40,M
3,4,Sara,35,F
4,5,David,28,M


In this example, we use Pandas to generate two sample datasets: sales and customers. The sales dataset contains information about eight sales transactions, while the customers dataset contains information about five customers.

Now, let's say we want to join these two datasets together to create a new dataset that contains information about each sale, including the customer's name, age, and gender. We can do this by performing an inner join on the customer_id column in both datasets, as well as on the product_id column in the sales dataset. Here's how we can do it:

In [16]:
# join the sales and customers dataframes
merged = pd.merge(sales, customers, on='customer_id', how='inner')
merged = merged.rename(columns={'name': 'customer_name', 'age': 'customer_age', 'gender': 'customer_gender'})

In this example, we use Pandas merge function to join the sales and customers dataframes. We specify the on parameter to join on the customer_id column, and we also specify how='inner' to perform an inner join. We also rename some of the columns in the resulting merged dataframe for clarity.

The resulting merged dataframe contains information about each sale, including the customer's name, age, and gender:


In [18]:
# display the resulting merged dataset
merged.head(8)

Unnamed: 0,customer_id,product_id,sale_date,quantity_sold,revenue,customer_name,customer_age,customer_gender
0,1,1,2022-01-01,3,832,John,30,M
1,1,2,2022-01-02,4,633,John,30,M
2,2,2,2022-01-03,1,466,Jane,25,F
3,3,3,2022-01-04,2,217,Bob,40,M
4,3,2,2022-01-05,2,711,Bob,40,M
5,3,1,2022-01-06,5,559,Bob,40,M
6,4,1,2022-01-07,1,593,Sara,35,F
7,5,2,2022-01-08,3,706,David,28,M


### Example 3

Suppose we have three datasets: orders, customers, and products. The orders dataset contains information about orders placed by customers, including the order ID, the customer ID, the product ID, and the quantity ordered. The customers dataset contains information about customers, including their name, age, and gender, as well as their customer ID. The products dataset contains information about the products available for sale, including the product ID, the product name, and the price.

Here's how we can generate the sample datasets:

In [19]:
import pandas as pd
import numpy as np

# generate orders data
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'customer_id': [1, 1, 2, 2, 2, 3, 3, 4],
    'product_id': [1, 2, 2, 3, 1, 1, 2, 3],
    'quantity': [3, 2, 1, 2, 4, 1, 3, 2],
    'order_date': pd.date_range(start='2022-01-01', periods=8)
})

# generate customers data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['John', 'Jane', 'Bob', 'Sara'],
    'age': [30, 25, 40, 35],
    'gender': ['M', 'F', 'M', 'F']
})

# generate products data
products = pd.DataFrame({
    'product_id': [1, 2, 3],
    'name': ['Product A', 'Product B', 'Product C'],
    'price': [100, 200, 300]
})

In [20]:
orders

Unnamed: 0,order_id,customer_id,product_id,quantity,order_date
0,1,1,1,3,2022-01-01
1,2,1,2,2,2022-01-02
2,3,2,2,1,2022-01-03
3,4,2,3,2,2022-01-04
4,5,2,1,4,2022-01-05
5,6,3,1,1,2022-01-06
6,7,3,2,3,2022-01-07
7,8,4,3,2,2022-01-08


In [21]:
customers

Unnamed: 0,customer_id,name,age,gender
0,1,John,30,M
1,2,Jane,25,F
2,3,Bob,40,M
3,4,Sara,35,F


In [22]:
products

Unnamed: 0,product_id,name,price
0,1,Product A,100
1,2,Product B,200
2,3,Product C,300


In this example, we use Pandas to generate three sample datasets: orders, customers, and products. The orders dataset contains information about eight orders, while the customers dataset contains information about four customers, and the products dataset contains information about three products.

Now, let's say we want to join these three datasets together to create a new dataset that contains information about each order, including the customer's name, age, and gender, as well as the product name and price. We can do this by performing an inner join on the customer_id column in the orders and customers datasets, as well as on the product_id column in the orders and products datasets. Here's how we can do it:

In [23]:
# join the orders and customers dataframes
merged1 = pd.merge(orders, customers, on='customer_id', how='inner')

# join the resulting dataframe with the products dataframe
merged2 = pd.merge(merged1, products, on='product_id', how='inner')

# rename columns for clarity
merged2 = merged2.rename(columns={'name_x': 'customer_name', 'age': 'customer_age', 'gender': 'customer_gender', 'name_y': 'product_name'})

# display the resulting merged dataset
merged2.head(8)

Unnamed: 0,order_id,customer_id,product_id,quantity,order_date,customer_name,customer_age,customer_gender,product_name,price
0,1,1,1,3,2022-01-01,John,30,M,Product A,100
1,5,2,1,4,2022-01-05,Jane,25,F,Product A,100
2,6,3,1,1,2022-01-06,Bob,40,M,Product A,100
3,2,1,2,2,2022-01-02,John,30,M,Product B,200
4,3,2,2,1,2022-01-03,Jane,25,F,Product B,200
5,7,3,2,3,2022-01-07,Bob,40,M,Product B,200
6,4,2,3,2,2022-01-04,Jane,25,F,Product C,300
7,8,4,3,2,2022-01-08,Sara,35,F,Product C,300


In this example, we use Pandas merge function twice to join the orders, customers, and products dataframes. First, we join the orders and customers dataframes on the customer_id column using an inner join. Then, we join the resulting dataframe with the products dataframe on the product_id column using another inner join.

## Left Join

In Pandas, a left join is a type of merge operation that combines two dataframes based on a common key, but includes all rows from the left dataframe and only matching rows from the right dataframe. The resulting dataframe will have the same number of rows as the left dataframe, and any rows from the right dataframe that do not match will have NaN values in the corresponding columns.

The syntax for a left join in Pandas is similar to that of an inner join. Here's an example:

In [24]:
import pandas as pd

# create two dataframes to merge
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': [5, 6, 7, 8]})

# perform a left join on the 'key' column
merged_df = pd.merge(df1, df2, on='key', how='left')

merged_df.head()

Unnamed: 0,key,value_x,value_y
0,A,1,
1,B,2,5.0
2,C,3,
3,D,4,6.0


In this example, we have two dataframes, df1 and df2, with a common 'key' column. We perform a left join on the 'key' column using the pd.merge() function and the how='left' parameter. The resulting dataframe includes all rows from df1, and any matching rows from df2, with NaN values in the value_y column for non-matching rows.

In summary, a left join in Pandas is useful when you want to keep all rows from the left dataframe, and only include matching rows from the right dataframe. This can be helpful for data analysis and visualization tasks when you want to keep all the data from one dataframe and only supplement it with additional information from another dataframe.

### Example 1:



In [25]:
import pandas as pd

# create supermarket products dataframe
products_df = pd.DataFrame({
    'product_id': [1, 2, 3, 4, 5],
    'product_name': ['Milk', 'Bread', 'Eggs', 'Meat', 'Vegetables'],
    'price': [1.5, 1, 2, 5, 3]
})

# create consumer reviews dataframe
reviews_df = pd.DataFrame({
    'product_id': [1, 3, 5, 7],
    'review': ['Great product', 'Needs improvement', 'Excellent!', 'Hated it']
})



   product_id product_name  price             review
0           1         Milk    1.5      Great product
1           2        Bread    1.0                NaN
2           3         Eggs    2.0  Needs improvement
3           4         Meat    5.0                NaN
4           5   Vegetables    3.0         Excellent!


In [26]:
products_df

Unnamed: 0,product_id,product_name,price
0,1,Milk,1.5
1,2,Bread,1.0
2,3,Eggs,2.0
3,4,Meat,5.0
4,5,Vegetables,3.0


In [27]:
reviews_df

Unnamed: 0,product_id,review
0,1,Great product
1,3,Needs improvement
2,5,Excellent!
3,7,Hated it


In [28]:
# perform left join on 'product_id' column
merged_df = pd.merge(products_df, reviews_df, on='product_id', how='left')

print(merged_df)

   product_id product_name  price             review
0           1         Milk    1.5      Great product
1           2        Bread    1.0                NaN
2           3         Eggs    2.0  Needs improvement
3           4         Meat    5.0                NaN
4           5   Vegetables    3.0         Excellent!


In this example, we have two dataframes products_df and reviews_df with a common 'product_id' column. We perform a left join on the 'product_id' column using the pd.merge() function and the how='left' parameter. The resulting dataframe includes all rows from products_df, and any matching rows from reviews_df, with NaN values in the review column for non-matching rows.

### Example 2

In [30]:
import pandas as pd

# create supermarket products dataframe
products_df = pd.DataFrame({
    'product_id': [1, 2, 3, 4, 5],
    'product_name': ['Milk', 'Bread', 'Eggs', 'Meat', 'Vegetables'],
    'price': [1.5, 1, 2, 5, 3]
})

# create consumer reviews dataframe
reviews_df = pd.DataFrame({
    'product_id': [1, 3, 5, 7],
    'review': ['Great product', 'Needs improvement', 'Excellent!', 'Hated it'],
    'rating': [4, 3, 5, 1]
})


In [31]:
products_df

Unnamed: 0,product_id,product_name,price
0,1,Milk,1.5
1,2,Bread,1.0
2,3,Eggs,2.0
3,4,Meat,5.0
4,5,Vegetables,3.0


In [32]:
reviews_df

Unnamed: 0,product_id,review,rating
0,1,Great product,4
1,3,Needs improvement,3
2,5,Excellent!,5
3,7,Hated it,1


In [33]:
# perform left join on 'product_id' column
merged_df = pd.merge(products_df, reviews_df[['product_id', 'rating']], on='product_id', how='left')

print(merged_df)

   product_id product_name  price  rating
0           1         Milk    1.5     4.0
1           2        Bread    1.0     NaN
2           3         Eggs    2.0     3.0
3           4         Meat    5.0     NaN
4           5   Vegetables    3.0     5.0


In this example, we have two dataframes products_df and reviews_df with a common 'product_id' column. We perform a left join on the 'product_id' column using the pd.merge() function and the how='left' parameter, but this time we only include the 'product_id' and 'rating' columns from reviews_df. The resulting dataframe includes all rows

## Right Join

A right join, also known as a right outer join, returns all the rows from the right dataframe and the matching rows from the left dataframe. If there is no match in the left dataframe, then the resulting dataframe will contain NaN values in the columns from the left dataframe.

Here's an example of a right join in Pandas:

In [34]:
import pandas as pd

# create a dataframe of students and their scores
students = pd.DataFrame({
    'student_id': ['A001', 'A002', 'A003', 'A004'],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'score': [85, 90, 75, 80]
})

# create a dataframe of students and their addresses
addresses = pd.DataFrame({
    'student_id': ['A001', 'A002', 'A005'],
    'address': ['123 Main St.', '456 Elm St.', '789 Oak St.']
})


In [35]:
students

Unnamed: 0,student_id,name,score
0,A001,Alice,85
1,A002,Bob,90
2,A003,Charlie,75
3,A004,David,80


In [36]:
addresses

Unnamed: 0,student_id,address
0,A001,123 Main St.
1,A002,456 Elm St.
2,A005,789 Oak St.


In [38]:
# perform right join on 'student_id' column
merged_df = pd.merge(students, addresses, on='student_id', how='right')

print(merged_df)

  student_id   name  score       address
0       A001  Alice   85.0  123 Main St.
1       A002    Bob   90.0   456 Elm St.
2       A005    NaN    NaN   789 Oak St.


In this example, we have two dataframes: students and addresses, with a common 'student_id' column. We perform a right join on the 'student_id' column using the pd.merge() function and the how='right' parameter. The resulting dataframe includes all rows from addresses, and any matching rows from students, with NaN values in the 'name' and 'score' columns for non-matching rows.

### Example

In [40]:
import pandas as pd

# create a dataframe of sales data
sales = pd.DataFrame({
    'order_id': ['O001', 'O002', 'O003', 'O004', 'O005'],
    'product_id': ['P001', 'P002', 'P003', 'P001', 'P002'],
    'quantity': [3, 2, 4, 1, 2],
    'price': [10.99, 5.99, 12.99, 9.99, 4.99]
})

# create a dataframe of product data
products = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Milk', 'Eggs', 'Bread', 'Cheese', 'Butter'],
    'category': ['Dairy', 'Dairy', 'Bakery', 'Dairy', 'Dairy']
})


In [41]:
sales

Unnamed: 0,order_id,product_id,quantity,price
0,O001,P001,3,10.99
1,O002,P002,2,5.99
2,O003,P003,4,12.99
3,O004,P001,1,9.99
4,O005,P002,2,4.99


In [42]:
products

Unnamed: 0,product_id,product_name,category
0,P001,Milk,Dairy
1,P002,Eggs,Dairy
2,P003,Bread,Bakery
3,P004,Cheese,Dairy
4,P005,Butter,Dairy


In [43]:

# perform right join on 'product_id' column
merged_df = pd.merge(sales, products, on='product_id', how='right')

print(merged_df)

  order_id product_id  quantity  price product_name category
0     O001       P001       3.0  10.99         Milk    Dairy
1     O004       P001       1.0   9.99         Milk    Dairy
2     O002       P002       2.0   5.99         Eggs    Dairy
3     O005       P002       2.0   4.99         Eggs    Dairy
4     O003       P003       4.0  12.99        Bread   Bakery
5      NaN       P004       NaN    NaN       Cheese    Dairy
6      NaN       P005       NaN    NaN       Butter    Dairy


In this example, we have two dataframes: sales and products, with a common 'product_id' column. We perform a right join on the 'product_id' column using the pd.merge() function and the how='right' parameter. The resulting dataframe includes all rows from products, and any matching rows from sales, with NaN values in the 'order_id', 'quantity', and 'price' columns for non-matching rows.

In pandas, the join() method can also be used to perform a full join (outer join) between two dataframes. The join() method is similar to merge(), but it joins dataframes based on their indices rather than a common column.

Here's an example dataframe that we can use to demonstrate join() with the 'outer' parameter:

In [44]:
import pandas as pd

# Create the first dataframe with supermarket products
products_df = pd.DataFrame({'Product': ['Apple', 'Banana', 'Pear', 'Orange', 'Grapes', 'Strawberries'],
                            'Price': [1.00, 0.50, 0.75, 1.25, 2.00, 2.50],
                            'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Fruit', 'Fruit']})

# Create the second dataframe with sales data
sales_df = pd.DataFrame({'Product': ['Apple', 'Banana', 'Pear', 'Orange', 'Pineapple', 'Mango'],
                         'Sales': [100, 200, 50, 75, 25, 50]})



In [45]:
products_df

Unnamed: 0,Product,Price,Category
0,Apple,1.0,Fruit
1,Banana,0.5,Fruit
2,Pear,0.75,Fruit
3,Orange,1.25,Fruit
4,Grapes,2.0,Fruit
5,Strawberries,2.5,Fruit


In [46]:
sales_df

Unnamed: 0,Product,Sales
0,Apple,100
1,Banana,200
2,Pear,50
3,Orange,75
4,Pineapple,25
5,Mango,50


In [47]:
# Join the two dataframes using an outer join
full_join = products_df.join(sales_df.set_index('Product'), on='Product', how='outer')

print(full_join)

          Product  Price Category  Sales
0.0         Apple   1.00    Fruit  100.0
1.0        Banana   0.50    Fruit  200.0
2.0          Pear   0.75    Fruit   50.0
3.0        Orange   1.25    Fruit   75.0
4.0        Grapes   2.00    Fruit    NaN
5.0  Strawberries   2.50    Fruit    NaN
NaN     Pineapple    NaN      NaN   25.0
NaN         Mango    NaN      NaN   50.0


In this example, we have two dataframes products_df and sales_df. The first dataframe contains information about supermarket products, such as their name, price, and category. The second dataframe contains sales data, such as the number of units sold for each product.

We use the join() method to perform an outer join on the two dataframes, and we set the how parameter to 'outer' to specify that we want to include all rows and columns from both dataframes. We also set the on parameter to 'Product' to specify that we want to join the dataframes based on the 'Product' column.

The resulting dataframe full_join contains all rows and columns from both dataframes, with NaNs in cells where there is no corresponding data. For example, the fourth row has a NaN value in the 'Price' and 'Category' columns because there is no corresponding data in products_df. Similarly, the fifth and sixth rows have NaN values in the 'Sales' column because there is no corresponding data in sales_df.

# Agregration

Pandas join aggregation is a way of combining data from multiple dataframes by joining them and aggregating the data based on certain criteria. This is useful when you have data that is spread across multiple tables, and you want to combine it in a way that summarizes the data and makes it more manageable.

In Pandas, join aggregation is typically performed using the groupby() function, which groups the data by one or more columns, and then applies an aggregation function to each group to produce a summary of the data. Some common aggregation functions include sum(), mean(), count(), min(), and max().

For example, suppose you have two dataframes, sales and products, with a common 'product_id' column, and you want to calculate the total sales for each product. You can perform a join on the 'product_id' column using the merge() function, and then group the resulting dataframe by the 'product_id' column and use the sum() function to calculate the total sales for each product:

In [48]:
import pandas as pd

# create a dataframe of sales data
sales = pd.DataFrame({
    'order_id': ['O001', 'O002', 'O003', 'O004', 'O005'],
    'product_id': ['P001', 'P002', 'P003', 'P001', 'P002'],
    'quantity': [3, 2, 4, 1, 2],
    'price': [10.99, 5.99, 12.99, 9.99, 4.99]
})

# create a dataframe of product data
products = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Milk', 'Eggs', 'Bread', 'Cheese', 'Butter'],
    'category': ['Dairy', 'Dairy', 'Bakery', 'Dairy', 'Dairy']
})


product_name
Bread    4
Eggs     4
Milk     4
Name: quantity, dtype: int64


In [50]:
sales

Unnamed: 0,order_id,product_id,quantity,price
0,O001,P001,3,10.99
1,O002,P002,2,5.99
2,O003,P003,4,12.99
3,O004,P001,1,9.99
4,O005,P002,2,4.99


In [51]:
products

Unnamed: 0,product_id,product_name,category
0,P001,Milk,Dairy
1,P002,Eggs,Dairy
2,P003,Bread,Bakery
3,P004,Cheese,Dairy
4,P005,Butter,Dairy


In [53]:
merged_df = pd.merge(sales, products, on='product_id')
merged_df

Unnamed: 0,order_id,product_id,quantity,price,product_name,category
0,O001,P001,3,10.99,Milk,Dairy
1,O004,P001,1,9.99,Milk,Dairy
2,O002,P002,2,5.99,Eggs,Dairy
3,O005,P002,2,4.99,Eggs,Dairy
4,O003,P003,4,12.99,Bread,Bakery


In [55]:
total_product = merged_df.groupby('product_name')['quantity'].sum()

print(total_product)

product_name
Bread    4
Eggs     4
Milk     4
Name: quantity, dtype: int64


In this example, we perform a join on the 'product_id' column using the merge() function, and then group the resulting dataframe by the 'product_name' column and apply the sum() function to the 'quantity' column to calculate the total sales for each product. The resulting output is a Series object that contains the total sales for each product.

## Mean example

Suppose you have two dataframes, products and reviews, with a common 'product_id' column, and you want to calculate the average rating for each product based on consumer reviews. You can perform a join on the 'product_id' column using the merge() function, and then group the resulting dataframe by the 'product_name' column and use the mean() function to calculate the average rating for each product:

In [56]:
import pandas as pd

# create a dataframe of product data
products = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Milk', 'Eggs', 'Bread', 'Cheese', 'Butter'],
    'category': ['Dairy', 'Dairy', 'Bakery', 'Dairy', 'Dairy']
})

# create a dataframe of review data
reviews = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P001', 'P002', 'P003', 'P004', 'P005', 'P005'],
    'rating': [4, 3, 5, 2, 4, 5, 4, 3, 2]
})


In [58]:
products

Unnamed: 0,product_id,product_name,category
0,P001,Milk,Dairy
1,P002,Eggs,Dairy
2,P003,Bread,Bakery
3,P004,Cheese,Dairy
4,P005,Butter,Dairy


In [57]:
reviews

Unnamed: 0,product_id,rating
0,P001,4
1,P002,3
2,P003,5
3,P001,2
4,P002,4
5,P003,5
6,P004,4
7,P005,3
8,P005,2


In [59]:
# perform join and aggregation
merged_df = pd.merge(products, reviews, on='product_id')
average_rating_by_product = merged_df.groupby('product_name')['rating'].mean()

print(average_rating_by_product)

product_name
Bread     5.0
Butter    2.5
Cheese    4.0
Eggs      3.5
Milk      3.0
Name: rating, dtype: float64


In this example, we perform a join on the 'product_id' column using the merge() function, and then group the resulting dataframe by the 'product_name' column and apply the mean() function to the 'rating' column to calculate the average rating for each product. The resulting output is a Series object that contains the average rating for each product based on consumer reviews.

## Count Example

In [61]:
import pandas as pd

# create a dataframe of product data
products = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Milk', 'Eggs', 'Bread', 'Cheese', 'Butter'],
    'category': ['Dairy', 'Dairy', 'Bakery', 'Dairy', 'Dairy']
})

# create a dataframe of review data
reviews = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P001', 'P002', 'P003', 'P004', 'P005', 'P005'],
    'rating': [4, 3, 5, 2, 4, 5, 4, 3, 2]
})


In [62]:
products

Unnamed: 0,product_id,product_name,category
0,P001,Milk,Dairy
1,P002,Eggs,Dairy
2,P003,Bread,Bakery
3,P004,Cheese,Dairy
4,P005,Butter,Dairy


In [63]:
reviews

Unnamed: 0,product_id,rating
0,P001,4
1,P002,3
2,P003,5
3,P001,2
4,P002,4
5,P003,5
6,P004,4
7,P005,3
8,P005,2


In [64]:
# perform join and aggregation
merged_df = pd.merge(products, reviews, on='product_id')
review_count_by_product = merged_df.groupby('product_name')['rating'].count()

print(review_count_by_product)

product_name
Bread     2
Butter    2
Cheese    1
Eggs      2
Milk      2
Name: rating, dtype: int64


In this example, we perform a join on the 'product_id' column using the merge() function, and then group the resulting dataframe by the 'product_name' column and apply the count() function to the 'rating' column to count the number of reviews for each product. The resulting output is a Series object that contains the number of reviews for each product.

## Min example

In [66]:
import pandas as pd

# create a dataframe of product data
products = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Milk', 'Eggs', 'Bread', 'Cheese', 'Butter'],
    'price': [2.50, 1.99, 1.25, 3.50, 2.75],
    'category': ['Dairy', 'Dairy', 'Bakery', 'Dairy', 'Dairy']
})

# create a dataframe of sales data
sales = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P001', 'P002', 'P003', 'P004', 'P005', 'P005'],
    'quantity': [2, 1, 3, 1, 2, 1, 4, 3, 2]
})


In [67]:
products

Unnamed: 0,product_id,product_name,price,category
0,P001,Milk,2.5,Dairy
1,P002,Eggs,1.99,Dairy
2,P003,Bread,1.25,Bakery
3,P004,Cheese,3.5,Dairy
4,P005,Butter,2.75,Dairy


In [68]:
sales

Unnamed: 0,product_id,quantity
0,P001,2
1,P002,1
2,P003,3
3,P001,1
4,P002,2
5,P003,1
6,P004,4
7,P005,3
8,P005,2


In [69]:
# perform join and aggregation
merged_df = pd.merge(products, sales, on='product_id')
merged_df.head()

Unnamed: 0,product_id,product_name,price,category,quantity
0,P001,Milk,2.5,Dairy,2
1,P001,Milk,2.5,Dairy,1
2,P002,Eggs,1.99,Dairy,1
3,P002,Eggs,1.99,Dairy,2
4,P003,Bread,1.25,Bakery,3


In [70]:
min_price_by_product = merged_df.groupby('product_name')['price'].min()

min_price_by_product.head()

product_name
Bread     1.25
Butter    2.75
Cheese    3.50
Eggs      1.99
Milk      2.50
Name: price, dtype: float64

In this example, we perform a join on the 'product_id' column using the merge() function, and then group the resulting dataframe by the 'product_name' column and apply the min() function to the 'price' column to find the minimum price of each product. The resulting output is a Series object that contains the minimum price for each product.