# Exercise: Data Merging

Ideally, data analysts would start their work with complete datasets. In practise, however, data often isn't even bundled and has to be aggregated from multiple sources. In this exercise, you will use pandas to merge data from multiple sources in different ways.

In [1]:
# for this exercise, only use pandas
import pandas as pd

##### 1. Read the customer table (customers.csv) and order table (orders.csv) files into two separate dataframes

In [2]:
df_customers = pd.read_csv("customers.csv")
df_orders = pd.read_csv("orders.csv")

##### 2. Create a dataframe, which contains each customer and their associated information from the order table. This new dataframe should keep all entries of the customer.csv table.

In [11]:
df_customers_extended = pd.merge(df_customers, 
                                 df_orders, 
                                 left_on="ID", 
                                 right_on="Customer", 
                                 how="left",
                                 suffixes=("_customer", "_order"))

# Note, that pd.merge renames columns if the two merged dataframes have identically named columns.
# By default it adds "_x" and "_y" suffixes, but this can be changed by the "suffixes" argument.
# It is good practice to rename these to fit the content of the data
# e.g. "ID_x" --> "ID_customer", and "ID_y" --> "ID_order"
# You can also keep column names by providing an empty suffix, e.g. suffixes=("", "_order")

##### 3. Create a dataframe, which contains only customers that already have placed at least one order

In [4]:
df_customers_order = pd.merge(df_customers, 
                              df_orders, 
                              left_on="ID", 
                              right_on="Customer", 
                              how="inner",
                              suffixes=("_customer", "_order"))

##### 4. Create a dataframe, that merges and keeps _all_ entries from both datasets

In [5]:
df_customers_full = pd.merge(df_customers, 
                             df_orders, 
                             left_on="ID", 
                             right_on="Customer", 
                             how="outer",
                             suffixes=("_customer", "_order"))

##### 5. Create a dataframe that contains all customers that have _not_ placed an order yet

In [6]:
# For this task, we use a "bitmask", which is an array containing True/False. 
# For each customer, i.e. each "ID" in df_customers, check if the ID is contained inside
# the "Customer" column of df_orders. If not, this customer did not place an order yet.
bitmask_customers_order = df_customers["ID"].isin(df_orders["Customer"])
df_customers_no_orders = df_customers[~bitmask_customers_order]  # "~" is the bit-wise "not"-operator