# Exercise: Data Merging

Ideally, data analysts would start their work with complete datasets. In practise, however, data often isn't even bundled and has to be aggregated from multiple sources. In this exercise, you will use pandas to merge data from multiple sources in different ways.

In [2]:
# for this exercise, only use pandas
import pandas as pd

##### 1. Read the customer table (customers.csv) and order table (orders.csv) files into two separate dataframes

In [3]:
df_customers = pd.read_csv("customers.csv")
df_orders = pd.read_csv("orders.csv")
df_customers.head(3)

Unnamed: 0,ID,Name,Street,Phone
0,1,Gerry Schaefer,Elizbeth Carroll Street,9624155983
1,2,Lizabeth Armstrong,Art Kirlin Street,6174621765
2,3,Ming Veum,Eusebio Pagac Street,6845739684


In [4]:
df_orders.head(3)

Unnamed: 0,ID,Item,Amount,Prize,Customer
0,10735,Lorenzo Hagenes,3,20.798804,399
1,10736,Margie Gibson,4,89.046203,498
2,10737,Melodie Dietrich,5,19.707403,26


##### 2. Create a dataframe, which contains each customer and their associated information from the order table. This new dataframe should keep all entries of the customer.csv table.

In [7]:
merged_df1 = pd.merge(df_customers, df_orders, left_on="ID", right_on="Customer", how="left")
#merged_df1.drop(columns=["Customer"], inplace=True)
merged_df1.head(3)

Unnamed: 0,ID_x,Name,Street,Phone,ID_y,Item,Amount,Prize,Customer
0,1,Gerry Schaefer,Elizbeth Carroll Street,9624155983,10784.0,Dillon Crist,2.0,29.916634,1.0
1,1,Gerry Schaefer,Elizbeth Carroll Street,9624155983,10804.0,Jermaine D'Amore,6.0,93.976604,1.0
2,2,Lizabeth Armstrong,Art Kirlin Street,6174621765,11005.0,Gennie Ferry,8.0,62.931166,2.0


##### 3. Create a dataframe, which contains only customers that already have placed at least one order

In [8]:
merged_df2 = pd.merge(df_customers, df_orders, left_on="ID", right_on="Customer", how="inner")
merged_df2["Amount"].value_counts()

Amount
1     41
6     36
8     35
9     33
4     33
2     31
3     27
7     25
5     21
10    20
Name: count, dtype: int64

##### 4. Create a dataframe, that merges and keeps _all_ entries from both datasets

In [9]:
merged_df3 = pd.merge(df_customers, df_orders, left_on="ID", right_on="Customer", how="outer")
merged_df3.head(3)

Unnamed: 0,ID_x,Name,Street,Phone,ID_y,Item,Amount,Prize,Customer
0,1.0,Gerry Schaefer,Elizbeth Carroll Street,9624156000.0,10784.0,Dillon Crist,2.0,29.916634,1.0
1,1.0,Gerry Schaefer,Elizbeth Carroll Street,9624156000.0,10804.0,Jermaine D'Amore,6.0,93.976604,1.0
2,2.0,Lizabeth Armstrong,Art Kirlin Street,6174622000.0,11005.0,Gennie Ferry,8.0,62.931166,2.0


##### 5. Create a dataframe that contains all customers that have _not_ placed an order yet

In [10]:
merged_df4 = pd.merge(df_customers, df_orders, left_on="ID", right_on="Customer", how="left")
df4_without_customer_orders = merged_df4[merged_df4["Customer"].isna()]
df4_without_customer_orders.head(3)

Unnamed: 0,ID_x,Name,Street,Phone,ID_y,Item,Amount,Prize,Customer
3,3,Ming Veum,Eusebio Pagac Street,6845739684,,,,,
4,4,Marcelino Larson,Jules Gutkowski Road,1594525216,,,,,
5,5,Brooke Ortiz,Monte Predovic Road,7618645478,,,,,
