# Ex 5.2  Left Joins

- [Part 1: Merging the Orders and Customers Tables](#Part-1:-Merging-the-Orders-and-Customers-Tables)  
  - Question 1a: How Many Orders has each customer placed? (without Join)  
  - Question 1b: How Many Orders has each customer placed? (with Join)  
  - Question 1c: Who Are Our Top Customers in terms of Number of Orders?  
- [Part 2: Merging the *Orders* and *Employees* Tables](#Part-2:-Merging-the-Orders-and-Employees-Tables) 
  - Question 2: Who are our Top Employees Based on Number of Orders?


- **My References**  
  - [Sorting](../0_References/1_Pandas_Reference/Sorting.ipynb#Sorting-by-One-Column)   
    
  
  
#### Note:  All data used in this Notebook  comes from:  *w3schools_Data.xlsx*   

In [171]:
from IPython.display import display, HTML
import pandas as pd
import math

import plotly.express as px
import numpy as np
from scipy import special

# Part 1: Merging the *Orders* and *Customers* Tables  
- The Objective is to display a plot that answers the question:  *How Many Orders Has Each Customer Placed?*


In [172]:
# Read the w3schools Orders data
df_orders = pd.read_excel("Data/w3schools_Data.xlsx", "Orders")

print("Number of Rows:  ", df_orders.shape[0])
print("Number of Columns:  ", df_orders.shape[1])
df_orders.head(3)

Number of Rows:   196
Number of Columns:   5


Unnamed: 0,OrderID,CustomerID,EmployeeID,OrderDate,ShipperID
0,10248,90,5,1996-07-04,3
1,10249,81,6,1996-07-05,1
2,10250,34,4,1996-07-08,2


# Question 1a:  How Many Orders has each customer placed?  *(without Join)*
- Note: Since we're just using the Orders table, we only have the CustomerID - not the Customer Name!

In [173]:
# Group By:  CustomerID
df_orders_by_cust = df_orders.groupby("CustomerID").count()

print("Number of Rows:  ", df_orders_by_cust.shape[0])
print("Number of Columns:  ", df_orders_by_cust.shape[1])
df_orders_by_cust.head()

Number of Rows:   74
Number of Columns:   4


Unnamed: 0_level_0,OrderID,EmployeeID,OrderDate,ShipperID
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,1,1,1,1
3,1,1,1,1
4,2,2,2,2
5,3,3,3,3
7,4,4,4,4


### Plot the results  
- **Note**:  We don't have to use *melt()* because we just want a simple Bar Chart (not a Stacked or Grouped Bar Chart) 

In [174]:
# Get CustomerID out of the index
df_orders_by_cust.reset_index(inplace=True)

print("Number of Rows:  ", df_orders_by_cust.shape[0])
print("Number of Columns:  ", df_orders_by_cust.shape[1])
df_orders_by_cust.head()

Number of Rows:   74
Number of Columns:   5


Unnamed: 0,CustomerID,OrderID,EmployeeID,OrderDate,ShipperID
0,2,1,1,1,1
1,3,1,1,1,1
2,4,2,2,2,2
3,5,3,3,3,3
4,7,4,4,4,4


In [175]:
df_orders_by_cust.dtypes

CustomerID    int64
OrderID       int64
EmployeeID    int64
OrderDate     int64
ShipperID     int64
dtype: object

In [176]:
# Vertical Bar Chart:  Count of Orders by Customer  
# Note:  
fig = px.bar(df_orders_by_cust, 
             x='CustomerID', 
             y='OrderID',
             width=900,
             height=400,
             labels={'OderID':'Number of Orders'},
             template='seaborn',
             title='Question 1a: How Many Orders has each customer placed? (without Join)')

fig.update_yaxes(showgrid=True,
                title_text='Number of Orders')

fig.show()

# Question 1b:  How Many Orders has each customer placed?  *(with Join)*
- To get the CustomerName column in the Customers table, we will join Orders with Customers  
- More precisely:  We will do a Left Join on Orders and Customers, with Orders being the Left Table
- Note: Now we will have Customer Names for our x-axis

### Read the Customers table data

In [177]:
df_customers = pd.read_excel("Data/w3schools_Data.xlsx", "Customers", skiprows=2)

print("Number of Rows:  ", df_customers.shape[0])
print("Number of Columns:  ", df_customers.shape[1])
df_customers.head(3)

Number of Rows:   93
Number of Columns:   7


Unnamed: 0,CustomerID,CustomerName,ContactName,Address,City,PostalCode,Country
0,1,Alfreds Futterkiste,Maria Anders,Obere Str. 57,Berlin,12209,Germany
1,2,Ana Trujillo Emparedados y helados,Ana Trujillo,Avda. de la Constitución 2222,México D.F.,5021,Mexico
2,3,Antonio Moreno Taquería,Antonio Moreno,Mataderos 2312,México D.F.,5023,Mexico


In [178]:
df_customers.dtypes

CustomerID      object
CustomerName    object
ContactName     object
Address         object
City            object
PostalCode      object
Country         object
dtype: object

In [179]:
#Convert CustomerID in df_customers 
#to match CustomerID datatype in df_orders

df_customers['CustomerID'] = df_customers['CustomerID'].astype(str)
df_orders['CustomerID'] = df_orders['CustomerID'].astype(str)

### Do a Left Join:  Orders and Customers  
- This Join will give us the CustomerName column.  That's what we want.  
- If this Join is successfull, the following should be true:  
  - The number of Rows should be identical to the number of rows of the Left Table (Orders)  = 196
  - The number of Columns should be the Left Table Cols (5) + Right Table Cols (7) - 1 = 11

In [180]:
# Left Join: Orders and Customers 
df_orders_custnames = pd.merge(df_orders, df_customers, on='CustomerID', how='left')

print("Number of Rows:  ", df_orders_custnames.shape[0])
print("Number of Columns:  ", df_orders_custnames.shape[1])
df_orders_custnames.head(2)

Number of Rows:   196
Number of Columns:   11


Unnamed: 0,OrderID,CustomerID,EmployeeID,OrderDate,ShipperID,CustomerName,ContactName,Address,City,PostalCode,Country
0,10248,90,5,1996-07-04,3,Wilman Kala,Matti Karttunen,Keskuskatu 45,Helsinki,21240,Finland
1,10249,81,6,1996-07-05,1,Tradição Hipermercados,Anabela Domingues,"Av. Inês de Castro, 414",São Paulo,05634-030,Brazil


### Plot the results  
- Use groupby to get the aggregated values we want to plot  
- **Note**:  We don't have to use *melt()* because we just want a simple Bar Chart (not a Stacked or Grouped Bar Chart) 

In [181]:
# Groupby:  CustomerName
df_orders_by_custname = df_orders_custnames.groupby('CustomerName').count()

print("Number of Rows:  ", df_orders_by_custname.shape[0])
print("Number of Columns:  ", df_orders_by_custname.shape[1])
df_orders_by_custname.head()

Number of Rows:   74
Number of Columns:   10


Unnamed: 0_level_0,OrderID,CustomerID,EmployeeID,OrderDate,ShipperID,ContactName,Address,City,PostalCode,Country
CustomerName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Ana Trujillo Emparedados y helados,1,1,1,1,1,1,1,1,1,1
Antonio Moreno Taquería,1,1,1,1,1,1,1,1,1,1
Around the Horn,2,2,2,2,2,2,2,2,2,2
B's Beverages,1,1,1,1,1,1,1,1,1,1
Berglunds snabbköp,3,3,3,3,3,3,3,3,3,3


In [182]:
# Get CustomerName out of the index
df_orders_by_custname.reset_index(inplace=True)

print("Number of Rows:  ", df_orders_by_custname.shape[0])
print("Number of Columns:  ", df_orders_by_custname.shape[1])
df_orders_by_custname.head()

Number of Rows:   74
Number of Columns:   11


Unnamed: 0,CustomerName,OrderID,CustomerID,EmployeeID,OrderDate,ShipperID,ContactName,Address,City,PostalCode,Country
0,Ana Trujillo Emparedados y helados,1,1,1,1,1,1,1,1,1,1
1,Antonio Moreno Taquería,1,1,1,1,1,1,1,1,1,1
2,Around the Horn,2,2,2,2,2,2,2,2,2,2
3,B's Beverages,1,1,1,1,1,1,1,1,1,1
4,Berglunds snabbköp,3,3,3,3,3,3,3,3,3,3


In [183]:
# Plot Vertical Bar Chart:  Count of Orders by Customer  
# Note:  
fig = px.bar(df_orders_by_custname, 
             x='CustomerName', 
             y='OrderID',
             width=900,
             height=600,
             labels={},
             template='seaborn',
             title='Question 1b: How Many Orders has each customer placed? (with Join)')

fig.update_yaxes(showgrid=True,
                title_text='Number of Orders')

fig.show()

# Question 1c: Who Are Our Best Customers in terms of Number of Orders  
- One way to get this is to sort the dataframe - greatest to least  
- And then create a new dataframe with just the first 5 rows of the sorted dataframe 


- **My References**  
  - [Sorting](../0_References/1_Pandas_Reference/Sorting.ipynb#Sorting-by-One-Column)   

### Sort the dataframe - greatest to least

In [184]:
# Sort

df_orders_by_custname.sort_values('OrderID', ascending=False, inplace=True)

In [185]:
# Plot Vertical Bar Chart:  Count of Orders by Customer  
fig = px.bar(df_orders_by_custname, 
             x='CustomerName', 
             y='OrderID',
             text='OrderID',
             width=900,
             height=500,
             #labels={},
             template='seaborn',
             title='Question 1c: Who Are Our Best Customers in terms of Number of Orders?')

fig.update_yaxes(showgrid=False,
                title_text='Number of Orders',
                tick0=0,
                dtick=5,
                 tickformat='.1s',
                 hoverformat=".2f")

fig.update_xaxes(title_text='Customer')

fig.update_traces(textposition='auto',
                 #texttemplate='.f'
                 )



# Part 2: Merging the *Orders* and *Employees* Tables  

1. Merge the following w3schools data tables: *Orders* and *Employees*  
2. Check that your Left Join was successful 
3. Sort the dataframe by # of Orders - Greatest to Least
4. Plot the Top Employees based on # of Orders  


### Read the Orders data

In [186]:
# Read the w3schools Orders data
df_orders = pd.read_excel("Data/w3schools_Data.xlsx", "Orders")

print("Number of Rows:  ", df_orders.shape[0])
print("Number of Columns:  ", df_orders.shape[1])
df_orders.head(3)

Number of Rows:   196
Number of Columns:   5


Unnamed: 0,OrderID,CustomerID,EmployeeID,OrderDate,ShipperID
0,10248,90,5,1996-07-04,3
1,10249,81,6,1996-07-05,1
2,10250,34,4,1996-07-08,2


### Read the Employees data

In [187]:
# Read the w3schools Orders data
df_employees = pd.read_excel("Data/w3schools_Data.xlsx", "Employees")

print("Number of Rows:  ", df_employees.shape[0])
print("Number of Columns:  ", df_employees.shape[1])
df_employees.head(3)

Number of Rows:   10
Number of Columns:   6


Unnamed: 0,EmployeeID,LastName,FirstName,BirthDate,Photo,Notes
0,1,Davolio,Nancy,25180,EmpID1.pic,Education includes a BA in psychology from Col...
1,2,Fuller,Andrew,19043,EmpID2.pic,Andrew received his BTS commercial and a Ph.D....
2,3,Leverling,Janet,23253,EmpID3.pic,Janet has a BS degree in chemistry from Boston...


### Left Join: Orders (Left) and Employees

In [188]:
df_orders_by_emp = pd.merge(df_orders, df_employees, on='EmployeeID', how='left')

print("Number of Rows:  ", df_orders_by_emp.shape[0])
print("Number of Columns:  ", df_orders_by_emp.shape[1])
df_orders_by_emp.head(3)

Number of Rows:   196
Number of Columns:   10


Unnamed: 0,OrderID,CustomerID,EmployeeID,OrderDate,ShipperID,LastName,FirstName,BirthDate,Photo,Notes
0,10248,90,5,1996-07-04,3,Buchanan,Steven,20152,EmpID5.pic,Steven Buchanan graduated from St. Andrews Uni...
1,10249,81,6,1996-07-05,1,Suyama,Michael,23194,EmpID6.pic,Michael is a graduate of Sussex University (MA...
2,10250,34,4,1996-07-08,2,Peacock,Margaret,21447,EmpID4.pic,Margaret holds a BA in English literature from...


### Plot the results  
- Use groupby to get the aggregated values we want to plot  
- **Note**:  We don't have to use *melt()* because we just want a simple Bar Chart (not a Stacked or Grouped Bar Chart) 

In [189]:
# Group By: LastName
df_orders_by_emp = df_orders_by_emp.groupby('LastName').count()

df_orders_by_emp

Unnamed: 0_level_0,OrderID,CustomerID,EmployeeID,OrderDate,ShipperID,FirstName,BirthDate,Photo,Notes
LastName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Buchanan,11,11,11,11,11,11,11,11,11
Callahan,27,27,27,27,27,27,27,27,27
Davolio,29,29,29,29,29,29,29,29,29
Dodsworth,6,6,6,6,6,6,6,6,6
Fuller,20,20,20,20,20,20,20,20,20
King,14,14,14,14,14,14,14,14,14
Leverling,31,31,31,31,31,31,31,31,31
Peacock,40,40,40,40,40,40,40,40,40
Suyama,18,18,18,18,18,18,18,18,18


### Sort the dataframe - greatest to least and Plot

In [190]:
# Sort

df_orders_by_emp.sort_values('OrderID', ascending=False, inplace=True)

df_orders_by_emp

Unnamed: 0_level_0,OrderID,CustomerID,EmployeeID,OrderDate,ShipperID,FirstName,BirthDate,Photo,Notes
LastName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Peacock,40,40,40,40,40,40,40,40,40
Leverling,31,31,31,31,31,31,31,31,31
Davolio,29,29,29,29,29,29,29,29,29
Callahan,27,27,27,27,27,27,27,27,27
Fuller,20,20,20,20,20,20,20,20,20
Suyama,18,18,18,18,18,18,18,18,18
King,14,14,14,14,14,14,14,14,14
Buchanan,11,11,11,11,11,11,11,11,11
Dodsworth,6,6,6,6,6,6,6,6,6


In [191]:
# Pop LastName out as its own column
df_orders_by_emp.reset_index(inplace=True)

In [192]:
# Plot Vertical Bar Chart:  Count of Orders by Employee
fig = px.bar(df_orders_by_emp, 
             x='LastName', 
             y='OrderID',
             text='OrderID',
             width=900,
             height=500,
             #labels={},
             template='plotly_dark',
             title='Question 2: Who are our Top Employees Based on Number of Orders?')

fig.update_yaxes(showgrid=False,
                title_text='Number of Orders',
                tick0=0,
                dtick=5,
                 tickformat='.s',
                 hoverformat=".f")

fig.update_xaxes(title_text='Employee')
fig.update_traces(textposition='auto',
                 #texttemplate='.f'
                 )

fig.show()