## 2021: Week 5 - Dealing with Duplication

Challenge by: Jenny Martin

Have you ever been working with a dataset in Tableau Desktop and noticed some duplication occurring? Of course, this is something you can fix with some potentially tricky LODs or Table Calc filters, but wouldn't it be nicer for your dataset to be viz ready before heading into Desktop? 

If you attended the Tableau Fringe Festival last year, this concept may feel familiar, as I did a quick demo explaining why I, personally, would prefer to use Prep to solve my duplication issues. You can find the video here if you like

### Input

The dataset we'll be working with for this challenge follows the same theme as the Fringe Festival. We have information relating to which of our Clients are attending our training sessions. Also included in our dataset is which Account Managers look after which Clients. However, we have historical information about Account Ownership which is leading to duplication. So how can we fix it?

![img](https://1.bp.blogspot.com/-d0bHWDavGPk/X-HeazG_L2I/AAAAAAAAAp8/OwmFLFlx8zc6TOKgk6vDLSMTEtsys9JtwCLcBGAsYHQ/w640-h173/Joined%2BDataset.png)

### Requirements

- Input the data 
- For each Client, work out who the most recent Account Manager is (help)
- Filter the data so that only the most recent Account Manager remains (help)
    - Be careful not to lose any attendees from the training sessions!
- In some instances, the Client ID has changed along with the Account Manager. Ensure only the most recent Client ID remains
- Output the data

### Output

![img2](https://1.bp.blogspot.com/-Kr_f9TYhdhg/X-HgeNwEzPI/AAAAAAAAAqQ/CDMGEy5J64Yml_B_JXOfX4sB95xEtuccQCLcBGAsYHQ/w640-h212/Output%2B2020W5.png)

7 fields
- Training
- Contact Email
- Contact Name
- Client
- Client ID
- Account Manager
- From Date

13,528 rows (13,529 including headers)

In [312]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [313]:
df = pd.read_csv("./data/Joined Dataset.csv")
df.shape

(13623, 7)

In [314]:
df.head()

Unnamed: 0,Training,Contact Email,Contact Name,Client,Client ID,Account Manager,From Date
0,Prep 101 - 2020-10-01,abagael.matresse@brauninc.com,Abagael Matresse,Braun Inc,1200,Xiaoxuan Ma,31/12/2019
1,Prep 101 - 2020-10-01,abagail.macconnell@lakinllc.com,Abagail MacConnell,Lakin LLC,924,Lucy Stevenson,01/01/2019
2,Prep 101 - 2020-10-01,abagail.moodey@raynorinc.com,Abagail Moodey,Raynor Inc,444,Nancy Smith,01/07/2015
3,Prep 101 - 2020-10-01,abby.eager@paucekgroup.com,Abby Eager,Paucek Group,893,Nancy Smith,20/09/2018
4,Prep 101 - 2020-10-01,abelard.mechell@lehner.swiftanddickinson.com,Abelard Mechell,"Lehner, Swift and Dickinson",1323,Xiaoxuan Ma,31/12/2019


In [315]:
# For each Client, work out who the most recent Account Manager is
df.loc[:, "From Date"] = pd.to_datetime(df["From Date"])
df = df.sort_values(by=["Client", "From Date"], ascending=[True, False]).reset_index(drop=True)
recent_account_manager = df.drop_duplicates(subset=["Client", "From Date"], keep="last").drop_duplicates(subset="Client", keep="first")
recent_account_manager.sample(10)

Unnamed: 0,Training,Contact Email,Contact Name,Client,Client ID,Account Manager,From Date
3017,Tableau 104 - 2020-09-17,waly.hainey`@durgan.wilkinsonandwest.com,Waly Hainey`,"Durgan, Wilkinson and West",872,Louisa James,2017-01-07
12544,Tableau 104 - 2020-09-17,zia.easterling@turcotte-borer.com,Zia Easterling,Turcotte-Borer,1300,Lucy Stevenson,2019-07-31
4546,Tableau 104 - 2020-09-17,winston.sleep@hammes.schneiderandbeahan.com,Winston Sleep,"Hammes, Schneider and Beahan",1346,Nancy Smith,2019-11-22
10080,Tableau 104 - 2020-10-08,tonya.crump@reilly-dickinson.com,Tonya Crump,Reilly-Dickinson,568,Louisa James,2015-01-07
7910,Tableau 104 - 2020-10-08,dana.beseke@lynch.dibbertandmitchell.com,Dana Beseke,"Lynch, Dibbert and Mitchell",1392,Lucy Stevenson,2020-04-03
3194,Tableau 104 - 2020-09-17,sigfried.perillio@erdman-ruecker.com,Sigfried Perillio,Erdman-Ruecker,2412,Xiaoxuan Ma,2020-03-31
3079,Tableau 104 - 2020-09-17,vivianne.rowlatt@eichmann-kessler.com,Vivianne Rowlatt,Eichmann-Kessler,696,Xiaoxuan Ma,2020-07-02
10307,Tableau 104 - 2020-09-17,willie.secrett@robertsinc.com,Willie Secrett,Roberts Inc,1010,Lucy Stevenson,2020-04-03
13268,Tableau 104 - 2020-10-08,valenka.anthon@wisokyllc.com,Valenka Anthon,Wisoky LLC,1088,Nancy Smith,2018-01-07
5060,Tableau 104 - 2020-10-08,whitaker.naldrett@herman-davis.com,Whitaker Naldrett,Herman-Davis,1003,Louisa James,2015-01-07


In [316]:
# Filter the data so that only the most recent Account Manager remains
test = df.groupby(["Client", "Client ID", "Account Manager"])["From Date"].apply(lambda df_: df_.sort_values(ascending=False)).reset_index().drop_duplicates(subset=["Client"], keep="last")
test = test.drop("level_3", axis=1)
test.loc[test["Client"] == "Braun Inc"]

Unnamed: 0,Client,Client ID,Account Manager,From Date
1510,Braun Inc,2460,Oscar Adams,2020-06-30


In [317]:
# Preparing the original df with columns remained
df = df.loc[:, ["Training", "Contact Email", "Contact Name", "Client"]]
df.head()

Unnamed: 0,Training,Contact Email,Contact Name,Client
0,Prep 101 - 2020-10-01,emilie.booton@abbott-runolfsson.com,Emilie Booton,Abbott-Runolfsson
1,Prep 101 - 2020-10-01,marleen.draper@abbott-runolfsson.com,Marleen Draper,Abbott-Runolfsson
2,Prep 101 - 2020-10-01,moira.beneteau@abbott-runolfsson.com,Moira Beneteau,Abbott-Runolfsson
3,Prep 101 - 2020-10-01,quinn.padkin@abbott-runolfsson.com,Quinn Padkin,Abbott-Runolfsson
4,Prep 102 - 2020-09-10,emilie.booton@abbott-runolfsson.com,Emilie Booton,Abbott-Runolfsson


In [318]:
# Merge the original df with filtered data
final_output = df.merge(test, on="Client", how="left").drop_duplicates()
final_output.shape

(13528, 7)

In [319]:
final_output.to_csv("./output/Week5_output.csv")