# Dedupe Contacts

## Problem

Contacts need to be uploaded into NB

The data extract from CiviCRM is producing duplicate Contacts records on `Primary Email` field. Further, these duplicate records are recorded as having unique Contact ID's. Also, there are secondary Email fields that can be the same as the same record's `Primary Email`, or even *another record's* `Primary Email`

We can't have contacts with same email in NB so we need to clean these up to ensure that all unique contacts also have unique primary emails

## Solution

*TBC*

Primary Email field will define the 'uniqueness' of a contact. Because these have a potential one to many relationship with Contact ID's we need to decide which Contact ID to 'keep'. Which Contact ID to keep will also have an impact to other relationships in the database e.g., Members, Member relationships, Contributions ...

Currently, the user is simply prompted to keep the first or last record in a set with the same Primary Email field. the associated record and its Contact ID is then kept. The rest (which have the same Primary Email but different Contact ID) are written to a separate file for safe keeping - and potentially future followup. 


## Order of Ops

#### 1. Identify ALL records duplicated on [Contact ID] but do not remove :  
- generate an output file of these records`file_dupeContactIDRecords` - investigate why and what to do with them and for safe keeping

#### 2. Remove all records that have blank [Primary Email] fields
- generate an output file of these records `file_blankRecords` - for safe keeping

#### 3. Remove records that have the same `dedupeField`, keeping only the first (or as defined by user)
- generate an output file of the discarded records `file_dupeRecords` - for safe keeping and potential future followup
- generate an output file of the remaining records `file_withoutDupes` - this is the file that gets loaded into NB


In [None]:
import pandas as pd
from pathlib import Path

# file_withDupes - filename of original file 
# file_withoutDupes - filename of file with (blanks and) duplicates removed
# file_blankRecords - filename of file with records where dedupeField is blank
# file_dupeRecords - filename of file with records where dedupefield is not blank and is a duplicate of an existing
# file_dupeContactIDRecords - filename of file with all records where ContactID is identical
# df_withDupes - data frame of original file
# df_withoutDupes - data frame with duplicates removed
# df_dupeRecordsMask - data frame of boolean mask indicating which rows are duplicate 
# df_dupeRecords - data frame of records identified as the duplicates


In [None]:
# get/set parameters here

# get the name of the original file OR default to: ./testdata/KR_Contacts_Names_Emails_00000_80000.csv
file_withDupesInput = input(f"\nEnter filename - no value will default to `KR_Contacts_Names_Emails_00000_80000.csv`\n")
if file_withDupesInput == "":
    file_withDupes = Path(
        "./01.DataExtract/KR_Contacts_Names_Emails_00000_80000.csv")
    print("./01.DataExtract/KR_Contacts_Names_Emails_00000_80000.csv")
else:
    file_withDupes = Path(file_withDupesInput)
    
# get the name of the column to check for duplicates
dedupeField = input(f"\nEnter name of column to check for duplicates - no value will default to `Primary Email`\n")
if dedupeField == "":
    dedupeField = "Primary Email"
    print(dedupeField)
    
# get whether to keep first or last record as the master
keepRecord = input(f"\nEnter record to keep. Valid values are 'first' or 'last'\n")
if keepRecord == "":
    keepRecord = "first" 
    print(keepRecord)

In [None]:
# This section defines the output filenames

# file of ALL duplicate ContactID records will be named:
# [file_withDupes]_deduped_on[DupeField]_keep[keepRecord]_dupeContactIDRecords.csv
file_dupeContactIDRecords = (
    file_withDupes.stem +
    '_deduped_on' + dedupeField.title().replace(" ", "") +
    '_keep' + keepRecord.capitalize() +
    '_dupeContactIDRecords' +
    '.csv')

# file of records with blank dedupeField will be named:
# [file_withDupes]_deduped_on[DupeField]_keep[keepRecord]_blankRecords.csv
file_blankRecords = (
    file_withDupes.stem + 
    '_deduped_on' + dedupeField.title().replace(" ", "") +
    '_keep' + keepRecord.capitalize() +
    '_blankRecords' +
    '.csv')

# file of duplicate records will be named:
# [file_withDupes]_deduped_on[DupeField]_keep[keepRecord]_dupeRecords.csv
file_dupeRecords = (
    file_withDupes.stem +
    '_deduped_on' + dedupeField.title().replace(" ", "") +
    '_keep' + keepRecord.capitalize() +
    '_dupeRecords' +
    '.csv')

# file without dupes will be named:
# [file_withDupes]_deduped_on[DupeField]_keep[keepRecord].csv
file_withoutDupes = (
    file_withDupes.stem +
    '_deduped_on' + dedupeField.title().replace(" ", "") +
    '_keep' + keepRecord.capitalize() +
    '.csv')

print(f"\n{file_withDupes} --> extract of only duplicate 'Contact ID' records \n= {file_dupeContactIDRecords}")
print(f"\n{file_withDupes} --> extract of records with blank '{dedupeField}' \n= {file_blankRecords}")
print(f"\n{file_withDupes} \n- {file_blankRecords} \n= {file_withoutDupes}")
print(f"\n{file_withDupes} \n- {file_blankRecords} \n- {file_withoutDupes} \n= {file_dupeRecords}")


In [None]:
# read the original file into a dataframe df_withDupes
df_withDupes = pd.read_csv(file_withDupes)
print(df_withDupes.head())

Lets find unique records by `Contact ID`

In [None]:
# create a dataframe with records where duplicate 'Contact ID' are removed, keeping only the first
df_withoutDupeContactID = df_withDupes.drop_duplicates(subset=['Contact ID'], keep='first')
print(df_withoutDupeContactID.head())

Lets find all records duplicated on `Contact ID` and write to a file

In [None]:
#create a boolean mask indicating which records have a duplicate 'Contact ID' field
mask_withoutDupeContactID = df_withDupes.duplicated(subset=['Contact ID'], keep=False)
print(mask_withoutDupeContactID)

In [None]:
#apply boolean mask to original file to create dataframe of all the duplicate records on 'Contact ID' field
df_dupeContactIDRecords = df_withDupes[mask_withoutDupeContactID]
print(df_dupeContactIDRecords.head())

In [None]:
# write the above dataframe of records with a blank dedupeField to a csv
df_dupeContactIDRecords.to_csv(file_dupeContactIDRecords,index=False)

NOTE: At this point we have not yet removed any records from the original file

In [None]:
# create a dataframe with records where dedupeField is not blank/NaN since these will all be considered "duplicates"
df_withoutBlanks = df_withDupes.dropna(subset=[dedupeField])
print(df_withoutBlanks.head())

In [None]:
#create a boolean mask indicating which records have a blank dedupeField
mask_blankDedupeFieldRecords = df_withDupes[dedupeField].isna()
print(mask_blankDedupeFieldRecords.head())

In [None]:
# apply boolean mask to original file to create dataframe of only the records with a blank dedupeField
df_blankDedupeFieldRecords = df_withDupes[mask_blankDedupeFieldRecords]
print(df_blankDedupeFieldRecords.head())

In [None]:
# write the above dataframe of records with a blank dedupeField to a csv
df_blankDedupeFieldRecords.to_csv(file_blankRecords,index=False)

In [None]:
#create a boolean mask indicating which records are duplicate
mask_dupeRecords = df_withoutBlanks.duplicated(subset=[dedupeField], keep=keepRecord)
print(mask_dupeRecords.head())

In [None]:
#apply boolean mask to original file to create dataframe of only the duplicate records (not including the "kept" record)
df_dupeRecords = df_withoutBlanks[mask_dupeRecords]
print(df_dupeRecords.head())

In [None]:
# write the above dataframe of duplicate records to a csv
df_dupeRecords.to_csv(file_dupeRecords,index=False)

In [None]:
# create a dataframe without dupes, by dedupeField, and keeping the keepRecord
df_withoutDupes = df_withoutBlanks.drop_duplicates(subset=[dedupeField], keep=keepRecord)
print(df_withoutDupes.head())

In [None]:
# write the unique records (duplicates removed) to a csv
df_withoutDupes.to_csv(file_withoutDupes, index=False)