In [1]:
### Data Prerequisites

Splink is a data linking package that allows users to link multiple datasets together by matching entries in unique id columns. In order to use this package effectively, it is important that certain prerequisites are met in the input datasets.


### Unique IDs

- Each input dataset must have a unique id column that's unique _within_ the dataset.  It need not be  This unique id column is used to match entries across datasets, so it is essential that each entry in this column is unique within its respective dataset. For example, one dataset might have a column called "customer_id" with unique values for each customer, while another dataset might have a column called "employee_id" with unique values for each employee.

### Conformant input datasets

- Input datasets must be conformant. This means that they should have the same column names, which correspond to the same data. For example, if one dataset has a column called "date of birth" and another has a column called "dob", these columns should be renamed to match in order to ensure that Splink can match entries across the datasets.

### Pre-cleaning

- Data should be pre-cleaned to ensure consistency. This includes things like ensuring that dates are in the same format, that text is in the same case (e.g. all uppercase or all lowercase), and that any missing or invalid data has been dealt with. For example, if one dataset has dates in the format "yyyy-mm-dd" and another has dates in the format "mm/dd/yyyy", these should be converted to the same format before using Splink.

### Ensure nulls are consistently and correctly represented

- Any null values in the datasets must be true nulls, not zero length strings. This is because Splink treats null values differently than zero length strings, and using true nulls will ensure that Splink functions properly when matching entries across datasets. For example, if a cell in a dataset contains an empty string, it should be replaced with a true null value in order for Splink to handle it correctly.

Here are some additional examples of data cleaning rules that can improve the accuracy of data matching:

Trimming leading and trailing whitespace from string values. For example, if a dataset contains the value " john smith " in a name column, it should be trimmed to "john smith" to avoid mismatches with the value "john smith" in another dataset.

Removing special characters from string values. For example, if a dataset contains the value "john smith!" in a name column, it should be cleaned to "john smith" to avoid mismatches with the value "john smith" in another dataset.

Converting string values to a consistent case. For example, if a dataset contains the values "John Smith" and "john smith" in a name column, they should be converted to a consistent case (e.g. all lowercase) to avoid mismatches.

Standardizing date formats. For example, if a dataset contains dates in the formats "yyyy-mm-dd", "mm/dd/yyyy", and "dd-mm-yyyy", they should all be converted to a consistent format (e.g. "yyyy-mm-dd") to avoid mismatches.

Replacing abbreviations with full words. For example, if a dataset contains the values "St." and "Street" in an address column, they should be standardized to the full word (e.g. "Street") to avoid mismatches.

Overall, implementing these data cleaning rules can help improve the accuracy of data matching by ensuring that the data is consistent and free of any anomalies that could cause mismatches.