# Constraints, Levenshtein- and Edit-Distance

In this small notebook, we are going to use the dataset from the lecture to try out different methods to find inconsistencies.

As always, we first import our required dependencies!

In [1]:
import sklearn
import pandas as pd
from pandas_schema import Column, Schema
from pandas_schema.validation import CustomSeriesValidation, LeadingWhitespaceValidation, TrailingWhitespaceValidation, CanConvertValidation, MatchesPatternValidation, IsDistinctValidation, InRangeValidation, InListValidation

Now is the first time we are going to create a dataset ourself:

In [38]:


df = pd.DataFrame(df_dict)
df.set_index('Id', inplace=True)

In [39]:
df.head()

Unnamed: 0_level_0,First Name,Last Name,Age,Mail
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Hilko,Janßen,26,mail@hilko.eu
2,Hilko,Janssen,Twenty Six,mail@hilko.eu
3,Christoph,Kolumbus,226,christoph@kolumbus.online
4,,Homer,26,homer@
5,Theresa,May,61,theresa@may.com


Now we can define a ValidationSchema and check against it:

In [44]:
not_none_validation = CustomSeriesValidation(lambda s: ~s.isnull(), 'is none / null')
str_length_gt_0 = CustomSeriesValidation(lambda s: s.str.len() > 0, 'length is not greater than 0')

schema = Schema([
    Column('First Name', [not_none_validation, str_length_gt_0]),
    Column('Last Name', [not_none_validation, str_length_gt_0]),
    Column('Age', [InRangeValidation(0, 120)]),
    Column('Mail', [IsDistinctValidation()])
])

errors = schema.validate(df)

for error in errors:
    print(error)

{row: 2, column: "Age"}: "Twenty Six" was not in the range [0, 120)
{row: 2, column: "Mail"}: "mail@hilko.eu" contains values that are not unique
{row: 3, column: "Age"}: "226" was not in the range [0, 120)
is none / null
length is not greater than 0
{row: 6, column: "Last Name"}: "" length is not greater than 0


## Distances

In [46]:
import Levenshtein

In [51]:
last_names = df['Last Name'].values

print(last_names)

['Janßen' 'Janssen' 'Kolumbus' 'Homer' 'May' '']


In [62]:
for i in range(len(last_names)):
    for j in range(i+1, len(last_names)):
        s1 = last_names[i]
        s2 = last_names[j]
        
        while True:
            if len(s1) < len(s2):
                s1 += " "
            elif len(s1) > len(s2):
                s2 += " "
            else:
                break
        print(f'Difference between {s1} and {s2} is {Levenshtein.distance(s1, s2)}')

Difference between Janßen  and Janssen is 3
Difference between Janßen   and Kolumbus is 8
Difference between Janßen and Homer  is 6
Difference between Janßen and May    is 5
Difference between Janßen and        is 6
Difference between Janssen  and Kolumbus is 8
Difference between Janssen and Homer   is 7
Difference between Janssen and May     is 6
Difference between Janssen and         is 7
Difference between Kolumbus and Homer    is 7
Difference between Kolumbus and May      is 8
Difference between Kolumbus and          is 8
Difference between Homer and May   is 5
Difference between Homer and       is 5
Difference between May and     is 3
