# Data Prep
When building the bank fraud graph db, it is likely that real data will have data quality issues. <br>
For example, 1 address but it is written in 2 different ways - '123 Porter Street 15390' vs '123 PORTER STREET' <br>
This file contains initial thoughts and investigations as to how to address such data quality issues.

Author: Mei Yong <br>
github.com/mei-yong/BankFraudDetection

### Potential Problems
* Whitespace
* Addresses missing different bits
* Addresses in different casing - i.e. upper or lower case
* Telephone numbers in different formats
* National ID numbers in different formats

#### fuzzywuzzy - pip install fuzzywuzzy
https://en.wikipedia.org/wiki/Levenshtein_distance <br>
https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/ <br>
https://github.com/seatgeek/fuzzywuzzy

In [None]:
from fuzzywuzzy import process
from fuzzywuzzy import fuzz

# String similarity
fuzz.ratio("new york mets","new york meats") # 96

# Partial string similarity - subsets in this case
fuzz.partial_ratio("yankees","new york yankees") # 100
fuzz.partial_ratio("new york mets","new york yankees") # 69

# Token sort - when 2 labels are the same thing but in a different order
fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") # 100

# Token set - when one phrase contains similar words but the 2 phrases are of very different lengths and are not in order
fuzz.token_set_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners") # 90

# Extracting matches from a longer phrase
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
process.extract("new york jets", choices, limit=2) # [('New York Jets', 100), ('New York Giants', 78)]
process.extractOne("cowboys", choices) # ("Dallas Cowboys", 90)
    

#### fuzzymatcher - pip install fuzzymatcher
https://github.com/RobinL/fuzzymatcher <br>
https://hub.gke.mybinder.org/user/robinl-fuzzymatcher-wzp8zn0a/notebooks/examples.ipynb

In [None]:
import fuzzymatcher
import pandas as pd

# Basic left join using only 1 identifier column
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on = "ons_name", right_on = "os_name")

# Left join using more than 1 identifier column
left_on = ["fname", "lname",  "dob"]
right_on = ["name", "surname", "date"]
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)

# Outputs the link table without actually joining the dfs - Note that if left_id_col or right_id_col are admitted a unique id will be autogenerated
left_on = ["fname", "mname", "lname",  "dob"]
right_on = ["name", "middlename", "surname", "date"]
fuzzymatcher.link_table(df_left, df_right, left_on, right_on, left_id_col = "id", right_id_col = "id")

#### autocorrect - pip install autocorrect
https://stackoverflow.com/questions/13928155/spell-checker-for-python/48280566 <br>
but will this work for things like addresses that might not have dictionary words?

In [None]:
from autocorrect import Speller
spell = Speller(lang='en')
print(spell('hte'))

#### spellchecker - pip intall pyspellchecker
https://pypi.org/project/pyspellchecker/

In [None]:
from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

#### Write a custom spelling corrector
http://norvig.com/spell-correct.html

In [None]:
def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)