# Reconciliations with Fuzzy Matching

In [18]:
%matplotlib inline
import fuzzywuzzy
from fuzzywuzzy import process, fuzz
import csv
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from pprint import pprint
from dateutil import parser
import re
import itertools



- Code for reconciliations of free text transcript fields.  Have defined a number of fields and a few different methods for each field type. First code for reading in the file.

In [19]:
def read_clean_data(file_name):
    data = {}
    with open(file_name, 'r') as in_file:
        reader = csv.DictReader(in_file)
        headers = reader.fieldnames
        for row in reader:
            key = row['filename']
            #print(key)
            xscripts = data.get(key, [])
            xscripts.append(row)
            data[key] = xscripts
            #if reader.line_num >10: 
            #    break
        print(reader.line_num)
    return data, headers

In [20]:
data,headers = read_clean_data('Brit_40transcriptsforhabitat.csv')
pprint(headers)
print(len(data))

137
['',
 '',
 'subject_id',
 'filename',
 'user_name',
 'created_at',
 'Collected by',
 'Collection date',
 'Collector Number',
 'Country',
 'County',
 'Habitat',
 'Description',
 'Species',
 'State',
 'Skip',
 '',
 '']
40


- **Scientific Name:**  
    - _Method_: Fuzzy match on tokens in order  (fuzz_ratio) -- Will take a subset as a match if one entirely exists in the other 
    
    
    - a. if 100 match on tokens for all 3 - full agree --> 2. 
    - b. if 100 match on 2 of 3 - maj rule --> 2.
    - c. If <100 match on tokens, report two highest scoring and flag --> Done.

- 2.To deal with varieties and hybrids (x in plant data) report the longest (string or token umber?)


In [25]:
#def fuzz_ratio_sn():

- **Collected By:**  
    - _Method_: fuzz.token_sort_ratio() will deal with out of order names and allows for order difference. Fuzzy wuzzy deals will flag punctuation differences (e.g.  A.I  and AI are the same token), just NEED TO CHECK that it deals with capitalization.
    
    
    - a. if 100 match on tokens for all 3 - full agree - show original form.
    - b. if 100 match on 2 of 3 - maj. rule - show original form
    - c. if less than 100 match on token, [report two highest scoring in original form] 
    

In [None]:
#def collected_by():

- **Habitat and Description:** 

    -**General Idea**:  _Want the label exactly as written._ 
        Instructions to transcribers were to write the label exactly as is. Therefore the first step is to look for exact matches. 
        If there is not an exact match want to maximize information while reducing interpretations. Generally if people do not write what is on the label that is because they have expanded an abbreviation (e.g. hwy --> highway) therefore we want the shortest string length. However, we want the labels with the longest token length to ensure that all words are accounted for.
        Main issues with these fields is that the inforamtion often can be found in both fields. For example habitat information writted in the description column and vice versa. Therefore we will first find the best label within each category and then compare across categories.
    
    -**Within Category Method**: _fuzz.token\_set\_ratio_  THEN _fuzz.token_sort_ratio_ -- first looking to see if all the words in one are present in another - would give 100 match. Second look to see the score of the sort ratio - checks the order.
    
       - a. if all 100s on the set then check on the sort and if 100 choose that comparison. --> go to 2.
       - b. if not all 100s find  -- NEED MORE LOGIC HERE>

       2. do an exact match.  --> Use the one with the shortest string length and longest word length if there is a difference.  
       
       

    -**Across Category Method**:
    - THEN just taking frequency based approach for both habitat and location and secondarily - ask if location and habitat match too much, significant overlap.


In [13]:
#pprint(data['BRIT118113'])
test = data['BRIT118160']
#pprint(test)
#lst=[]
l = len(test)
pprint(l)
n=len(test)
pprint(n)
lst=[]
for x in range(n):
    for y in range(x+1,n):
        score = fuzz.token_set_ratio(test[x]['Habitat'],test[y]['Habitat'])
        ### for all 100 take the one with the more words
        # if there are different lengths report the flag , most likely there was a shorter transcript in this set. 
        # 
        #score2 = fuzz.token_sort_ratio(test[x]['Habitat'],test[y]['Habitat'])
        #print(score,x,y)
        lst.append((score,x,y))
        
slst=sorted(lst,key=lambda x: x[0], reverse=True)

pprint(slst)

3
3
[(100, 0, 1), (100, 0, 2), (100, 1, 2)]


In [16]:
def reconcile_text_fields(file_name,field):
    #test = file_name[field]
    #need to turn it to do it on all the subjects
    for key, xscripts in data.items():
        l = len(xscripts)
        #for x in range(l-1):
            #if not xscripts[l][field]:
            #    print(xscripts[l][field])
            #    flag='all blank'
            #    print(xscripts[field])
        #counts = Counter([xscripts[field] for x in xscripts if x[field] and x[field].lower()])
        #if not len(counts):
        #    flags[key] = dict(flag=FLAGS[4], value='', top_count=0, blank_count=len(xscripts))
        l = len(xscripts)
        for x in range(n):
            for y in range(x+1,n):
                score = fuzz.token_set_ratio(xscripts[x][field],xscripts[y][field])
                print(key,score,x,y)
        break
    #    n = len(test)
    #    pprint(test)
    return(key)
    #return (scores,comparisons)

In [17]:
test = reconcile_text_fields(data,'Habitat')


BRIT118134 100 0 1
BRIT118134 100 0 2
BRIT118134 100 1 2
