## Byline Similarities of names using [jellyfish](https://github.com/jamesturk/jellyfish)

Bylines were extracted from the the last paragraph of each letter. This was done here [Extract bylines from raw letters](/notebooks/extract_letter_bylines.ipynb) where attempts to identify the different sections of byline including name, occupation and location done. This notebook is thus concerned with associating–despite errors in the OCR process–the named identities and number of letters they sent to the editor.

Unlike semantic based algorithms such as the one provided by the spaCy nlp library (see [attempt Byline Similarities of names using spaCy](/notebooks/cluster_similar_reader_names_spacy.ipynb)), using jaro distance, one of similarity algorithms provided by jellyfish package performed much better overall and was thus employed in the research.

In order maximise the results, `cleanup_name` function ensure noise characters such and other non-alphabeticals were removed altogether.

In [1]:
import re

def cleanup_name(_input):
    output = ireader_name = re.sub(r'[^A-Za-z0-9 ]+', '', _input) # strip dots
    output = re.sub(r"\b[a-zA-Z0-9]\b", "", output) # remove single letter words
    output = re.sub(' +', ' ', output) # remove successive spaces
    output = output.strip() # remove leading and trailing spaces
    return output

This step helped to make similarities possible where otherwise would not have been established due to the state of the OCR errors. For example in the case below, two byline identities “HM Mwafusi” and “A H..M. Mwafusi”

In [2]:
import jellyfish

id1 = 'HM Mwafusi'
id2 = 'A H..M. Mwafusi'
ratio = jellyfish.jaro_distance(id1, id2)
print(f"{'Similarity before clean-up:':>30} {ratio}")

id1 = cleanup_name(id1)
id2 = cleanup_name(id2)
ratio = jellyfish.jaro_distance(id1, id2)
print(f"{'Similarity after clean-up:':>30} {ratio}")

   Similarity before clean-up: 0.8388888888888889
    Similarity after clean-up: 1.0


### Generating similarities

Two key parameters were used in the determinations. The `min_length` was used to limit valid words and ratio was set to `0.84` where results were reliable.

In [19]:
import os
import pandas as pd
from pprint import pprint

current_directory = os.getcwd()
prj_root = os.path.dirname(current_directory)
data_dir = f'{prj_root}/data'
bylines_and_files_dir = f'{data_dir}/bylines_and_files'

proc_year = "1978"

data = pd.read_csv(f'{bylines_and_files_dir}/{proc_year}.tsv', 
                   delimiter='\t', 
                   usecols=['reader_name', 'reader_location', 'reader_title', 'reader_org', 'txt_name', 'letter_date', 'lines_count'],
                   na_filter=False
                  )
min_length = 5 # min length of valid string to compare

edges_list = []
nodes_list = []
skip_list = []
for i, irow in data.iterrows():
    ireader_name = irow['reader_name']
    ireader_location = irow['reader_location']
    iletter_date = irow['letter_date']
    itxt_name = irow['txt_name']
    
    if i not in skip_list:  # if sth is already match with high percentage then don't match it sth else    
        subj_len = len(ireader_name)
        # generate a unique id for the record on the whole corpora
        uid = f'{itxt_name[4:9]}{i:>04}'
    
        if subj_len > min_length:  # consider valid strings only

            subject_name = cleanup_name(ireader_name)
            subject_name = str(subject_name)
            ignore_these = [] # list of matches to ignore same page letters
            for j, jrow in data.iloc[i+1:].iterrows(): # inner loop begins after subject to avoid duplications
                jreader_name = jrow['reader_name']
                jtxt_name = jrow['txt_name']
                jletter_date = jrow['letter_date']
                
                comp_len = len(jreader_name)

                # again, consider valid strings only
                # also not too long than the subject name
                if comp_len > min_length:
                    comp_name = cleanup_name(jreader_name)
                    comp_name = str(comp_name)
                    sim_val = jellyfish.jaro_distance(subject_name, comp_name)

                    if sim_val >= 0.84:
                        ujd = f'{jtxt_name[4:9]}{j:>04}'
                        foundlings = [x for x in ignore_these if x.startswith(jtxt_name[:10])]
                        if len(foundlings) > 0:
                            # because this letter is from same page and not possible to have 
                            # more than one letter from same reader in same column. So skip
                            continue
                        skip_list.append(j)
                        print(f'{uid:>09} | {ujd:>09} | {subject_name:>20} | {comp_name:>20} | {sim_val:.3f} | {jletter_date:>10}')
                        
                        edges_list.append([ujd, uid, ireader_name, jreader_name, sim_val, jletter_date, jtxt_name])
                        
                        nodes_list.append([uid, subject_name, ireader_name, iletter_date, itxt_name])
                        nodes_list.append([ujd, comp_name, jreader_name, jletter_date, jtxt_name])
                        
                        ignore_these.append(jtxt_name)

903250003 | 903250004 |   the resovmed eters | athe resovmed aaters | 0.912 | 1978-01-03
903350083 | 905881736 | Tappeal to the Government to do | appeal to the Governme | 0.903 | 1978-12-02
903360087 | 903360092 | As far as Gema is concerned we | As far as Gema is concerned we | 1.000 | 1978-01-17
903380103 | 904731083 |          FpA Apollos |           FM Apollos | 0.869 | 1978-06-28
903400115 | 903880495 |      Wilfred Waititu |    Woailfred Waititu | 0.861 | 1978-03-18
903400115 | 903910518 |      Wilfred Waititu |      Wilfred Waititu | 1.000 | 1978-03-22
903400115 | 904200728 |      Wilfred Waititu |      Wilfred Waititu | 1.000 | 1978-04-26
903440144 | 904731083 |            FM Appoll |           FM Apollos | 0.896 | 1978-06-28
903440145 | 903480165 |  ion Mrs KeliNairobi | Ring Mrs KeliNairobi | 0.949 | 1978-01-31
903470162 | 903710348 |       Jimmy Mohammed |       Jimmy Mohammed | 1.000 | 1978-02-27
903470162 | 903750382 |       Jimmy Mohammed |       Jimmy Mohammed | 1.000 |

### Create nodes and edges for visualization

In [20]:
### create an edges output
edge_headers = ['source', 'target', 'ireader_name', 'jreader_name', 'sim_val', 'jletter_date', 'jtxt_name']
sims_df = pd.DataFrame(edges_list, columns=edge_headers)

# export to edges.tsv
processed_tsv = os.path.join(bylines_and_files_dir, f'{proc_year}_jellyfish_edges.tsv')
sims_df.to_csv(processed_tsv, 
                sep='\t',
                encoding='utf-8', 
                index=False,
                columns = edge_headers)

### create an nodes output
node_headers = ['id', 'label', 'ireader_name', 'iletter_date', 'itxt_name']
nodes_df = pd.DataFrame(nodes_list, columns=node_headers)

# dropping ALL duplicte values based on 'id' 
nodes_df.drop_duplicates(subset ="id", keep = 'first', inplace = True)

# export to nodes.tsv
processed_tsv = os.path.join(bylines_and_files_dir, f'{proc_year}_jellyfish_nodes.tsv')
nodes_df.to_csv(processed_tsv, 
                sep='\t',
                encoding='utf-8', 
                index=False,
                columns = node_headers)

### Other resources

[https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings](https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings)

[https://stackoverflow.com/a/55732255/754432](https://stackoverflow.com/a/55732255/754432) (How to draw heatmap)