# Generate Training Set (Part 2)

These training sets compare long forms within each NSF.

## Setup

In [2]:
import pandas as pd
import numpy as np
import itertools
from fuzzywuzzy import fuzz

#### Load Dataset

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/lisavirginia/clinical-abbreviations/master/modules/Step2Output_Clinical_Abbreviation_Acronym_Crosswalk.csv',
                 sep='|',
                 header=0,
                 index_col=False,
                 na_filter=False,
                 dtype=object)

In [4]:
df.sample(3, random_state=0)

Unnamed: 0,GroupID,EntryID,SF,SFUI,NormSF,NSFUI,PrefSF,LF,LFUI,NormLF,PrefLF,Source,SFEUI,LFEUI,Type,Score,Count,Frequency,UMLS.CUI
33582,,E033583,α-GPDH,S103760,α_gpdh,N072157,,alpha-glycerophosphate dehydrogenase,L041419,alpha glycerophosphate dehydrogenase,,UMLS,E0412935,E0769306,acronym,,,,
70982,,E070983,PRPPs,S056625,prpps,N248835,,phosphoribosyl pyrophosphate synthetase,L128319,phosphoribosylpyrophosphate synthetase,,UMLS,E0571504,E0047486,acronym,,,,
192802,,E192803,IPH,S036667,iph,N193753,,intra-peritoneal haemorrhage,L097757,,,UMLS,E0703883,E0703882,acronym,,,,


## Initial Filtering

Identifies **most similar string** that is **not equivalent** within each **NSF**.

We used the standard Levenshtein distance similarity ratio available [here](https://github.com/seatgeek/fuzzywuzzy).

#### Use Normalized Long Form Where Possible

In [5]:
df['TrainLF'] = np.where(df['NormLF']=='', df['LF'], df['NormLF'])

In [6]:
df.sample(3, random_state=0)

Unnamed: 0,GroupID,EntryID,SF,SFUI,NormSF,NSFUI,PrefSF,LF,LFUI,NormLF,PrefLF,Source,SFEUI,LFEUI,Type,Score,Count,Frequency,UMLS.CUI,TrainLF
33582,,E033583,α-GPDH,S103760,α_gpdh,N072157,,alpha-glycerophosphate dehydrogenase,L041419,alpha glycerophosphate dehydrogenase,,UMLS,E0412935,E0769306,acronym,,,,,alpha glycerophosphate dehydrogenase
70982,,E070983,PRPPs,S056625,prpps,N248835,,phosphoribosyl pyrophosphate synthetase,L128319,phosphoribosylpyrophosphate synthetase,,UMLS,E0571504,E0047486,acronym,,,,,phosphoribosylpyrophosphate synthetase
192802,,E192803,IPH,S036667,iph,N193753,,intra-peritoneal haemorrhage,L097757,,,UMLS,E0703883,E0703882,acronym,,,,,intra-peritoneal haemorrhage


In [7]:
df=df.head(100)
df.shape

(100, 20)

#### Compute Similarities

In [None]:
df = df.sort_values(by=['NSFUI'])