## Spacy

Open the Gold Stanard NACC file to find how labeled anglicisms are POS & NE tagged

In [4]:
import pandas as pd
import spacy
from collections import Counter
nlp_sp = spacy.load('es', parse=True, tag=True, entity=True)
nlp_en = spacy.load('en', parse=True, tag=True, entity=True)
print("Spanish:", len(nlp_sp.vocab), "English", len(nlp_en.vocab))

Spanish: 37940 English 57852


In [6]:
nlp_en.vocab["d3fasd"]

<spacy.lexeme.Lexeme at 0x122714948>

Create a function to combine punctuation with the preceding word

In [None]:
def join_punctuation(seq, characters='.,:;?!'):
    characters = set(characters)
    seq = iter(seq)
    current = next(seq)

    for nxt in seq:
        if nxt in characters:
            current += nxt
        else:
            yield current
            current = nxt

    yield current

In [None]:
NACC_df = pd.read_csv('Data/NACC-GoldStandard.tsv',delimiter='\t',encoding='utf-8', header=0)
NACC_tokens = NACC_df["Token"].tolist()
NACC_punct_joined = join_punctuation(NACC_tokens)
NACC_text = " ".join(NACC_punct_joined)
NACC_spacy = nlp_sp(NACC_text)

Add new columns to the pandas dataframe for POS, NE, and Lemma

In [None]:
NACC_df["POS"] = [token.pos_ for token in NACC_spacy]
NACC_df["NE"] = [token.ent_iob_ for token in NACC_spacy]
NACC_df["Lemma"] = [token.lemma_ for token in NACC_spacy]

Filter the dataset to only analysis those tokens labeled as anglicisms

In [None]:
is_anglicism =  NACC_df['Anglicism']=="yes"
for value, count in Counter(NACC_df[is_anglicism]["POS"]).most_common():
    print(value, count)


In [None]:
is_openclass =  NACC_df['POS'].isin(["NOUN", "VERB", "PROPN", "ADJ"])
print(round(100*len(NACC_df[is_openclass]) / len(NACC_df), 2))

All anglicisms in the Goldstandard are labeled as {'NOUN': 62, 'VERB': 8, 'PROPN': 7, 'ADJ': 4}, which are the open class pos tags. This is in keeping with the borrowablity scale so filtering out all other POS tags appears to be both theortically and practically sound. 
The majority of anglicisms are labeled as not named entities but a few are unexpected one (i.e. not capitalized) so a simple capitalization test maybe better. I think anglicisms that are towards the beginning of the sentence get confused for NE because they are OOV.

In [None]:
for value, count in Counter(NACC_df[is_anglicism]["NE"]).most_common():
    print(value, count)
a = NACC_df[is_anglicism]
is_NE =  a['NE'].isin(["I", "B"])
print(a[is_NE])

Around 40% of the data is in open class POS tag ("NOUN", "VERB", "PROPN", "ADJ") so by filtering out all other data (closed class), we will eliminate the need to test 60% of the data.

In [None]:
export_csv = NACC_df.to_csv (r'Data/NACC-Spacy-annotated.csv', index = None, header=True) 

In [None]:
len(nlp_sp.vocab)