## Spacy

Open the Gold Stanard NACC file to find how labeled anglicisms are POS tagged

In [54]:
import pandas as pd
import spacy
from collections import Counter
nlp_sp = spacy.load('es', parse=True, tag=True, entity=True)

In [33]:
NACC_df = pd.read_csv('Data/NACC-annotated.tsv',delimiter='\t',encoding='utf-8', header=0)
NACC_text = " ".join(NACC_df["Token"].tolist())
NACC_spacy = nlp_sp(NACC_text)

Add new columns to the pandas dataframe for POS, NE, and Lemma

In [42]:
NACC_df["POS"] = [token.pos_ for token in NACC_spacy]
NACC_df["NE"] = [token.ent_iob_ for token in NACC_spacy]
NACC_df["Lemma"] = [token.lemma_ for token in NACC_spacy]

Filter the dataset to only analysis those tokens labeled as anglicisms

In [58]:
is_anglicism =  NACC_df['Anglicism']=="yes"
for value, count in Counter(NACC_df[is_anglicism]["POS"]).most_common():
    print(value, count)
for value, count in Counter(NACC_df[is_anglicism]["NE"]).most_common():
    print(value, count)

NOUN 62
VERB 8
PROPN 7
ADJ 4
O 68
I 7
B 6


All anglicisms in the Goldstandard are labeled as {'NOUN': 62, 'VERB': 8, 'PROPN': 7, 'ADJ': 4}, which are the open class pos tags. This is in keeping with the borrowablity scale so filtering out all other POS tags appears to be both theortically and practically sound. 
The majority of anglicisms are labeled as not named entities but a few are (I & B)

In [72]:
is_openclass =  NACC_df['POS'].isin(["NOUN", "VERB", "PROPN", "ADJ"])
print(round(100*len(NACC_df[is_openclass]) / len(NACC_df), 2))

40.41


Around 40% of the data is in open class POS tag ("NOUN", "VERB", "PROPN", "ADJ") so by filtering out all other data (closed class), we will eliminate the need to test 60% of the data.

In [73]:
export_csv = NACC_df.to_csv (r'Data/NACC-Spacy-annotated.csv', index = None, header=True) 