## Spacy

Open the Gold Stanard NACC file to find how labeled anglicisms are POS & NE tagged

In [16]:
import pandas as pd
import spacy
import re
from collections import Counter
nlp_sp = spacy.load('es', parse=True, tag=True, entity=True)
nlp_en = spacy.load('en', parse=True, tag=True, entity=True)
print("Spanish:", len(nlp_sp.vocab), "English", len(nlp_en.vocab))

Spanish: 37940 English 57852


Create a function to combine punctuation with the preceding word

In [11]:
def join_punctuation(seq, characters=r'.,:;?!)/'):
    characters = set(characters)
    seq = iter(seq)
    current = next(seq)

    for nxt in seq:
        if nxt in characters:
            current += nxt
        else:
            yield current
            current = nxt

    yield current

In [20]:
NACC_df = pd.read_csv('Data/NACC-GoldStandard.tsv',delimiter='\t',encoding='utf-8', header=0)
NACC_tokens = NACC_df["Token"].tolist()
NACC_punct_joined = join_punctuation(NACC_tokens)
NACC_text = " ".join(NACC_punct_joined)
NACC_text = re.sub(r"\( ", r"(", NACC_text)
NACC_spacy = nlp_sp(NACC_text)

In [21]:
with open('Data/NACC-GoldStandard-Text.txt', "w") as output:
    output.write(NACC_text)

Add new columns to the pandas dataframe for POS, NE, and Lemma

In [None]:
NACC_df["POS"] = [token.pos_ for token in NACC_spacy]
NACC_df["NE"] = [token.ent_iob_ for token in NACC_spacy]
NACC_df["Lemma"] = [token.lemma_ for token in NACC_spacy]

Filter the dataset to only analysis those tokens labeled as anglicisms

In [None]:
is_anglicism =  NACC_df['Anglicism']=="yes"
for value, count in Counter(NACC_df[is_anglicism]["POS"]).most_common():
    print(value, count)


In [None]:
is_openclass =  NACC_df['POS'].isin(["NOUN", "VERB", "PROPN", "ADJ"])
print(round(100*len(NACC_df[is_openclass]) / len(NACC_df), 2))

All anglicisms in the Goldstandard are labeled as {'NOUN': 62, 'VERB': 8, 'PROPN': 7, 'ADJ': 4}, which are the open class pos tags. This is in keeping with the borrowablity scale so filtering out all other POS tags appears to be both theortically and practically sound. 
The majority of anglicisms are labeled as not named entities but a few are unexpected one (i.e. not capitalized) so a simple capitalization test maybe better. I think anglicisms that are towards the beginning of the sentence get confused for NE because they are OOV.

In [None]:
for value, count in Counter(NACC_df[is_anglicism]["NE"]).most_common():
    print(value, count)
a = NACC_df[is_anglicism]
is_NE =  a['NE'].isin(["I", "B"])
print(a[is_NE])

Around 40% of the data is in open class POS tag ("NOUN", "VERB", "PROPN", "ADJ") so by filtering out all other data (closed class), we will eliminate the need to test 60% of the data.

In [None]:
export_csv = NACC_df.to_csv (r'Data/NACC-Spacy-annotated.csv', index = None, header=True) 

In [None]:
len(nlp_sp.vocab)

In [22]:
spn_5k_file = "/Users/jacquelineserigos/Google Drive/X-Reference/My_Data/Dictionaries/Originals/HFSpn.txt"
eng_5k_file = "/Users/jacquelineserigos/Google Drive/X-Reference/My_Data/Dictionaries/Originals/HFEng.txt"
spnDict_file = "/Users/jacquelineserigos/Google Drive/X-Reference/My_Data/Dictionaries/Originals/lemario-20101017.txt"

spn5k = set(open(spn_5k_file, encoding = 'cp1252').readlines())
eng5k = set(open(eng_5k_file).readlines())
spnDict = set(open(spnDict_file).readlines())

overlap = spn5k & eng5k
print(len(overlap))
#print(*overlap)
Eng = eng5k - overlap
#print(*Eng)
a = spnDict & Eng
print(*a)




136
eligible
 supervisor
 informal
 open
 acre
 sheriff
 far
 mutual
 stand
 tour
 gentleman
 echo
 spot
 flash
 display
 grand
 cute
 lay
 input
 film
 look
 to
 crucial
 lady
 western
 lunch
 green
 detective
 data
 home
 virtual
 golf
 agenda
 sexy
 chance
 slip
 bacteria
 marine
 single
 audio
 ten
 son
 tape
 campus
 ratio
 bit
 temple
 guitar
 suite
 mentor
 show
 car
 instructor
 uh
 logical
 light
 miss
 plus
 do
 corps
 fan
 escape
 ad
 exclusive
 cap
 jazz
 fame
 put
 controversial
 dancing
 mere
 set
 render
 gap
 video
 rain
 ton
 digital
 cave
 sauce
 sponsor
 pope
 rock
 box
 speech
 neutral
 naval
 dance
 body
 ring
 sport
 pop
 senior
 relax
 chef
 in
 be
 her
 trace
 rape
 force
 web
 nuclear
 prior
 tumor
 as
 lobby
 short
 jet
 vegetable
 diagnosis
 radar
 cross
 casino
 software
 invite
 banana
 diabetes
 pizza
 junior
 virus
 folk
 kit
 panel
 funeral
 chin
 debut
 gang
 calendar
 monitor
 deal
 man
 standing
 ruin
 romance
 lord
 grant
 cope
 cancel
 memorial
 cra

In [8]:
'''with open(spn_5k_file, encoding = 'cp1252').readlines() as spn5k, \
    open(eng_5k_file).readlines() as eng5k:
    print(set(spn5k) & set(eng5k))'''

['a\n', 'abajo\n', 'abandonado\n', 'abandonar\n', 'abandono\n', 'abanico\n', 'abarcar\n', 'abeja\n', 'abiertamente\n', 'abierto\n', 'abismo\n', 'abogado\n', 'abordar\n', 'abrazar\n', 'abrazo\n', 'abrigo\n', 'abril\n', 'abrir\n', 'absolutamente\n', 'absoluto\n', 'absorber\n', 'abstracción\n', 'abstracto\n', 'absurdo\n', 'absurdo\n', 'abuela\n', 'abuelo\n', 'abundancia\n', 'abundante\n', 'abundar\n', 'aburrido\n', 'aburrimiento\n', 'aburrir\n', 'abusar\n', 'abuso\n', 'acá\n', 'acabar\n', 'academia\n', 'académico\n', 'acariciar\n', 'acarrear\n', 'acaso\n', 'acatar\n', 'acceder\n', 'accesible\n', 'acceso\n', 'accidente\n', 'acción\n', 'aceite\n', 'acelerado\n', 'acelerar\n', 'acento\n', 'acentuar\n', 'aceptable\n', 'aceptación\n', 'aceptar\n', 'acera\n', 'acerca\n', 'acercamiento\n', 'acercar\n', 'acero\n', 'acertado\n', 'acertar\n', 'ácido\n', 'acierto\n', 'aclarar\n', 'acoger\n', 'acomodado\n', 'acomodar\n', 'acompañar\n', 'aconsejar\n', 'acontecer\n', 'acontecimiento\n', 'acordar\n', 'a