**Prototype for anglicisms seeking**
>This notebook explores the first digitized Estadio Magazine number looking for words borrowed from English. It traces how the language of sports changed throughout the second half of the 20th Century. 

**Loading Libraries**

In [1]:
import PyPDF2

In [2]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hernanadasme/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/hernanadasme/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

**Import pdf file from directory**

>This line of code takes the pdf file from the directory and opens it using the pdf_reader. 

In [7]:
pdf_file = open(r'/Users/hernanadasme/Projects/estadio_1940_1980/estadio_pdfs/estadio_43_59_7_05_17_12_1943.pdf', 'rb')

In [8]:
pdf_reader = PyPDF2.PdfReader(pdf_file)

In [9]:
len(pdf_reader.pages)

565

**From pdf to txt**

>This line of code extracts the text from the pdf_reader and creates a txt file called `estadio_43_59_1943`

In [10]:
estadio_43_59_1943 = ""

for page in pdf_reader.pages:
    estadio_43_59_1943 += page.extract_text()


Checking the txt file. 

In [11]:
estadio_43_59_1943

'^Sfeñ*~*-\n>**\n.%Á\n^■w\'\'*<»í^S\n\\,$%\'\'■S***\n*^j .í*■•\';\'$¿fr&:\n»->/:."\'\'">&?«*í\\,#w•\'■■.rTV\n-<á-**L\nJ^ál**ap\n>f"X"\n■t"Jp\'rí>T-\'\n■M;\'.-¿\nr\'""-.\'&\n\'*»■■•/\'/;k\\>;\ni■•rJ^MllfSSiPsis----\'- :\'-W\'\nBIBLIOTECA NACIONAL\nDECHILE\nVolúmenes deestaobra\nSalaenqueseencuentra\nTabla enquesehalla\nOrden queenellatiene....\n4lj*(l*1)~^)BBIBLIOTECA NACIONAL\nDECHILE\nSección :Hemeroteca\nVolúmenes delaobra\n1253\nUbicaciónCV°V~4)\n3ÍBLÍ0TECA\n/<—\n.\n¿\n^¿>\nNUMERO-DEDICADO ALXII\nCAMPEONATO SUDAMERICANO\nDEATLETISMOM.R.\nN.°43\nPRECIO: $4.—\nJORGEEHLERS, jovenyex\\\ntraordinario atletachileno,\nqueconsiguióunadeJas]\nmejoresmarcas ene!Sud\namericano deAtletismo.\n<■;■■*..•\n*\'í*\nb**fEDERlCO HORN,elprimer\'\'arte-\nmi..chileño quepasalavarilla\ncolocada aloscuatrometros, en\nelsaltocongarrocha. Ganóesa\nprueba enelSudamericano, de-\npostrandoeltemple ylaperso-\n«alidad de«losverdaderos cam\npeones. Sueátilo esbastante\nSerfecto, comolodemuestra la\njpto.Sujuventud 

**Basic - Preprocesing**

>These two funcions --preprocess_text and a tokenize-- do a basic preprocessing. The first one replaces the annoying \b, the special characters, digits and it lowers all teh text. 

>The tokenize function does basically two things: 1) it breaks down the txt long string into chunks and 2)it removes the stopwords (words that aren't relevant to our analysis)

In [12]:
def preprocess_text(text):
    # Lowercase the text
    text = text.replace('\n', '')
    pattern_1 = r'[^a-zA-Z0-9\s]'
    # Replace the special characters with an empty string
    text = re.sub(pattern_1, '', text)
    pattern_2 = r'\d+'
    # Replace the digits with an empty string
    text = re.sub(pattern_2, '', text)
    text = text.lower()
    
    return text

In [13]:
def tokenize(text):
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if len(token) > 0]
    #tokens = [get_lemma(token) for token in tokens]
    return tokens

In [14]:
cleaned_estadio_43_59_1943 = preprocess_text(estadio_43_59_1943)

In [15]:
tokenized_estadio_43_59_1943 = tokenize(cleaned_estadio_43_59_1943)

In [16]:
cleaned_estadio_43_59_1943

'sfewssj frwrtvljlapfxtjprtmrkirjmllfssipsis wbiblioteca nacionaldechilevolmenes deestaobrasalaenqueseencuentratabla enquesehallaorden queenellatieneljlbbiblioteca nacionaldechileseccin hemerotecavolmenes delaobraubicacincvvbltecanumerodedicado alxiicampeonato sudamericanodeatletismomrnprecio jorgeehlers jovenyextraordinario atletachilenoqueconsiguiunadejasmejoresmarcas enesudamericano deatletismobfederlco hornelprimerartemichileo quepasalavarillacolocada aloscuatrometros enelsaltocongarrocha ganesaprueba enelsudamericano depostrandoeltemple ylapersoalidad delosverdaderos campeones suetilo esbastanteserfecto comolodemuestra lajptosujuventud ysuextraordinarioentusiasmo porsupruebajjavoritahandepermitirle muyftontbalcanzar nosloelrecordbtttinentalsinoquecolocarlo ennivelfrancamente superior alasemsmarcassudamericanasvvaoiin publicacin quincenalsantiago dechiledemayoderedaccin yadministracincompaa opisocasillateloadirector alejandro jaramillo nestarevistaladistribuye entodoelpasyelextran 

In [280]:
tokenized_estadio_43_59_1943

['sfewssj',
 'frwrtvljlapfxtjprtmrkirjmllfssipsis',
 'wbiblioteca',
 'nacionaldechilevolmenes',
 'deestaobrasalaenqueseencuentratabla',
 'enquesehallaorden',
 'queenellatieneljlbbiblioteca',
 'nacionaldechileseccin',
 'hemerotecavolmenes',
 'delaobraubicacincvvbltecanumerodedicado',
 'alxiicampeonato',
 'sudamericanodeatletismomrnprecio',
 'jorgeehlers',
 'jovenyextraordinario',
 'atletachilenoqueconsiguiunadejasmejoresmarcas',
 'enesudamericano',
 'deatletismobfederlco',
 'hornelprimerartemichileo',
 'quepasalavarillacolocada',
 'aloscuatrometros',
 'enelsaltocongarrocha',
 'ganesaprueba',
 'enelsudamericano',
 'depostrandoeltemple',
 'ylapersoalidad',
 'delosverdaderos',
 'campeones',
 'suetilo',
 'esbastanteserfecto',
 'comolodemuestra',
 'lajptosujuventud',
 'ysuextraordinarioentusiasmo',
 'porsupruebajjavoritahandepermitirle',
 'muyftontbalcanzar',
 'nosloelrecordbtttinentalsinoquecolocarlo',
 'ennivelfrancamente',
 'superior',
 'alasemsmarcassudamericanasvvaoiin',
 'publicacin',


**Saving the file**

In [53]:

path = r'/Users/hernanadasme/Projects/estadio_1940_1980/estadio_43_59_7_05_17_12_1943.txt'

with open(path, 'w') as txt_file:
    txt_file.write(estadio_43_59_1943)

**Trial #1: a small sample of anglicisms**

>I am using a short list of terms to try the code. It includes only 5 terms that have a widespread use in Revista Estadio. 

In [300]:
tokens_english = [
'goal', 
'match', 
'forward', 
'field', 'player']

#counting ENG words in the cleaned txt and creating a dictionary with words and counts
counts_english = {}
for token in tokens_english:
    count = cleaned_estadio_43_59_1943.count(token)
    counts_english[token] = count

In [301]:
#checking the dictionary 
counts_english

{'goal': 39, 'match': 347, 'forward': 36, 'field': 44, 'player': 80}

In [283]:
#counting ESP words in the cleaned txt and creating a dictionary with words and counts
tokens_spanish = [
'gol', 
'partido', 
'delantero', 
'cancha', 'jugador']

#checking the dictionary 
counts_spanish = {}
for token in tokens_spanish:
    count = cleaned_estadio_43_59_1943.count(token)
    counts_spanish[token] = count
counts_spanish

{'gol': 706, 'partido': 504, 'delantero': 165, 'cancha': 485, 'jugador': 655}

**Calculating the Ratios**

>This line of code creates a new dictionary that will store the English word as the key and the ratio of the English word understood. The ratio is constructed by dividing the frequency of the English word by its Spanish counterpart. 

In [285]:
# Initialize a new dictionary with the keys from the original dictionary
ratios_dict = {key: None for key in counts_english}

# Print the new_dict
print(ratios_dict)

{'goal': None, 'match': None, 'forward': None, 'field': None, 'player': None}


In [286]:
#creating a list with the values fro teh dictionary
english_values = list(counts_english.values())
spanish_values = list(counts_spanish.values())

In [287]:
#checking values
print(english_values)
print(spanish_values)

[39, 347, 36, 44, 80]
[706, 504, 165, 485, 655]


In [288]:
#calculating the ratio
ratio = []
f = 0
for n in english_values:
    nume = english_values[f] / spanish_values[f]
    ratio.append(nume)
    f += 1

In [289]:
#checking the ratios
ratio

[0.05524079320113314,
 0.6884920634920635,
 0.21818181818181817,
 0.09072164948453608,
 0.12213740458015267]

**Dictionary with english terms and ratios**
>I use a  dictionary comprehension to iterate over the list and create a new dictionary with the English terms as keys and the ratios as values. 

In [290]:
# Iterate over the keys in the dictionary and assign values from the list to the corresponding keys
for i, key in enumerate(ratios_dict.keys()):
    ratios_dict[key] = ratio[i]

# Print the updated dictionary
print(ratios_dict)


{'goal': 0.05524079320113314, 'match': 0.6884920634920635, 'forward': 0.21818181818181817, 'field': 0.09072164948453608, 'player': 0.12213740458015267}


**Trial #2: a longer sample of anglicisms**

>I will try the same process but with a list of 29 English terms and their Spanish counterparts. 

In [383]:
#creating the list of english terms 
#ADD TEH WORD RANKING
tokens_eng = ['goal','match','Forward','Field','Back','Pitchers','Wing','Shot','shoot','Player','Handicap','Kick','Second','Referee','Insider','Crack','Standard','Jersey','Foul','Knockout','Out','Record','Score','Single','Sport','Shortstop','Training','Centroforward','sprinter']

In [384]:
#lowercasing the terms
tokens_englow = []
for w in tokens_eng:
    word = w.lower()
    tokens_englow.append(word)

In [385]:
#counting ENG words in the cleaned txt and creating a dictionary with words and counts
counts_english_long = {}
for token in tokens_englow:
    count = cleaned_estadio_43_59_1943.count(token)
    counts_english_long[token] = count

In [386]:
#checking dictionary
counts_english_long

{'goal': 39,
 'match': 347,
 'forward': 36,
 'field': 44,
 'back': 41,
 'pitchers': 1,
 'wing': 60,
 'shot': 34,
 'shoot': 1,
 'player': 80,
 'handicap': 3,
 'kick': 3,
 'second': 13,
 'referee': 28,
 'insider': 36,
 'crack': 488,
 'standard': 9,
 'jersey': 1,
 'foul': 21,
 'knockout': 6,
 'out': 28,
 'record': 258,
 'score': 102,
 'single': 25,
 'sport': 100,
 'shortstop': 1,
 'training': 1,
 'centroforward': 15,
 'sprinter': 23}

In [387]:
len(tokens_englow)

29

In [388]:
#creating the list of spanish terms 
tokens_esplow = ['gol', 'partido','delantero','cancha','defensa','lanzador','lateral','disparo','disparar','jugador','desventaja','patear','segundo','juez','interior','estrella','estandar','camiseta','falta','nocaut','fuera','registro','marcador','individual','deporte','parada','entrenamiento','centrodelantero','velocista']

In [389]:
#counting ESP words in the cleaned txt and creating a dictionary with words and counts
counts_spanish_long = {}
for token in tokens_esplow:
    count = cleaned_estadio_43_59_1943.count(token)
    counts_spanish_long[token] = count

In [390]:
#checking dictionary
counts_spanish_long

{'gol': 706,
 'partido': 504,
 'delantero': 165,
 'cancha': 485,
 'defensa': 304,
 'lanzador': 13,
 'lateral': 8,
 'disparo': 3,
 'disparar': 2,
 'jugador': 655,
 'desventaja': 11,
 'patear': 9,
 'segundo': 361,
 'juez': 10,
 'interior': 73,
 'estrella': 17,
 'estandar': 2,
 'camiseta': 38,
 'falta': 165,
 'nocaut': 3,
 'fuera': 181,
 'registro': 3,
 'marcador': 28,
 'individual': 56,
 'deporte': 561,
 'parada': 56,
 'entrenamiento': 79,
 'centrodelantero': 49,
 'velocista': 13}

In [391]:
len(counts_spanish_long)

29

In [393]:
# Initialize a new dictionary with the keys from the original dictionary
ratios_dict_long = {key: None for key in counts_english_long}

# Print the new_dict
print(ratios_dict_long)

english_values_long = list(counts_english_long.values())
spanish_values_long = list(counts_spanish_long.values())

ratio_long = []
f = 0
for n in english_values_long:
    nume = english_values_long[f] / spanish_values_long[f]
    ratio_long.append(nume)
    f += 1


# Iterate over the keys in the dictionary and assign values from the list to the corresponding keys
for i, key in enumerate(ratios_dict_long.keys()):
    ratios_dict_long[key] = ratio_long[i]

# Print the updated dictionary
print(ratios_dict_long)

{'goal': None, 'match': None, 'forward': None, 'field': None, 'back': None, 'pitchers': None, 'wing': None, 'shot': None, 'shoot': None, 'player': None, 'handicap': None, 'kick': None, 'second': None, 'referee': None, 'insider': None, 'crack': None, 'standard': None, 'jersey': None, 'foul': None, 'knockout': None, 'out': None, 'record': None, 'score': None, 'single': None, 'sport': None, 'shortstop': None, 'training': None, 'centroforward': None, 'sprinter': None}
{'goal': 0.05524079320113314, 'match': 0.6884920634920635, 'forward': 0.21818181818181817, 'field': 0.09072164948453608, 'back': 0.13486842105263158, 'pitchers': 0.07692307692307693, 'wing': 7.5, 'shot': 11.333333333333334, 'shoot': 0.5, 'player': 0.12213740458015267, 'handicap': 0.2727272727272727, 'kick': 0.3333333333333333, 'second': 0.036011080332409975, 'referee': 2.8, 'insider': 0.4931506849315068, 'crack': 28.705882352941178, 'standard': 4.5, 'jersey': 0.02631578947368421, 'foul': 0.12727272727272726, 'knockout': 2.0, 

In [394]:
len(ratios_dict_long)

29

In [411]:
df = pd.DataFrame(ratios_dict_long, index=[0])

In [413]:
df.head()

Unnamed: 0,goal,match,forward,field,back,pitchers,wing,shot,shoot,player,...,knockout,out,record,score,single,sport,shortstop,training,centroforward,sprinter
0,0.055241,0.688492,0.218182,0.090722,0.134868,0.076923,7.5,11.333333,0.5,0.122137,...,2.0,0.154696,86.0,3.642857,0.446429,0.178253,0.017857,0.012658,0.306122,1.769231


**REGULAR EXPRESSIONS**

This is the solution to problem of workds squished together, and to letter spacing issues. 

126