# Welcome to Wrangling Linguistic Data with Python 
Let's start off with a quick survey to see what everyone's programming background is. https://forms.gle/DvzQnyWsPcuE4jQC6

### 0. Practice with Objects and Functions
A variable is the name attached to a particular object. We will look at three types of objects here: strings, integers, and lists  To create a variable, you just assign it a value with the equals sign (=) and then start using it. 

In [26]:
a = "This is a string" # let's assign this text to a variable called "a"
b = 4 # let's assign this number to a variable called "b"
c = [a, b] # let's assign this list of objects to a variable called "c"

In [27]:
print(a) # use the print function to see what variable "a" is
print(b) # use the print function to see what variable "b" is
print(c) # use the print function to see what variable "c" is

This is a string
4
['This is a string', 4]


In [23]:
type(c) # use the type function to see what type of variable "a" is

list

#### Your turn!
Create new variables of different types, print them, and verify their types using the `print` and `type` functions. Play around with decimal numbers. What is that variable type called?

### 1. Reading in Data
We first need to read our file so we can access our data. The type of file your data is in determines what function you should use to read in your data. We will read in several sample files from the Davies Corpus that are readily accessible at https://www.corpusdata.org/formats.asp

#### Basic Reading in Data

In [9]:
# read in a single plain text file, ending in .txt
f = open("Data/spanish-sample-text/text.txt")
data = f.read()
data[:10000])

#### Advanced Reading in Data

In [10]:
# read in a directory of files
import os

#with os.scandir('my_directory/') as entries:
#    for entry in entries:
#        print(entry.name)

#### Your turn
Choose either the basic or advanced method to read in the following three files: 

### 2. Cleaning Data

After reading in a data file, it is important look at the data read in to see if there are any encoding issues or other textual isses that need to be addressed. The Spanish Sample text above contains random insertions of the symbol "@". This is something we need to remove before we process the data. 

In [6]:
import re
data_clean = re.sub("@+", "", data)
data_clean[:1000]

'textID\ttext\n----\t----\n\n124 Gran convocatoria para el concurso docente que se realiza en la Escuela Normal Con una inmensa convocatoria de docentes , convocada desde la 7.30 de este lunes en el salón de actos de la Escuela Normal Mariano Moreno , se realizó la primera jornada de el concurso para titularización de los cargos . El día comenzó con las palabras de bienvenidas de las autoridades , quienes hablaron de cientos de docentes que presentaron sus documentos que deben ser analizados por las autoridades . Los cargos fueron 138 , pero esa suma se incrementó debido a que muchos realizaron cambio de escuelas , abriendo otras oportunidades . Las jornadas continuaran este martes debido , precisamente , a el enorme número de docentes presentes . Estuvieron , la presidenta de el CGE , Graciela Bar , el profesor Héctor de la Fuente , vocal de presidencia , el secretario general de Agmer Fabián Peccín y la directora Departamental de Escuela , María del Carmen Tourfini de Córdoba . Más a

Did this solve the problem? If not, how can it be resolve?

### 3. Processing Data
We will use the Spacy package to create annotated linguistic data. If you have not used this package before, you will first need to install it via the console. You send a command to the console by prefixing it with "!" or alternatively opening up your console and typing the command there. You only need to do this once on a given computer. 

In [24]:
#! conda install -c conda-forge spacy
! python -m spacy download en_core_web_sm
! python -m spacy download pt_core_news_sm
! python -m spacy download es_core_news_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('pt_core_news_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_sm')


Import the Spacy module and create an object for each language you will use, Spanish, English, etc. 

In [7]:
import spacy
nlp_en = spacy.load("en_core_web_sm")
nlp_sp = spacy.load("es_core_news_sm")

In [8]:
data_spacy = nlp_sp(data_clean)
data_spacy[:100]

ValueError: [E088] Text of length 11986988 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

#### Advanced 
Let's look at custom tokenization. Sometimes Spacey won't tokenize as you would like but you can customize your own tokenizer to fix this. 

In [None]:
[obj.text for obj in Alice_spacy.sents] # at the sentence level
[token for token in Alice_spacy] # at the word level
[(token, token.pos_) for token in Alice_spacy]

In [None]:
for ent in Alice_spacy.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
[(token, token.ent_iob_) for token in Alice_spacy]

In [None]:
[(token, token.is_stop) for token in Alice_spacy]

### Your turn

Build a function that will convert a plain text into a dataframe all the necessary functions and build your NLP pipeline. Test your function on the following documents. 


Explore these documents. How many words are in each? What is the distribution of POS tags? How accurate is the tokenization and the pos tagging for the CS_interview and CS_novel?