# Welcome to Wrangling Linguistic Data with Python 
This workshop will introduce you to the programming language Python and walk you through a typical workflow for converting raw text into an annotated linguistic dataset.  We will cover various computational tasks, including reading in raw text files, segmenting text into sentences and tokens, and annotating tokens for various levels of metadata. We will explore the types of linguistic annotation available in the NLP package SpaCy, such as part-of-speech, lemma, and syntactic function. After annotating texts, we will cover techniques for searching and filtering data and use regular expressions to look for word patterns. This workshop is designed to be accessible to both those who are new to programming as well as those who have experience programming.  

## 0. Practicing with Objects and Functions
We will look at three types of objects here: 
* string - sequence of characters, marked by single (`'`) or double (`"`) quotes
* integer - a single interger (ex. 1, 2, 3, 4, or 5)
* list - group of several objects together, marked by square brackets `[]`

In order to reference these objects within our code, we assign them to a variable, using equals sign (=) and then start using it. In order to manipulate and utlizes these objects, we apply functions and methods to them. In essence functions and methods transform something into something else. 

In [None]:
a = "This is a string" # let's assign this string to a variable called "a"
b = 4 # let's assign this integer to a variable called "b"
c = [a, b] # let's assign this list of objects to a variable called "c"

In [None]:
print(a) # use the print function to see what variable "a" is
print(b) # use the print function to see what variable "b" is
print(c) # use the print function to see what variable "c" is

In [None]:
type(c) # use the type function to see what type of variable "a" is 

####  <font color=red>Your turn! (4 minutes) </font> 
<font color=red> **Beginner:** Create new variables of different types, print them, and verify their types using the `print` and `type` functions. Create a list. Play around with numbers with decimals. What is that variable type called?</font> 

<font color=dark red> **Intermediate:** Play around with the function `len()` on lists and strings. What does this function do? Try adding integers together and strings together with `+`. What happens? Can you at integers and strings together? 
Try using [string methods](http://www.python-ds.com/python-3-string-methods), similar to functions Python except they are attached to an object.
</font> 

In [None]:
a + b

<font color=dark red> **Advanced:** Try using [string methods](http://www.python-ds.com/python-3-string-methods), such as `.upper()` and `.capitalize()`. Methods are similar to functions Python except they are attached to an object. 
</font> 

## 1. Reading in Data
We first need to read in our file so we can access our data. We will read in a sample file from the Davies Corpus that are readily accessible at https://www.corpusdata.org/formats.asp. All the data files are available in this repository within the Data folder.

#### Basic Reading in Data
First we create an object for the file with the `open` function and then we read it into a variable with the `read` function. 

In [None]:
f = open("Data/sp_short_text.txt")
data = f.read()

####  <font color=red>Your turn! (2 minutes) </font> 
<font color=red> **Beginner:** What type of variable is `data`? Do you notice anything strange about the text?</font>  

<font color=dark red> **Intermediate:** Practice reading in another file using the `.readlines()` function. What does this do? What type of object is it? Hint: Don't forget to first open the file.</font> 
 

## 2. Cleaning Data

After reading in a data file, it is important look at the data to see if there are any encoding issues or other textual isses that need to be addressed. The Spanish sample text above contains random insertions of the symbol "@". This is something we need to remove before we process the data. 
We will use a new function, `replace` to sustitute the symbols we wish to remove with nothing. 

In [None]:
data_clean = data.replace("@", "")
data_clean

####  <font color=red>Your turn! (4 minutes) </font> 
<font color=red> **Beginner**: Did this solve the problem completely? If not, how might you fix it? What other elements might you want to remove? 
Practice using the `.replace` function on other string objects. </font>  

<font color= dark red> **Advanced:** What is we also wanted to remove the numbers that follow @@? We can use a new function, `re.sub` from the `re` package to use [regular expressions](https://www.w3schools.com/python/python_regex.asp). First we need to import the package using the `import` function. 
    
What regular expression is needed to identify the sequences @@124 @@1124, etc.? </font>  

In [None]:
import re
data_clean = re.sub("@ ", "", data) #remove all @

## 3. Annotating Data
We will use the Spacy package to create annotated linguistic data. If you have not used this package before, you will first need to install it via the console. You can send a command to the console by prefixing it with "!" or alternatively opening up your console and typing the command there. You only need to do this once on a given computer. 

In [None]:
! conda install -c conda-forge spacy

In [None]:
! python -m spacy download en_core_web_sm
! python -m spacy download pt_core_news_sm
! python -m spacy download es_core_news_sm

Import the Spacy module and create an object for each language you will use, Spanish, English, etc. 

In [None]:
import spacy
nlp_en = spacy.load("en_core_web_sm")
nlp_sp = spacy.load("es_core_news_sm")

Use the `nlp_sp` function to create a Spacy Object from the data. Which variable should we use? `data` or `data_clean`

In [None]:
#data_spacy = nlp_sp(data)
data_spacy = nlp_sp(data_clean)
type(data_spacy)
print(data_spacy)

### Tokenization
Let's look at how the data_spacy object divides a text into sentences and tokens. But first we need to learn about loops

#### Loops
If we want to look at every element of a list and perform a repeated task, we use a loop.  

The general structure is:
```
for <element> in <list>:
    <do something>
    ...
```
This `for`-loop tells Python to take each element from a `list`, call it `element`, and then do the things that follow to it. Note that the words **`for`** and **`in`** are reserved words, meaning that you should not use them as a variable name. The colon at the end of the line is important part of the syntax and the following indent. It tells Python that everything that comes afterwards is what you want to do with each element.

In [None]:
fruits = ["apples", "bananas", "pears"]

for item in fruits:
    print(item)

In [None]:
cap_fruits = []
for item in fruits:
    cap_item = item.capitalize()
    cap_fruits.append(cap_item)

####  <font color=red>Your turn! (4 minutes) </font> 
<font color=red> **Beginner:** Create a loop that takes each element, deletes the 's' and prints the result. Play around with the `.upper()` and `.swapcase()` methods with the loop to see what they do </font>  

<font color= dark red> **Intermediate:** Make a new list of integers and another empty list. Create a loop that adds 5 to each element, adds the result to the new list. </font>  

Let's go back to tokenization. The data_spacy object is like a list in that you can loop over it to access each token.

In [None]:
# at the word level
for token in data_spacy: 
    print(token.text)

In [None]:
for index, token in enumerate(data_spacy):
    print(index, token.text)

In [None]:
# at the sentence level
for index, sent in enumerate(data_spacy.sents): # Loops
    print(index, sent.text)
    
#[sent.text for sent in data_spacy.sents] # List Comprehension - ADVANCED

len(data_spacy) # number of tokens in the text

### POS Tags
Spacy has two types of POS tags. They can be accessed for each token via attributes.
* POS: The simple [UPOS](https://universaldependencies.org/docs/u/pos/) part-of-speech tag.
* Tag: The detailed part-of-speech tag.

In [None]:
for index, token in enumerate(data_spacy): # Loop
    print(index, token.text, token.pos_, token.tag_)

#### Filtering data
"If" statements can be a useful tool within loops to filter data. The general Python syntax for a simple if statement is. 
`if condition :
    indentedStatementBlock`
    
`If` statements many utilize a variety of logical conditions, such as:
* Equals: a == b
* Not Equals: a != b
* Less than: a < b
* Less than or equal to: a <= b
* Greater than: a > b
* Greater than or equal to: a >= b

In [1]:
a = 33
b = 200
if b > a:
  print("b is greater than a")

b is greater than a


We can now use an `if` statement within a `for` loop to identify all the verbs within our text.

In [None]:
for token in data_spacy:
    if token.pos_ == "VERB":
        print(token.text)

####  <font color=red>Your turn! (4 minutes) </font> 
<font color=red> **Beginner:** Create an `if` statement, using another logical condition. Try testing both integers and string objects.  </font>  

<font color=red> **Advanced:** Create a `for` loop with an  `if` statement to identify all the nouns in our text. Can you write a line of code to figure out how many nouns there are total?  </font>  

### Named Entities

In [None]:
for token in data_spacy: # Loop
    print(token.text, token.ent_iob_)


In [None]:
for ent in data_spacy.ents:
    print(ent.text, ent.label_)

### Other Annotations

In [None]:
[(token, token.is_stop) for token in data_spacy] # Stop words
[(token, token.lemma_) for token in data_spacy] # Lemmas
[(token, token.dep_) for token in data_spacy] # dependencies

In [None]:
# Analyze syntax
[chunk.text for chunk in data_spacy.noun_chunks])

In [None]:
from spacy import displacy
doc = nlp_en("This is a sentence.")
displacy.serve(doc, style="dep")

####  <font color=red>Your turn! (6 minutes) </font> 
<font color=red> **Beginner:** Read in a new document. Explore the various tags by looping over the data. Can you view various tags at once? </font>



## 4. Creating a dataframe
After exploring the annotation levels available in Spacy, we can create a dataframe that contains the information we are interested in. We will use the `pandas` package to create a dataframe and then output of dataframe into a csv file that can be read into R for statistical analysis.First we will import the `pandas` package using the `import` function.

In [None]:
import pandas as pd

df = pd.DataFrame()
df['Token'] = [token.text for token in data_spacy]
df['POS'] = [token.pos_ for token in data_spacy]
df['NE'] = [token.ent_iob_ for token in data_spacy]
df['Lemma'] = [token.lemma_ for token in data_spacy]
df['Tag'] = [token.tag_ for token in data_spacy]

df


#### <font color=red>Your turn!</font> 

<font color=red>Create a dataframe with a column for POS, NE, and Lemma</font> 

<font color=dark red> **Advanced:** Create a dataframe only of the nouns appearing in the text. Have a column for lemma, NE and _is_stop annotations. Can you create a column containing the sentence each noun comes from?</font>  
 

## 5. Outputing data

In [None]:
df.to_csv(r'Data/SpacyDF.csv', index=None, header=True)