# Introduction to Python for Natural Language Processing

In this notebook, we're going to go over some of the basics of Python. This is so that in later sessions we can focus on the big ideas behind the methods, rather than the implementation details.

[Data Types & Operations](#section 1)<br>

[A few tricks up your sleeve](#section 2)<br>

### Time
- Teaching: 20 minutes
- Exercises: 15 minutes

### Python code

In the data directory, you will find a text file of an English dictionary. We can use this to count how many English words end in "ing".

In [None]:
dictionary_file = 'data/dictionary.txt'
total = 0
for line in open(dictionary_file):
    word = line.strip()
    if word.endswith('ing'):
        total = total + 1
print(total)

## Data types & Operations

### Arithmetic

In [None]:
5+2

In [None]:
print(5+2)
print(5-2)
print(5*2)
print(5/2)

In [None]:
5>2

### Variable assignment

Assigning variables is something that we do all the time in programming. These aren't quite like the variables from high school algebra, where <i>x</i> represents an unknown to solve for. Instead these are like notes to ourselves that we want to save some value(s) for later use.

Note that the equals sign is directional, like an arrow, telling the computer to give a certain value to a certain label.

In [None]:
# 'a' is being given the value 2; 'b' is given 5
a = 2
b = 5

In [None]:
# Let's perform an operation on the variables
a+b

In [None]:
# Variables can have many different kinds of names
this_number = 2
b/this_number

### Strings

In Python, human language text gets represented as a <i>string</i>. These contain sequential sets of characters and they are offset by quotation marks, either double (") or single (').

We will explore different kinds of operations in Python that are specific to human language objects, but it is useful to start by trying to see them as the computer does, as numerical representations.

In [None]:
# The iconic string
print("Hello, World!")

In [None]:
# Assign these strings to variables
a = "Hello"
b = 'World'

In [None]:
# Try out arithmetic operations.
# When we add strings we call it 'concatenation'
print(a+b)
print(a*5)

In [None]:
# Unlike a number that consists of a single value, a string is an ordered
# sequence of characters. We can find out the length of that sequence.
len("Hello, World!")

### Lists

The _numbers_ and _strings_ we have just looked at are the two basic data types that we will focus our attention on in this workshop. When we are working with just a few numbers or strings, it is easy to keep track of them, but as we collect more we will want a system to organize them.

One such organizational system is a _list_. This contains values (regardless of type) in order, and we can perform operations on it very similarly to the way we did with numbers.

In [None]:
# A list in which each element is a string
['Call', 'me', 'Ishmael']

In [None]:
# Let's assign a couple lists to variables
list1 = ['Call', 'me', 'Ishmael']
list2 = ['In', 'the', 'beginning']

### Challenge

What will happen when we run the following cell?

In [None]:
print(list1+list2)
print(list1*5)

In [None]:
# As with a string, we can find out the length of a list
len(list1)

In [None]:
# Sometimes we just want a single value from the list at a time
print(list1[0])
print(list1[1])
print(list1[2])

In [None]:
# Or maybe we want the first few
print(list1[0:2])
print(list1[:2])

In [None]:
# Of course, lists can contain numbers or even a mix of numbers and strings
list3 = [7,8,9]
list4 = [7,'ate',9]

In [None]:
# And python is smart with numbers, so we can add them easily!
sum(list3)

### Challenge

- Concatenate 'list1' and 'list2' into a single list.
- Retrieve the third element from the combined list.
- Retrieve the fourth through sixth elements from the combined list.

## A few tricks up your sleeve

### String Methods

The creators of Python recognize that human language has many important yet idiosyncratic features, so they have tried to make it easy for us to identify and manipulate them. For example, in the demonstration at the very beginning of the workshop, we referred to the idea of the suffix: the final letters of a word tell us something about its grammatical role and potentially the author's argument.

We can analyze or manipulate certain features of a string using its <i>methods</i>. These are basically internal functions that every string automatically possesses. Note that even though the method may transform the string at hand, they don't change it permanently!

In [None]:
# Let's assign a variable to perform methods upon
greeting = "Hello, World!"

In [None]:
# We saw the 'endswith' method at the very beginning
# Note the type of output that gets printed
greeting.startswith('H'), greeting.endswith('d')

In [None]:
# We can check whether the string is a letter or a number
this_string = 'f'
this_string.isalpha()

In [None]:
# When there are multiple characters, it checks whether *all*
# of the characters belong to that category
greeting.isalpha(), greeting.isdigit()

In [None]:
# Similarly, we can check whether the string is lower or upper case
greeting.islower(), greeting.isupper(), greeting.istitle()

In [None]:
# Sometimes we want not just to check, but to change the string
greeting.lower(), greeting.upper()

In [None]:
# The case of the string hasn't changed!
greeting

In [None]:
# But if we want to permanently make it lower case we re-assign it
greeting = greeting.lower()
greeting

In [None]:
# Oh hey. And strings are kind of like lists, so we can slice them similarly
greeting[:3]

In [None]:
# Strings may be like lists of characters, but as humans we often treat them as
# lists of words. We tell the computer to can perform that conversion.
greeting.split()

### Challenge

- Return the second through eighth characters in 'greeting'

### Challenge

Split the string below into a list of words and assign this to a new variable.

_NB: A slash at the end of a line allows a string to continue unbroken onto the next._

In [None]:
new_string = "It, is a truth universally acknowledged, that a single \
man in possession of a good fortune must be in want of a wife."

### List Comprehension

You can think of them as list filters. Often, we don't need every value in a list, just a few that fulfill certain criteria.

In [None]:
# 'list1' had contained three words, two of which were in title case.
# We can automatically return those words using a list comprehension
[word for word in list1 if word.istitle()]

In [None]:
# Or we can include all the words in the list but just take their first letters
[word[0] for word in list1]

### Challenge

Using the list of words you produced by splitting 'new_string', create a new list that contains only the words whose last letter is "e".

### Challenge

Create a new list that contains the first letter of each word.

### Challenge

Create a new list that contains only words longer than two letters.

### BONUS: Exploratory Natural Language Processing Tasks

Now that we have some of Python's basics in our toolkit, we can immediately perform the kinds of tasks that are the digital humanist's bread and butter. When we first meet a text in the wild, we often wish to find out a little about it before digging in deeply, so we start with simple questions like "How many words are in this text?" or "How long is the average word?"

### Challenge
Run the cell below to read in the text of "Pride and Prejudice" and answer the following questions:

- How many words are in the novel?
- How many words in the novel appear in title case?
- Approximately how long is the average word in the novel?

In [None]:
austen_file = 'data/pride-and-prejudice.txt'
with open(austen_file) as f:
    contents = f.read()