# Day 1: Python basics, text processing

Na-Rae Han (`naraehan@pitt.edu`) and David J. Birnbaum (`djbpitt@pitt.edu`) 

June 25-29, [NASSLLI 2018 at CMU](https://www.cmu.edu/nasslli2018/) 

This tutorial is found on https://github.com/naraehan/NASSLLI2018-Corpus-Linguistics. 
- Jump to: [Day 1](day1.ipynb), [Day 2](day2.ipynb), [Day 3](day3.ipynb), [Day 4](day4.ipynb), [Day 5](day5.ipynb)

## Getting around in Jupyter Notebook
- When you launch Jupyter Notebook, you "start" in your personal directory. 
- Move into "Desktop". Create a new folder there and rename it "corpusling", and then move into it. You should store all your Notebook files in here. 
- Create a new Python3 notebook file, give it a name. 


- Click `+` to create a new cell, ► to run (Also: `Ctrl+ENTER`)
- `Alt+ENTER` to run cell, create a new cell below
- `Shift+ENTER` to run cell, go to next cell
- More on [this page](https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/)

## The very basics

### First code

* Printing a string, using `print()`. 

In [1]:
print("hello, world!")

hello, world!


### Strings

* String type objects are enclosed in quotation marks (" or ').
* \+ is a concatenation operator.
* Below, `greet` is a variable name assigned to a string value. 
* Here we are not explicitly printing out; instead, a string value is *returned*. 

In [2]:
greet = "Hello, world!"
greet = greet + " I come in peace." + " I'm called merklar."
greet

"Hello, world! I come in peace. I'm called merklar."

* String methods such as `.upper()`, `.lower()` transform a string. 
* Rather than changing the original variable, the commands *return* a *new* string value. 

In [3]:
greet2 = greet.upper().lower() 
greet2

"hello, world! i come in peace. i'm called merklar."

* Some string methods return a boolean value (True/False) 

In [4]:
# try .isupper(), .isalnum(), .startswith('he')
'hello123'.isalnum()

True

* `len()` returns the length of a string in the # of characters. 

In [5]:
len(greet)

50

* `in` tests substring-hood between two strings. 

In [6]:
'he' not in 'hello' or ''.endswith('')

True

### Numbers

* Integers and floats are written without quotes. 
* You can use algebraic operations such as `+`, `-`, `*` and `/` with numbers. 

In [7]:
num1 = 5678
num2 = 3.141592
result = num1 / num2
print(num1, "divided by", num2, "is", result)  # can print multiple things! 

5678 divided by 3.141592 is 1807.3639097629482


### Lists
* Lists are enclosed in `[ ]`, with elements separated with commas. Lists can contain strings, numbers, and more. 
* As with string, you can use `len()` to get the size of a list. 
* As with string, you can use `in` to see whether an element is in a list. 

In [8]:
li = ['red', 'blue', 'green', 'black', 'white', 'pink']
len(li)

6

In [9]:
# Try logical operators not, and, or
'mauve' in li and 'teal' in li
li.append('mauve')
print(li)

['red', 'blue', 'green', 'black', 'white', 'pink', 'mauve']


* A list can be indexed through `li[i]`. Python indexes starts with 0. 
* A list can be sliced: `li[3:5]` returns a sub-list beginning with index 3 up to and not including index 5. 

In [10]:
# Try [0], [2], [-1], [3:5], [3:], [:5]
li[2]

'green'

### `for` loop
* Using a `for` loop, you can loop through a list of items, applying the same set of operations to each element. 
* The embedded code block is marked with indentation. 

In [11]:
for x in li :
    print('"'+x.capitalize()+'" is', len(x), "characters long.")
    print('--')
print("Done!")

"Red" is 3 characters long.
--
"Blue" is 4 characters long.
--
"Green" is 5 characters long.
--
"Black" is 5 characters long.
--
"White" is 5 characters long.
--
"Pink" is 4 characters long.
--
"Mauve" is 5 characters long.
--
Done!


### List comprehension
* List comprehension builds a new list from an existing list. 
* You can filter to include only certain elements, and you can apply transformationa in the process.
* Try: `.upper()`, `len()`, `+'ish'`

In [12]:
# filter
[x for x in li if len(x)==4]

['blue', 'pink']

In [13]:
# transform
[x.upper() for x in li]

['RED', 'BLUE', 'GREEN', 'BLACK', 'WHITE', 'PINK', 'MAUVE']

In [14]:
# filter and transform
[x.upper() for x in li if len(x)>=5]

['GREEN', 'BLACK', 'WHITE', 'MAUVE']

### Dictionaries
- Dictionaries hold **key:value** mappings. 
- `len()` on dictionary returns the number of keys. 
- Looping over a dictionary means looping over its keys. 

In [15]:
di = {'Homer':35, 'Marge':35, 'Bart':10, 'Lisa':8}
di['Lisa']

8

In [16]:
# 20 years-old or younger. x is bound to keys. 
[x for x in di if di[x] <= 20]

['Lisa', 'Bart']

In [17]:
len(di)

4

### Processing a piece of text
- [Visit this page](http://www.pitt.edu/~naraehan/python3/text-samples.txt) and copy-paste the first passage of Moby Dick.  
- `"""` triple quotes have the special power of straddling across line breaks. 


In [18]:
moby = """Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation. Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up
the rear of every funeral I meet; and especially whenever my hypos get
such an upper hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street, and methodically
knocking people's hats off--then, I account it high time to get to sea
as soon as I can. This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship. There is nothing surprising in this. If they but knew
it, almost all men in their degree, some time or other, cherish very
nearly the same feelings towards the ocean with me."""

In [19]:
# What is '\n'? 
moby

"Call me Ishmael. Some years ago--never mind how long precisely--having\nlittle or no money in my purse, and nothing particular to interest me on\nshore, I thought I would sail about a little and see the watery part of\nthe world. It is a way I have of driving off the spleen and regulating\nthe circulation. Whenever I find myself growing grim about the mouth;\nwhenever it is a damp, drizzly November in my soul; whenever I find\nmyself involuntarily pausing before coffin warehouses, and bringing up\nthe rear of every funeral I meet; and especially whenever my hypos get\nsuch an upper hand of me, that it requires a strong moral principle to\nprevent me from deliberately stepping into the street, and methodically\nknocking people's hats off--then, I account it high time to get to sea\nas soon as I can. This is my substitute for pistol and ball. With a\nphilosophical flourish Cato throws himself upon his sword; I quietly\ntake to the ship. There is nothing surprising in this. If they but k

In [20]:
print(moby)

Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation. Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up
the rear of every funeral I meet; and especially whenever my hypos get
such an upper hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street, and methodically
knocking people's hats off--then, I account it high time to get to sea
as soon as I can. This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship. There is nothing surprising in this. If they but knew
it, almost

In [21]:
len(moby)
# But how many _words_?

1110

In [22]:
# .split() is a "poor-man's tokenizer". What problem do you see? 
%pprint
moby.split()

Pretty printing has been turned OFF


['Call', 'me', 'Ishmael.', 'Some', 'years', 'ago--never', 'mind', 'how', 'long', 'precisely--having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse,', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore,', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world.', 'It', 'is', 'a', 'way', 'I', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation.', 'Whenever', 'I', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth;', 'whenever', 'it', 'is', 'a', 'damp,', 'drizzly', 'November', 'in', 'my', 'soul;', 'whenever', 'I', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses,', 'and', 'bringing', 'up', 'the', 'rear', 'of', 'every', 'funeral', 'I', 'meet;', 'and', 'especially', 'whenever', 'my', 'hypos', 'get', 'such', 'an', 'upper', 'hand', 'of', 'me,', 'that', 'it', 'requires', 'a', 'strong', 'moral', 'principle', 'to', 'prevent', 'me', 'f

### Using regular expressions for tokenization
* `re` is Python's regular expression module. Start by importing. 
* `re.findall` finds all substrings that match a pattern.
* For regular expression strings, use `r'...'` (rawstring) prefix. 

In [23]:
import re

In [24]:
sent = "You haven't seen Star Wars...?"
re.findall(r'\w+', sent)

['You', 'haven', 't', 'seen', 'Star', 'Wars']

In [25]:
re.findall(r'\w+', moby)

['Call', 'me', 'Ishmael', 'Some', 'years', 'ago', 'never', 'mind', 'how', 'long', 'precisely', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', 'It', 'is', 'a', 'way', 'I', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation', 'Whenever', 'I', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth', 'whenever', 'it', 'is', 'a', 'damp', 'drizzly', 'November', 'in', 'my', 'soul', 'whenever', 'I', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses', 'and', 'bringing', 'up', 'the', 'rear', 'of', 'every', 'funeral', 'I', 'meet', 'and', 'especially', 'whenever', 'my', 'hypos', 'get', 'such', 'an', 'upper', 'hand', 'of', 'me', 'that', 'it', 'requires', 'a', 'strong', 'moral', 'principle', 'to', 'prevent', 'me', 'from', '

In [26]:
moby_toks = re.findall(r'\w+', moby)

In [27]:
len(moby_toks)

202

#### Type vs. token
- *Tokens* are individual instances of linguistic units. 
- *Types* are unique classes found in the tokens. 

In [28]:
moby_types = set(moby_toks)
moby_types

{'grim', 'thought', 'mouth', 'warehouses', 'methodically', 'same', 'pausing', 'sea', 'There', 'part', 'drizzly', 'regulating', 'requires', 'hand', 'way', 'stepping', 'very', 'nearly', 'principle', 'take', 'my', 'his', 'sword', 'rear', 'ocean', 'years', 'long', 'in', 'to', 'it', 'every', 'upon', 'nothing', 'account', 'about', 'up', 'this', 'ship', 'for', 'ago', 'how', 'can', 'soul', 'If', 'prevent', 'all', 'November', 'substitute', 'driving', 'on', 'money', 'soon', 'of', 's', 'find', 'sail', 'whenever', 'or', 'shore', 'interest', 'precisely', 'high', 'spleen', 'I', 'never', 'especially', 'Ishmael', 'is', 'world', 'me', 'deliberately', 'watery', 'Some', 'before', 'moral', 'but', 'funeral', 'myself', 'their', 'cherish', 'men', 'himself', 'see', 'damp', 'coffin', 'Call', 'some', 'no', 'hypos', 'throws', 'meet', 'Cato', 'the', 'into', 'then', 'purse', 'as', 'This', 'off', 'other', 'and', 'almost', 'strong', 'ball', 'surprising', 'With', 'quietly', 'street', 'get', 'having', 'knew', 'upper',

In [29]:
len(moby_types)

140

In [30]:
[w for w in moby_types if len(w)>=10]

['warehouses', 'methodically', 'regulating', 'substitute', 'especially', 'deliberately', 'surprising', 'particular', 'involuntarily', 'circulation', 'philosophical']

In [31]:
# lowercased version
moby_ltoks = [t.lower() for t in moby_toks]
moby_ltoks

['call', 'me', 'ishmael', 'some', 'years', 'ago', 'never', 'mind', 'how', 'long', 'precisely', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', 'i', 'thought', 'i', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', 'it', 'is', 'a', 'way', 'i', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation', 'whenever', 'i', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth', 'whenever', 'it', 'is', 'a', 'damp', 'drizzly', 'november', 'in', 'my', 'soul', 'whenever', 'i', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses', 'and', 'bringing', 'up', 'the', 'rear', 'of', 'every', 'funeral', 'i', 'meet', 'and', 'especially', 'whenever', 'my', 'hypos', 'get', 'such', 'an', 'upper', 'hand', 'of', 'me', 'that', 'it', 'requires', 'a', 'strong', 'moral', 'principle', 'to', 'prevent', 'me', 'from', '

In [32]:
moby_ltypes = set(moby_ltoks)

In [33]:
# sorted() takes a list/set/... and returns a sorted list
sorted(moby_ltypes)

['a', 'about', 'account', 'ago', 'all', 'almost', 'an', 'and', 'as', 'ball', 'before', 'bringing', 'but', 'call', 'can', 'cato', 'cherish', 'circulation', 'coffin', 'damp', 'degree', 'deliberately', 'driving', 'drizzly', 'especially', 'every', 'feelings', 'find', 'flourish', 'for', 'from', 'funeral', 'get', 'grim', 'growing', 'hand', 'hats', 'have', 'having', 'high', 'himself', 'his', 'how', 'hypos', 'i', 'if', 'in', 'interest', 'into', 'involuntarily', 'is', 'ishmael', 'it', 'knew', 'knocking', 'little', 'long', 'me', 'meet', 'men', 'methodically', 'mind', 'money', 'moral', 'mouth', 'my', 'myself', 'nearly', 'never', 'no', 'nothing', 'november', 'ocean', 'of', 'off', 'on', 'or', 'other', 'part', 'particular', 'pausing', 'people', 'philosophical', 'pistol', 'precisely', 'prevent', 'principle', 'purse', 'quietly', 'rear', 'regulating', 'requires', 's', 'sail', 'same', 'sea', 'see', 'ship', 'shore', 'some', 'soon', 'soul', 'spleen', 'stepping', 'street', 'strong', 'substitute', 'such', '

In [34]:
len(moby_ltypes)

135

## More tomorrow
- NLTK
- Opening and processing a text file
- How long are George Washington’s sentences on average? 
- Which long words did he use, and how frequent were they? 

All answered on [Day 2 (Tuesday)](day2.ipynb)

## Bring your own corpus
Is there any particular corpus you are looking to work with? Please suggest it for our very last class, when we will take a look at a couple of them together. Ideal candidates are: 
- Sharable with class (you should either have ownership or the corpus should be publicly available)
- Moderate in size (100MB or less)

__Please email both Na-Rae and David with your suggestions by TOMORROW NOON__. Please include a web link or attach a zipped archive (if you own the rights) along with a brief description of your end goals. 