# Cultural and Literary Text Mining
## Lab Session A: Python Fundamentals [0]

**Date**: June 21

**Author**: Kent K. Chang

**Accompanying slides**: [link](http://caltmig.kentchang.com/lab/slides/python-fundamentals.html#/)

* * *

## The first program (as per tradition)

In [None]:
print('Hello World')

**Question 1**: What is the difference between 
```python
print 'Hello World'
```
and 
```python
print('Hello World')
```

Questions are for you to review at home and maybe make this session more interactive. :)

## The second program

### Task 1: load `.txt` files remotely

Focus only on the use of `import`; ignore bits as instructed.

In [None]:
import urllib.request

# ignore headers = {...} for now
url = 'http://caltmig.kentchang.com/lab/resources/de_profundis.txt'
req = urllib.request.Request(url, headers = {'User-Agent': 'Mozilla/5.0'})
print(req)
# pay attention to this particular language feature (slides)
raw = urllib.request.urlopen(req).read().decode('utf8')

# ignore [:15] for now
print(raw[:15]) 

**Question 2-1**: What happens on the first line? 

**Question 2-2**: What is `utf8`? Why decoding?

**Question 2-3**: Bonus question: Who is the author of the content in the text file?

* * *

`urllib.request.Request()` is too damn long. Can we make it shorter?

In [None]:
from urllib.request import Request, urlopen

req = Request('http://caltmig.kentchang.com/practical/tutorial-1/de_profundis.txt', headers={'User-Agent': 'Mozilla/5.0'})
raw = urlopen(req).read().decode('utf8')

# ignore [:15] for now
print(raw[:15])

Keep in mind, however, `from` is useful when you're only using one or two specific methods, and that if you're using multiple modules, spelling things out can enhance the readability of your code.
* * *

### Fun with NLTK
#### Task 2: Tokenize the loaded `.txt`

The following snippet fires the default word tokenizer. There are a few built-in ones but let's stick with the default one for now.

In [None]:
import nltk

tokens = nltk.word_tokenize(raw)
print(tokens[:10])

# this will give you the number of tokens (word-ish) in *De Profundis*
# len(): return the length of an object
print(len(tokens))

The output `['a', 'b', . . .]` suggests that technically, the tokenizer turns a long string into a list (a smaller unit), where each element (or item) in the list is a token. List is a data type in Python, which is our next topic.

**Question 3:** What is wrong with the first ten tokens?
* * *

## Python data types
### List

Let's create a list of Wilde's works in chronological order:

In [None]:
wildes_works = ['The Happy Prince and Other Stories (1888)', 'Lady Windermere\'s Fan (1892)', 'A Woman of No Importance (1893)']
print(wildes_works)

Note that strings are enclosed in single or double quotes. Numbers aren't:

In [None]:
first_three_int = [1, 2, 3]
print(first_three_int)

**Question 4:** What does `\` do in `wildes_works[1]`?
* * *
Let's add Kent's favorite play to the list, using the `append()` function.

In [None]:
wildes_works.append('The Importance of Being Ernest (1895)')
print(wildes_works)

O how stupid of me – there's a typo in the last element. Since lists are mutable, we can correct it:

In [None]:
wildes_works[-1] = 'The Importance of Being Earnest (1895)'
print(wildes_works)

I very smoothly introduced the magical index of `-1`, which refers to the last element in a list. Similar to `append()` is `extend()`:

In [None]:
wildes_works.extend(['De Profundis (1895)', 'The Ballad of Reading Goal (1888)'])
print(wildes_works)

**Question 5:** What is the difference between `append()` and `extend()`?
* * *
Life is difficult; sometimes we need only comedies. Let's get rid of sad stories and the very sad poem in the list. You can delete an element in three ways (see slides):

In [None]:
del wildes_works[0]
# +: for string concatenation, like in JavaScript (or . in PHP)
print(wildes_works.pop() + ' has been deleted.') # does four things at the same time
wildes_works.remove('De Profundis (1895)')

Now we need to add into the list *An Ideal Husband* (1895), the other comedy Wilde wrote, using `insert()`. Now we have two plays produced in the same year, so probably we should do alphabetical order, and thus *An Ideal Husband* should come before *Earnest* in the list. `insert()` allows us to specify the index we want the new element to be inserted:

In [None]:
wildes_works.insert(2, 'An Ideal Husband (1895)')
print(wildes_works)

#### List slicing

There's a cool thing about list which is you can slice it using `:`. Watch:

In [None]:
# the whole list
print(wildes_works)
# the sliced list
print(wildes_works[0:2])

**Question 6:** Play around with the following:

* `wildes_works[1:3]`
* `wildes_works[1:]`
* `wildes_works[:4]`

And generalize what `list[x:y]` means.

* * *

### String

Suppose we have a string `str = 'banana'`. Guess if this will work:
```python
print(str[0:2])
```
The output will either be a type error or something else. 

In [None]:
str = 'banana'
print(str[0:2])

Why did we get what we get? See slides.
* * *

### List (reprise)
#### Accessing nested sequences . . . and slice them

Since a string is technically a sequence of characters, a string in a list is technically a nested sequence. Let's try accessing elements in this nested sequence:

In [None]:
# third character in the string 'A Woman of No Importance (1893)'
print(wildes_works[1][2])

Being a sequence, it can be sliced too:

In [None]:
print(wildes_works[1][:-7])

This gets rid of the last seven characters (including space) like ` (1895)`.
* * *

#### List concatenation

`+`, too, works with lists. 

In [None]:
happy_prince_tales_1 = ['The Happy Prince', 'The Nightingale and the Rose']
print(happy_prince_tales_1)
happy_prince_tales_2 = ['The Selfish Giant', 'The Devoted Friend']
print(happy_prince_tales_2)
happy_prince_tales = happy_prince_tales_1 + happy_prince_tales_2
print(happy_prince_tales)

**Question 7-1:** We have left one story, “The Remarkable Rocket”, out. Can we do this:
```python
happy_prince_tales_1 + happy_prince_tales_2 + 'The Remarkable Rocket'
```
If we can, why? If not, how to fix it?

**Question 7-2:** Can we do this:
```python
happy_prince_tales_1[0] + happy_prince_tales_2[0] + 'The Remarkable Rocket'
```
If we can, why? If not, how to fix it?

* * *

#### Traversing lists

At this point, you have
```python
happy_prince_tales = ['The Happy Prince', 'The Nightingale and the Rose', 
                      'The Selfish Giant', 'The Devoted Friend', 
                      'The Remarkable Rocket']
```
Note that you can have extra spaces within a statement to improve readability.

Let's print each element in our list, and concatenate string elements with other strings to form a natural sentence.

Intuitively, we need to loop through the list. Like other programming languages, there are two statements for iteration: `while` and `for`. We'll learn about `for`-loop here. (See slides.)

In [None]:
for happy_prince_tale in happy_prince_tales:
    print(happy_prince_tale)

Let's give this list some context:

In [None]:
print('Oscar Wilde\'s \"The Happy Prince and Other Tales\" includes the following stories: ')
for happy_prince_tale in happy_prince_tales:
    print('\"' + happy_prince_tale + '\"') # Insisting the correct punctuation

What if we want to write it in one sentence:

In [None]:
# print is a function that accepts multiple arguments, one being `end` 
print('Oscar Wilde\'s \"The Happy Prince and Other Tales\" includes the following stories: ', end = "")
for happy_prince_tale in happy_prince_tales:
    print('\"' + happy_prince_tale + '\", ', end = "")

We can use a simple `if`-statement to do something else for the last element:

In [None]:
print('Oscar Wilde\'s \"The Happy Prince and Other Tales\" includes the following stories: ', end = "")
for i in range(len(happy_prince_tales)):
    if i == (len(happy_prince_tales)-1):
        print('\"' + happy_prince_tales[i] + '\". ', end = "")
    else:
        print('\"' + happy_prince_tales[i] + '\", ', end = "")

`If`-statement is pretty self-explanatory. But in fact, there's a cool function that does the exact same thing: 

In [None]:
print('Oscar Wilde\'s \"The Happy Prince and Other Tales\" includes the following stories: \"' 
      + '\", \"'.join(happy_prince_tales) + '".')
# slides

Alternatively, we can add quotes first and use `.join()` to add commas only.

In [None]:
happy_prince_tales_with_quotes = []
for happy_prince_tale in happy_prince_tales:
    happy_prince_tales_with_quotes.append('\"' + happy_prince_tale + '\"')
    
print(happy_prince_tales_with_quotes)
print('Oscar Wilde\'s \"The Happy Prince and Other Tales\" includes the following stories: ', end = "")
print(', '.join(happy_prince_tales_with_quotes) + '.')

#### List comprehension

See slides.

In [None]:
happy_prince_tales = ['\"' + happy_prince_tale + '\"' for happy_prince_tale in happy_prince_tales]
print('Oscar Wilde\'s \"The Happy Prince and Other Tales\" includes the following stories: ', end = "")
print(', '.join(happy_prince_tales) + '.')

* * *

### Dictionary

Technically, dictionaries represent the mapping between keys (or indices) and their associated values. In a sense, that's also how actual dictionaries work. For example, you want to know the German word for "day" and you look it up in a English-German dictionary. You do so by searching for the entry "day" (which is a `key` in Python) and you get "Tag" (which is the corresponding `value` of the `key`).

```python
en2de_dict = {'day': 'Tag'}
```

This gives you the German equivalent of "day" in this dictionary of `en2de_dict`:

```python
en2de_dict['day']
```

Note that `{}` is associated with unordered sets, dictionary being one of those.

In [None]:
en2de_dict = {'day': 'Tag'}
de = en2de_dict['day']
print(de)

**Question 8**: Would this work:

```python
print(en2de_dict[1])
```

Why or why not?

In [None]:
en2de_dict['Good'] = 'Gute'
en2de_dict.update({'I': 'ich'})
en2de_dict.update({'“': '„', 
                   '”': '“'})
print(en2de_dict)

Dictionary objects come with handy functions like `get()` and `items()`:

In [None]:
# get()
if en2de_dict.get('I'):
    print('The word \"I\" is in the dictionary')
else:
    print('The word \"I\" is not in the dictionary')

print('-------------------------------------------')

if en2de_dict.get('Oscar'):
    print('The word \"Oscar\" is in the dictionary')
else:
    print('The word \"Oscar\" is not in the dictionary')

In [None]:
# items() - note it's plural
for key, value in en2de_dict.items():
    print('{0} is {1} in German.'.format(key, value))

I subtly introduced `format()`.

* * *

If you are comfortable with the above material you may attempt this before we finish off the meeting.

## Wrap-it-up task: a simple translator

Let's practice everything we've learned today. You're given a string (quotes aren't escaped here because they are curly ones which aren't used in programming)

```python
en = 'Kent says: “Good night!”'
```

You have to write code such that when you run `print(de)`, it prints

```python
Kent sagt: „Gute Nacht!“
```

Some steps:

* add new items to the dictionary as required
* tokenize the sentence `en`
* loop through the list and check if a word is in the dictionary through `get()`
* if it is, replace the word in `en` with the corresponding value in the dictionary (use **list comprehension** when you can)
* try to concatenate elements in the updated list into a sentence