# 2015 Computational Social Science Workshop

# Day 1 - Introduction to `python` - Part 2 / 3
with **Jongbin Jung** (jongbin at stanford.edu)
- PhD student. Decision analysis, MS&E


All material for days 1 (intro to `python`) and 2 (web scraping with `python`) publicly available at https://github.com/jongbinjung/css-python-workshop

## 2. `python` development (with `spyder`)

Lines of `python` code can be saved to a plain text file, conventionally appended with a `.py` extension, which can by read by the interpreter to run a `python` program. A common setup for `python` development is to have
1. a text editor of choice
1. a command line/terminal open to run the `.py` file
1. (optionally) a interpreter for testing small pieces of code

Having your own setup can be great if you have a favorite text editor and love pushing commands around different windows. Some text editors have pretty good support for `python` development, too. 

Another option is to use an IDE (**I**ntegrated **D**evelopment **E**nvironment), specifically catered to your `python` development needs. [PyCharm](https://www.jetbrains.com/pycharm/) is a pretty one, and will be familiar if you've worked with IDEA, Android Studio, WebStorm, PhpStorm, etc. (They're all based on the same platform.) Today, we'll be working with `spyder` because

1. it's free
1. it comes included with Anaconda
1. it loads faster (compared to PyCharm)

### Launching `spyder`
`Spyder` is best launched from Anaconda's launcher. 

- #### launch the `launcher`
 - **Windows**
   - <kbd>Win</kbd> + <kbd>r</kbd>, type `launcher` and hit <kbd>OK</kbd> (or <kbd>Enter</kbd>)
  <img src=img/ss_win_run_launcher.png alt="Run command on Windows" width=300>
 - **OS X / \*nix**
   - Open a terminal
   - type `launcher` and hit <kbd>Enter</kbd>
  <img src=img/ss_osx_term_launcher.png alt="Run launcher from a terminal">
 
<img src=img/ss_launcher.png alt="Anaconda launcher" width=600>

- #### hit <kbd>Launch</kbd> for `spyder-app`

<img src=img/ss_spyder.png alt="Jupyter open in a browser" width=600>

By default, `spyder` has a text editor on the left pane, an interactive console (to which you can also send selected commands from the text editor pane with <kbd>Shift</kbd>+<kbd>Enter</kbd>) and object/variable/file browsers on the right side.

Feel free to explore and get used to the `spyder` environment before we move on.

(One feature I find particularly useful is the <kbd>Ctrl</kbd>+<kbd>i</kbd> shortcut, which displays documentation for the object at my current cursor, whether in the editor or console.)

### `if` statements
An example should suffice

In [92]:
x = int(raw_input('Give me a BIG number: '))
if x < 0:
    print 'You\'re joking, right?'
elif x < 1e3:
    print 'Try harder ... '
else:
    print 'Nice.'

Give me a BIG number: 1000
Nice.


Some notes on the above code:
- the `raw_input()` function (as you've now seen), promts the user for an input
- the `int()` (tries to) convert string values to integers (`raw_input()` will always return the user's input as a string)
- `elif` is short for `else, if`, and there can be none or more than one `elif` sequences
- the `else` clause is optional

One more thing that's implicit but *__extremely__* important: **Indents.**

- `python`, unlike many other languages out there, doesn't use curley brackets {}
- instead, blocks of grouped code are identified by the level of indents (this is something to get used to, if you've never seen it before)
- word of caution: NEVER USE <kbd>Tab</kbd> (don't worry, `spyder` changes all your <kbd>Tab</kbd>s to four spaces by default, which is the [PEP 8 spec for indentation][pep8] in `python`)

[pep8]: https://www.python.org/dev/peps/pep-0008/#indentation

### `for` statements
The `for` statement in `python` iterates over the items of any sequence (e.g., lists and even strings!), in the order that they appear in the sequence.

In [93]:
names = ['Jamie', 'Cersei', 'Jon', 'Sansa']

for name in names:
    print name, 'has', len(name), 'characters and starts with a', name[0]

Jamie has 5 characters and starts with a J
Cersei has 6 characters and starts with a C
Jon has 3 characters and starts with a J
Sansa has 5 characters and starts with a S


The example above introduces a few new concepts:
- the variable `name` is defined along with the declaration of the `for` statement. It doesn't need to exist beforehand
- the `print` statement can take multiple expressions of different types, (try to) change them to a string, and insert a space between each item (separated by commas)
- it's good practice to use plurals for collections (`names` for the list) and singulars for individual items (`name` for each name)

You can also loop over a string, one character at a time.

In [94]:
vowels = ['a', 'e', 'i', 'o', 'u']  # make a list of vowels
for name in names:
    vowel_count = 0  # initialize the vowel count
    for char in name:
        if char in vowels:
            vowel_count += 1
    print name, 'has', vowel_count, 'vowel(s)'

Jamie has 3 vowel(s)
Cersei has 3 vowel(s)
Jon has 1 vowel(s)
Sansa has 2 vowel(s)


You can use the built-in `range()` function to do a more 'classic' `for` loop over a sequence of numbers.

In [95]:
range(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

`range(len)` generates the legal indices (starting from 0) for a sequence of length `len`. You can also use `range(start, stop[, step])` to specify the start, end, and (optionally) step to take.

(The `[, step]` notation in the fuction signiture shows that the `step` argument is optional. It's useful to know such conventions when refering to the docs.)

In [96]:
range(4,20)

[4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [97]:
range(4,20,7)

[4, 11, 18]

In [98]:
range(20,4,-3)

[20, 17, 14, 11, 8, 5]

You can combine `range()` with `len()` to iterate over the indices of a sequence.

In [99]:
for i in range(len(names)):
    print 'Name', i, 'is', names[i]

Name 0 is Jamie
Name 1 is Cersei
Name 2 is Jon
Name 3 is Sansa


But in such cases, the `enumerate()` function is usually more convenient.

In [100]:
for i, name in enumerate(names):
    print 'Name', i, 'is', name

Name 0 is Jamie
Name 1 is Cersei
Name 2 is Jon
Name 3 is Sansa


As you might have guessed, the `enumerate()` function takes a sequence, and returns the (index, value) pairs for each item (the 'pairs' are actually called `tuple`s, but more on that later...), and you can assign items from a `tuple` to its own variable in the `for` statement.

In [101]:
print list(enumerate(names))

[(0, 'Jamie'), (1, 'Cersei'), (2, 'Jon'), (3, 'Sansa')]


Occasionally, you might want to loop over two or more sequences at a time. You can pair the entries with the `zip()` function.

In [102]:
title = 'Game of Thrones'
houses = ['Lannister', 'Lannister', 'Snow', 'Stark']
for char, house, name in zip(title, houses, names):
    print char, '-', name, house

G - Jamie Lannister
a - Cersei Lannister
m - Jon Snow
e - Sansa Stark


Note how `zip()` gracefully fits the interator to the length of the shortest sequence, i.e., only the first four characters of the string 'Game of Thrones' were iterated.

### `break` and `continue` statements
You can manage your loops in more detail using `break` and `continue` statements. 

A `break` statement, as the name implies, will break you out of the smallest enclosing loop.

In [103]:
for name, house in zip(names, houses):
    if house == 'Snow':
        break
    else:
        print name, house

Jamie Lannister
Cersei Lannister


A `continue` statement will simply skip over to the next item in the iterator, instead of breaking out of the loop.

In [104]:
for name, house in zip(names, houses):
    if house == 'Snow':
        continue  # compare to the previous example where we stopped the loop at Snow, now we simply skip it
    else:
        print name, house

Jamie Lannister
Cersei Lannister
Sansa Stark


### Some more data structures
Before we move on, now might be a good time to cover a few more data structures.
#### `dict` (dictionary)
The most useful data structure in `python` (my very personal opinion)! Also known as *associative arrays* or *hash tables* in other languages, a `python` dictionary maps *hashable* values to *arbitrary* objects. Dictionaries can be created by placing a comma-separated list of `key:value` pairs within curly braces. Just remember that the `key` must be immutable (like a string).

In [106]:
me = {'name':'Jongbin', 'email':'jongbin@stanford.edu'}
print me

{'name': 'Jongbin', 'email': 'jongbin@stanford.edu'}


You can assign new keys to existing dictionaries.

In [107]:
me['cel'] = '650-123-4567'
print me

{'cel': '650-123-4567', 'name': 'Jongbin', 'email': 'jongbin@stanford.edu'}


Or delete existing `key:value` pairs with the `del` statement.

In [108]:
del me['email']
print me

{'cel': '650-123-4567', 'name': 'Jongbin'}


The `key` of a dictionary can't be a list (because lists are mutable), but the `value` sure can!

In [109]:
me['siblings'] = ['Hanbyul', 'Hansol']
print me

{'siblings': ['Hanbyul', 'Hansol'], 'cel': '650-123-4567', 'name': 'Jongbin'}


Use the `keys()` method of dictionary objects to get a list of the keys used in the dictionary.

In [110]:
me.keys()

['siblings', 'cel', 'name']

And use the `in` keyword (compatible with all lists) to see if the a certain key exists in the dictionary.

In [111]:
'name' in me.keys()

True

In [112]:
'email' in me.keys()

False

When the keys are simple strings, it is sometimes easier to specify pairs using the `dict` constructor.

In [113]:
me = dict(name='Jongbin', email='jongbin@stanford.edu', siblings=['Hanbyul', 'Hansol'])
print me

{'siblings': ['Hanbyul', 'Hansol'], 'email': 'jongbin@stanford.edu', 'name': 'Jongbin'}


The `iteritems()` method lets you loop over each `key:value` pair.

In [114]:
for key, value in me.iteritems():
    print key, ':', value

siblings : ['Hanbyul', 'Hansol']
email : jongbin@stanford.edu
name : Jongbin


#### `tuple`s
`Tuple`s are pretty similar to lists, except for the fact that they are immuatable. They consist of a number of values separated by commas (not necessarily, but often, enclosed in parentheses).

In [115]:
description = 'male', 'dark hair'
print description

('male', 'dark hair')


In [116]:
description[0]  # tuples are also sequences, and can be indexed

'male'

In [117]:
description[1:]  # or sliced

('dark hair',)

In [118]:
description[0] = 'female'  # but NOT changed, because they are immutable

TypeError: 'tuple' object does not support item assignment

While being immutable may seem like a minor difference from lists, the implications are quite big, and tuples are generally used for very different purposes compared to lists. For example, tuples can be used as the `key` for dictionaries (think sparse matrices). 

In [119]:
super_sparse_matrix = {(0, 0):1, (1000, 1000):1}  # a 1000*1000 matrix with only two non-zero elements?
print super_sparse_matrix

{(0, 0): 1, (1000, 1000): 1}


In [120]:
word_matrix = {('apples', 'bananas'):1, ('apples', 'pears'):1}  # a matrix indexed by words
print word_matrix

{('apples', 'pears'): 1, ('apples', 'bananas'): 1}


There are many more data structures commonly used in `python`, but lists, dictionaries, and tuples pretty much cover the basics (not to mention that these three constitute enough to fully represent the [JSON](http://json.org/) format in `python`)

### List comprehension
List comprehension is `python`'s way of creating lists (and also other data structures) in a concise manner. One way to create a list of squares would be:

In [121]:
squares = []  # make an empty list
for x in range(10):
    squares.append(x**2)
    
print squares

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


However, the more 'pythonic' way to do this, is to use list comprehension:

In [122]:
[x**2 for x in range(10)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

The command reads: 
> build a list out of the square of x (x\*\*2), for the values of x in `range(10)`

List comprehension can be used to build a list of tuples too.

In [123]:
[(x, y) for x in range(10) for y in range(10) if x*y == 21]

[(3, 7), (7, 3)]

This is equivalent to the nested `for` loop:

In [124]:
twenty_one = []
for x in range(10):
    for y in range(10):
        if x*y == 21:
            twenty_one.append((x, y))
            
print twenty_one

[(3, 7), (7, 3)]


Just be aware that if the item of the list is a tuple, it must be parenthesized.

In [125]:
[x, y for x in range(10) for y in range(10) if x*y == 21]  # this won't work

SyntaxError: invalid syntax (<ipython-input-125-c0af109a84cc>, line 1)

Let's enhance our list of vowels from the previous exercises, by appending the uppercase letters as well.

In [126]:
vowels = vowels + [V.upper() for V in vowels]
print vowels

['a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U']


List comprehension can also be used to build dictionaries.

Try building a sparse matrix (represented by a dictionary), where the position is indexed by words $x$ and $y$, and
$$(x, y) = \begin{cases}
1 & \text{if} \quad y \quad \text{is longer than} \quad x \\
0 & \text{otherwise}
\end{cases}$$

In [127]:
words = ['anti', 'happy', 'evening', 'eagles', 'interior', 'zebra']
{(x,y):1 for x in words for y in words if len(y) > len(x)}

{('anti', 'eagles'): 1,
 ('anti', 'evening'): 1,
 ('anti', 'happy'): 1,
 ('anti', 'interior'): 1,
 ('anti', 'zebra'): 1,
 ('eagles', 'evening'): 1,
 ('eagles', 'interior'): 1,
 ('evening', 'interior'): 1,
 ('happy', 'eagles'): 1,
 ('happy', 'evening'): 1,
 ('happy', 'interior'): 1,
 ('zebra', 'eagles'): 1,
 ('zebra', 'evening'): 1,
 ('zebra', 'interior'): 1}

### Writing functions
Let's create a function to count the number of vowels in a given string

In [128]:
def count_vowels(s):
    """Count the number of vowels in a string."""
    vowels = 'aeiouAEIOU'
    nvowels = [s.count(v) for v in vowels]  # count the number of each vowel in s
    return sum(nvowels)  # return the sum of elements in nvowel

# use the new function
count_vowels('Eels are delicious animals')

12

- the `def` keyword declares a function **def**inition, followed by a function name and the parenthesized list of formal parameters
- the statements that form the body of the function start at the next line, and must be indented
- the first statement of the function body can optionally be a string, also known as the [docstring](https://docs.python.org/2/tutorial/controlflow.html#tut-docstrings)
- many tools (such as `spyder`) use the docstring to give users meaningful information - so help yourself, make a habit of writing meaningful docstrings
- functions that don't finish with a `return` statement return `None` (a special `python` object)

Functions can also return more a tuple of values. For example, let's modify our `count_vowels` function to return the number of vowel along with the list specifying the number of *each* vowel.

In [131]:
def count_vowels(s):
    """
    Count the number of vowels in a string.
    
    returns: number of vowels, list containing number of appearance for each vowel 
    """
    vowels = 'aeiouAEIOU'
    nvowels = [s.count(v) for v in vowels]  # count the number of each vowel in s
    return sum(nvowels), list(zip(vowels, nvowels))  # return the sum and a zipped list
                              
count_vowels('Eels are delicious animals')

(12,
 [('a', 3),
  ('e', 3),
  ('i', 3),
  ('o', 1),
  ('u', 1),
  ('A', 0),
  ('E', 1),
  ('I', 0),
  ('O', 0),
  ('U', 0)])

A returned tuple can also be 'unpacked' into multiple variables.

In [136]:
total_count, individual_count = count_vowels('Eels are delicious animals')
print 'Found total', total_count, 'vowels, each vowel as follows:'
print individual_count

Found total 12 vowels, each vowel as follows:
[('a', 3), ('e', 3), ('i', 3), ('o', 1), ('u', 1), ('A', 0), ('E', 1), ('I', 0), ('O', 0), ('U', 0)]


### Functions with optional arguments
Let's further enhance the `count_vowels` function by letting the user specify
- which vowels to count ('aeiouAEIOU' by default)
- whether to return a single sum or a tuple of the sum and list (single sum by default)

This can be achieved by specifying default values in the function declaration.

In [137]:
def count_vowels(s, vowels = 'aeiouAEIOU', returnAll = False):
    """
    Count the number of vowels in a string.
    
    returns: number of vowels, list containing number of appearance for each vowel 
    """
    nvowels = [s.count(v) for v in vowels]  # count the number of each vowel in s
    if returnAll:
        return sum(nvowels), list(zip(vowels, nvowels))  # return the sum and a zipped list
    else:
        return sum(nvowels)  # return just the sum
                              
count_vowels('Eels are delicious animals')

12

In [138]:
count_vowels('Eels are delicious animals', vowels = 'aeiou')  # no caps

11

In [139]:
count_vowels('Eels are delicious animals', returnAll = True)  # give me EVERYTHING

(12,
 [('a', 3),
  ('e', 3),
  ('i', 3),
  ('o', 1),
  ('u', 1),
  ('A', 0),
  ('E', 1),
  ('I', 0),
  ('O', 0),
  ('U', 0)])

Be careful with having mutable defaults, though. Default values of a function's argument are shared between subsequent calls, and this might cause problems if you're manipulating the argument's value within the function. For example,

In [140]:
def fun(n, stuff=[]):
    """Issues with mutable defaults."""
    stuff.append(n)
    return stuff

print fun(1)  # stuff is empty by default
print fun(2)  # stuff was manipulated, and is now [1] from the previous call!
print fun(3)  # even worse, stuff is now [1, 2] !!!

[1]
[1, 2]
[1, 2, 3]


This behavior isn't necessarily a problem, and it might even make sense in some contexts. However, it's definitely worth keeping in mind to avoid being surprised. If you want to prevent such behavior, one simple work-around is to set the default to `None`, and check if it is indeed `None`, before assigning the 'true' default, such as:

In [141]:
def fun(n, stuff=None):
    """Fix for mutable defaults."""
    if stuff is None:
        stuff = []
    stuff.append(n)
    return stuff

print fun(1)  # unspecified argument stuff is None, then set to []
print fun(2)  # unspecified argument stuff is None, then set to []
print fun(3)  # unspecified argument stuff is None, then set to []
print fun(3, [1,2])  # and we can always specify stuff if we need to!

[1]
[2]
[3]
[1, 2, 3]


You can definitely do more with function arguments, but that might be beyond the scope of today's workshop, and I've found that most of my functioning needs are satisfied by roughly this much detail. Take a look at [The Python Tutorial](https://docs.python.org/2/tutorial/controlflow.html#more-on-defining-functions) at https://docs.python.org/2/tutorial/controlflow.html#more-on-defining-functions if you're interested in more advanced argument handling.

### Lambda expressions
Anonymous one-liner functions can be created with the `lambda` keyword wherever function objects are required, but you don't want or need to define a full function. 

In [149]:
# sort the list of words by the number of vowels in each word, 
# but without defining a separate count_vowel function
words.sort(key=lambda word: sum([word.count(v) for v in 'aeiouAEIOU']))
print words

['happy', 'anti', 'zebra', 'evening', 'eagles', 'interior']


## Mid-session exercise
Hopefully, we've reached this point by the end of the morning session, and now would be a perfect time for an exercise.

Problem: define a function to make word count? / N-Gram?
TODO: add link to data chuck