# Python for text analysis: Topic 1

Welcome to the course! You are now in the Jupyter Notebook environment, running the notebook for Topic 1.
Notebooks are pretty straightforward. Some tips:

* Cells in a notebook contain code or text. If you run a cell, it will either run the code or render the text.
* There are five ways to run a cell:
    1. Click the 'play' button next to the 'stop' and 'refresh' button in the toolbar.
    2. Alt + Enter runs the current cell and creates a new cell.
    3. Ctrl + Enter runs the current cell without creating a new cell. (Cmd + Enter on a Mac.)
    4. Shift + Enter runs the current cell and moves to the next one.
    5. Use the menu and select Cell/Run all.
* The instructions are written in Markdown. [Here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) is a nice Markdown cheatsheet if you want to write some more.
* Explore the menus for more options! You can even create a presentation using Notebooks.

Hint when you're writing Python code: press Tab to auto-complete your variable names!

**At the end of this week, you will be able to:**

- Assign values to variables.
- Find and use object methods.
- Use different built-in classes, such as *strings* and *lists*.
- Use Python's *slicing* and *indexing* notation.

**We'll also touch upon:**

- Normalization: making sure that a text is suitable for further processing.
- Tokenization: splitting up a text into sentences and words.

We'll cover these in more detail later on in the course, but there are some links to the literature if you're already keen to explore these topics.

**Notes**

You'll find that this week is a bit theoretical. That's because we need to agree on a common vocabulary, before we can move on to more practical matters. It might be a bit overwhelming at first, but you'll get used to it pretty quickly!

There are no prerequisites for this week. That means that if you do find something you don't understand, it's our fault. Please contact us, and we'll help you out!

**If you want to learn more about these topics, you might find the following links useful:**

- Documentation: [The Python 3 documentation](https://docs.python.org/3/)
- Glossary: [Glossary](https://docs.python.org/3/glossary.html)
- Free e-book: [How to think like a computer scientist](http://www.openbookproject.net/thinkcs/python/english3e/)
- Free e-book: [A Byte of Python](https://python.swaroopch.com/)
- Useful tool: [Pythontutor](http://www.pythontutor.com/) -- Shows you line-by-line what your code does
- Community: [Learnpython](https://www.reddit.com/r/learnpython/) -- Reddit community for learners of Python
- Video: [Python names and values](http://nedbatchelder.com/text/names1.html) -- Note: this might be a bit too technical at this stage
- Reading: Chapter 1 of the Python-chapters in this repository.

## Getting started together

This notebook provides you with an overview of the basics of Python. You don't need to remember everything, we just want to give you a sense of what's available in the language. Also recall the 15 minute rule: if you're stuck for longer than 15 minutes, contact us and we'll help you. (In class, of course, you can ask us immediately.)

Let's first start with something really simple. Every programming language is traditionally introduced with a "Hello world" example. Please run the following cell.

In [1]:
print("Hello, world!")

Hello, world!


What happened here? Well, Python has a large set of built-in functions, and `print()` is one of them. When you use this function, `print()` outputs its *argument* to the screen. 'Argument' is a fancy word for "object you put in a function". In this case, the argument is the string "Hello, world!". And 'string' just means "a sequence of characters".

Instead of providing the string directly as an argument to the print function, we can also create a variable that refers to the string value "Hello, world!". When you pass this variable to the print function, you get the same result as before:

In [4]:
text = "Hello, world!"
print(text)

Hello, world!


Note that the variable name `text` is not part of Python. You could use any name you like, and the example would still work. Even if you change the name `text` to something silly like `pikachu` or `sniffles`. But it's standard practice to use clear variable names, so that your scripts will remain understandable (especially when they grow larger).

Variables are nice because you can re-use them as many times as you want:

In [None]:
# Let's print the variable again!
print(text)

Since it's kind of boring to always use the same string, we'll make use of another built-in function: `input()`. This takes user input and returns it as a string. Try it below:

In [None]:
text = input('Please enter some text.') # If this doesn't work, you may have Python 2 installed.
print(text)                             # Please install Python 3, or you'll be unable to use these notebooks.

Please note that this *overwrites* the original value that `text` was referring to!

Until we learn how to use functions and files, we will use the following setup to explore python:

1. We assign some input value to a variable,
2. do something with that value,
3. and print the result.

The `print` function is nice because you can use it to see what's going on behind the scenes:

In [None]:
# Change the text
text = "I like apples"

# Print the text
print(text)

# Change the text
text = "I like oranges"

# And print it again
print(text)

And it's very flexible in that it can print as many things as you want at the same time, on one line. Just separate each of the things you want to print with a comma.

In [None]:
# Make the variable 'number' refer to the number 5.
number = 5

# Print a string and the value of number.
print("Here's the value of 'number':", number)

## Objects

Python is an object-oriented programming language. This means that it treats every piece of data like some kind of object that can be manipulated and passed around. Python has the following basic *types* of objects:

* **String**: for representing text.
* **Integer**: for representing whole numbers.
* **Float**: for representing numbers with decimals.
* **Tuple**: for representing immutable combinations of values.
* **List**: for representing ordered sequences of objects.
* **Set**: for representing unordered sets of objects.
* **Dictionary**: to represent mappings between objects.
* ..and **functions**: to manipulate objects, or to produce new objects given some input.

You can also read about the types in Python in the documentation [here](https://docs.python.org/3/library/stdtypes.html). Basically, what you need to know now is that each type has particular *affordances*, or associated things that you can do with them. When something is of a numeric type (integer, float), the Python interpreter knows that you can perform mathematical operations with that object. You cannot use those operations with strings, because it doesn't make sense to take the square root of the string "Hello, world!".

It's the same with vehicles and food in real life. Anything that is of the type *vehicle* can be used to get around and possibly transport goods. But you cannot eat a vehicle. Anything that is of the type *food* is edible, but it's pretty difficult to use food for transportation. (Try biking home on a carrot.)

Here's an example of each type:

In [None]:
a_string       = 'test'
an_integer     = 4
a_float        = 3.14
a_tuple        = (2,5)
a_list         = [1,2,3,1,2,3,'a','b','c']
a_set          = {1,2,3,4,'apple'}
a_dict         = {'milk':2, 'cheese':1, 'pickles':45}
a_function     = print

We can use the `type` function to check object types. Let's use it for a selection of our newly defined objects:

In [None]:
type(a_function)

In [None]:
type(an_integer)

In [None]:
type(a_string)

### Strings

We'll now take a look at the different object types in Python, starting with strings. Let's define a few of them:

In [None]:
# Here are some strings:
string_1 = 'Hello, world!'
string_2 = 'I ❤️ cheese'      # If you are using Python 2, your computer will not like this.
string_3 = '1,2,3,4,5,6,7,8,9'
# Strings that span multiple lines must start and finish with three single or double quotes.
string_4 = """This string covers
multiple lines!"""
# You can also use double quotes:
string_5 = "This one\n does too!"

Strings can contain any character you can think of, including emoji! The cell above also shows different ways to enter a string in Python: using single/double quotes, or three single/double quotes. In addition, the 5th string also shows a hidden character. Here is what that line looks like when you print it:

In [None]:
print(string_5)

`\n` stands for 'new line', and produces a line break. Another common hidden character is `\t`, which produces a tab.

Let's explore some properties of strings.


**Strings are sequences of characters**

Python provides very useful functions to work with sequences. Since strings are represented as sequences of characters, these functions work for strings as well. Important for now are:

* Length
* Containment
* Indexing
* Looping

**Length**: Python has a built-in function called `len()` that lets you compute the length of a sequence. It works like this:

In [None]:
number_of_characters = len(string_1)
print(number_of_characters) # Note that spaces count as characters too!

What happened above is that the `len`-function was called with `string_1` as its argument. The Python interpreter then counted all characters in `string_1` and *returned* the result. (Programmers say 'return' to mean the function produces some kind of result.) This result (the number of characters) got assigned to the variable `number_of_characters`, which gets printed by `print` (a function that doesn't return anything, but rather *displays* its argument on the screen).

**Containment**: The Python keyword `in` allows you to check whether a string contains a particular substring. It returns `True` if the string contains the relevant substring, and `False` if it doesn't. These two values (`True` and `False`) are called *boolean values*, or *booleans* for short. We'll talk about them in more detail later. Here are some examples to try:

In [None]:
"fun" in "function"

In [None]:
"I" in "Team"

In [None]:
"App" in "apple" # Capitals are not the same as lowercase characters!

**Indexing**: Python provides access to the characters in each string through indexing. The table below shows all indexes for the string "Sandwiches are yummy". 

| Positive index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
|----------------|---|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|----|----|----|----|
| Characters     | S | a | n | d | w | i | c | h | e | s |    | a  |  r | e  |    |  y |  u |  m |  m |  y |
| Negative index |-20|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10|-9|-8|-7|-6|-5|-4|-3|-2|-1|

You can access the letters of string using the following notation:

```python
my_string = "Sandwiches are yummy"
print(my_string[1])   # This will print 'a'.
print(my_string[1:4]) # This will print 'and'
print(my_string[1:4:1]) # This will also print 'and', but is more explicit about what's happening.
print(my_string[-1]) # This will print 'y'
```

So how does this notation work?

```python
my_string[i] # Get the character at index i.
my_string[start:end] # Get the substring starting at 'start' and ending *before* 'end'.
my_string[start:end:stepsize] # Get all characters starting from 'start', ending before 'end', 
                              # with a specific step size.
```

You can also leave parts out:

```python
my_string[:i] # Get the substring starting at index 0 and ending just before i.
my_string[i:] # Get the substring starting at i and running all the way to the end.
my_string[::i] # Get a string going from start to end with step size i.
```

You can also have negative step size. `my_string[::-1]` is the idiomatic way to reverse a string.

Let's have a small test. Do you know what the following statements will print?

In [None]:
my_string = "Sandwiches are yummy"
print(my_string[0])

In [None]:
print(my_string[11:14])

In [None]:
print(my_string[15:])

In [None]:
print(my_string[:9])

In [None]:
print('cow'[::2])

**Iterating**: we'll cover this in much more detail later on, but all sequences are *iterable*, which means you can loop over them. In other words, you can do something like this:

In [None]:
for char in "word": # For every character in the string "word"..
    print(char)     # Print that character.

That is: 

1. Take all the elements (in this case: characters) in the sequence, 
2. assign them one-by-one to a variable (in this case: `char`, but you can use any name), 
3. do something with that variable (in this case: print it), 
4. move to the next element and go to (2.) until there are no more elements left.

and there are some other things you can do, like:

In [None]:
for char in reversed("word"): # For every character in the string "word" (but then reversed)..
    print(char)               # Print that character.

In [None]:
for char in sorted("word"): # For every character in the string "word" (but then sorted)..
    print(char)             # Print that character.

**Strings have useful methods**

A method is a function that is associated with an object. For example, the string-method `lower()` turns a string into all lowercase characters, and `upper()` makes strings uppercase. You can call this method using the dot-notation:

In [None]:
print(string_1)         # The original string.
print(string_1.lower()) # Lowercased.
print(string_1.upper()) # Uppercased.

So how do you find out what kind of methods an object has? There are two options:

1. Read the documentation. See [here](https://docs.python.org/3.5/library/stdtypes.html#string-methods) for the string methods.
2. Use the `dir()`-function, which returns a list of method names (as well as attributes of the object). If you want to know what a specific method does, use the `help()`-function.

Run the code below to see what the output of `dir()` looks like. 

The method names that start and end with double underscores ('dunder methods') are Python-internal. They are what makes general methods like `len` work (`len` internally calls the `string.__len__()` function), and cause Python to know what to do when you, for example, use a for-loop with a string.

The other method names indicate common and useful methods. 

In [None]:
# Run this cell to see all methods for string_1
dir(string_1)

If you'd like to know what one of these methods does, you can just use `help` (or look it up online):

In [None]:
help(string_1.upper)

In [None]:
help(string_1.split)

It's important to note that methods only *return* the result. They do not change the string itself.

In [None]:
x = 'test'    # Defining x.
y = x.upper() # Using x.upper(), calling the result y.
print(y)      # Print y.
print(x)      # Print x. It is unchanged.

### Lists

Lists are very useful to store an ordered set of objects. You can define a list like this:

In [None]:
my_list = [1,2,3,'a','b','c']

And lists share many properties with strings. **Check**: can you guess what the following cells will print?

In [None]:
# Length:
print(len(my_list))

In [None]:
# Containment:
print(2 in my_list)
print(4 in my_list)

In [None]:
# Indexing:
print(my_list[2])

In [None]:
# Iterating:
print("Let's loop over the list!")
for item in my_list:
    print(item)

#### List methods

Lists also have several useful methods. The three most commonly used methods are `append, count`, and `extend`. Here is how they work.

In [None]:
my_list = []
# Add the number 1 to the list.
my_list.append(1)
print(my_list)

In [None]:
second_list = [2,3,4]
# Extend my_list with the items in the second list.
my_list.extend(second_list)
print(my_list)

In [None]:
# Count the number of times 'a' occurs in the list.
my_words = ['a', 'dog', 'and', 'a', 'panda']
my_words.count('a')

**Multiple assignment**

All sequences support multiple assignment, which means that you can do stuff like this:

In [None]:
# Number of variables is equal to the number of items in the list.
word_1, word_2 = ['apple', 'pie']
print(word_1)
print(word_2)

In [None]:
# Number of items in the list is bigger than the number of variables.
first_word, *rest = ['apple','pie','is','delicious']
print(first_word)
print(rest)

You can do this with any number of variables, as long as that number is smaller than the number of items in the list.

### Lists ❤️ strings (and vice versa)

Lists and strings are best friends. During this course, you will find yourself moving between strings and lists quite often.

One of the first steps in any Natural Language Processing task is usually to *tokenize* the text you are working with. Tokenization is the act of splitting text into separate *tokens* (words or punctuation). If you tokenize a sentence, you are basically turning a string into a list of strings.

The most naive way to tokenize a sentence is to just use the `split()`-method that is built into Python:

In [None]:
sentence = "I like tokenization"
tokens = sentence.split()
print(tokens)

This works very well for short bits of text. It's useful to represent texts like this because we can start computing statistics about them. E.g. how many times each word occurs in the text:

In [None]:
for token in tokens:
    print(token,'\t',tokens.count(token))

This is nice because we now know what the actual words in the text are. Let's illustrate the difference between strings and lists with a small example.

In [None]:
# How many times does the sequence 'token' occur in the sentence?
print(sentence.count('token'))
# How many times does the string 'token' occur in the list of tokens?
print(tokens.count('token'))

You can turn a list of tokens back into a string by using the `join()`-method.

In [None]:
# Use a space to join the tokens.
joined_tokens = ' '.join(tokens)
print(joined_tokens)

This method joins the list with whatever string precedes it. So we could also do this:

In [None]:
print('hamster'.join(tokens))

The split-method doesn't work too well for longer pieces of text. Let's take the Wikipedia definition for "Language" and try to tokenize it:

In [None]:
language = "Language is the ability to acquire and use complex systems of communication, particularly the human ability to do so, and a language is any specific example of such a system. The scientific study of language is called linguistics."
tokens = language.split()
print(tokens)

How many times does the word 'communication' occur in the text?

In [None]:
# How many times does the sequence 'communication' occur in the sentence?
print(language.count('communication'))
# How many times does the string 'communication' occur in the list of tokens?
print(tokens.count('communication'))

..whoops! The word 'communication' doesn't seem to be in the list of tokens! Why?
 
.
 
. 
 
.
  
.
 
.

.

.

.

.

Punctuation 😰 

One way to combat this is to *preprocess* the data. Common preprocessing steps are:

1. Casefolding/lowercasing: removes distinction between 'Language' and 'language'.
2. Punctuation removal: removes distinction between 'communication,' and 'communication'.
3. Stemming/Lemmatization: remove distinction between 'is' and 'was'.

This is a destructive process: you lose some information that might be useful in further analysis. Later on we'll use specialized NLP modules/toolkits that keep all the information from the input. 

In [None]:
# Replace commas with 0-length strings. I.e. remove them.
language_nocommas = language.replace(',','')
print(language_nocommas)

In [None]:
# Replace periods with 0-length strings. I.e. remove them.
language_noperiods = language_nocommas.replace('.','')
print(language_noperiods)

In [None]:
# Lowercase the sentence:
language_lowercased = language_noperiods.lower()
print(language_lowercased)

In [None]:
# Tokenize the text by splitting it:
tokenized_normalized = language_lowercased.split()
print(tokenized_normalized)

**Reflection**: can you think of a non-destructive way to deal with commas and periods?

When we look at the words in the text, they don't contain punctuation anymore, and it's much easier to count them.

In [None]:
# Here's the result:
print(tokenized_normalized)

### Further reading

These papers discuss preprocessing and normalization. 

* [Assessing the Consequences of Text Preprocessing Decisions](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2849145) (Denny & Spirling 2016). This paper is a bit long, but it provides a nice discussion of common preprocessing steps and their potential effects.
* [What to do about bad language on the internet](http://www.cc.gatech.edu/~jeisenst/papers/naacl2013-badlanguage.pdf) (Eisenstein 2013). This is a quick read that I recommend everyone to at least look through.

### Building a vocabulary

So now we have a list of all the words in the document, that is very useful if you want to find out how many times words occur in the text. But what if we just want to have a vocabulary, containing all the words in the document? There are two things we can do here:

1. Write a small algorithm.
2. Use sets.

Let's do both! Here is a small algorithm in Python:

In [None]:
# This list will contain all the words in the text, just once.
vocabulary = []
# Loop over the words in the text. For each word..
for word in tokenized_normalized:
    # If the word is not yet in the vocabulary..
    if word not in vocabulary:
        # Add it.
        vocabulary.append(word)

# Print the results:
print(vocabulary)

But we can also just turn the tokenized sentence into a *set*. Sets are unordered collections of unique objects. Any duplicates are instantly removed. Below is an example showing how to turn lists into sets.

In [None]:
vocabulary = set(tokenized_normalized)
print(vocabulary)

If you wanted to, you could also do the reverse. Note that you don't get the duplicates back by doing this.

In [None]:
vocabulary = list(vocabulary)
print(vocabulary)

This form of explicit type conversion is sometimes called *type casting*. In later assignments, you will often find yourself converting between strings and integers.

### Sets

Sets are unordered collections of unique objects. They might already be familar to you if you've ever had a course on logic, set theory, or formal semantics. Or perhaps you've seen one of these [venn diagrams](https://en.wikipedia.org/wiki/Venn_diagram):

![venn diagram](./images/venn.png)

This one, from Wikipedia, shows the "uppercase letter glyphs [that] are shared by the Greek, Latin and Russian alphabets." The parts of the circles that overlap are called *intersections*. The parts that don't overlap constitute the *difference* between the circles. If two circles don't intersect at all, we say that they are *disjoint*.

In [None]:
# Two ways to define a set:
abc = {'a','b','c'}
cde = set(['c','d','e'])

In [None]:
# Intersection of the two:
intersection = abc & cde
print(intersection)

In [None]:
# differences:
print(abc - cde)
print(cde - abc)

In [None]:
# Everything together
union = abc | cde
print(union)

In [None]:
# Alternatively:
intersection = abc.intersection(cde)
difference = abc.difference(cde)
union = abc.union(cde)

print(intersection)
print(difference)
print(union)

**Finding text-specific words**

Now that we know all this, we can write a bit of code to see for two pieces of text which words are text-specific. We'll do this by tokenizing both texts, and looking at the difference between the vocabularies of the texts. Later on in the course, we will take a more advanced approach by doing a frequency analysis of the words.

To make our lives a bit easier, we will define a function that tokenizes the texts for us. The tokenization code is just copied from the section on lists. The code is wrapped in a function called `tokenize_text` so that we don't have to write almost the same code twice. Rather, we give the whole procedure a name (`tokenize_text`) and then use that name to execute the code for the relevant text.

You don't have to understand how functions work just yet. We'll start practicing with functions next week. This is just a little preview!

In [None]:
# This is a function that takes one argument: a text-string.*
# That variable can have any name, but within the function it will be referred to as 'text'.
def tokenize_text(text):
    """
    Function that takes some text as its input, and tokenizes it 
    such that the result is a list of lowercase words.
    """
    # Replace commas with 0-length strings. I.e. remove them.
    text_nocommas = text.replace(',','')

    # Replace periods with 0-length strings. I.e. remove them.
    text_noperiods = text_nocommas.replace('.','')

    # Lowercase the sentence:
    text_lowercased = text_noperiods.lower()

    # Tokenize the text by splitting it:
    tokenized_normalized = text_lowercased.split()

    # Return the result, so that we can keep using it outside the function.
    # If you forget this statement, the function will give 'None' back as a result.
    return tokenized_normalized

# * In Python, people leave the type of the function argument implicit, but if you enter anything
# other than a string, the function will give an error.

Before we do anything else, let's make sure that we understand what this specific function does. 

In [None]:
language_text = "Language is the ability to acquire and use complex systems of communication, particularly the human ability to do so, and a language is any specific example of such a system. The scientific study of language is called linguistics."

tokenized_language_text = tokenize_text(language_text)
print(tokenized_language_text)

OK, it does the same thing as before. Let's call the same function again on a different piece of text (thanks, Wikipedia!).

In [None]:
numeracy_text = "Numeracy is the ability to reason and to apply simple numerical concepts.[1] Basic numeracy skills consist of comprehending fundamental arithmetics like addition, subtraction, multiplication, and division. For example, if one can understand simple mathematical equations such as, 2 + 2 = 4, then one would be considered possessing at least basic numeric knowledge. Substantial aspects of numeracy also include number sense, operation sense, computation, measurement, geometry, probability and statistics. A numerically literate person can manage and respond to the mathematical demands of life.[2]"

tokenized_numeracy_text = tokenize_text(numeracy_text)
print(tokenized_numeracy_text)

Now what are the words that are specific to the language text?

In [None]:
# Let's turn the lists of tokens into sets.
LT = set(tokenized_language_text)
NT = set(tokenized_numeracy_text)

In [None]:
# CLASS TEST: what do we write here to see the language-specific words?


In [None]:
# CLASS TEST: what do we write here to see the numeracy-specific words?


**Reflection**: Is there anything surprising/unexpected about these words? How do you think we can fix this?
(Hint: Think about the steps we took to preprocess the data.)

In [None]:
# CLASS TEST: what do we write here to see the overlap?


### Numbers

We've postponed numbers for a while, but it's time to tackle them as well. Let's first start with [mathematical operators](https://docs.python.org/3.5/library/stdtypes.html#numeric-types-int-float-complex). These are defined for both **integers** and **floats** (numeric types):

| Operation | Result |
|-----------|--------|
| `x + y` |	sum of x and y|
| `x - y` |	difference of x and y 	|  	 
| `x * y` |	product of x and y 	  	 |
| `x / y` |	quotient of x and y 	  	 |
| `x // y` |	floored quotient of x and y	 |
| `x % y` |	remainder of x / y 	|
| `-x` |	x negated 	  	 |
| `+x` |	x unchanged 	  |	 
| `x ** y` |	x to the power y |

Here's an example:

In [None]:
# Prompt the user for two numbers:
first_number = input('Please enter any number you like\n')
second_number = input('Please enter another number\n')

# Any input is automatically considered a string.
# Let's change the type of the numbers to integer.
first_number = int(first_number)
second_number = int(second_number)

# Compute the sum of the two numbers.
total = first_number + second_number

# And print the result.
print("The sum of these numbers is", total)

**Question:** Can you predict what will the following cells will print?

In [None]:
# difference: 
print(7-2)

In [None]:
# quotient:
print(7/2)

In [None]:
# floored quotient:
print(7//2)

In [None]:
# remainder of dividing 7 by 2:
print(7 % 2)

Two of the number operators (`+` and `*`) also work for other types of objects. Can you predict what these do?

In [None]:
# Addition with lists:
[1] + [2]

In [None]:
# Multiplication:
[1] * 8

In [None]:
# Combined:
[1] * 4 + [2] * 4

In [None]:
# Addition with strings:
'a' + 'b'

In [None]:
# Multiplication:
'a' * 4

In [None]:
# Combined:
'a' * 4 + 'b' * 4

#### Useful facts about numbers

In [None]:
# These are equivalent (and the same holds for *, -, **):
x = 4
x = x + 4    # Repetition of x.
print(x)

x = 4
x += 4       # No repetition of x.
print(x)

In [None]:
# If you divide a number, it might turn from an integer into a float!
# This is known as implicit type conversion, or 'coercion'.
x = 5
print(type(x))
x = x/2
print(type(x))

In [None]:
# You can use an integer as an index, but you cannot use a float.

# Indexing with an integer:
['a','b','c'][1]

In [None]:
# Indexing with a float:
['a','b','c'][1.5]

In [None]:
# You cannot coerce an integer into a string:
num_apples = 6*5
x = 'I have ' + num_apples + 'apples.'

In [None]:
# Instead, use str():
x = 'I have ' + str(num_apples) + 'apples.'
print(x)

In [None]:
# If you want to do something five times, you can use a loop for that:
for i in range(5):
    print('repeat')

### Dictionaries

The dictionary is one of the most powerful classes in Python. You can use them to keep your data organized. They look like this:

In [None]:
shopping_list = {'milk': 3, 'eggs': 6, 'spam': 327}
print(shopping_list)

This shopping list has three **keys**: *milk, eggs*, and *spam*. The integers are called *values* of the dictionary. You can also access these through the dictionary methods:

In [None]:
shopping_list.keys()

In [None]:
shopping_list.values()

Together, the keys and values make up the *items* in a dictionary:

In [None]:
shopping_list.items()

The items are represented as *tuples* of keys and values. We won't discuss tuples in a lot of detail now; basically they are like lists, but they are *immutable*, meaning that their contents cannot be changed. This is a very useful property, as we will see later in the course.

In [None]:
# Get data from the dictionary:
print(shopping_list['milk'])

In [None]:
# Equivalent ways to initialize a dictionary:
empty_dictionary = {}
empty_dictionary = dict()

In [None]:
# Add entry to the dictionary:
empty_dictionary['milk'] = 3

In [None]:
print(empty_dictionary)

Generally speaking, people tend to use dictionaries in two ways:

1. As an index, to quickly look things up. (See the `shopping_list` example)
2. To stand for some kind of data point. For example:

In [None]:
# This is another (very readable!) way to initialize a dictionary:
human = dict(name='Bob', 
             age=24, 
             occupation='Michael Jackson impersonator')

In the second case, it is very common to have a list of dictionaries. Such a list corresponds more or less to a spreadsheet or a database, where each dictionary constitutes one row. You can think of the keys as the column headers, and the values as the cell values. We will explore this metaphor further next week.

## Exercises

Phew! That was a lot of theory to take in! Below are some assignments to play around with Python and to get a feeling for what you can do with all the objects in Python. There will be a little more theory here and there, but now you can work at your own pace, rather than mostly listening to us talk about Python. If you get stuck, please remember the 15-minute rule. If you are stuck for longer than 15 minutes: *contact us*. We're there to help.

Tips:

* Try to do the exercises together with someone. This becomes more relevant as the course progresses.
* Remember the `dir` and `help`-functions. They tell you what methods an object has, and how to use those methods.
* The [library reference](https://docs.python.org/3/library/index.html) provides a good overview of the language. You can also find all built-in functions and methods there.

### String formatting and `repr`

As we've seen above, it's possible to make strings that span multiple lines. Here are two ways to do so:

In [5]:
multiline_text_1 = """This is a multiline text, so it is enclosed by triple quotes.
Pretty cool stuff!
I always wanted to type more than one line, so today is my lucky day!"""
multiline_text_2 = "This is a multiline text, so it is enclosed by triple quotes.\nPretty cool stuff!\nI always wanted to type more than one line, so today is my lucky day!"

In [6]:
# With the double equals sign, we can show that these are equivalent:
print(multiline_text_1 == multiline_text_2)

True


So from this we can conclude that `multiline_text_1` has the same hidden characters (`\n`, which stands for 'new line') as `multiline_text_2`. You can show that this is indeed true by using the built-in `repr` function (which gives you the Python-internal *repr*esentation of an object).

In [7]:
# Show the internal representation of multiline_text_1.
print(repr(multiline_text_1))

'This is a multiline text, so it is enclosed by triple quotes.\nPretty cool stuff!\nI always wanted to type more than one line, so today is my lucky day!'


**Question** which string methods could you use to turn a multiline string into a paragraph? (Without a newline after each line).

In [9]:
print(multiline_text_1)

This is a multiline text, so it is enclosed by triple quotes.
Pretty cool stuff!
I always wanted to type more than one line, so today is my lucky day!


**Question** which string method should you use to turn `multiline_text_1` into a list of sentences? You can use the code box below to experiment.

### Variable assignment & using pythontutor.com

The website http://pythontutor.com/ is an excellent help in showing you how Python works 'under the hood'. We'll look at a couple of code samples to make sense of how Python behaves.

**Code snippet 1**

Here's a small snippet where we first assign some values to the variable `x`, and then also to `y`. Take a look at the interactive visualization [here](http://pythontutor.com/visualize.html#code=x%20%3D%201%0Ax%20%3D%202%0Ax%20%3D%203%0Ay%20%3D%203&cumulative=false&curInstr=0&heapPrimitives=false&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) to see what happens.

```python
x = 1
x = 2
x = 3
y = 3
```

**Code snippet 2**

Here's a small snippet where `x` refers to a list, and `y` is made to *refer to the same list*. Click [here](http://pythontutor.com/visualize.html#code=x%20%3D%20%5B1,2,3%5D%0Ax.append(4%29%0Ay%20%3D%20x%0Ay.append(5%29%0Aprint(x%29&cumulative=false&curInstr=5&heapPrimitives=false&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) to see the visualization of this bit of code.

```
x = [1,2,3]
x.append(4)
y = x
y.append(5)
print(x)
```

Do you understand why this code works the way it does?

**Code snippet 3**

If you do the same with integers, Python behaves differently. Click [here](http://pythontutor.com/visualize.html#code=x%20%3D%201%0Ax%20%2B%3D%201%0Ay%20%3D%20x%0Ay%20%2B%3D%201%0Aprint(x%29%0Aprint(y%29&cumulative=false&curInstr=5&heapPrimitives=false&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) to see the visualization of the code below.

```python
x = 1
x += 1
y = x
y += 1
print(x)
print(y)
```

The difference between snippet 2 and snippet 3 is that lists are *mutable objects*. They are containers whose contents can be changed, but changing their contents doesn't change their identity. Integers are *immutable*. They don't change, but rather `x += 1` makes `x` refer to a *different number* that is different from the old one by exactly 1.

**Code snippet 4**

The way to prevent the behavior in the second code snippet is to copy the list. Try to visualize this code [here](http://pythontutor.com/visualize.html#mode=edit).

```python
x = [1,2,3,4]
y = x.copy() # or, equivalently, y = x[:]
y.append(5)
```

**Exercise** Play around with the different operations that you've seen in this notebook. Can you predict what happens?


**(Advanced) Code snippet 5**

For future reference: the solution above only works for simple lists. The `copy` method (AKA *shallow copy*) isn't enough for nested lists. See the issue [here](http://pythontutor.com/visualize.html#code=x%20%3D%20%5B%5B1,2,3%5D,%5B4,5,6%5D%5D%0Ay%20%3D%20x.copy(%29%0Ay%5B1%5D.append(7%29&cumulative=false&curInstr=3&heapPrimitives=false&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false).

```python
x = [[1,2,3],[4,5,6]]
y = x.copy()
# This still modifies x.
y[1].append(7) 
```

For lists containing other lists, you need the `deepcopy` method. See it in action [here](http://pythontutor.com/visualize.html#code=import%20copy%0A%0Ax%20%3D%20%5B%5B1,2,3%5D,%5B4,5,6%5D%5D%0Ay%20%3D%20copy.deepcopy(x%29%0Ay%5B1%5D.append(7%29&cumulative=false&curInstr=3&heapPrimitives=false&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false).

```python
import copy

x = [[1,2,3],[4,5,6]]
y = copy.deepcopy(x)
# Problem solved!
y[1].append(7)
```



### Get creative

Here's some space to do whatever you want! Have fun :)