# Introduction to Python, Jupyter Notebooks and Working with Text

---
---

## The Programming Language Python
Python is a **general-purpose programming language** that can be used in many different ways; for example, it can be used to analyse data, create websites and automate boring stuff. 

Python is **excellent for beginners** because its basic syntax is simple and uncluttered with punctuation. Coding in Python often feels like writing in natural language. Python is a very popular language and has a large, friendly community of users around the world, so there are many tutorials and helpful experts out there to help you get started.

![Logo of the Python programming language](https://www.python.org/static/img/python-logo.png "Logo of the Python programming language")

![Two women coding together and looking at a laptop](assets/two-women-computer.png "Two women coding together and looking at a laptop")

---
---

## Jupyter Notebooks

This 'document' you are reading right now is a Jupyter Notebook. It allows you to combine explanatory **text** and Python **code** that executes to produce the results you can see on the same page. You can also create and display visualisations from your data in the same document.

Notebooks are particularly useful for *exploring* your data at an early stage and *documenting* exactly what steps you have taken (and why) to get to your results. This documentation is extremely important to record what you did so that others can reproduce your work... and because otherwise you are guaranteed to forget what you did in the future.

![Logo of Jupyter Notebooks](https://jupyter.org/assets/logos/rectanglelogo-greytext-orangebody-greymoons.svg "Logo of Jupyter Notebooks")

For a more in-depth tutorial on getting started with Jupyter Notebooks try this [Jupyter Notebook for Beginners Tutorial](https://towardsdatascience.com/jupyter-notebook-for-beginners-a-tutorial-f55b57c23ada).

### Notebook Basics

#### Text cells

The box this text is written in is called a *cell*. It is a *text cell* marked up in a very simple language called 'Markdown'. Here is a useful [Markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet). You can edit and then run cells to produce a result. Running this text cell produces formatted text.

---
> **EXERCISE**: Double-click on this cell now to edit it. Run the cell with the keyboard shortcut `Crtl+Enter`, or by clicking the Run button in the toolbar at the top.

![Click the Run button to run a cell](https://problemsolvingwithpython.com/02-Jupyter-Notebooks/images/run_cell.png "Click the Run button to run a cell")

---

#### Code cells

The other main kind of cell is a *code cell*. The cell immediately below this one is a code cell. Running a code cell runs the code in the cell (marked by the **`In [1]:`**) and produces a result (marked by the **`Out [1]:`**). We say the code is **evaluated**.

---
> **EXERCISE**: Try running the code cell below to evaluate the Python code. Then change the sum to try and get a different result.

---

In [None]:
3 + 4

From now on, always try running every code cell you see.

In [None]:
# This is a comment in a code cell. Comments start with a # symbol. They are ignored and do not do anything.

**Important!**

When running code cells you need to run them in order, from top to bottom of the notebook. This is because cells rely on the results of other cells. Without those earlier results being available you will get an error.

To run all the cells in a notebook at once, and in order, choose Cell > Run All from the menu above. To clear all the results from all the cells, so you can start again, choose Cell > All Output > Clear.

---
---

## Introduction to the Python Excercises
This is a long notebook that covers many basic topics in Python:

* Strings, lists, dictionaries and tuples
* String methods
* Indexing and slicing
* Loops and comprehensions
* Imports
* Functions

It is not really intended to teach you these things from scratch as your only resource if you are a beginner. Run through the cells and do the brief exercises as a reminder or recap. 

Most of what you need to understand about Python to understand the examples in the later notebooks is here (with the exception of objects, which I gloss over). Refer back to this notebook if you find something confusing in a later notebook, but do not be afraid to move on and come back another time. You can still get a lot out of this course without understanding everything.

---
---

## How to Join In with Coding

* **Edit** any cell and try changing the code, or delete it and write your own.

* Before running a cell, try to **guess** what the output will be by thinking through what will happen.

* If you encounter an **error**, realise this is normal. Errors happen all the time and by reading the error message you will learn something new.

* Remember: you cannot 'break' your computer by editing this code, so **don't be afraid to experiment**.

---
---

## Working with Strings in Python

When we want to store and manipulate books, archives, records or other textual data in a computer, we have to do so with **strings**. Strings are the way that Python (and most programming languages) deal with text.

A string is a simple *sequence of characters*, for example, the string `coffee` is a sequence of the individual characters `c` `o` `f` `f` `e` `e`.

![A cup of coffee](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Cup-o-coffee-simple.svg/240px-Cup-o-coffee-simple.svg.png "A cup of coffee")

This section introduces some basic things you can do in Python to create and manipulate strings. What you learn here forms the basis of any and all text-mining you may do with Python in the future. It's worth getting the fundamentals right.

### Create and Store Strings with Names
Strings are simple to create in Python. You can simply write some characters in quote marks (either single `'` or double `"` is fine in general).

By running the cell below Python will evaluate the code, recognise it is a new string, and then print it out. Try this now.

In [None]:
'Butterflies are important as pollinators.'

In order to do something useful with this string, other than print it out, we need to store it by using the **assignment operator** `=` (equals sign). Whatever is on the right-hand side of the `=` is stored with the **name** on the left-hand side. In other words, the name is assigned to the value. The name is like a label that sticks to the value and can be used to refer to it at a later point.

The pattern is as follows:

`name = value`

(In other programming languages this is called creating a *variable*.)

Run the code cell below.

In [None]:
my_sentence = 'Butterflies are important as pollinators.'

Notice that nothing is printed to the screen.

That's because the string is stored with the name `my_sentence` rather than being printed out. In order to see what is 'inside' `my_sentence` we can simply write `my_sentence` in a code cell, run it, and the interpreter will print it out for us.

Remember to run every code cell you come across in the notebook from now on.

In [None]:
my_sentence

---
> **EXERCISE**: Try creating your own string in the code cell below and assign it the name `my_string`. Then add a second line of code that will cause the string to be printed out. You should have two lines of code before you run the code cell.

---

If you are in need of inspiration, copy and paste a string from the [Cambridge Digital Library Darwin-Hooker letters](https://cudl.lib.cam.ac.uk/collections/darwinhooker/1).

In [None]:
# Write code here. NB: if you don't do this, some of the code below will give an error!

### Concatenate Strings
If you add two numbers together with a plus sign (`+`) you get addition. With strings, if you add two (or more) of them together with `+` they are concatenated, that is, the sequences of characters are joined.

In [None]:
another_sentence = my_sentence + ' Bees are too.'
another_sentence

In [None]:
# Write code here to create and print a new string that concatenates 3 strings together of your choice

### Manipulate Whole Strings with Methods
Python strings have some built-in **methods** that allow you to manipulate a whole string at once. You apply the method using what is called **dot notation**, with the name of the string first, followed by a dot (`.`), then the name of the method, and finally a pair of parenthesis (`()`), which tells Python to run the method.

The pattern is as follows:

`name_of_my_string.method_name()`

You can change all characters to lowercase or uppercase:

In [None]:
my_string.lower()

In [None]:
my_string.upper()

Note that these functions do not change the original string but instead create a new one. The original string is still the same as it was before:

In [None]:
my_string

In order to 'save' the newly manipulated string, you have to assign it a new name:

In [None]:
new_string = my_string.lower()
new_string

### Test Strings with Methods

You can also test a string to see if it is passes some test, e.g. is the string all alphabetic characters only?

In [None]:
my_sentence.isalpha()

There are many different **string methods** to try. Here is a list:



The [full documentation on string methods](https://docs.python.org/3.8/library/stdtypes.html#string-methods) gives all the technical details. It is a skill and art to read code documentation, and you should start to learn it as soon as you can on your code journey. But it's not necessary for the session today.

---
> **EXERCISE**: Try three of the string methods listed in the documentation above on your own string `my_string`.

---


In [None]:
# Write code here

### Access Individual Characters with an Index
A string is just a sequence of characters, much like a list. You can access individual characters in a string by specifying which ones you want. To do this we use what is called an **index** number in square brackets (`[]`). Like this:

In [None]:
my_sentence[1]

Here the index is `1` and we expect to get the first character of the string... or do we? Did you notice something unexpected?

It gave us `u` instead of `B`.

In programming, things are often counted from 0 rather than 1, which is called **zero-based numbering**. Thus, in the example above, `1` gives us the *second* character in the string, not the first like you might expect.

If you want the first character in the string, you need to specify the index `0`.

In [None]:
# Write code here to get the first character in the string

### Access a Range of Characters with Slicing

You can also pick out a range of characters from within a string, by giving a start index, followed by an end index, with a semi-colon (`:`) in between. This is called **slice notation**.

The example below gives us the character at index `0` all the way up to, but **not** including, the character at index `20`.

In [None]:
my_sentence[0:20]

In [None]:
# Write code here to access the word "important" from the string

We can also slice in jumps through a string. To do this, we add what is known is the **step**.

The pattern is as follows:

`my_sentence[start:stop:step]`

So, to go in jumps of 2 characters:

In [None]:
my_sentence[0:20:2]

### Lists of Strings
Another important and useful thing we can do with strings is storing them in a **list**. For example, you could have each sentence in a document stored as a list of strings.

To create a new list of strings we list them inside square brackets `[]` on the right-hand side of the assignment operator (`=`):

In [None]:
my_list = ['Butterflies are important as pollinators',
          'Butterflies feed primarily on nectar from flowers',
          'Butterflies are widely used in objects of art']

In [None]:
# Write code here to print out the list

Yes, we have used square brackets (`[]`) before for indexing individual characters (above), but this is a different use of square brackets to create lists. If you are unfamiliar with Python, I can reassure you that eventually you get used to these different uses of square brackets.

### Access Individual Strings in a List with an Index
Just like with strings, we can access individual items inside a list by index number:

In [None]:
my_list[0]

### Access a Range of Strings in a List with Slicing
Likewise, we can access a range of items inside a list by using slice notation:

In [None]:
my_list[0:2]

Just like with strings, we can also slice in steps.

In [None]:
# Write code here to slice every other item in `my_list` (i.e. the first and third item)

To access the whole list we can use the shorthand:

In [None]:
my_list[:]

Why would we want to do this? Well, combine this trick with a negative step, and we can go *backwards* through the whole list!

In [None]:
my_list[::-1]

---
---

## Iterate with Loops

A `for` **loop** goes over every item in a list in turn — and runs some code for every item in that list. It makes sure that every item is visited, and then it stops when it gets to the end. We call this **iteration**; the loop _iterates_ over the list.

Loops also work for many other things other than lists, like strings, but here we stick to lists as an example.

First, let's create a list to work with:

In [None]:
game = ['rock', 'paper', 'scissors']
game

Now we will loop over every item in the list `game` and print it out:

In [None]:
for move in game:
    print(move)

The pattern is as follows:

```
for item in list:
    # do something here
```

Let's look at some of the details:

* `for` is a **keyword** that starts the loop.
* `item` could be any name you give to each item in the list. Name it something that makes sense, e.g. if it's a list of fruit, name it 'fruit', or if it's a list of words, name it 'word'.
* `in` is another keyword that goes before the name of the list.
* `list` could be any name for the list. If your list is a list of novels, for example, it might make sense to name your list 'library'.
* `:` is a colon that starts the **block** of code that you want to run.
* `# do something here` represents whatever code you want to run for every item in the list.

Note that a block of code in Python is indicated by **indenting** the code by several spaces (typically four spaces). If you don't indent code blocks correctly you'll get an error.

In [None]:
# Write code here to print out each string on `my_list` with each string lowercase

You can also create a loop inside another loop, to access all the items in _nested_ lists (i.e. lists inside of lists):

In [None]:
# This list has 3 lists in it
list_of_lists = [
    [0, 1, 2],
    [True, False, True],
    ['straw', 'twigs', 'bricks']
]

# First we loop over every list
for list_item in list_of_lists:
    # Then we loop over each item in each list
    for item in list_item:
        print(item)

---
---

## List Comprehensions

We can **create new lists** in a quick and elegant way by using **list comprehensions**. Essentially, a list comprehension loops over each item in a list, one by one, and returns something each time, and creates a new list.

For example, here is a list of strings:

`['banana', 'apple', 'orange', 'kiwi']`

We could use a list comprehension to loop over this list and create a new list with every item made UPPERCASE. The resulting list would look like this:

`['BANANA', 'APPLE', 'ORANGE', 'KIWI']`

The code for doing this is below:

In [None]:
fruit = ['banana', 'apple', 'orange', 'kiwi']
fruit_u = [item.upper() for item in fruit]
fruit_u

If the `for` and `in` seems familiar, this is because list comprehensions are a special type of loop.

We have already seen [above](0-introduction-to-python-and-text.ipynb#Iterate-with-Loops) how `for move in game` is a `for` loop that loops over every `move` in `game`. 

In list comprehensions the `for` loop is simply on one line, without any identation.

The pattern is as follows:

`[return_something for each_item in list]`

![List comprehensions diagram](assets/list-comprehension.png)

> Let's look at some of the details:
 * A list comprehension goes inside square brackets (`[]`), which tells Python to create a new list.
 * `list` is the name of your list. It has to be a list you have already created in a previous step.
 * The `each_item in list` part is the loop.
 * `each_item` is the name you assign to each item as it is selected by the loop. The name you choose should be something descriptive that helps you remember what it is.
 * The `return_something for` part is what happens each time it loops over an item. The `return_something` could just be the original item, or it could be something fairly complicated.

The most basic example is just to return exactly the same item each time it loops over and return all items in a list.

Here is an example where we have taken our original list `my_list` and created a new list `new_list` with the exact same items unchanged:

In [None]:
new_list = [item for item in my_list]
new_list

Why do this? There does not seem much point to creating the same list again.

### Manipulate Lists with String Methods

By adding a string method to a list comprehension we have a powerful way to manipulate a list.

We have already seen this in the `fruit` example above. Here's another example of the same thing with the 'butterflies' list we've been working with. Every time the Python loops over an item it transforms it to uppercase before adding it to the new list:

In [None]:
new_list_upper = [item.upper() for item in my_list]
new_list_upper

In [None]:
# Write code to transform every item in the list with a string method (of your choice)

Hint: see the [full documentation on string methods](https://docs.python.org/3.8/library/stdtypes.html#string-methods).

### Filter Lists with a Condition

We can filter a list by adding a **condition** so that only certain items are included in the new list:

In [None]:
new_list_p = [item for item in my_list if 'p' in item]
new_list_p

The pattern is as follows:

`[return_something for each_item in list if some_condition]`

![List comprehensions with condition diagram](assets/list-comprehension-with-condition.png)

Essentially, what we are saying here is that **if** the character "p" is **in** the item when Python loops over it, keep it and add it to the new list, otherwise ignore it and throw it away.

Thus, the new list has only two of the strings in it. The first string has a "p" in "permanent"; the second has a "p" in "September".

In [None]:
moon_list = ['The Moon formed 4.51 billion years ago',
           "The Moon is Earth's only permanent natural satellite",
          'The Moon was first reached in September 1959']

# Write code to filter the list `moon_list` for items that include a number (of your choice)

---
---
## Add New Capabilities with Imports

Python has a lot of amazing capabilities built-in to the language itself, like being able to manipulate strings. However, in any Python project you are likely to want to use Python code written by someone else to go beyond the built-in capabilities. Code 'written by someone else' comes in the form of a file (or files) separate to the one you are currently working on.

An external Python file (or sometimes a **package** of files) is called a **module** and in order to use them in your code, you need to **import** it.

This is a simple process using the keyword `import` and the name of the module. Just make sure that you `import` something _before_ you want to use it!

The pattern is as follows:

`import module_name`

Here are a series of examples. See if you can guess what each one is doing before running it.

In [None]:
import math
math.pi

In [None]:
import random
random.random()

In [None]:
import locale
locale.getlocale()

The answers are: the value of the mathematical constant *pi*, a random number (different every time you run it), and the current locale that the computer thinks it is working in.

---
---
## Reuse Code with Functions

A function is a **reusable block of code** that has been wrapped up and given a name. The function might have been written by someone else, or it could have been written by you. We don't cover how to write functions in this course; just how to run functions written by someone else.

In order to run the code of a function, we use the name followed by parentheses `()`.

The pattern is as follows:

`name_of_function()`

We have already seen this earlier. Here are a selection of functions (or methods) we have run so far:

In [None]:
# 'lower()' is the function (aka method)
my_sentence = 'Butterflies are important as pollinators.'
my_sentence.lower()

In [None]:
# 'isalpha()' is the function (aka method)
my_sentence.isalpha()

In [None]:
# 'random()' is the function
random.random()

---
#### Functions and Methods
There is a technical difference between functions and methods. You don't need to worry about the distinction for our course. We will treat all functions and methods as the same.

If you are interested in learning more about functions and methods try this [Datacamp Python Functions Tutorial](https://www.datacamp.com/community/tutorials/functions-python-tutorial).

---

### Functions that Take Arguments
If we need to pass particular information to a function, we put that information _in between_ the `()`. Like this:

In [None]:
math.sqrt(25)

The `25` is the value we want to pass to the `sqrt()` function so it can do its work. This value is called an **argument** to the function. Functions may take any number of arguments, depending on what the function needs.

Here is another function with an argument:

In [None]:
import calendar
calendar.isleap(2020)

Essentially, you can think of a function as a box.

![Function black box diagram](assets/function-black-box.png)

You put an input into the box (the input may be nothing), the box does something with the input, and then the box gives you back an output. You generally don't need to worry _how_ the function does what it does (unless you really want to, in which case you can look at its code). You just know that it works.

> ***Functions are the basis of how we 'get stuff done' in Python.***

For example, we can use the `requests` module to get the text of a Web page:

In [None]:
import requests
response = requests.get('https://www.wikipedia.org/')
response.text[136:266]

The string `'https://www.wikipedia.org/'` is the argument we pass to the `get()` function for it to open the Web page and read it for us.

> **EXERCISE**: Try getting the text of a different URL of your choice. What happens if you print the whole of `response.text` instead of slicing out some of the characters?

---
---
## Dictionaries
Dictionaries are a form of _mapping_. They map **keys** to **values**. You can think of it like the index at the back of a book, where the key is a word and its value is the page number where you can find that word in the book. To find the page number of a word, you look through the index and find the word you want (the key) and then look at the number (the value).

```
agriculture, 228
air freight, 46
airplane food, 19
alcohol, 165
alfalfa, 242
```

_etc._


The Python dictionary is called a `dict` and it can hold (almost) any type of key and value: strings, numbers, Booleans (`True`, `False`) and more.

To create a new `dict` we use curly braces `{}` and inside put each key and value separated by a colon `:`

In [None]:
my_dict = {
    'agriculture': 228,
    'air freight': 46,
    'airplane food': 19,
    'alcohol': 165,
    'alfalfa': 242
}
my_dict

So now we can find the page number (value) of any of these words (keys) by putting the key in square brackets `[]`:

In [None]:
my_dict['agriculture']

To add a new key-value pair to the dictionary we can use the key in square brackets `[]` and assign the new value to it with the assignment operator `=`. In this example, the new key is 'allergies' and the new value is '210'.

In [None]:
my_dict['allergies'] = 210
my_dict

---
---
## Tuples
A tuple is a bit like a list, except unlike a list, tuples cannot be changed. You cannot add or remove items from a tuple once you have created it. Tuples are known as **immutable**.

NB: Tuple is often pronounced 'toople' if you are from the UK, or 'tupple' if you are from the US, but it doesn't really matter.

To create a new tuple we use parentheses `()`:

In [None]:
my_tuple = (1, 5.0, 'ten-thousand')
my_tuple

You might be a little confused as we also use parentheses `()` to call a function. However, you can recognise a tuple because the parentheses don't have a function name immediately before them. The use of `()` in tuples is totally unrelated to the use of `()` to call functions. They are merely using the same sort of brackets.

Like a list you can _slice_ a tuple, to access its items:

In [None]:
my_tuple[2]

But unlike a list, you cannot assign a new value to any of its items:

In [None]:
my_tuple[2] = 'rainbows and unicorns'

You should get an error above that says `TypeError: 'tuple' object does not support item assignment`. This means you cannot assign a new value to any of the items in a tuple.


---
---
## Summary

Let's take a moment to summarise what we have covered so far:

* Python is a general-purpose programming language that is good for beginners.
* Jupyter notebooks have two main types of **cell** to run: **code** and **text** (Markdown).
* **Strings** are the basis of how Python handles text.
* Things we can do with strings:
 * Create and store a string with a **name** (_aka_ variables).
 * **Concatenate** two or more strings together.
 * Access an individual character with an **index**.
 * Access a range of characters with a **slice**.
 * Manipulate whole strings or test strings with **string methods**.
* **Lists** are useful for holding a collection of strings.
* Things we can do with lists:
 * Create and store a list of strings with a name.
 * Access strings in a list with an index and with slicing.
 * **Loop** over each item in a list.
 * Create, change and **filter** a new list with a **list comprehension**.
* Add new capabilities by **importing** code someone else has written in a **module**.
* Run a **function** with and without **arguments** between the parentheses.
* Map keys and string values in a **dictionary**.
* Store a collection of strings in a **tuple** that is **immutable**.

This is a lot! I will just reiterate what I said at the very beginning: this notebook is designed as a reminder or a recap. It is not really intended to teach you these things from scratch as your only resource if you are a beginner.

In the [next notebook](1-basic-text-mining-concepts.ipynb) we will look at some basic text-mining concepts with Python examples.