# Introduction to Python, Jupyter Notebooks and Working with Text

---
---

## The Programming Language Python
Python is a **general-purpose programming language** that can be used in many different ways; for example, it can be used to analyse data, create websites, automate boring stuff and write software with a graphical interface. Python is **excellent for beginners** because its basic syntax is simple and uncluttered with punctuation. Coding in Python often feels like writing in natural language. Python is a very popular language and has a large, friendly community of users around the world, so there are many tutorials and helpful experts out there to help you get started.

### Why Do Text-Mining By Programming?

You could do a lot of text mining manually, by searching and counting through text by eye, or using the 'find and replace' function in a text editor. You could count and input numbers on a spreadsheet and do your analysis with formulae. However, with a large corpus of many thousands or millions of words these tasks are error-prone, boring and mind-boggling. It may even be impossible in the time available.

You could use some of the specialist software tools out there for cleaning and exploring a corpus, and while these are definitely an option — and they could be used in combination with manual and programming techniques — here are several advantages to programming your text-mining:

* Automation: coding automates boring and difficult tasks that are hard for humans to do.
* Reproducibility: code both executes and unambiguously documents the steps to your results.
* Clarity: coding forces you to understand exactly what you are doing with your text, promoting deep knowledge of the techniques you are using.
* Bespoke: coding your own solution means you can design it to meet exactly what your research questions demand.
* Advanced: coding may be the only way to do certain advanced analysis techniques or analyse extremely large datasets.

Can you think of any other advantages or disadvantages?

### Alternatives to Python

R is another language well suited to text mining; in fact, the R language was designed specifically for data analysis and is widely used in many fields such as finance, medical research and social sciences. You *could* learn to do text analysis in R instead, but I recommend you choose either Python or R and stick to it. They are very different languages and for most people it is better to have deeper expertise in one programming language than spread yourself too thin.

If you are interested in using R instead of Python then you should look at the book [Text Mining in R](https://www.tidytextmining.com/), which uses the R package [tidytext](https://github.com/juliasilge/tidytext).

Of course, we have chosen Python on your behalf! In this workshop we will be using the Python library [Natural Language Tool Kit](http://www.nltk.org/) (don't worry, this need not make any sense to you yet).

---
---

## Jupyter Notebooks

This 'document' you are reading right now is a Jupyter notebook. It allows you to combine explanatory **text** and **code** that executes to produce the results you can see on the same page. You can also create and display graphs from your data in the same document.

Notebooks are particularly useful for *exploring* your data at an early stage in your research and *documenting* exactly what steps you have taken (and why) to get to your results. This documentation is extremely important for reproducibility of your research and to record what you did as you are guaranteed to forget in weeks or months down the line!

For more on getting started with Jupyter Notebooks for your research try this [Jupyter Notebook for Beginners Tutorial](https://towardsdatascience.com/jupyter-notebook-for-beginners-a-tutorial-f55b57c23ada).

### Notebook Basics

#### Text cells

The box this text is written in is called a *cell*. It is a *text cell* marked up in a very simple language called 'Markdown'. Here is a useful [Markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet). You can edit and then run cells to produce a result. Running this text cell produces formatted text.

#### Code cells

The other main kind of cell is a *code cell*. The cell immediately below this one is a code cell. Running a code cell runs the code in the cell (marked by the **In**) and produces a result (marked by the **Out**). We say the code is **evaluated**.

Try it now.

In [2]:
3 + 4

7

In [3]:
# This is a comment in a code cell. Comments start with a # symbol. They are ignored and do not do anything.

> **Important!**

> **When running code cells you need to run them in order, from top to bottom of the notebook. This is because cells rely on the results of other cells. Without those earlier results being available you will get an error.**

> **To run all the cells in a notebook at once, in order, choose Cell > Run All from the menu above. To clear all the results from all the cells, so you can start again, choose Cell > All Output > Clear.**

---
---

## How to Join In with Coding

* **Edit** any cell and try changing the code, or delete it and write your own.

* Before running a cell, try to **guess** what the output will be by thinking through what will happen.

* If you encounter an **error**, realise this is normal. Errors happen all the time and by reading the error message you will learn something new.

* Remember: you cannot break the notebook or your computer, so **don't be afraid to experiment**.

**Let's get coding!**



---
---

## Simple String Manipulation in Python

This section introduces some basic things you can do in Python to create and manipulate strings. A string is a simple *sequence of characters*, for example, the string `coffee` is a sequence of the individual characters `c` `o` `f` `f` `e` `e`. Strings are the way that Python (and most programming languages) deal with text.

### Creating and Storing Strings with Names
Strings are simple to create in Python. You can simply write some characters in quote marks (either single `'` or double `"` is fine in general).

In [4]:
'Butterflies are important as pollinators.'

'Butterflies are important as pollinators.'

In order to do something useful with this string, other than print it out, we need to store it by using the assignment operator `=` (equals sign). Whatever is on the right-hand side of the `=` is stored with the _name_ on the left-hand side.

In [5]:
my_sentence = 'Butterflies are important as pollinators.'

*Notice that nothing is printed to the screen.*

That's because the string is stored with the name `my_sentence` rather than being printed out. In order to see what is 'inside' `my_sentence` we can simply write `my_sentence` in a code cell, run it, and the interpreter will print it out for us.

In [6]:
my_sentence

'Butterflies are important as pollinators.'

### Slicing Bits of Strings

#### Accessing Individual Characters
A string is just a sequence (list) of characters. You can access **individual characters** in a string by specifying which ones you want in square brackets.

In [7]:
my_sentence[1]

'u'

**Hang on a minute!** Did you notice something unexpected?

Why did it give us `u` instead of `B`?

In programming, everything tends to be *zero indexed*, which means that things are counted from 0 rather than 1. Thus, in the example above, `1` gives us the *second* character in the string, not the first like you might expect.

If you want the first character in the string, you need to specify the index `0`! 

In [8]:
my_sentence[0]

'B'

#### Accessing a Range of Characters

You can also pick out a **range of characters** from within a string, by giving the *start index* followed by the *end index* with a semi-colon (`:`) in between.

The example below gives us the character at index `0` all the way up to, *but not including*, the character at index `20`.

In [9]:
my_sentence[0:20]

'Butterflies are impo'

### Changing Whole Strings with Methods
Python strings have some built-in *methods* that allow you to change a whole string at once. You can change all characters to lowercase or uppercase:

In [10]:
my_sentence.lower()

'butterflies are important as pollinators.'

In [11]:
my_sentence.upper()

'BUTTERFLIES ARE IMPORTANT AS POLLINATORS.'

NB: These functions do not change the original string but create a new one. Our original string is still the same as it was before:

In [12]:
my_sentence

'Butterflies are important as pollinators.'

### Testing Strings with Methods

You can also test a string to see if it is passes some test, e.g. is the string all alphabetic characters only?

In [13]:
my_sentence.isalpha()

False

Why does this produce this particular result?

Here's another. Does the string have the letter `p` in it?

In [14]:
'p' in my_sentence

True

---

#### Going Further with Python Documentation

Everything you can do in Python is well-documented online. It is a skill and art to read code documentation, and you should start to learn it as soon as you can on your code journey.

Here is a link to all the methods you can use with strings: 
https://docs.python.org/3.6/library/stdtypes.html#string-methods

Why not try a method we have not used here so far?

---


### Lists of Strings
Another important thing we can do with strings is creating a *list of strings* by listing them inside square brackets `[]`:

In [15]:
my_list = ['Butterflies are important as pollinators',
          'Butterflies feed primarily on nectar from flowers',
          'Butterflies are widely used in objects of art']
my_list

['Butterflies are important as pollinators',
 'Butterflies feed primarily on nectar from flowers',
 'Butterflies are widely used in objects of art']

### Slicing Lists of Strings
Just like with strings, we can access individual items inside a list by index number:

In [16]:
my_list[0]

'Butterflies are important as pollinators'

And we can access a range of items inside a list by *slicing*:

In [17]:
my_list[0:2]

['Butterflies are important as pollinators',
 'Butterflies feed primarily on nectar from flowers']

### Creating Lists of Strings with List Comprehensions
We can create new lists in an elegant way by combining some of the things we have covered above. Here is an example where we have taken our original list `my_list` and created a new list `new_list` by going over each string in the list:

In [18]:
new_list = [string for string in my_list]
new_list

['Butterflies are important as pollinators',
 'Butterflies feed primarily on nectar from flowers',
 'Butterflies are widely used in objects of art']

Why do this? If we combine it with a test, we can create a filtered list that only contains strings with the letter `p` in them:

In [19]:
new_list_p = [string for string in my_list if 'p' in string]
new_list_p

['Butterflies are important as pollinators',
 'Butterflies feed primarily on nectar from flowers']

This is a very powerful way to quickly create lists. We can even change all the strings to uppercase at the same time!

In [20]:
new_list_p_upper = [string.upper() for string in my_list if 'p' in string]
new_list_p_upper

['BUTTERFLIES ARE IMPORTANT AS POLLINATORS',
 'BUTTERFLIES FEED PRIMARILY ON NECTAR FROM FLOWERS']

---
---
## Summary

Let's take a moment to summarise what we have covered so far.

We have: 

* Learnt that Python is a programming language
* Discussed why programming is useful for text-mining
* Covered the basics of Jupyter notebooks
* Created and manipulated strings by:
 * Slicing them into bits
 * Making them lowercase and uppercase
 * Testing them
* Created and sliced lists of strings
* Created new lists by filtering lists with tests

👌👌👌

The next notebook `2-collecting-and-preparing.ipynb` will show how we can collect, prepare and explore a text.
