# 3 Reading texts at scale: Introduction


This notebooks will teach you how to:
- read a collection of text files
- process the text
- describe the dataset
- query the corpus for word patterns

In [11]:
from IPython.display import HTML
#HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/S_f2qV2_U00?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')


In this Notebook you will learn to work with large collections of text documents. We first show you how to read one document, process its contents and performs some simple queries. Subsequently, we focus on iteration—the repetition of exactly the same behaviour to different files—which allows you to easily scale up your research. Once you understand how to work with one text, with Python you can then easily extend to your analysis to a very large corpus.

But, please be a bit patient, we have to start with a few very basic elements of the Python language before getting stuck in. We start very slow, and speed up considerably in the next section.

## 3.1 Variables and data types

Python makes a distinction between text and numbers, which it considers as values belonging to different data types. You can just enter any number in the code cell and it will be printed at the below.

In [2]:
8

8

Try it yourself in the cell below, remove the comment

Comments are marked by hashtags. Python ignores everything that is follows a `#` and this is handy when we want to say something about the code without Python interpreting as code. 

Below, run the code cell, you will see nothing happens! Remove the hashtag, this should raise an `SyntaxError`. We tell you more about errors later, but Python here wants to attract your attention to the fact that your code is syntactically incorrect. Now, remove the whole phrase `Enter your favorite number here`and type a number. This time the error should disappear and the number you entered should be printed below the cell.

In [14]:
# Enter your favorite number here

SyntaxError: invalid syntax (<ipython-input-14-39de8aafdbe0>, line 1)

You can ask Python what data type you are working with. For this you use the `type()` function. 

Functions are an essential component to almost any program and we will discuss them more extensively later, but for now only pay attention to the form (the key work `type` with **8** enclosed by parentheses) of the expression and what it returns (`int`).


Besically, the expression `type(8)` tells you to the data type of the value, more formally, it shows that **8** is an instance of the int (short for integer) class.



In [15]:
type(8)

int

Now, look at the cell below, you notice that here the number is surrounded by single quotation marks, this to indicate that in this case, the value is a string and not a number

In [3]:
'8'

'8'

In [4]:
# check this using the type function type('8')

int

The textual data we will work with are usually represented by Python strings, for example below I enter my first name as a string

In [16]:
'Kaspar'

'Kaspar'

If I remove the quotation marks, I get a `NameError`, this is because Python now thinks `Kaspar` refers to a variable and not a string. We will discuss this distinction soon, don't worry.

In [17]:
Kaspar

NameError: name 'Kaspar' is not defined

-- Exercise -- 

Create a new code cell below and print your own name 

### `Intermezzo`

Strings and quotations marks

In [5]:
Decimal numbers are expresssed with a dot in the middel `1.0` 

What is the data type of 3.1415?

str

In [None]:
# enter answer here

## 3.2 Operations

Just returning values you enter manually isn't very useful. But in Python you can manipulate these values by performing operations on them such addition and substraction.

Operation work different for each data type (i.e. integers or strings).

For example, you can use python as a simple calculator, by for example summing 2 and 3 returns 5.


In [18]:
2+3

5

In [None]:
You can also add strings together but this will return a different results.

In [19]:
'2'+'3'

'23'

-- Exercise: can you explain what happened and why the results are different

-- Exercise: what happens when you subtract 2 and 3 (first as integers then as strings?)

In [None]:
Please note that these operation return a value of the same data type

In [22]:
type(2+3)

int

In [23]:
type('2'+'3')

str

`## Intermezo`:

- Operations for integers, strings and float
- Precedence
- Data types and operations

## 3.3 Variables

One of the most powerful features of a programming language is the ability to **store and manipulate variables**. A variable is a **name** that refers to a value. The **assignment statement** creates new variables and relates them to concrete values. Instead of passing these elements as an argument to the `print()` function, we can **store** them, by creating a variable that refers to the "Hello, World!" string.

In [None]:
# declare a variable
x = 'Hello World.'
# print what is in the box
print(x)

In [None]:
# declare a variable
y = 22
# print what is in the box
print(y)

If you vaguely remember your math-classes in school, this should look familiar. It is basically the same notation with the name of **the variable on the left, the value on the right**, and the = sign in the middle. 

In the code block above, two things happen. **First**, we fill `x` with a value, in our case `22`. This variable x behaves pretty much like a **box** on which we write an `x` with a thick, black marker to find it back later. **Second**: We print the contents of this box, using the `print()` command.

In [None]:
text = 'Hello, Worlds!'
print(type(text))
number = 10
print(type(number))
number_string = '10'
print(type(number_string))

#### --Exercise--
Create and print two variables, one containing your name (string) and another on your year of birth (integer)

## `Intermezzo`

- variable names
- common errors

## `Intermezzo++`
- multiple assignment

## 3.4 Processing a text document

Intermezzo
- other packages
- libraries, documentation
- other types of text proccesing

## 3.5 Scaling up: processing a collection of text

## 3.6 Improving code: functions

## 3.7 Case study 1: Trends over time 

## 3.8 Case study 2: Work with semi-structured data