# Computational Text Analysis for Historians
## A gentle introduction to working with with textual data in Python

# 1. Introduction 

Welcome to "Computational Text Analysis for Historians" a course that will teach you to process, analyse and "read" texts using a programming language (as opposed to a Graphical User Interface such as Antconc). In this notebook we give an overview of the course—and detail its "philosophy" (or pedagogical principles) as we diverge from the usual coding courses—and why we think a basic code literacy is an import skill for future historians.

## 1.1 Philosophy and Goals of the Course

### 1.1.2. Why this course?

#### Reading, writing and adapting code

In this course you will slowly become familiar with a programming language called Python (a reference to comedian collective Monty Python (not the snake) which was beloved the creator Guido van Rossum. We tell more about "Why Python?" below, but here we wish to explain how we intend to make the first explorations of totally new field more digestible, even pleasurable, for historians.

First we have to the common perception that coding is difficult. **It is not**, but it is **different** type of intellectual activity, and generally historians posses less of familiarity with or prerequisite training to master this skill. In this course, we made the decision to show historians how learning Python could be useful to them, in the first instance focussing more on the applications before delving into more intricate matters writing code that is syntactically and semantically correct (we explain later what this means).   

The first thing you learn is to **read** and **adapt** code before we move our attention on independently **write** code. This implies that you shouldn't try to completely understand everything at once, but focus on how changing certain parts of the code affect outcomes, i.e. how this if you add or remove lines in given program. We hope you that reading and adapting code will make you feel comfortable with programming languages, as the perceived difficulties, even anxiousness that students in history often feel when confronted with digital tools is largely a cultural artefact, and artificial. If you can read Foucault, you can read Python. 

The reasons why prioritize adapting code, before we emphasis the skill writing code are the following:
- Tradiotionally programming courses (on which we lean heavily of )
- We want to demonstrate realistic research scenarios that showcase how code contributes to a rigorous and reproducible way of doing digital history. By initial skipping some of the technical details, we hope that this course allows you to fairly quickly assess how coding and textual analyiss could be useful to your research (maybe you conclude that it isn't, and that is fine of course, but by going through these examples, you will have won a lot of time). The goal is more to help you find your way into new territory, get a basic understanding that helps to you to teach yourself more effectively in the future. Withing the scope of two lessons, one can not become a proficient programmer and digital historian, but one can get a sense of direction and we hope to provide you with the initial tools to grow in future

The course is built around a few practical examples, which go from basic (reading and querying corpora) to more advanced topics (classifying document, vector spaces and topic models).


In [1]:
## Why is code literacy important

In [None]:
## Why Python

In [None]:


learn in cycles, it's Ok to skim and go through the notebooks in multiple loops

Not a Python course, but analysing language with Python


"What do I have to know about X"

targeted skills


Some basics, but 

Active leaning

Targeted Learning

You can only learn to code by **doing**. Therefore this course consists of a series of **workshops** that teach you to tackle data-related problems of increasing complexity.

very focussed on text analysis, given less attention to

History





## source

Party Manifesto
Hansard
Newspapers


The preceding 

# analyzing one document

# analyzing thousand of documents

# semi-structured data

# Advanced topic


This course is not a extensive introduction to programming, but it shows you a few, rather common, scenarios, that allow you to hit the ground running when it comes to processing and analysing data.


## Why coding?



### How to get through this course

- Coding is **not** difficult, but obtaining basic programming skills requires a **sustained effort**.
- With only a few basic skills you can go a long way (writing scripts vs. developing tools).
- The full course, with all the details, is available [here](https://github.com/kasparvonbeelen/CTH2019) (but still under construction).
- It takes a while before you can do some more fancy stuff (you have to go through kindergarten again before you become a rocket scientist).

### 1.2 The Language of Choice: Python

#### **What** is Python?

[From Wikipedia](https://en.wikipedia.org/wiki/Python_(programming_language): Python is a widely used **high-level** programming language for **general-purpose** programming.
- ** high-level programming language**: In computer science, a high-level programming language is a programming language with **strong abstraction from the details of the computer**. In comparison to low-level programming languages, it may use **natural language elements**, be easier to use, or may **automate** (or even **hide** entirely) significant areas of computing systems (e.g. memory management), making the process of developing a program simpler and more understandable relative to a lower-level language. The amount of **abstraction** provided defines how "high-level" a programming language is.


#### **Why** Python?

In general, Python is **easier to learn and to read**. Let's look at a very simple example. 

Compare this to the C++ version of  "Hello, World." which looks like this:

C++ code below:
``
#include <iostream.h>

void main()

{
    
    cout << "Hello, world." << endl;

}

``

End of C++ code.


So, in general, the reasons why I teach **Python** are:

- Software **Quality**: Python code is designed to be **readable**, and hence reusable and maintainable. 
- Developer **Productivity**: Python code is typically one-third to one-fifth the size of C++ or Java code. 
- **Portability**: Python code runs unchanged on all major computer platforms (Windows, Linux, MacOS). 
- **General-purpose**: data analysis, web development etc.
- **Support Libraries**: Standard, homegrown and third-party libraries.
- **Widely used by the academic and scientific community!**

## Your very first steps

Learning to know the notebook environment

# 3. Baby Python

For practising your coding skills, you can use the many **'code blocks'** in this Notebook, such as the grey cell below. Place your cursor inside the cell and press ``ctrl+enter`` to "run" or execute the code. Let's begin right away: run your first little program!

In [None]:
print('Hello, World!')

You've just executed your first program!

#### --Exercise--
- Can you describe what the programme just did?
- Can you adapt it to print your name (with a greeting, i.e. "Hello, ...")?

Use the code block **below**.

In [None]:
# Insert your own code here!
# Print your own name ... or whatever you want, and press ctrl + enter

Besides printing words to your screen, you can use Python as a **calculator**. 

In [1]:
print(10)
print(5+9)
print(3*8)

10
14
24


> Please note that a string is always enclosed in **quotation** marks (which be single *`'`* or double *`"`*), while a number (integer or float) is not.

#### --Exercise--

- print the number 5 as a string (i.e. with quotation marks)
- print the number 5 as an integer (i.e. without quotation marks)


In [None]:
# Insert code here

#### --Exercise--
Use the code block below to calculate (and print) how many minutes there are in one week?

**HINT**: use the multiplication operator **`*`** (i.e `5*4*4`)

In [None]:
# Insert code here

How many minutes have passed since your birth? (Approximately of course, just use your age (for example: how many  minutes are there in 21 years))

In [None]:
# Insert code here

# -1. Disclaimer
- 
-- Many of the materials in this NoteBook are gently borrowed from the following courses and books:

- **[Humanities Data Analysis: Case Studies with Python](https://press.princeton.edu/books/hardcover/9780691172361/humanities-data-analysis)** Folgert Karsdorp, Mike Kestemont, and Allen Riddell
- **["How to Think Like a Computer Scientist"](http://www.greenteapress.com/thinkpython/thinkCSpy.pdf)** by Allen Downey, Jeffrey Elkner, Chris Meyers
- **[Coding the Humanities]()**
- **["A Python Course for the Humanities"](https://www.karsdorp.io/python-course/)** a course designed by Folgert Karsdorp and Maarten van Gompel
- and later modified by Mike Kestemont and Lars Wieneke for the course **["Programming for Linguistics and Literature"](https://github.com/mikekestemont/prog1617)**
- **["Python for text analysis"](https://github.com/cltl/python-for-text-analysis)** designed by H.D. van der Vliet and taught at the Vrije Universiteit


If things remain unclear, please go though the [NLTK Book](https://www.nltk.org/book/), Chapter 1 ["Language Processing and Python"]( https://www.nltk.org/book/ch01.html)