[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kasparvonbeelen/ghi_python/blob/main/1%20-%20Introduction.ipynb)

# Text Mininig for Historians with Python
## A gentle introduction to working with with textual data in Python

### Created by Kaspar Beelen and Luke Blaxill

### Creating for the German Historical Institute, London

<img align="left" src="https://www.ghil.ac.uk/typo3conf/ext/wacon_ghil/Resources/Public/Images/institute_icon_small.png">


In this notebook
- we introduce the principles and aims of this course
- ensure you are set up properly and ready to explore!

# 1. Introduction 

Welcome to "Text Mining for Historians" a course on processing and analysing texts using a programming language. In this part of the course, you will learn the basics of coding in Python and basic techniques for processing text documents. We end the course by inspecting trends over time. In the second part of this course, we focus more on the analytical and statistical aspects of data mining. 

At the end of this course, you should feel more comfortable with programming and have some basic knowledge of working with textual data.

This notebook provides an overview of the course—and details its "philosophy" (or pedagogical principles) as we diverge from the usual coding courses—and why we think code literacy is a crucial skill for future historians.

## 1.1 Philosophy and Goals of the Course

In this course, you will become acquainted with a programming language called Python (a reference to the comedy Monty Python (and hence not the snake) beloved by the creator of Python, Guido van Rossum. In the section "Why Python?" we motivate this choice, here we wish to explain how we intend to make your (probably first) encounter with a programming language more pleasurable (or less frustrating) and even fruitful experience.

### 1.1.1 Coding is not difficult 

(... or at least not much harder than reading 17th century hand-writing, to give just one example)

Historians tend to perceive coding as a difficult, if not alien and intimidating skill that has no direct relevance to their discipline. In this course, we want to change this perception. Coding is not hard, but it is **different** type of intellectual activity, and generally historians posses less of prerequisite training and knowledge. 

In this course (both part I and II) we hope to make your first exploration of Python a good experience by prioritzing applications and spend less time on discussing the intricacies and technicalities of programming. We assume you are not very interest in the art of coding, but what it can do for your research. However, this does not mean we can skip the fundamental and all technical aspects. Compared to the Antconc course the learn process will be slower, but you'll end having a new and more fundamental skill set that provides endless new research opportunities. Therefore, we recommend developping both you Antconc and Python skill, the former to start text mining straight away, the latter to invest in new skills that allow you to go beyond the limitation of existing programs and more freely design your own research.
   
One important aim of the course is to make the first steps as easy as possible and avoid problems related to setup and installation. For this reason the course is designed for Colab Notebooks, which run in browser (not the scary terminal) and allow you to start coding straight away!



### 1.1.2 Learning to code is not a linear process

The first thing you learn is to **read** and **adapt** code before we move our attention on independently **writing** code. This implies that you shouldn't try to understand everything at once, but focus on certain parts of the code, observe what happens if you change specific variables or lines.

We hope you that reading and adapting code will make you feel comfortable with programming languages. The anxiousness that historians often feel when confronted with digital tools is largely a cultural artefact. If you can read Foucault, you can read Python. 

As opposed to Antconc—where you learn a handful of operations and than apply them to your own data and research—
learning to code is not such a straightforward process. While we discuss some fundamental aspect of the Python syntax, just observing these aspects won't immediately make you a proficient coder. However, by practicing you do develop an intuition and fluency over time. The important part here is keep immersing yourself, looking at (parts of) the materials we provide you here, but also to the links to more specialized discussion.

Lastly, please be aware there is so much we can't explain. We can give you a push in the right direction, but there is so much to learn and you would have to do that by yourself. Luckily, once you have a basic knowledge of Python, there are so many great tutorials out there that will help you developing your skills.

If we enable you to learn, we've succeeded!



### 1.1.3 Learning to code is an iterative process

Traditionally programming courses (on which we lean heavily) start with an overview of the fundamental elements of the Python language, such as variables, lists, dictionaries. However, a common problem when teaching to historians (and other students in the Humanities) is that these intricate, technical details mostly come across as very abstract and (to be honest) irrelevant?


We want to demonstrate research applications and useful tools, showcasing how code contributes to a rigorous and reproducible way of doing digital history. By initial skipping some of the technical details, we hope that this course allows you to fairly quickly assess the relevance coding and textual analyiss to your research (maybe you conclude that it isn't, and that is fine). 

Instead of percieving learning to code as a linear process, we propose a more **iterative** approach and advise you to go through the material multiple times. First just go through the materials look at the code, maybe do some simple exercises just to get a feeling what's on offer. In the following iterations, you can start adapting the code yourself, and visit some of the **`break out`** Notebooks which focus on specific, often more technical topic.  

The course is therefore set up as a **"Web of Notebooks"** that have to goal to enable a layered learning process, exploiting the affordance of this format as opposed to traditional books. For more information please consult the section **"Learning Process"**. 




### 1.1.4 Learning outcomes

At the end of the Part I of the course you'll be able to:
- read and adapt Python code
- understand basic Python syntax
- manipulate texts
- load and write texts
- apply external libraries to text for more advanced processing
- write short program that operate on a collection of texts

But more importantly, the goal of this course, to help you find your way into new territory, get a basic understanding that enables you to teach yourself more effectively. Within the scope of two lessons, one can not become a proficient programmer and digital historian, but one can get a sense of direction of what skills to develop.

Please hop-on and enjoy the ride!

## 1.2 Learning Process

As already hinted to in the preceding section, this course teachess you practical coding skills in an iterative fashion. When going through the content the first, we invite you to just read the examples, run the code and inspect the outcomes of this process.

At this stage we provide you with a few simple exercises, in which you mostly just have to reproduce or tweak existing code. This will (hopefully) make you slowly familiar with Python and take away the initial fears of interacting with computers via a programming language. 

However, we leave a lot unexplained, and don't worry about that initially. Once you went through the main Notebooks, you should have a vague understanding of how to work with (large collections) of texts programmatically.

If you remain intrigued and see some potential applications, we invite you to have a closer look at. In the following iterations, we invite you to look at the syntax more closely, and focus on the **"Technical Break outs"** that link to a more detailled discussion about python syntax and data types.

## 1.3 Why Python

Python is a relatively simple programming langauge that is widely use in the academic research community, especially in the digital humanities.


In more technical terms, "simple" means "high-level".

[From Wikipedia](https://en.wikipedia.org/wiki/Python_(programming_language)): Python is a widely used **high-level** programming language for **general-purpose** programming.
- **high-level programming language**: In computer science, a high-level programming language is a programming language with **strong abstraction from the details of the computer**. In comparison to low-level programming languages, it may use **natural language elements**, be easier to use, or may **automate** (or even **hide** entirely) significant areas of computing systems (e.g. memory management), making the process of developing a program simpler and more understandable relative to a lower-level language. The amount of **abstraction** provided defines how "high-level" a programming language is.


What you have to remember from this excerpt that, in general, Python is **easier to learn and to read**. Let's look at a very simple example. Let's assume we want the computer to print (on the screen) as simple message "Hello world." In Python, the code for doing this is short and rather intuitive.

```python
print("Hello world.")
```

Compare this to the C++ version of  "Hello, World." which looks like this:

**C++ code below**

```c++
#include <iostream.h>

void main()

{
    
    cout << "Hello, world." << endl;

}

```

**End of C++ code**

Not only is Python simpler and more readable, it is also useful for many different types of tasks (i.e. it is a general purpose language, as oppossed to for example `R` which is largely used for statistical analysis (even though, admittedly, it provides more and more text mining tools)). We are initially interested in working with text data, but we'll also use it for statistical analysis. In other words, if you have a basic knowledge of Python you can go into many directions: processing images, doing web development or focus on data visualisation.

Lastly, Python is widely used by the academic and scientific community! It is best practice currently to release code with papers, and the programming language of choice (as you can guess by now) is most likely Python. This means that if you read an interesting paper, you will be able to reproduce their results, or if you encounter a new type of tool for data-mining, there likely will be a Python implementation.

## 1.4 Using Colab Notebooks

Colab provides an online environment for writing executable code. It allows you to run Python code but also document you program and proces with text. In 

- Code cells: these appear usually with a light-grey background. In these cell you write and execute Python code (which we'll do in a second.)

- Text cells: The allow you to write and document your code with Markdown, a lightweight markup language. In fact, the cell you are reading just now is actually written in Markdown. 

### --Exercise--
Double click on this cell to reveal the Markdown text.



Working in Markdown is in many way similar to Word. You can format text in different ways

With `Markdown` you can produce text in different

# Sizes
## From big
##### to small.

Print text in **bold**

.. or *cursive*!

Add [links](https://www.wikidata.org/)

... or insert images, like the GHI's logo!

![GHI Logo](https://www.ghil.ac.uk/typo3conf/ext/wacon_ghil/Resources/Public/Images/institute_icon_small.png)

| even | tables |
|------|------- |
|work | ! |



More information about Markdown syntax is available [here](https://www.markdownguide.org/basic-syntax/)

### -- Exercise --

1. Add a new `text cell`.
2. Print your name in bold
3. Add a link 

## 1.5. Baby Python

For practising your coding skills, you can use the many **'code blocks'** in this Notebook, such as the grey cell below. Place your cursor inside the cell and press ``ctrl+enter`` to the "run" button next to the cell to execute the code (s). Let's begin right away: run your first little program!

![run_button](imgs/run.png)

In [None]:
print('Hello, World!')

You've just executed your first program!

### --Exercise--
- Can you describe what the programme just did?
- Can you adapt it to print your name (with a greeting, i.e. "Hello, ...")?

Use the code block **below**, remove the **comments** (the line starting with a hashtag), write your code and run it!

In [None]:
# Insert your own code here!
# Print your own name ... or whatever you want, and press ctrl + enter

In [None]:
print('Well done! You can rest now... or go to the second Notebook :-) .')

# -1. Disclaimer
- 
-- Many of the materials in this NoteBook are gently borrowed from the following courses and books:

- **[Humanities Data Analysis: Case Studies with Python](https://press.princeton.edu/books/hardcover/9780691172361/humanities-data-analysis)** Folgert Karsdorp, Mike Kestemont, and Allen Riddell
- **["How to Think Like a Computer Scientist"](http://www.greenteapress.com/thinkpython/thinkCSpy.pdf)** by Allen Downey, Jeffrey Elkner, Chris Meyers
- **[Coding the Humanities]()**
- **["A Python Course for the Humanities"](https://www.karsdorp.io/python-course/)** a course designed by Folgert Karsdorp and Maarten van Gompel
- and later modified by Mike Kestemont and Lars Wieneke for the course **["Programming for Linguistics and Literature"](https://github.com/mikekestemont/prog1617)**
- **["Python for text analysis"](https://github.com/cltl/python-for-text-analysis)** designed by H.D. van der Vliet and taught at the Vrije Universiteit


If things remain unclear, please go though the [NLTK Book](https://www.nltk.org/book/), Chapter 1 ["Language Processing and Python"]( https://www.nltk.org/book/ch01.html)