Before you start
- Download Anaconda: https://www.anaconda.com/download
        Choose Python 3.6 Version
        Follow the installation instructions
- Download the Notebook and data: https://github.com/kasparvonbeelen/Coding-the-Humanities
        Open Anaconda Navigator
        Launch Jupyter Notebook
        This should open a tab in your browser
        Go to the location where you cloned/unzipped the material downloaded from Github

# Humanities Computing with Python

Coding is not a gift but a skill acquired via practice. Coding is not the preserve of the computer scientists. It has value for almost any type of research, **also for the Humanities**. But for historians, linguists, philosophers learning how to make efficient use of computers takes more time; programming may, therefore,  seem a frustrating and pointless exercise at the beginning.

Practice and persistency, here, are the key to success. Similar to learning a natural language, you only get proficient in coding through **exercise** (by doing it). Code, sleep, repeat, this is the mantra of this course, which is very hands-on: you will have to write a lot of programming code yourself from the very beginning.

The theory is only secondary, more important is that you get the feel for coding.

Today
- Setting up the Anacando Notebook Enviroment.
- Reading and analysing Parliamentary data with Pandas (a Python library).

### Hands-on? Run your own code

For practising your coding skills, you can use the many 'code blocks' you will encounter below, such as the grey cell below. Place your cursor inside the cell and press ``ctrl+enter`` to "run" or execute the code. Let's begin right away: run your first little program! [MK]

In [None]:
print('Hello, World!')

### Questions:
- Can you describe what the programme just did?
- Can you adapt it to print your own name? (code block below)

In [None]:
# Insert your own code here!
# Use Python as a calculator
print(??*??)

## Teaching Method: Learning at Different Speeds

At this point, you may think that, if we continue at this speed, we won't be getting very far. A problem with learning how to code is the **distance between obtaining the basics skills (the tedious part) and applying them to  real-world problems (the fun part)**.

Therefore, we will make some bigger jumps (higher level functions from external libraries). I explain later in more detail with this means.

Instead of following the classic sequence of 'variables', 'conditions', 'iterations' (and only introduce practical application later), we will jump to more practical application such as performing emotion mining a large set of JSON files.

So remain **Zen**, and don't worry if everything is immediately clear.

In [None]:
# The Zen of Python. General coding guidelines.
import this

## The IPython Notebook Environment

The document you opened in your browser is an IPython Notebook, an interactive coding environment in your browser. It broadly consists of two different types of cells: ``Code`` and ``Markdown``.
** ``Code``** cells are preserved for running Python scripts.
** ``Markdown``** cells can be used for adding notes to your Notebook document.
The text you are reading now, is written using Markdown.


**Click here**, the box should be marked by a black rectangle. If you **double click** the original [**Markdown**](https://en.wikipedia.org/wiki/Markdown) syntax appears, and you can add your own text. 


**Exercise**: Let's try. Enter your name below surrounded by ``**`` (two asterisks) to print it in bold type

Hello, my name is [your name here]

Then press ``run cell``, or press ``ctrl+enter``

# Python and Support Libraries

## Python

#### **What** is Python?

Python is a widely used **high-level** programming language for **general-purpose** programming.
- **high level programming language**: In computer science, a high-level programming language is a programming language with **strong abstraction from the details of the computer**. In comparison to low-level programming languages, it may use natural language elements, be easier to use, or may automate (or even hide entirely) significant areas of computing systems (e.g. memory management), making the process of developing a program simpler and more understandable relative to a lower-level language. The amount of abstraction provided defines how "high-level" a programming language is.


#### **Why** Python? 

In general Python is easier to learn and to read. The first example in this Notebook illustrates this point. In the C++ version the hello world programma looks like:

C++ code below:
``
#include <iostream.h>

void main()

{
    
    cout << "Hello, world." << endl;

}

``

End of C++ code.

while in Python version it simply was:

``
print("Hello, world.")
``


So, why **Python**:

- Software Quality: Python code is designed to be readable, and hence reusable and maintainable. 
- Developer Productivity: Python code is typically one-third to one-fifth the size of C++ or Java code. 
- Portability: Python code runs unchanged on all major computer platforms. 
- **General-purpose**: data analysis, web development etc.
- **Support Libraries**: Standard, homegrown and third-party libraries.
- **Widely used by the academic and scientific community!**

## Pandas

An example of such a third-party library that makes Python so powerful is [**Pandas**](https://pandas.pydata.org/), a.k.a. "Excel on steroids". Pandas facilitates data-analysis, and we will use some of its functionalities regularly in this lecture. Please check if you installed Pandas properly by running the cell below.

In [None]:
# Check if the Pandas Library is properly install
import pandas as pd
print('It works!')

## Analyzing Parliamentary Speech with Python

The previous section covered the most basic elements of Python. You know now how to assign values to variables. A variable is a **box** that can contain almost anything. Below we will take some bigger steps: instead of strings and integers, we scrutinize **a whole corpus of parliamentary speeches**. Don't worry if the code seems difficult--because it is hard at the first time (and we did not cover all the Python basics). The point of this sudden acceleration is to demonstrate the power of coding, to show you that with relatively few lines of code you can accomplish a lot.

As an example we used all tweets of the current American President. These we obtained via the [Political Mashup] export function(http://search.politicalmashup.nl/

The database is a [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file in which each item is a speech. The cell below shows the first speech of the collection. 

Okay, let's have a closer look at the corpus, which includes all tweets after the inauguration. Pandas is a very useful library to load and interrogate data. Simply run the code below (and relax, you are not supposed to really understand everything, except maybe line 4).

In Python the `=` notation means assigning a value (an object Python can manipulate) to a variable (a "box") in which we save the value.

In [None]:
# assign value to a box with the name 22
x = 22
# print the content of the box
print(x)

We can also assign a whole CSV Table to a variable. 

CSV stands for Comma Separated Values: a table in which, on each row, the cells are separated by commas.

Example of a CSV table:

``
A,B,C
D,E,F
``

We can load a whole CSV Table in just one line, using the `read_csv()` function. In this table, each row contains a mention of a query term with limited context and metadata).

In [None]:
# import the pandas library`
import pandas as pd

dateparse = lambda x: pd.datetime.strptime(x,'%Y-%m-%d')

# read the CSV file as assign 
df = pd.read_csv('data/immigration_uk.csv',
                       header=0, # specify where the header is located
                       sep=",", # specify the delimiter
                       # additional 
                       escapechar=u'\\', # quotes inside the text are escaped
                       parse_dates=['date'], # which column contains dates
                       date_parser = dateparse # how to read dates
                      )


With these few lines, you managed to lead the whole corpus of Trump tweets.

In [None]:
df.head(3)

**Exercise 1**: What information does the table contain

In [None]:
# wite in Markdown

**Exercise 2**: print the first 10 rows

In [None]:
# your code here

**Exercise 3**:You can count the number of speeches by wrapping the "len()" function around the "speeches" variable. Try it!

In [None]:
# your code here

## Exploring Data

Pandas allows you to inspect the data with the help of some descriptive statistics and plots. Run the cell below, otherwise the plots won't appear in the Notebook.

In [None]:
# Run this cell to plot all figures in the Notebook
%matplotlib inline
df.describe()

**Exercise 4**: What do these summary statistics mean?

In Pandas we can easily plot a histogram to show us the distribution of the values. Below is a figure with the distribution of the paragraphs_count variable.

In [None]:
df['paragraphs_count'].plot(kind='hist',bins=100)

**Exercise 5**: can you explain in simple terms what the distribution plot actually shows?

The variable `score` is one for each observation of the word immigration (i.e. is one for all rows). You can check this is true with:

In [None]:
sum(df.score) == len(df)

Now let's make a barplot that shows how the speeches are distributed over the different parties

In [None]:
df['score'].groupby(df.party).sum().plot(kind='bar',alpha=0.75, rot=90)

Obviously, Labour and Conservatives are overrepresented. Let's jus discard the other parties. In Pandas, there are different ways for selecting a subset of the data.

In [None]:
print(df.shape)
df_red = df[df.party.isin(['Labour','Conservative'])]
print(df_red.shape)

The line `df.party.isin(['Labour','Conservative']` says as much as: select all rows for which the value of column party is either equal to 'Labour' or 'Conservative'.

or using the `or` notation with `|`:

In [None]:
print(df.shape)
df_red = df[(df.party=='Labour') | (df.party=='Conservative')]
print(df_red.shape)

In Pandas, we can easily plot how different subsections of the data compare to each other. For example, let's plot the distribtion of the paragraphs_count by party.

In [None]:
df_red.loc[df_red.party=='Labour'].paragraphs_count.plot(kind='hist',bins=100)
df_red.loc[df_red.party=='Conservative'].paragraphs_count.plot(kind='hist',bins=100)

Or compare the means and standard deviations (spread around the mean).

In [None]:
import numpy as np
m_con = np.mean(df_red[df_red.party=='Conservative'].paragraphs_count)
std_con = np.mean(df_red[df_red.party=='Conservative'].paragraphs_count)
print('Mean = ',m_con,'Standard Deviation = ', std_con)

**Exercise 6**: Print the mean and standard deviation of the variable `paragraphs_count` for the Labour party

For closer inspection, you can sort the table by a certain column. 

In [None]:
long_sp = df_red.sort_values('paragraphs_count',ascending=False)[:10]
long_sp

**Exercise 7**: Inverse the sorting (from high to low). Tip: In Python the opposite of `False` is `True`.

## Vader Sentiment Analyzer
The variable paragraphs counts is not the most interesting one, let's have a look at the semtiment values of these mentions of immigration.

For this we use **VADER**.

[from Github](https://github.com/cjhutto/vaderSentiment): VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool.

VADER uses a lexicon (a mapping of words to sentiment values, e.g bad=-1.0, good=+1.0) to compute the sentiment (positivity or negativity) of a text.

In [None]:
# we need to install the vader lexicon first
import nltk
nltk.download('vader_lexicon')

Now load the VADER Sentiment analyzer

In [None]:
from nltk.sentiment import vader
analyzer = vader.SentimentIntensityAnalyzer()

Below you can test VADER yourself by changing the value of the ``text`` variable, and running the code block. 

Can you trick the system? Not very easy isn't it?!

In [None]:
text = "Not interesting."
sentiments_analysis = analyzer.polarity_scores(text)
print(sentiments_analysis)

We are interested here in the compound, the combination of positive and negative sentiments. We can select this by putting the string 'compound' between square brackets

In [None]:
sentiments_analysis['compound']

**Exercise 8**: Select and print the `neg` and `pos` values for two snippets of text

Now we can easily calculate the sentiment of each reference to immigration.

In [None]:
# this defines a function that calculates the sentiment and returns the compound sentiment
compound_sentiment = lambda x: analyzer.polarity_scores(x)['compound']
# here we apply this function to each cell in the 'text' column
df_red['compound_sentiment'] = df_red['text'].apply(compound_sentiment)

`df_red['text'].apply(compound_sentiment)` can be almost read as natural language: to all cells in the 'text' column `apply` the compound_sentiment calculater. We assign these values to a new column with the name `compound_sentiment`.

In [None]:
df_red.head(3)

**Exercise 9**: Plot the distribution of the sentiment values (tip copy the code for creating the histograms)

Now we can sort the table by the sentiment value in each row.

**Exercise 10**: revisit the sorting the paragraphs_counts above. Now create two variable `negative_speeches` and `positive speeches`

You can print the party and the text with the `itterows()` method:

In [None]:
for index,row in negative_speeches.iterrows():
    print('Party: ',row.party,'\n','Text: ',row.text)

Now we can compute if Conservatives are more negative about immigration than Labour MPs.

In [None]:
df_red.groupby('party')['compound_sentiment'].mean()

`groupby` simply groups all the rows by the distint values of the specified column. Here it groups all rows by party. In other words, it temporarily splits the table in 'Labour' and 'Conservative' camps. Then we select the 'compound sentiment' column of each of these subgroups and compute the mean `mean()`.

**Exercise 11**: Is there a difference between the role a speaker takes up and the sentiment he or she expresses on immigration? 

We plot the distribution of sentiment scores for different parties, but this largely confirms the simple statistics generated above (this time we use density plot to better compare the two groups):

In [None]:
df_red[(df_red.party=='Labour')].compound_sentiment.plot(kind='kde')
df_red[(df_red.party=='Conservative')].compound_sentiment.plot(kind='kde')

**Exercise 12**: Plot distributions for the different roles.

## Studying changes in content 

The code below allows you to search for specific words in the speeches corpus.

In [None]:
contains_word = lambda x,w: x.lower().find(w)

In [None]:
print('lala crime criminal'.find('crime'))
print('lala '.find('crime'))

`find()` returns the start position of the word you are searching for. If the word is not found, it returns -1. 

We can now apply the `contains_word` function to all the speeches and append these results as a new column.

In [None]:
df_red['contains_crime'] = df_red['text'].apply(contains_word,w='crime')
df_red.head(3)

To select all speeches with the word crime, we take those rows whose value for 'contains_crime' is higher than or equal to zero.

In [None]:
about_crime = df_red[df_red.contains_crime >= 0]

**Exercise 13**: How many speeches about crime are there?

Instead of just matching one pattern, we can make a function that looks how often a given list of words appears in a text.

Below is a function that counts how often each word in a list of occurs. A lot is happening here, we dissect the code below.

In [None]:
from collections import Counter
from nltk import word_tokenize

def count_words_from_list(text,words2count):
    tokens = word_tokenize(text.lower()) # convert string to tokens
    wordfreq = Counter(tokens) # count the tokens, i.e. map tokens to their frequency
    counter = 0
    for w in words2count:
        counter+=wordfreq.get(w,0)
    
    return counter

Let's start with the example text: 'Crime is no fun kids. No, stay of the drugs you little criminals!' and print what the code does at each stage.

In [None]:
text = 'Crime is no fun kids. No, stay of the drugs you little criminals!'
text_lower = text.lower()
print(text_lower)
tokens = word_tokenize(text_lower)
print(tokens)
wordfreq = Counter(tokens)
print(wordfreq)

After calculating the word frequencies we can count those we are intereted in

In [None]:
# create a counter variable that keeps track of the word frequncies
counter = 0

words2count = ['crime','criminals']
for w in words2count:
        # add the frequency of word w to the counter, if the word is not found add zero
        counter+=wordfreq.get(w,0)

print(counter)

Let's study how often migration is associated with crime over the years. We look at the words 'crime' and 'criminal' (and their plurals). We apply the `count_words_from_list` function we created earlier to the corpus.

In [None]:
words2count = ['crime','criminal','crimes','criminals']

In [None]:
df_red['contains_crime'] = df_red['text'].apply(count_words_from_list,
                                                            words2count=words2count)


**Exercise 14**: Do Conservatives mention crime more than Labour MPs. You answer this by using the `groupby` function. Revisit the sentiment example:

`df_red.groupby('party')['compound_sentiment'].mean()`

Instead of `mean()` use `count()`

**Exercise 15**: Sort the table by how often the fragments mention crime related words. Select and inspect the 20 highest ranked.

We can now plot the the mentions of crime over time:

In [None]:
df_red.groupby('date')['contains_crime'].sum().plot()

In [None]:
df_red[df_red.party=='Labour'].groupby('date')['contains_crime'].sum().plot()
df_red[df_red.party=='Conservative'].groupby('date')['contains_crime'].sum().plot()

To plot the result nicely by year:

In [None]:
df_red[df_red.party=='Labour'].groupby(df_red.date.map(lambda x: x.year))['contains_crime'].sum().plot(color='r')
df_red[df_red.party=='Conservative'].groupby(df_red.date.map(lambda x: x.year))['contains_crime'].sum().plot(color='b')

Instead of looking at the absulote frequencies, we better study the probabilities.

In [None]:
def token_length(text):
    return len(word_tokenize(text))

In [None]:
df_red['fragment_length'] = df['text'].apply(token_length)

In [None]:
crimes = df_red.groupby('date')['contains_crime'].sum()
speeches = df_red.groupby('date')['fragment_length'].sum()
(crimes / speeches).plot()

Now split the timeline by party:

In [None]:
crimes_lab = df_red[df_red.party=='Labour'].groupby(df_red.date.map(lambda x: x.year))['contains_crime'].sum()
speeches_lab = df_red[df_red.party=='Labour'].groupby(df_red.date.map(lambda x: x.year))['fragment_length'].sum()

                                            
crimes_con = df_red[df_red.party=='Conservative'].groupby(df_red.date.map(lambda x: x.year))['contains_crime'].sum()                                                                                        
speeches_con = df_red[df_red.party=='Conservative'].groupby(df_red.date.map(lambda x: x.year))['fragment_length'].sum()
(crimes_con / speeches_con).plot(color='b')
(crimes_lab / speeches_lab).plot(color='r')

How to interpret this timeline?

In [None]:
((crimes_con / speeches_con) - (crimes_lab / speeches_lab)).plot()