# Overview
-- Inspired on **A Python Course for the Humanities** a course designed by Folgert Karsdorp and Maarten van Gompel [PH], and later version modified by Mike Kestemont and Lars Wieneke [MK]. Other material is taken from ``How to Think Like a Computer Scientist`` [CS].

# Coding the Humanities

Coding is not a gift but a skill acquired via practice. Coding is neither the preserve of the computer scientists. It has value for almost any type of research, also for the Humanities. But for historians, linguists, philosophers learning how to make efficient use of computers takes more time; programming may, therefore,  seem a frustrating and pointless exercise at the beginning.

Practice and persistence, here, is the key to success. Similar to learning a natural language, you only get proficient in coding through exercise. Code, sleep, repeat, this is the mantra of this course, which is extremely hands-on: you will have to write a lot of programming code yourself from the very beginning.

The theory is only secondary, more important is that you get the feel for coding. 

For practising your coding skills, you can use the many 'code blocks' you will encounter below, such as the grey block below. Place your cursor inside the block below and press ``ctrl+enter`` to "run" or execute the code. Let's begin right away: run your first little program! [MK]

In [None]:
print('Hello World!')

### Questions:
- Can you describe what the programme just did?
- Can you adapt it to print your own name? (code block below)

In [None]:
# Insert your own code here!
# print your own name ... or whatever you want.

### What is a program?

``A program is a sequence of instructions that specifies how to perform a computation. The computation might be something mathematical, such as solving a system of equations or finding the roots of a polynomial, but it can also be a symbolic computation, such as searching and replacing text in a document or (strangely enough) compiling a program.

**input**: Get data from the keyboard, a file, or some other device.

**output**: Display data on the screen or send data to a file or other device.

**math**: Perform basic mathematical operations like addition and multiplication.

**conditional execution**: Check for certain conditions and execute the appropriate sequence of statements.
repetition: Perform some action repeatedly, usually with some variation.

**repetition**: Perform some action repeatedly, usually with some variation.
`` [CS]

**Question**: Why is ``print('Hello World!)`` a program?

[MK] Apart from printing words to your screen, you can also use Python to do calculations. 

* Use the code block below to calculate (and print) how many minutes there are in seven weeks? (Hint: multiplication is done using the `*` symbol in Python.)

In [None]:
# Insert your own code here!
# Use Python as a calculator

## Teaching Method: Learning at Different Speeds

At this point, you may think that, if we continue at this speed, we won't be getting very far. A problem with learning how to code is the distance between obtaining the basics skills and applying them to actual real-world problems.

Therefore, we focus both on the basic elements of the programming language we discuss (Python) but also present you with some more advanced use-cases that show you where we are heading with this course. We combine taking small steps (elementary Python) with bigger jumps (higher level functions from external libraries).

Instead of following the classic sequence of 'variables', 'conditions', 'iterations' (and only introduce practical application later), we will jump immediately to more practical application such as performing emotion mining a large set of JSON files.

So remain **Zen**, and don't worry if everything is immediately clear.

In [None]:
# The Zen of Python. General coding guidelines.
import this

## The IPython Notebook Environment

The document you opened in your browser is an IPython Notebook, an interactive coding environment in your browser. It broadly consists of two different types of cells: ``Code`` and ``Markdown``.
** ``Code``** cells are preserved for running Python scripts.
** ``Markdown``** cells can be used for adding notes to your Notebook document.


**Click here**, the box should be marked by a black rectangle. If you **double click** the original [**Markdown**](https://en.wikipedia.org/wiki/Markdown) syntax appears, and you can add your own text. 


**Exercise**: Let's try. Enter your name below surrounded by ``**`` (two asterisks) to print it in bold type

Hello, my name is [your name here]

Then press ``run cell``, or press ``ctrl+enter``

You can add a cell by going to
``Insert >> Insert Cell Below``

An empty cell should appear.

**Exercise**: Add two cells below. One for printing you know with the Python ``print`` function. In another you write down your name in in italic (single asterisk) and bold type.

You can always delete a cell by clicking on one and going to
``Edit >> Delete Cells``

# Python and Support Libraries

## Python

#### **What** is Python?

[FROM WIKIPEDIA]Python is a widely used **high-level** programming language for **general-purpose** programming.
- **high level programming language**: In computer science, a high-level programming language is a programming language with **strong abstraction from the details of the computer**. In comparison to low-level programming languages, it may use natural language elements, be easier to use, or may automate (or even hide entirely) significant areas of computing systems (e.g. memory management), making the process of developing a program simpler and more understandable relative to a lower-level language. The amount of abstraction provided defines how "high-level" a programming language is.


#### **Why** Python? [LU+CS]

[CS]In general Python is easier to learn and to read. The first example in this Notebook illustrates this point. In the C++ version the hello world programma looks like:
``
#include <iostream.h>

void main()

{
    
    cout << "Hello, world." << endl;

}

``
while in Python version it simply was:

``
print("Hello, world.")
``



- Software Quality: Python code is designed to be readable, and hence reusable and maintainable. 
- Developer Productivity: Python code is typically one-third to one-fifth the size of C++ or Java code. 
- Portability: Python code runs unchanged on all major computer platforms. 
- **Support Libraries**: Standard, homegrown and third-party libraries.
- **Widely used by the academic and scientific community!**

## Pandas

An example of such a third-party library that makes Python so powerful is [**Pandas**](https://pandas.pydata.org/), a.k.a. Excel on steroids. Pandas facilitates many data-analysis tasks, and we will use of often below. Please check if you installed Pandas properly by running the cell below.

In [None]:
# Check if the Pandas Library is properly install
import pandas as pd

### Before we start, something about versions
Python 2.7 or 3.x?

In [None]:
# Check version by making Python crash hard.
# This press ctrl+enter; this should raise a SyntaxError
print "Hello, World."

### Errors
[CS] Programming is a complex process, and because it is done by human beings, 
it often leads to errors. For whimsical reasons, programming errors are called 
bugs and the process of tracking them down and correcting them is called
debugging.
Three kinds of errors can occur in a program: 

- Syntax errors
- Runtime errors
- Semantic errors

We won't go into details here. More important is to solve these errors. Where to look for help? Generally copy-pasting the Error message in the Google Search Box, will lead you to useful solutions. A very useful resource is [Stack Overflow](https://stackoverflow.com/). For example typing

``
Stack Overflow SyntaxError: Missing parentheses in call to 'print'.
``

...will show you the [correct answer](https://stackoverflow.com/questions/25445439/what-does-syntaxerror-missing-parentheses-in-call-to-print-mean-in-python) to your question.

# Ok, let's start!
... for real.

## Variables, expressions and statements
[MK] 
Below we will have a closer look at the Trump Twitter Archive. We use twitter to study what topics his follower pay more attention to. 
How can we inspect and process all this candy? To understand this, we first have to introduce some basic Python vocabulary: namely **values, variables, and statements** the elementary building blocks of your eventual programs.

### Values

In [None]:
# run this cell
print(2)
print("Hello, World!")
print(0.5)

[CS] A value is one of the fundamental things like a letter or a number that a program manipulates. The values we have seen so far are 2 (the result when we added 1 + 1), and "Hello, World!".

### Variables
One of the most powerful features of a programming language is the ability to manipulate variables. A variable is a name that refers to a value. The assignment statement creates new variables and relates them to concrete values:

In [None]:
# declaring a variable
x = 'Hello World.'
# printing what is in the box
print(x)

In [None]:
# declaring a variable
x = 3
# printing what is in the box
print(x)

[MK] If you vaguely remember your math-classes in school, this should look familiar. It is basically the same notation with the name of the variable on the left, the value on the right, and the = sign in the middle. 

In the code block above, two things happen. First, we fill `x` with a value, in our case `3`. This variable x behaves pretty much like a **box** on which we write an `x` with a thick, black marker to find it back later. We then print the contents of this box, using the `print()` command. 

* Now copy the outcome of your code calculating the number of minutes in seven weeks and assign this number to `x`. Run the code again.

The box metaphor for a variable goes a long way: in such a box you can put whatever value you want, e.g. the number of minutes in seven weeks. When you re-assign a variable, you remove the content of the box and  put something new in it. In Python, the term **'variable'** refers to such a box, whereas the term **'value'** refers to what is inside this box.

In [None]:
#can you understand the difference between
x = "x"
y = y
y = x
# Running this code will produce an error, can you explain why?

### Statements and Operators

A **statement** is an instruction that the Python interpreter can execute. We have
seen two kinds of statements: ``print`` and assignment (``=``).

In [None]:
# write here two statements: a print and an assignment

A **script** usually contains a sequence of statements.

We can transform variables and values by applying operators to them. 
[CS] The symbols ``+``, ``-``, and ``/``, and the use of parenthesis for grouping, mean in
Python what they mean in mathematics. The asterisk ``(*)`` is the symbol for
multiplication, and ``**`` is the symbol for exponentiation.

In [None]:
# Using Python as a calculator
x = 4
y = 2
print(x+y)
print(x-y)
print(x/y)
print(x*2)
print(y**2)

### Types

You can see that ``x/y`` is not an integer. Numbers with a "dot" are called ``floats``. We won't go into details here but you should be aware that Python includes different "native" data types. Why is this important? Try to run the following block.

In [None]:
print(1+"1")

This shoud raise a TypeError. The error message points out that you can not combine elements with different types. The check if the value belong the the same ``type``. Simply wrap the ``type`` function around the value. At this point you probably won't know what the previous sentence meant, but (to give you the answer), try to run variations of the code in cell below: ``print(type(put any value here))``

In [None]:
# print the type of a value

You can force an item to change the type. Run the cell below. Can you explain the differences between first and second print statement?

In [None]:
print(str(1)+str("1"))
print(int(1)+int("1"))

## difference between strings and numbers

Let's have a closer look at the 'str' type (str stands for string)

In [None]:
first_name = "Kaspar"
last_name = "Beelen"
print(first_name+last_name)

This the last operation is called string **concatination**.

In [None]:
# print last name neatly

In [None]:
# try other operators on the first and last names, what works, what does not? can you explain why?

## One last thing before we really, really kick off
### Let me introduce to you: The List

Instead of assigning a variable to one item (in integer or a string), you can relate to a collection of items, a ``list``---there are more collections in Python, but now we only look at lists. 

A list is demarcated by squary brackets

In [None]:
# can be a list
# defined by square parentheses, items are separated by commas
l1 = [1,2,3]
l2 = ['my','name','is','kaspar']
l3 = ['kaspar','is','gr',8]

In [None]:
print(l2)

In [None]:
# try to assign a variable to a list value. Try to come up with something useful.

## What we learnt so far [MK]
To finish this section, here is an overview of what we have learnt. Go through the list and make sure you understand all the concepts.

- variable
- value
- assignment operator (=)
- difference between variables and values
- integers vs. floats
- operators for multiplication (*), subtraction (-), addition (+), division (/)
- string concatination
- print()


# Loading Json Data

The previous section covered the most basic elements of Python. You learned how to assign values to variables. A variable is a box that can contain almost anything. Below we will take some bigger steps: instead of strings and integers, we scrutinize a whole corpus of Tweets. Don't worry if the code seems difficult--because it is hard at the first time. The point of this sudden acceleration demonstrates the power of coding, to show you that with relatively few lines of code.

As an example we used all tweets of the current American President. These we obtained via the [Trump Twitter Archive](http://www.trumptwitterarchive.com/archive).

The database is a [JSON](https://en.wikipedia.org/wiki/JSON) file in which each item is a tweet. The cell below shows the first tweet of the collection. It may seem difficult to read JSON notation, but there are various tools to help you. Go for example to this [JSON viewer](http://jsonviewer.stack.hu/) and copy paste the text into the cell below.

``{
    "source":"Twitter for iPhone",
    "text":"The Tax Cut Bill is coming along very well, great support. With just a few changes, some mathematical, the middle class and job producers can get even more in actual dollars and savings and the pass through provision becomes simpler and really works well!",
    "created_at":"Mon Nov 27 14:24:36 +0000 2017",
    "retweet_count":15663,
    "favorite_count":79868,
    "is_retweet":false,
    "id_str":"935152378747195392"}``

Inspect the JSON file. What information is in there, what is missing? What type of questions could one answer using these data? Just FYI, the information per tweet is actually larger. Inspect the "example.json" in the previously mentioned [viewer](http://jsonviewer.stack.hu/).

Okay, let's have a closer look at the corpus, which includes all tweets after the inauguration. Pandas is a very useful library to load and interrogate data. Simply run the code below (and relax, you are not supposed to really understand everything, except maybe line 4).

In [None]:
# import the pandas library`
import pandas as pd
# read the JSON corpus, or: put all tweets in a box called trump_tweets
tweets = pd.read_json('data/trump2.json')
# ignore for now, this simply uses the moment of posting as an index
tweets.set_index('created_at',inplace=True)
# keep only the tweets posted by Trump himself (i.e. exclude retweets)
tweets = tweets[tweets.is_retweet==False]
# print the first five rows
tweets.head(5)

With these few lines, you managed to lead the whole corpus of Trump tweets.

In [None]:
# Exercise 1: print the first 10 rows
# Exercise 2: What is the type of the tweets variable?

In [None]:
# Exercise 2:You can count the number of tweets 
# by wrapping the "len()" function around the "tweets" variable. Try it!

## Exploring Data

Pandas allows you to inspect the data with the help of some descriptive statistics and plots. Run the cell below, otherwise the plots won't appear in the Notebook.

In [None]:
tweets.describe()

Does this table give you an overview of the whole dataset?

In [None]:
# Run this cell to plot all figures in the Notebook
%matplotlib inline

An easy way to study the popularity of Trump is to plot the number of retweets over time

In [None]:
# Get the summary statistics for the retweet_count column
tweets.retweet_count.describe()

In [None]:
# Plot retweets over time
tweets['retweet_count'].plot()

In [None]:
from datetime import datetime
to_month = lambda x: datetime(x.year,x.month,1)
to_day = lambda x: datetime(x.year,x.month,x.day)

In [None]:
# plot by month or day by replacing the lambda functions
tweets['retweet_count'].groupby(to_day).mean().plot()

**Question**: Changing the unit of analysis (month or day) leads to different figures. Which, do you think, is most interesting?

A **histogram** gives an indication of the distribution of the values.

In [None]:
tweets['retweet_count'].plot(kind='hist',bins=100)

For closer inspection, you can sort the table by a certain column. 

In [None]:
tweets.sort_values('retweet_count',ascending=False)[:5]

#### Exercise
Make plots and sort the data, but this time for the **"favourite_count"** column 

In [None]:
# Add your code here

## Vader Sentiment Analyzer
[from Github](https://github.com/cjhutto/vaderSentiment): VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

VADER uses a lexicon (a mapping of words to sentiment values, e.g bad=-1.0, good=+1.0) to compute the sentiment (positivity or negativity) of a text.

In [None]:
from nltk.sentiment import vader
analyzer = vader.SentimentIntensityAnalyzer()

Below you can test VADER yourself by changing the value of the ``text`` variable, and running the code block. 

Can you trick the system? Not very easy isn't it?!

In [None]:
text = "I am very very very happy! Oh so happy."
vs = analyzer.polarity_scores(text)['compound']
print("{:_<65} {}".format(text, str(vs)))

Now we can easily calculate the sentiment of Trump's tweets. 

In [None]:
compound_sentiment = lambda x: analyzer.polarity_scores(x)['compound']
tweets['compound_sentiment'] = tweets['text'].apply(compound_sentiment)

### Exercises

In [None]:
# print the ten first lines of the tweets table
# you should see a new column

In [None]:
# make a timeline and histogram for the compound sentiment collumn

## Indexing and slicing

You might have wondered what the square brackets in ``tweets.sort_values(retweets,ascending=True)``**[:5]** actually meant? Changing the number ``n`` will, as you undoubtedly notice, impacted the number of rows shown in the Notebook. Actually, here we **sliced** the data, demarcated by the **indices** ``0`` and ``5``.

To properly understand these operations, let's have a look at how these operations work on simple strings.

In [None]:
song = "Naturkatastrophenkonzert"
print(song)

[MK, partly a recap] Such a piece of text ("Naturkatastrophenkonzert") is called a ``string`` in Python (cf. a string of characters). Strings in Python must always be enclosed with 'quotes' (either **single** or **double** quotes). *Without those quotes, Python will think it's dealing with the name of some variable that has been defined earlier, because variable names never take quotes.* The following distinction is confusing, but extremely important: variable names (without quotes) and string values (with quotes) look similar, but they serve a completely different purpose. Compare:

In [None]:
name = "Doris"
Doris = "name"
D = "D"
print(name)
print (Doris)
print(D)

Now that you know the difference between variables and string values, we can inspect these strings further. [MK] Strings are called strings because they consist of a series (or ``'string'``) of individual characters. We can access these individual characters in Python with the help of **``'indexing'``**, because each character in a string has a unique **``'index'``**. To print the first letter of your name, you can type:

In [None]:
song_startswith = song[0]
print(song)
print(song_startswith)


### Or try:
#print("Starts with: ",song_startwith)

[MK] How does indexing work exactly?

![Indexes of the string Monty Python starting with 0](https://i.stack.imgur.com/vIKaD.png)
Take a look at the string "Monty Python". We use the index `0` to access the first character in the string. This might seem odd, but  remember that all indexes in Python start at zero. Whenever you count in Python, you start at `0` instead of `1`. Note that the space character gets an index too, namely 2. This is something you will have to get used to!

Because you know the length of your name you can ask for the last letter of "Naturkatastrophenkonzert": 

In [None]:
last_letter = song[# fill in the last index of "Naturkatastrophenkonzert" (tip indexes start at 0)]
print(last_letter)

[MK]It is rather inconvenient having to know how long our strings are if we want to find out what its last letter is. Python provides a simple way of accessing a string 'from the rear':

In [None]:
# Exercise print the last letter of your name

In [None]:
last_letter = song[-1]
print(last_letter)

[MK]Now can you write some code that defines a variable `but_last_letter` and assigns to this variable the *one but last* letter of your name?

In [None]:
name = # enter your name as a string

In [None]:
but_last_letter = name[# insert your code here]
print(but_last_letter)

[MK] You're starting to become a real expert in indexing strings. Now what if we would like to find out what the last two or three letters of our name are? In Python we can use so-called 'slice-indexes' or 'slices' for short. To find the first two letters of our name we type in:

In [None]:
first_two_letters = name[0:2]
print(first_two_letters)

The `0` index is optional, so we could just as well type in `name[:2]`. This says: take all characters of name until you reach index 2 (i.e. up to the third letter, but not including the third letter). We can also start at index 2 and leave the end index unspecified:

In [None]:
without_first_two_letters = name[2:]
print(without_first_two_letters)

Because we did not specify the end index, Python continues until it reaches the end of our string. If we would like to find out what the last two letters of our name are, we can type in:

In [None]:
last_two_letters = name[-2:]
print(last_two_letters)

Indexing and slicing applies to other datatypes than strings. However, the exact mechanic might differ. 
First we inspect the data type of the twitter corpus.

In [None]:
type(tweets)

...which, as you can see is a pandas DataFrame object. However we can still select items by using a similar syntax as with strings. 

In [None]:
# fetch a row with index 0
print(tweets.iloc[0])

In [None]:
# Exercise get the last ten rows

If we want to find the most popular tweet, we ca sort the rows by ``retweet_count`` and take the first row.

In [None]:
sorted_by_retweet = tweets.sort_values('retweet_count',ascending=False)
print(sorted_by_retweet.iloc[0])

You can read the tweets using the following expressions:

In [None]:
print(sorted_by_retweet.iloc[0]['text'])

In [None]:
print(sorted_by_retweet.iloc[0].text)

In [None]:
Using the slicing technique we can retrieve the ten most popular tweets.

In [None]:
sorted_by_retweet.iloc[0:10]['text']

In [None]:
To read the full text, run the code below.

In [None]:
for r in sorted_by_retweet.iloc[0:10]['text']:
    print(r)
    print()

### Exercise
Print the ten most positive and ten most negative tweets

In [None]:
# Add your code here

### You're done for today!
If you have some energy left, play around a bit with the code below

The code below allows you to search for specific words in the Twitter corpus.

In [None]:
contains_word = lambda x,w: x.lower().find(w)
trump_tweets['contains_obama'] = trump_tweets['text'].apply(contains_word,w='obama')
about_obama = trump_tweets[trump_tweets.contains_obama > 0]
len(about_obama)

How does Trump use uppercase?

In [None]:
def count_upper(text):
    uppers = []
    for char in text:
        if char.isupper():
            uppers.append(char)
    return len(uppers)/len(text)

tweets = tweets[tweets.is_retweet==False]
tweets['uppers'] = tweets.text.apply(count_upper)
tweets.sort_values('uppers',ascending=False)[:10]

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()