# Python scripts - brown bag session 02

This Jupyter notebook contains the introductory presentation on Python scripts to our colleagues at the Leibniz-Institut of European History

Authors:  
Jaap Geraerts   
Demi Vasques

# RUNNING SCRIPTS

## Introduction

![Image of code](https://dollars-bbs.org/suggestions/src/1445876922671.jpg)

In [None]:
## Content

**1. What are scripts and how can they assist our historical research?**

**2. Where to find scripts?**

**3. How to run and use scripts?**

## What is a script?

 1. Difference between **mark-up languages** (e.g. HTML, XML) and **progamming languages**
    
    *HTML example*
    
    <HTML>

        <HEAD>

        <TITLE>Your Title Here</TITLE>

    </HEAD>

    <BODY BGCOLOR="FFFFFF">

    <CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"> </CENTER>

    <HR>

    <a href="http://somegreatsite.com">Link Name</a>

    is a link to another nifty site

    <H1>This is a Header</H1>

    <H2>This is a Medium Header</H2>

    Send me mail at <a href="mailto:support@yourcompany.com">

    support@yourcompany.com</a>.

    </BODY>

    </HTML>
    
    
    *HTML example*
    
    <note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
    </note>
    

**Programming language**: 

* "A programming language is a *formal language*, which comprises a *set of instructions* that produce various kinds of output" (https://en.wikipedia.org/wiki/Programming_language)
* "A programming language is a set of commands, instructions, and other syntax use to create a *software program*" (https://techterms.com/definition/programming_language)

The *source code* of a program represents all the commands, instructions, etc, of which a program consists. 

In [None]:
print("This is a simple print instruction")# example of a simple print instruction 

2. Scripts

"A computer script is a *list of commands* that are executed by a certain program or scripting engine" (https://techterms.com/definition/script)

A script is a self-contained set of code that can be executed and produces a certain result. The possibilities are seemingly endless, scripts can be created for pretty much anything. For example, scripts can be created that require the user to give a certain input or can be applied to existing files.

In order to understand scripts, one has to understand the programming language in which they are written. Learning programming languages can be very time consuming and is partly comparable to learning human languages (albeit with other dimensions). However, **even without being an expert, one can learn to find, understand, and use scripts.** Important in this respect is that you have some background knowledge of a programming language so that you can understand what a script does (which also enables one, with a bit of effort, to modify a script). Acquiring knowledge of a programming language can be done though (online) tutorials and courses. For the programming language called **Python**, see e.g.

* https://www.tutorialspoint.com/python/
* https://www.w3schools.com/python/
* https://www.pythoncentral.io/
* https://stackify.com/learn-python-tutorials/ (an overview with Python tutorials)

NB: The quality and quantity of documentation differs per programming language, partly depending on the time a language has been around for and the extent to which it has been adopted by non-specialists. 


Possible applications of scripts for historical research (there are a great many):

* ingest, gather, and transform information
* analyse and manipulate information

Useful to apply scripts when:

* working with large datasets
* working with clean and uniform data
* outcome of script is pretty straight-forward

Perhaps better not to use scripts when:

* working with a small dataset
* working with unstructured and/or complicated data

## Where to find scripts?

1. How to acquire scripts?

    * Create a script ex nihilo
    * Use the work of others ('Beter goed gejat dan slecht bedacht', Dutch for 'It is better to steal something good then to invent something bad')
    * Modify or add to existing script


2. Where to find them?

This is a question that is easier raised then answered, for there are several ways of finding scripts. One can check online resources which contain code examples and 'snippets', such as:

* https://www.pythonforbeginners.com/code-snippets-source-code/python-code-examples
* https://www.pythonforbeginners.com/code-snippets-source-code/python-code-snippets-2

Such an approach has a high degree of hit-or-miss. Targeted searches in Google are much more efficient and return specific items on fora such as *Stack Overflow*, *GitHub* and other **community initiatives** (very helpful for open-source languages such as Python).  

*Stack Overflow*

* Forum
* Ask questions and respond to questions about computer programming
* See https://stackoverflow.com/

*Example*

Finding a script which converts DDMMYY (e.g. 01072019) into DD/MM/YYYY. Search in Google on 'python convert date stack overflow' and https://stackoverflow.com/questions/1745042/convert-date-python/1745082. Here, one of the users provides a particular script: import datetime
date = datetime.datetime.strptime("111609", "%m%d%y")
print date.strftime("%m/%d/%Y")

The question is: what if we have found a script, how to run and actually use it??

## How to run and use scripts?

There are several ways of running scripts, depending on the programming language and on one's preference.  

**Our focus**: 

* **Python** programming language (https://www.python.org/)
* **Jupyter Notebook** (a Python **interpreter**) web application. "The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more." (https://jupyter.org/)
* Easy installation with **Anaconda**, a Python (and R) distribution that facilitates package management (https://www.anaconda.com/). Many packages for data analysis are already include in this distribution

### Data Analysis with Python - Basics (adapted from http://swcarpentry.github.io/python-novice-inflammation/)

#### 1. Variables

A variable is just a way to store values

In [None]:
# It is possible to perform direct calculations using any Python interpreter. For instance:
13 + 8

In [None]:
# However, this is not very useful for data analysis, so we assign values to a variable,
# and we can use that variable at any time:
birth_year = 1769
death_year = 1821

In [None]:
# The we can perform calculations, like:
print('Napoleon Bonaparte lived for', death_year - birth_year, 'years')

#### 2. Types of data

Python deals with various types of data. Three common ones are:

* integer numbers
* floating point numbers, and
* strings

In the previous example, **birth_year** and **death_year** were **integer numbers**. **Floating point numbers** include a decimal point, so it is easy to performa calculations with this type of data too. However, with **strings**, things get a little more complicated

In [10]:
birth_date = '08/15/1769'
death_date = '05/05/1821'

In [11]:
# Now the dates have more information and are stored as STRINGS instead of INTEGERS. If we try the same, we will get an error:
print('Napoleon Bonaparte lived for', death_date - birth_date, 'years')

TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [12]:
# We need to manipulate the data
import datetime as dt # importing a package tha deals with dates
date_format = "%m/%d/%Y" # defining the standard date format

# assuring our data is in the right date format
birth_date = dt.datetime.strptime(birth_date, date_format).date() 
death_date = dt.datetime.strptime(death_date, date_format).date()

# getting our result
print('Napoleon Bonaparte lived for', (death_date - birth_date).days, 'days')

Napoleon Bonaparte lived for 18890 days


#### 3. Storing data

Besides variables, there are other ways of storing data in Python. The two most common are:

* lists
* dictionaries

While with lists we can store multiple values (but only values!), with dictionaries we can store values that are associated with keys! Let's see an example

In [13]:
# below we have two lists: one with historical actors and the other with their birth dates
historical_actors = ['Isabella I of Castile','Napoleon Bonaparte','Catherine the Great','Martin Luther','Queen Victoria']
birth_dates = ['04/22/1451','08/15/1769','05/02/1729','11/10/1483','05/24/1819']

In [14]:
# we can create a single dictionary with these two lists, with historical actors as keys and
# their respective birth dates as values
birth_actors = {'Isabella I of Castile':'04/22/1451','Napoleon Bonaparte':'08/15/1769','Catherine the Great':'05/02/1729',
                'Martin Luther':'11/10/1483','Queen Victoria':'05/24/1819'}

#### 4. Manipulating data

Two very basic and also very practical ways of manipulating data are:

* slicing
* for loops

##### Slicing

**Very, very important** - In Python the index (the position of a value) starts at 0, so when slicing we have to keep this in mind! The first value of a list, for instance, has index 0!

In [22]:
# the entire list (our dataset)
print(historical_actors)

['Isabella I of Castile', 'Napoleon Bonaparte', 'Catherine the Great', 'Martin Luther', 'Queen Victoria']


In [20]:
# one may be interested only in the first three values of the dataset
print(historical_actors[0:3]) # the first limit (the value before the ':') is included, but the second limit is not!
print(historical_actors[:3]) 

['Isabella I of Castile', 'Napoleon Bonaparte', 'Catherine the Great']
['Isabella I of Castile', 'Napoleon Bonaparte', 'Catherine the Great']


In [18]:
# or perhaps, in the last four values of the data
print(historical_actors[-4:])

['Napoleon Bonaparte', 'Catherine the Great', 'Martin Luther', 'Queen Victoria']


In [21]:
# or yet, only in the values in specific positions
print(historical_actors[1:3])
print(historical_actors[2:4])

['Napoleon Bonaparte', 'Catherine the Great']
['Catherine the Great', 'Martin Luther']


##### For loops

This is a technique used when we want to repeat the same task, several times, as for instance, for every value of the dataset. There are two main ways of performing for loops:

In [25]:
# first, we can 'call' values directly
for actor in historical_actors:
    print(actor)

Isabella I of Castile
Napoleon Bonaparte
Catherine the Great
Martin Luther
Queen Victoria


In [26]:
# second, we can 'call' values using their indexes 
# this is particularly useful when we have more than one list, for example

for i in range(len(historical_actors)): 
    print('The birth of', historical_actors[i], 'was on', birth_dates[i])
    
# this reads as: for every index in the range of the length of the list containing the historical actors,
# print their name and their birthdate

The birth of Isabella I of Castile was on 04/22/1451
The birth of Napoleon Bonaparte was on 08/15/1769
The birth of Catherine the Great was on 05/02/1729
The birth of Martin Luther was on 11/10/1483
The birth of Queen Victoria was on 05/24/1819


#### 4. Importing data

Besides variables, there are other ways of storing data in Python. The two most common are:

* CSV files

While with lists 

#### 5. Scripts

Besides variables, there are other ways of storing data in Python. The two most common are:

* changing date format
* getting coordinates (or creating a network)

While with lists 

#### 5. Visualisation

Besides variables, there are other ways of storing data in Python. The two most common are:

* network of letters exchanged

While with lists 