# Teach the Teacher workshop 

*25 January 2021*


## A guessing game

The code below contains the code for a simple game in which the user needs to guess a number in between 1 and 50.

The standard function input() can be used to request a value from the user. The function int() converts the input into an integer, if possible. The code also makes use of ‘for’ and of ‘while’. When the user enters a value which is too low or too high, this information is communicated to the user via a print statement.

To run the code in the cell below, place the cursur in the cell and click on [Shift] + [Enter].

In [None]:
import random

max = 50
number = random.randint(0,max)

guess = int( input( f"Guess a number in between 1 and {max}: ") )

while guess != number:

    if guess > number:
        print("Lower!")
    elif guess < number:
        print("Higher!")
    guess = int( input("Guess again ... \n") )

print("The correct number is indeed {}.".format(number) )

## Corpora and Open Data

* [Project Gutenberg](https://www.gutenberg.org/)
* [Distant Reading E-COST](https://github.com/distantreading/distantreading.github.io)
* [DBNL](https://dbnl.nl/)
* [Text Creation Partnership](https://github.com/textcreationpartnership/Texts)
* [WikiData](https://www.wikidata.org/)
* [Folger Shakespeare DIgital Library](https://shakespeare.folger.edu/download/)

## Bulk downloads

Files can obviously be downloaded manually via their URL. 

See, for example, https://www.gutenberg.org/ebooks/98

When the number of files to acquire becomes very large, it can be more efficient to write a program which can download files in bulk. 

In [None]:
import requests
import re

gutenberg_files = {
    'http://www.gutenberg.org/files/158/158-0.txt':'Emma',
    'http://www.gutenberg.org/files/161/161-0.txt':'Sense and Sensibility',
    'http://www.gutenberg.org/files/1342/1342-0.txt':'Pride and Prejudice'
}


for url in gutenberg_files:
    print("Downloading " + gutenberg_files[url] + " ...")
    response = requests.get(url)
    title = re.sub( r'\s+' , '_' ,  gutenberg_files[url] )

    if response:
        response.encoding = 'utf-8'
        lines = re.split( r'\n' , response.text )
        flag = 0 
        full_text = ''
        
        for line in lines:
            if flag == 1:
                full_text += line + '\n'
            
            if re.search( r'\*{3,}\s+START\s+OF\s+TH(E|IS)\s+PROJECT\s+GUTENBERG\s+EBOOK' ,  str(line) , re.IGNORECASE ):
                flag = 1
            if re.search( r'\*{3,}\s+END\s+OF\s+TH(E|IS)\s+PROJECT\s+GUTENBERG\s+EBOOK' ,  str(line) , re.IGNORECASE ):
                flag = 0
        full_text = full_text.strip()
        if re.search( r'^Produced by' , full_text , re.IGNORECASE ):
            full_text = full_text[ full_text.index('\n') : len(full_text) ]

            
        out = open( title , 'w' , encoding = 'utf-8')
        out.write( full_text.strip() )
        out.close()

print('\nDone!')    

The code below downloads files which are listed in [a CSV file](https://raw.githubusercontent.com/peterverhaar/introduction_to_dh/main/gutenberg_metadata.csv): 

In [None]:
import pandas as pd
import re

github = 'https://raw.githubusercontent.com/peterverhaar/introduction_to_dh/main/'


md = pd.read_csv( github + 'gutenberg_metadata.csv')

for index,row in md.iterrows():
   
    if re.search( r'Dickens' , str( row['author'] ) , re.IGNORECASE ):
        print( f"{row['author']}\n{row['title']}\n{row['url']}\n\n" )
    
    '''
    if re.search( r'Gothic' , str( row['subject'] ) , re.IGNORECASE ):
        print( f"{row['author']}\n{row['title']}\n{row['url']}\n\n" )
    '''    
        


## Wikipedia API

See, for example: https://en.wikipedia.org/w/api.php?action=opensearch&search=leiden&limit=30&format=json 


In [None]:
import urllib.request
import json

baseURL = 'https://en.wikipedia.org/w/api.php?action=opensearch'
search_term = "leiden"
search_term = re.sub( '\s+' , '%20' , search_term )

limit = 30
format = 'json'

apiCall = '{}&search={}&limit={}&format={}'.format( baseURL, search_term , limit , format )
print( f'URL of the API call: { apiCall } \n')

responseData = requests.get( apiCall )

wikiResults = responseData.json()


for i in range( 0 , len(wikiResults[1]) ):
    print( 'Title: ' + wikiResults[1][i] )
    print( 'Tagline: ' + wikiResults[2][i] )
    print( 'Url: ' + wikiResults[3][i] + '\n')


## OpenStreetMap API

In [None]:

import requests
import xml.etree.ElementTree as ET
import re
import string
from os.path import isfile, join , isdir
import os

addresses = '''
Arsenaalstraat 1, Leiden
Witte Singel 27, Leiden
Wassenaarseweg 52, Leiden
Leiden Nonnensteeg 1
'''

addressesList = re.split( r'\n' , addresses.strip() )



for a in addressesList:
    url = 'https://nominatim.openstreetmap.org/search?q='+ a + '&format=xml'
    url = re.sub( '\s+' , '%20' , url )

    response = requests.get( url )
    root = ET.fromstring( response.text )
    el = root.findall('place')
    
    count = 0
    if el is not None:
        for place in el:
            count += 1
            lat = place.attrib['lat']
            lon = place.attrib['lon']
            if count == 1:
                print( '{}: {},{}'.format( a, lat , lon ) )




## Goodreads API

In [None]:
import re
import requests
import xml.etree.ElementTree as ET


#isbn = '1841156736'
isbn = '9780140181067'

baseUrl = 'https://www.goodreads.com/book/isbn/'
key = 'yZUIiWVAZOHzCFlFwIOTXA'

apiCall = '{}{}?key={}'.format( baseUrl , isbn , key )

print(apiCall)

response = requests.get( apiCall )


root = ET.fromstring(response.text)

title = root.find( 'book/title' ).text
author = root.find( 'book/authors/author/name' ).text
date = root.find( 'book/publication_year' ).text
averageRating = root.find( 'book/average_rating' ).text

reviews = root.find( 'book/work/reviews_count' ).text
ratingsSum = int( root.find( 'book/work/ratings_sum' ).text )
ratingsCount = int( root.find( 'book/work/ratings_count' ).text )

print( f'Title: {title}' )
print( f'Author: {author}' )
print( f'Date: {date}' )
print( f'Average rating: {averageRating}' )
print( f'Number of reviews: {reviews}' )
print( f'Number of ratings: {ratingsSum}' )

In [None]:
import re
import requests
import xml.etree.ElementTree as ET


#isbn = '1841156736'
isbn = '9780140181067'

baseUrl = 'https://www.goodreads.com/book/isbn/'
key = 'yZUIiWVAZOHzCFlFwIOTXA'

apiCall = '{}{}?key={}'.format( baseUrl , isbn , key )

print(apiCall)

response = requests.get( apiCall )


root = ET.fromstring(response.text)


title = root.find( 'book/title' ).text
date = root.find( 'book/publication_year' ).text

reviews = root.find( 'book/work/reviews_count' ).text
ratingsSum = int( root.find( 'book/work/ratings_sum' ).text )
ratingsCount = int( root.find( 'book/work/ratings_count' ).text )

reviewsWidget = root.find( 'book/reviews_widget' ).text

#print(reviewsWidget)

from bs4 import BeautifulSoup
soup = BeautifulSoup( reviewsWidget ,"lxml")

links = soup.find_all("iframe")

out = open( f'reviews-{ isbn }.txt' , 'w' )

for l in links:
    url = l.get("src")
    print(f" {url}")
    for i in range(1,2):
        url += '&page=' + str(i)
        response = requests.get( url )
        if response:
            response.encoding = 'utf-8'
            soup = BeautifulSoup( response.text ,"lxml")
            #No reviews found. Showing 0-0 in response.text
            reviewLinks = soup.find_all("link")
            for r in reviewLinks:
                reviewUrl = r.get("href")

                if re.search( 'goodreads.*review.*show' , reviewUrl ):
                    response = requests.get( reviewUrl )
                    if response:
                        response.encoding = 'utf-8'
                        soup = BeautifulSoup( response.text ,"lxml")
                        fullText = soup.find( 'div' , itemprop='reviewBody' )
                        out.write(fullText.text.strip())
                        out.write('\n\n')
                
                    
print('Ready!')

## IIIF

Documentation: https://iiif.io/api/image/2.1/

The following images (from the collection of the National Gallery of Art in Washingtn) are made available via IIIF.

https://media.nga.gov/iiif/public/objects/1/0/6/3/8/2/106382-primary-0-nativeres.ptif/full/full/0/default.jpg

https://media.nga.gov/iiif/public/objects/4/6/3/0/3/46303-primary-0-nativeres.ptif/full/full/0/default.jpg

https://media.nga.gov/iiif/public/objects/1/1/3/8/1138-primary-0-nativeres.ptif/full/full/0/default.jpg


Basic structure of URL according to IIIF Image API:
{scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}


You can request other manifestations of these images via the following parameters:

* Region: Value 'full' or four numbers. In the latter case, the first two numbers define the starting point and the last two number specify the width and the height. 
* Size: Value 'full or 'two numbers which specify the width and the height. The second number can be omitted
* Rotation: Number in between 0 or 359; at the NGA, the values 0, 90, 180 and 270 have been implemented. 
* Format: values 'default','gray' or 'bitonal', followed by extensons 'jpg', 'png' or 'gif'

Sample query: 
https://media.nga.gov/iiif/public/objects/1/1/3/8/1138-primary-0-nativeres.ptif/20500,2000,20000,5000/500,/90/default.jpg

Try to select one of the images listed above and try to request a version of the image with the following properties:
*	Bitonal, rotated 90 degrees
*	Image with a width of 150 pixels (default size for a thumbnail)
*	Zoom in on a specific detail of the image by specifying a region.

You can test the API query in the cell below.

![IIIF IMAGE](https://media.nga.gov/iiif/public/objects/1/1/3/8/1138-primary-0-nativeres.ptif/20500,2000,20000,5000/500,/0/default.jpg)

## Web Scraping

In [None]:
from bs4 import BeautifulSoup
import requests
import re


soup = ""


url = 'https://www.imdb.com/chart/top?ref_=ft_250'
print(url)


response = requests.get( url )
soup = BeautifulSoup( response.text ,"lxml")


movies = soup.find_all('td', {'class': 'titleColumn'})

movie_urls = []

for m in movies:
    children = m.findChildren("a" , recursive=False)
    for c in children:
        movieTitle = c.text
        url = c.get('href')
        url = 'http://imdb.com' + url
        movie_urls.append( url )
        print( '{}: {}'.format( movieTitle , url ) )


## Enriching data using APIs

* First, download a CSV file from [the episolarium website](http://ckcc.huygens.knaw.nl/epistolarium/), describing all the letters that have been received by René Descartes. Save the CSV file as 'descartes.csv' in the same directory as this notebook.
* Next, run the code below.
             

In [19]:
import pandas as pd
import xml.etree.ElementTree as ET
import re
import requests

def remove_brackets(text):
    text = re.sub( '(\[)|(\])' , '' , text )
    return text

data = pd.read_csv( 'descartes.csv' , sep = ';' )

locations = []
locations_coord = dict()

for index , row in data.iterrows():
    place_sender = remove_brackets(row[3])
    place_recipient = remove_brackets(row[5])
    locations.append(place_sender)
    locations.append(place_recipient)
    
    
for loc in locations:

    if loc not in locations_coord:
        url = 'https://nominatim.openstreetmap.org/search?q='+ loc + '&format=xml'
        url = re.sub( '\s+' , '%20' , url )

        response = requests.get( url )
        root = ET.fromstring( response.text )
        el = root.findall('place')

        count = 0
        if el is not None:
            for place in el:
                count += 1
                lat = place.attrib['lat']
                lon = place.attrib['lon']
                if count == 1:
                    locations_coord[ loc ] = ( lat , lon )
     

    
out = open( 'descartes_enriched.csv' , 'w' , encoding = 'utf-8' )

out.write('id,date,sender,place,geo_sender,recipient,place,geo_recipient\n')

for index , row in data.iterrows():
    
    
    sender_coord = tuple()
    recipient_coord = tuple()
    
    
    if remove_brackets(row[3]) in locations_coord:
        sender_coord = locations_coord[remove_brackets(row[3])]
    if remove_brackets(row[5]) in locations_coord:
        recipient_coord = locations_coord[remove_brackets(row[5])]

    out.write( f'"{row[0]}","{row[1]}","{row[2]}","{row[3]}",' )
    
    if sender_coord and not( re.search( r'\?' , row[3] ) ):
        out.write( f'"{ sender_coord[0]}, {sender_coord[1] }",' )
    else:
        out.write( f',' )

    out.write( f'"{row[4]}","{row[5]}",' )
    
    if sender_coord and not( re.search( r'\?' , row[5] ) ):
        out.write( f'"{ recipient_coord[0]}, {recipient_coord[1] }"' )  
    
    out.write('\n')     

out.close()

    

Run the cell below to generate a map displaying all the locations. 

In [8]:
out = open( 'map.html' , 'w' , encoding = 'utf-8')


out.write('''
<!DOCTYPE html>
<html>
<head>

                <title>Correspondence on a map</title>

                <meta charset="utf-8" />
                <meta name="viewport" content="width=device-width, initial-scale=1.0">

                <link rel="shortcut icon" type="image/x-icon" href="docs/images/favicon.ico" />

    <link rel="stylesheet" href="https://unpkg.com/leaflet@1.7.1/dist/leaflet.css" integrity="sha512-xodZBNTC5n17Xt2atTPuE1HxjVMSvLVW9ocqUKLsCC5CXdbqCmblAshOMAS6/keqq/sMZMZ19scR4PsZChSR7A==" crossorigin=""/>
    <script src="https://unpkg.com/leaflet@1.7.1/dist/leaflet.js" integrity="sha512-XQoYMqMTK8LvdxXYG3nZ448hOEQiglfqkJs1NOQV44cWnUrBc8PkAOcXy20w0vlaXaVUearIOBhiXZ5V3ynxwA==" crossorigin=""></script>



</head>
<body>





<div id="mapid" style="width: 600px; height: 400px;"></div>
<script>

                var mymap = L.map('mapid').setView([52.0799838, 4.3113461], 6);

                L.tileLayer('https://api.mapbox.com/styles/v1/{id}/tiles/{z}/{x}/{y}?access_token=pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpejY4NXVycTA2emYycXBndHRqcmZ3N3gifQ.rJcFIG214AriISLbB6B5aw', {
                                maxZoom: 18,
                                attribution: 'Map data &copy; <a href="https://www.openstreetmap.org/">OpenStreetMap</a> contributors, ' +
                                                '<a href="https://creativecommons.org/licenses/by-sa/2.0/">CC-BY-SA</a>, ' +
                                                'Imagery  <a href="https://www.mapbox.com/">Mapbox</a>',
                                id: 'mapbox/streets-v11',
                                tileSize: 512,
                                zoomOffset: -1
                }).addTo(mymap); 
''')

for l in locations_coord:
     out.write( f' L.marker([ { locations_coord[l][0] }, { locations_coord[l][1] }  ]).addTo(mymap);  ')

out.write(
'''
</script>



</body>
</html>

''')

out.close()

