<a href="https://colab.research.google.com/github/maxodsbjerg/strik-og-kod/blob/main/notebooks/SK_handson_notesbook_eng.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/maxodsbjerg/strik-og-kod.git/main?labpath=notebooks%2FSK_handson_notesbook_eng.ipynb) 

# Knit & Code
author: "Mathias Johansson & Max Odsbjerg Pedersen"

date: "2025-04-06"

Detta dokument består av den kod-delen av workshoppen "Knit and Code" vid Humanistiska och Teologiska Fakulteterna vid Lunds universitet utvecklad I samarbete med _AU Bibliotek vid Det Kongelige Bibliotek_. Workshopen handlar om att dra paralleller mellan stickning och kodning. "Kodning" förstås här som kopplingen mellan kodningsbaserad datahantering och ligger därför inom området datorvetenskap. Eftersom workshopen är gjord i sammanhang av humaniora avser exemplet _textmining_. När man använder _textmining_ är det primära intresset att extrahera information ur stora korpus - vilket är exakt det intresse som många humanister har.

# <Todo>
No recipe is complete without a picture of the final product as one of the first items. And this is no exception. The final result at the end of this document is the visualisation shown just under this paragraph. It shows the most frequently appearing words in old newspaper articles concerning knitting after all stopwords has been removed (it, that, to, and, in - words which bear no larger meaning).

![](https://raw.githubusercontent.com/maxodsbjerg/strik-og-kod/refs/heads/main/notebooks/graphics/strikke_wordcloud.png)

Knitting words and words which accompany them.

<br>


# </Todo>


## Ladda ner och installera Python paket
Vi arbetar i programmeringsspråket [Python](https://www.python.org/), ett
gratis och _open-source_ programmeringsspråk. Python får mest av sin
funktionalitet genom att importera 'bibliotek', och python har ett mycket brett
ekosystem med bibliotek som erbjuder nästan all funktionalitet du kan tänka
dig. Bland annat många möjligheter för att bearbeta text, statistik och grafisk
presentation av resultaten. Python får mest av sin funktionalitet genom att
importera 'bibliotek', och python har ett mycket starkt ekosystem med bibliotek
som erbjuder nästan alla funktioner du kan tänka dig. Bland annat många
möjligheter för att bearbeta text, statistik och grafisk presentation av
resultaten.

I denna workshop är de relevanta paketen:
- Pandas: Ett kraftfullt bibliotek för datahantering.
- Wordcloud: Ett Python-bibliotek för att generera ordmoln från text.
- Matplotlib: Ett bibliotek för att skapa statiska, animerade och interaktiva
  visualiseringar i Python.

Vi kommer att installera dessa paket med hjälp av pip, Pythons pakethanterare.
Pip är ett kommandoradsverktyg som gör det lätt att installera och hantera
Python-paket.

Vi använder också två bibliotek från _standardbiblioteket_:
- re: Ett reguljärt uttrycksmodul -- för textmönstermatchning.
- Counter: Ett objekt för att räkna förekomster av objekt.


In [None]:
print("install and load libraries")
!python -m pip install pandas wordcloud matplotlib

import re
import pandas as pd
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt

## Data – articles about knitting

The first thing we need is some text data. We will here use data from the Danish newspaper collection. The data is supplied by the Royal Danish Library experimental Newspaper-API. Interaction with the API builds on searches on Mediestream which is the Royal Danish Libraries platform searching in the newspaper collection and other. Before using the API it is a good idea to get acquainted with the expanded searching codes of Mediestream. To learn the search tips of Mediestream - <https://www2.statsbiblioteket.dk/mediestream/info/soegetips>

You can also see the actual search code which is used here:
<https://gist.github.com/maxodsbjerg/e2dd484d3c9dcaa9c422a861d6a93f6e>

When you feel confident with limiting your searches with search codes you can use this interface to make calls to the Newspaper API:
<http://labs.statsbiblioteket.dk/labsapi/api//api-docs?url=/labsapi/api/openapi.yaml> (Choose "aviser(newspapers)/export/fields")

We have in this workshop prepared an API-call that makes the following search which will returns the matches as data:

> strik\* AND py:[1845 TO 1850] NOT familyId:(stcroixavisdvi OR sanctthomaetidendedvi)

This search gives us articles from the collection in the period 1845 to 1850, that contains words which begins with “strik” (knit) and have all possible endings. And so, we matches such as: “strikke” (knit), “strikning” (knitting), “striktøj”(knit cloth) and “strikketøj” (knitwear). But we also get other words such as “strikt” (strict).

Searches in Mediestream looks like this: ![](https://raw.githubusercontent.com/maxodsbjerg/strik-og-kod/refs/heads/main/pics/mediestream_strik.png)

But the data that the API returns to us is available as files in CSV-format (Comma Separated Values). To gain access to the data the API returns a link. This link contains the file which is our data. For some the link will open the file in your browser and you will then be able to see something like this:

![](https://raw.githubusercontent.com/maxodsbjerg/strik-og-kod/refs/heads/main/pics/api_strik.png)

For others the link will download the csv-file to your computer. The most important thing that the API returns is a link to the raw data which matches our search. Without any unnecessary colouring or available interface that we can interact with as in the Mediestream-search above. The raw data can be loaded directly into R and we will afterwards be able to handle the data. Let us get our articles on knitting into R!

### Load data

First we use the URL we have gotten from the API to download the corpus-csv file to disk.

In [None]:
!wget -c -nc "http://labs.statsbiblioteket.dk/labsapi/api/aviser/export/fields?query=strik%2A%20AND%20py%3A%5B1845%20TO%201850%5D%20NOT%20familyId%3A%28stcroixavisdvi%20OR%20sanctthomaetidendedvi%29&fields=link&fields=timestamp&fields=fulltext_org&fields=familyId&fields=lplace&max=-1&structure=header&structure=content&format=CSV" -O corpus.csv

Then we use the function `read_csv` from the pandas library to load the file's contents into a DataFrame and keep that in memory under the variable name "strik".
A DataFrame is comparable to spreadsheet in that it is a large matrix that stores data we work with.


In [None]:
strik = pd.read_csv("corpus.csv")

This gives us a new Pandas DataFrame named “strik” and containing 7810 rows and 16 columns.

What is especially interesting for us is the column “fulltext_org” – This is
where the text from the articles is stored. At first the text will not be easy
on the eyes as it is filled with errors, and it is here where you meet the
first downside of working with old text: OCR-errors.

To understand why these errors occur it is necessary to turn towards the
digitalization. In this process the newspapers are photocopied (either from
microfilm or from the original), afterwards a computer algorithm runs through
the pages of the newspapers. The computer algorithm does two things:

  1. Segmenting the articles – with other words the algorithm guesses which
  body belongs to which headline.
  2. Doing text recognition so that the text becomes digital and becomes
  searchable. This is also called OCR (Optical Character Recognition).

This algorithm has been developed with modern newspapers in mind and is
therefore pretty precise when used on more recent newspapers (from 1910
until today). If the algorithm is used on older material the quality of
the digitalization dwindles. This is in part due to layout of older newspapers
differ from modern layouts. One of the big problems are that the text
recognition is bad. This is a result of the typeface used in old newspapers
which used fraktur when pressing newspapers. Some will recognize the typeface
as gothic letters or curly letters. ![](https://raw.githubusercontent.com/maxodsbjerg/strik-og-kod/refs/heads/main/pics/fraktur.png) Our hope here is
that the data is so large that we can gather something interesting despite
the OCR-errors.



## The Text mining task


First we will convert the text into lowercase and split it into words using [Regular expression](https://en.wikipedia.org/wiki/Regex).
We store these `lists` of lowercase words in the DataFrame in the column `word` and we expand this column into a new dataframe where each word has its own row.


In [None]:
strik['word'] = strik.fulltext_org.apply(lambda x: re.findall(r'\w+', x.lower()))
strik_tidy = strik.explode('word')

Let us just print out the new data frame to see how the tidytext format looks in practice. This is achieved by writing the name of the data frame:

In [None]:
strik_tidy

If we flip through the columns (with the little black arrow in the top-right corner) the last column will now be “word” which only contains single words.

## Analysis

### Wordcloud

To get an overview of our dataset we will begin by counting the most used words in the article about knitting in the period 1845 to 1850:


In [None]:
Counter(strik_tidy.word.values).most_common()

<br> To no one’s surprise most frequent words in the dataset is the grammatical particles. One way to negate these words is by using a stopword list which can be used to remove unwanted words:

In [None]:
!wget "https://gist.githubusercontent.com/maxodsbjerg/4d1e3b1081ebba53a8d2c3aae2a1a070/raw/e1f63b4c81c15bb58a54a2f94673c97d75fe6a74/stopord_18.csv" -O stopord.csv


In [None]:
stopord = pd.read_csv("stopord.csv")['word'].to_list()

<br>


We will filter out all the stopwords !!

In [None]:
def not_stopword(word):
  return word not in stopord

words = Counter(filter(not_stopword, strik_tidy.word.values)).most_common()
words

<br> We can already see quite a few interesting words. Something points
towards a connection between maids that are seeking “condition” which back
in the day meant a “service position” or a space of sorts. We can also see
an OCR-error “eondition” and another spelling of condition, “kondition”.

But a list is a little boring to look at. Could we perhaps create a
beautiful wordcloud? Of course we can!


In [None]:
wc = WordCloud()
wc.generate_from_frequencies(
  Counter(filter(not_stopword, strik_tidy.word.values))).to_file('wc.png')
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()