<a href="https://colab.research.google.com/github/mathjoha/strik-og-kod/blob/main/notebooks/KnC_handson_notesbook_swe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
[![Open In Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mathjoha/strik-og-kod.git/main?labpath=notebooks%2FKnC_handson_notesbook_swe.ipynb) 

# Knit & Code
author: "Mathias Johansson & Max Odsbjerg Pedersen"

date: "2025-04-06"

Detta dokument består av den kod-delen av workshoppen "Knit and Code" vid Humanistiska och Teologiska Fakulteterna vid Lunds universitet utvecklad I samarbete med _AU Bibliotek vid Det Kongelige Bibliotek_. Workshopen handlar om att dra paralleller mellan stickning och kodning. "Kodning" förstås här som kopplingen mellan kodningsbaserad datahantering och ligger därför inom området datorvetenskap. Eftersom workshopen är gjord i sammanhang av humaniora avser exemplet _textmining_. När man använder _textmining_ är det primära intresset att extrahera information ur stora korpus - vilket är exakt det intresse som många humanister har.

# <Todo>
No recipe is complete without a picture of the final product as one of the first items. And this is no exception. The final result at the end of this document is the visualisation shown just under this paragraph. It shows the most frequently appearing words in old newspaper articles concerning knitting after all stopwords has been removed (it, that, to, and, in - words which bear no larger meaning).

![](https://github.com/mathjoha/strik-og-kod/blob/53a64caa55cebfea9e3d9e2db1f7305d1043d129/notebooks/swe_wc.png)

Knitting words and words which accompany them.

<br>


# </Todo>



## Ladda ner och installera Python paket
Vi arbetar i programmeringsspråket [Python ](https://www.python.org/), ett
gratis och _open-source_ programmeringsspråk. Python får mest av sin
funktionalitet genom att importera 'bibliotek', och python har ett mycket brett
ekosystem med bibliotek som erbjuder nästan all funktionalitet du kan tänka
dig. Bland annat många möjligheter för att bearbeta text, statistik och grafisk
presentation av resultaten. Python får mest av sin funktionalitet genom att
importera 'bibliotek', och python har ett mycket starkt ekosystem med bibliotek
som erbjuder nästan alla funktioner du kan tänka dig. Bland annat många
möjligheter för att bearbeta text, statistik och grafisk presentation av
resultaten.

I denna workshop är de relevanta paketen:
- Pandas: Ett kraftfullt bibliotek för datahantering.
- Wordcloud: Ett Python-bibliotek för att generera ordmoln från text.
- Matplotlib: Ett bibliotek för att skapa statiska, animerade och interaktiva
  visualiseringar i Python.

Vi kommer att installera dessa paket med hjälp av pip, Pythons pakethanterare.
Pip är ett kommandoradsverktyg som gör det lätt att installera och hantera
Python-paket.

Vi använder också två bibliotek från _standardbiblioteket_:
- re: Ett reguljärt uttrycksmodul -- för textmönstermatchning.
- Counter: Ett objekt för att räkna förekomster av objekt.


In [None]:
print("install and load libraries")
!python -m pip install pandas wordcloud matplotlib

import re
import pandas as pd
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt

## Data – utterances about ABM

The first thing we need is some text. 
We will here use _utterances_ from Swedish Parliamentary debates from 1867-2022.
In order to find and access these _utterances_ we will use the[Riksdagsdebatter.se](https://riksdagsdebatter.se/public/index.html#/about) which offers a graphical interface for searching and accessing the debates.
In order to find all the utterances mentioning ABM we search for the following three keywords:
- `arkiv*`
- `bibliotek*`
- `muse*`
Where `*` is a trunctation mark -- expanding to find *all words* starting with `arkiv`, `bibliotek`, or `muse` (1477 words according to the GUI).

`Riksdagsdebatter.se` does not prove an API and the ~9k results are paginated at 50 results per page.
However, the source material is available at: [the-swedish-parliament-corpus](https://github.com/swerik-project/the-swedish-parliament-corpus)
where one can download the entire corpus and all speaker-metadata.
In preparation for this workshop I have downloaded the records, filtered out all the utterances that do NOT mention at least one of our keywords and restructured the result into a CSV file. [scrip](https://gist.github.com/mathjoha/edcdaf57c5c2d58d9f6b58a6350b811d) [corpus](https://raw.githubusercontent.com/mathjoha/strik-og-kod/refs/heads/main/the-swedish-parliament-corpus_ABM_.csv)


### Load data

In order to access the data for the workshow it needs to be downloaded, which can easily be done with the `wget` program as such:

In [None]:
!wget -c -nc "https://github.com/mathjoha/strik-og-kod/blob/08f31ff60c9460efb232fecc1fbd07aace62dab4/data/the-swedish-parliament/the-swedish-parliament-corpus_ABM_.csv" -O corpus.csv

Then we use the function `read_csv` from the pandas library to load the file's contents into a DataFrame and keep that in memory under the variable name "strik".
A DataFrame is comparable to spreadsheet in that it is a large matrix that stores data we work with.


In [None]:
riksdag = pd.read_csv("corpus.csv")

This gives us a new Pandas DataFrame named ”riksdag" and containing 13755 rows and 5 columns.

What is especially interesting for us is the column “content” – This is
where the transcribed utterenaces are stored. Some of this text will not be easy
on the eyes as they are filled with errors, and it is here where you meet the
first downside of working with digitised text: OCR-errors.

To understand why these errors occur it is necessary to turn towards the
digitalization. In this process the protocols are scanned and processed 
with an OCR engine. These engines tend to work in two steps:

  1. Segmenting the image into different blocks of text.
  2. Transcribing the blocks of text into. 

These engines are typicall developed and tested on newer material -- part becase the older material is more complex, and there is less of it available. 
The team behin `The Swedish Parliament Corpus` have dedicated a lot of effort into developing systems for transcribing the older debates, identifying speakers and mappint them to metadata. Still, many [OCR-errors remain](https://github.com/swerik-project/pyriksdagen/blob/5cdc0875b7ed9a46f6ec7039d439d6e22e6acf54/examples/corpus-walkthrough.ipynb) in the material




## The Text mining task


First we will convert the text into lowercase and split it into words using [Regular expression](https://en.wikipedia.org/wiki/Regex).
We store these `lists` of lowercase words in the DataFrame in the column `word` and we expand this column into a new dataframe where each word has its own row.


In [None]:
riksdag['word'] = riksdag.content.apply(lambda x: re.findall(r'\w+', x.lower()))
riksdag_tidy = riksdag.explode('word')

Let us just print out the new data frame to see how the tidytext format looks in practice. This is achieved by writing the name of the data frame:

In [None]:
riksdag_tidy

If we flip through the columns (with the little black arrow in the top-right corner) the last column will now be “word” which only contains single words.

## Analysis

### Wordcloud

To get an overview of our dataset we will begin by counting the most used words in the article about knitting in the period 1845 to 1850:


In [None]:
Counter(riksdag_tidy.word.values).most_common()

<br> To no one’s surprise most frequent words in the dataset is the grammatical particles. One way to negate these words is by using a stopword list which can be used to remove unwanted words. For this we will use a stopwords-list published publicly by [@peterdalle](https://gist.github.com/peterdalle):

In [None]:
!wget "https://gist.githubusercontent.com/peterdalle/8865eb918a824a475b7ac5561f2f88e9/raw/cc1d05616e489576c1b934289711f041ff9b2281/swedish-stopwords.txt" -O stopord.txt

In [None]:
with open('stopord.txt', 'r', encoding='utf-8') as f:
  stopord = f.read().split('\n')
stopord += ['icke', 'af', 'herr', 'talman', 'år', 'ju', 't', '000']

<br>


We will filter out all the stopwords !!

In [None]:
def not_stopword(word):
  return word not in stopord

words = Counter(filter(not_stopword, riksdag_tidy.word.values)).most_common()
words

<br> We can already see quite a few interesting words. Something points
towards a connection between maids that are seeking “condition” which back
in the day meant a “service position” or a space of sorts. We can also see
an OCR-error “eondition” and another spelling of condition, “kondition”.

But a list is a little boring to look at. Could we perhaps create a
beautiful wordcloud? Of course we can!


In [None]:
wc = WordCloud()
wc.generate_from_frequencies(
  Counter(filter(not_stopword, riksdag_tidy.word.values))).to_file('swe_wc.png')
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()