**Note:** *This notebook is a work in progress and serves as an example of how users can programmatically analyse the National Library of Norway’s collection of web news texts. The notebook allows you to:*

- *build corpora and visualise distribution of titles, languages and harvest dates*  
- *retrieve Keywords in Context (concordances),*  
- *calculate the relative frequency of collocations for a keyword.*  

*Do you have any questions? Feel free to read the [general information about the online newspaper corpus](https://www.nb.no/en/collection/web-archive/research/web-news-corpus/) or send us an email at [nettarkivet@nb.no](mailto:nettarkivet@nb.no).*

___

# 0. Import `dhlab` for Python

Before you begin, import the necessary Python packages. If you have not installed `dhlab` yet, you need to uncomment the first line by removing `# ` in front of `!pip install ...`.

In [None]:
# !pip install -U dhlab
import dhlab as dh
import dhlab.nbtext as nb
from dhlab import Corpus, totals, Collocations, Ngram
import plotly.express as px
import numpy as np
import pandas as df

___

# 1. Corpus Analysis

In text analysis, a **[corpus](https://en.wikipedia.org/wiki/Text_corpus)** is a collection of texts. Below, you will be able to define corpora from the Web News Collection. In total, there are more than 1.5 million texts in the collection in various languages.

In `dhlab`, the class `Corpus` can be used to represent metadata for each document. This includes metadata such as publication title, language, harvesting date, domain name, and more. Each text also has a unique `dhlabid`, which is DHlab’s persistent URN.

If you want an overview of the various attributes of the texts as they are exposed through the API, you can read more about the [Web News Collection](https://www.nb.no/en/collection/web-archive/research/web-news-corpus/#which-schema-attributes-can-be-used-with-the-api?).

## 1.1 Build a Corpus

Let's build a corpus!

The code cell below creates a corpus with texts containing either **covid-19** or **korona**.

- `doctype="nettavis"` specifies that we are working with web news.  
- `fulltext="covid-19 OR korona"` limits the scope to texts containing the specified keyword(s).

In [None]:
corpus = dh.Corpus(doctype="nettavis", fulltext="covid-19 OR korona", limit=100000)
corpus

## 1.2 Insight and Visualisation

Before proceeding with text analysis, we might want some insight into the corpus that was built. Let's visualise how the texts are distributed by *publication*, *harvesting date* and *language*.

### 1.2.1 Publication Titles (tree map)

Displaying the distribution of texts per publication and visualise the distribution using a tree map. 

In [None]:
count_titles = corpus['title'].value_counts()

fig = px.treemap(
    path=[count_titles.index],
    values=count_titles.values,
    title='Distribution of texts per publication title',
    hover_data={'Number of texts': count_titles.values}
)

fig.show()


### 1.2.2 Distribution over Time

Displaying the distribution of texts over time, based on harvest date. A small random offset (jitter) is applied to the x-axis (harvest date) to reduce overlap, while a random spread along the y-axis is introduced purely for visual separation.

In [None]:
harvestdate = df.to_datetime(corpus['timestamp'], format='%Y%m%d')

# Generate random values for spread along the y-axis
y_spread = np.random.uniform(0, 1, size=len(harvestdate))

# Add jitter (small random offset) to the harvest dates
x_jitter = harvestdate + df.to_timedelta(np.random.uniform(-2, 2, size=len(harvestdate)), unit='D')

# Create a scatter plot with jitter for better distribution
fig = px.scatter(
    x=x_jitter,
    y=y_spread,
    labels={'x': 'Harvest Date', 'y': 'Spread'},
    title='Distribution by Harvest Date',
    opacity=0.5,  # Make points semi-transparent
    size_max=5  # Increase point size
)

# Display the chart
fig.show()


### 1.2.3 Distribution by Language

In [None]:
# Count number of text per language code
language_distribution = corpus['langs'].value_counts()

# Create pie chart
fig = px.pie(
    values=language_distribution.values,
    names=language_distribution.index,
    title='Distribution of Languages',
    labels={'names': 'Language', 'values': 'Texts'}
)

# Display chart
fig.show()


## 1.3 Export Corpus Definition

If you are happy with the corpus you defined, you are almost ready to advance. But first, you probably want to export the corpus definition. This is convenient for research data management (RDM), enabling it possible for yourself and others to reproduce the dataset.

You can export the corpus definiton as both Excel and JSONL. Set the filename below and run cells to export.

In [None]:
# Define filename (without extension)
filename = 'corpus-covid19-OR-korona'

# Export Excel
corpus.frame.to_excel(f"{filename}.xlsx", index=False)

# Export JSONL
corpus.frame.to_json(f"{filename}.jsonl", orient='records', lines=True)

___

## 2. Concordances

When you have defined your corpus, you can extract various types of information from it.

One approach is by retreiving small snippets of text with **[Keyword in Context (KWIC)](https://en.wikipedia.org/wiki/Key_Word_in_Context)**, also known as *concordances*.

The code cell below retrieves concordances, with a text window of up to 12 words before and after your keyword. The keyword is highlighted in bold. (To prevent that snippets can be used to reconstruct complete texts, snippets never crosses paragraphs.)

Let’s request concordances for words starting with **"dugnad"** (see "dugnad" explained in [Wikipedia](https://en.wikipedia.org/wiki/Communal_work#Norway)).

In [None]:
conc_dugnad = corpus.conc(words="dugnad*")
conc_dugnad.show()

### Export concordances

Once again, we can save the results as Excel and/or JSONL format. (Ensure to update the filename!)  


In [None]:
# Define filename (without extension)
filename = 'concordances-dugnad'

# Export Excel
conc_dugnad.frame.to_excel(f"{filename}.xlsx", index=False)

# Export JSONL
conc_dugnad.frame.to_json(f"{filename}.jsonl", orient='records', lines=True)


___

## 3. Collocations and Word Frequencies

**[Collocations](https://en.wikipedia.org/wiki/Collocation)** refer to word pairs that frequently occur together. By counting collocations for a given word, we can analyse how often or rarely different words co-occur.

Some words tends to appear frequently in any text, such as `and`, `he`, and `she`. To identify words that are significant within a specific context, we calculate the relative frequency (RF) for collocates, comparing the frequency in our specific corpus with frequency in a general reference corpus.

The cell below lists the words with the highest relative frequency, given a keyword in the corpus.  


In [None]:
tot = totals(75000)  # Fetches reference corpus with common word frequency
coll = corpus.coll("dugnad").frame.sort_values(by="counts", ascending=False)  # Counts collocations for the keyword in the corpus
(coll.counts / tot.freq * 100).sort_values(ascending=False).head(20)  # Calculates relative frequency (RF) in the corpus compared to the reference corpus
