# Download magazine articles per subject

This notebook guides you through the SRU and OAI of the KB: National Library of the Netherlands, in order to collect magazine articles based on subject and time range. 

### Install the neccesary packages

It is preffered to install the package through a commandline, but installing through the Jupypter Notebook is also possible.

In [None]:
%pip install pandas
%pip install requests
%pip install BeautifulSoup4
%pip install lxml
%pip install html5lib

### Import  the neccesary packages

In [None]:
## Import the necessary packages 
import pandas as pd
from bs4 import BeautifulSoup
import requests
import xml
import re

### Defining the API key

An API key is needed to query and download material. 

In [None]:
apikey = "" #Insert the API key here

### Defining the search parameters

There are various parameters that can be used to search through the collection.
The code in this notebook is based on searching with a keyword.


In [None]:
keyword = 'Inhuldiging+and+koning' ## use '+or+' or '+and+' to search with multiple keywords, such as 'griep+and+ziekte'


### Retrieving the magazine article identifiers

Before we can download the actual content, we need a list of identifiers from the magazine articles that fit to the selection criteria we made above. We put this list in a dataframe in which we store some additional metadata. This  dataframe is used later on for accessing the content. 

In [None]:
## Extract the identifiers
## This might take a while
identifierList = []
startRecord = 0
maximumRecord = 1000
recordCounter = 0

## Assemble the query based on the parameters, we set the  maximumRecords to 1000 to prevent overloading the system
query = f"https://jsru.kb.nl/sru/sru/{apikey}?operation=searchRetrieve"\
        f"&query={keyword}"\
        f"&recordSchema=dc&startRecord={startRecord}&maximumRecords={maximumRecord}&x-collection=DTS_pagina"
print(query)


page = requests.get(query)
soup = BeautifulSoup(page.content,'xml')
print(soup)

for item in soup.findAll('srw:searchRetrieveResponse'):
    records = item.find('srw:numberOfRecords').text
    
## Iterate through the query results to extract the metadata 
while recordCounter < int(records):
    page = requests.get(query)
    soup = BeautifulSoup(page.content, 'xml')

    ## The query returns an xml page with (in this example) 1000 articles 
    ## We extract the metadate per article
    for item in soup.findAll('srw:recordData'):
        identifier = item.find('dcx:pageOcrUrl')
        oai = item.find('OaiPmhIdentifier')    
        identifierList.append([identifier.text, oai.text])
        recordCounter += 1
    ## If there are more than 1000 results, 
    ## this code is used to proceed to the next pages to collect the remainder of the results
    startRecord = startRecord + 1000
    query = f"https://jsru.kb.nl/sru/sru/{apikey}?operation=searchRetrieve"\
        f"&query={keyword}"\
        f"&recordSchema=dc&startRecord={startRecord}&maximumRecords={maximumRecord}&x-collection=DTS_pagina"

In [None]:
## Create the dataframe
dfIdentifiers = pd.DataFrame(identifierList, columns = ['identifier', 'oai'])
## Show the number of found identifiers
print('Found: ',len(dfIdentifiers))
csvlist = dfIdentifiers.to_csv('identifierslist.csvlist')
dfIdentifiers.head(40)

# Retrieving the magazine identifiers
The articles do not directly include metadata on the magazines. Therefore, some additional coding is necessary to retrieve metadata of the magazine (in this case title and date) to which the article belongs.

In [None]:
# Iterate through each record in the DataFrame to get the title and date
for index, row in dfIdentifiers.iterrows():
    oai = row['oai']  # Extract the OaiPmhIdentifier
    identifier = row['identifier']
    
    # Construct the URL for DTS records
    if "DTS" in oai or "TIJDSCHRIFTEN" in oai:
        # Form the URL by replacing the 'DTS:' with the proper format for the URL
        urn = identifier.split("urn=")[-1]
        urn = ":".join(urn.split(":")[:3])
        url = f"http://resolver.kb.nl/resolve?urn={urn}"
        
        # Try to access the URL and parse the title and extract date
        try:
            page = requests.get(url)
            soup = BeautifulSoup(page.content, 'html.parser')  
            title_tag = soup.find('title')  
            if title_tag:
                magazine_title = title_tag.text.strip() 
                dfIdentifiers.at[index, 'magazine title'] = magazine_title
                date_match = re.search(r'\b\d{4}\b', magazine_title)
                if date_match:
                    magazine_date = date_match.group(0)
                else:
                    magazine_date = "No Date Found"
                dfIdentifiers.at[index, 'publication date'] = magazine_date
            else:
                dfIdentifiers.at[index, 'magazine title'] = "No Title Found"
                dfIdentifiers.at[index, 'publication date'] = "No Date Found"
        except Exception as e:
            dfIdentifiers.at[index, 'magazine title'] = "Error Fetching Title"
            dfIdentifiers.at[index, 'publication date'] = "Error Fetching Date"
            print(f"Error for {url}: {e}")

# Show the updated DataFrame with the magazine title and publication date
print('Updated DataFrame with magazine titles and publication dates:')
print(dfIdentifiers.head(40))

# Save the updated DataFrame to the same CSV file, overwriting the original
dfIdentifiers.to_csv('identifierslist.csvlist', index=False)


### Retrieve the content of the articles

In [None]:
## Retrieve the content of the articles based on the identifiers
## If there are a lot of articles, this can take a while

contentList = []

for index, row in dfIdentifiers.iterrows():
    identifier = row['identifier']
    url = requests.get(identifier)

    if url.status_code == 200:
        soup = BeautifulSoup(url.content, "xml")
        text = ''
        for item in soup.findAll('p'):
            text = text + (item.text)
        contentList.append([identifier, text])
    else:
        contentList.append([identifier, "Not enough rights to view digital object"])   

In [None]:
## Create a dataframe
dfText = pd.DataFrame(contentList, columns = ['identifier', 'content'])

In [None]:
len(dfText)

In [None]:
dfText.head(4)

In [None]:
dfText[dfText['content'].str.contains('rel')]

### Merge the metadata with the content

This is an additional step to store everything in one dataframe. 

In [None]:
dfArticles = dfIdentifiers.merge(dfText, on = 'identifier', how = 'inner')

In [None]:
dfArticles.head(4)

In [None]:
dfArticles[dfArticles['content'].str.contains('rel')]

In [None]:
# dfArticles = dfArticles.head(10)
print(dfArticles.head(10))

csvdfArticles = dfArticles.to_html('Articles.html')