## Purpose

Ingest the information on which newspapers are available in the database at [data.kb.se](https://data.kb.se/).

In [3]:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
import json

# Fix the URL - add the missing 'h'
url = "https://tidningar.kb.se/titles"

# Get the data from the KB website
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all table rows containing newspaper information
    table_rows = soup.select('tr')
    
    # Create lists to store the newspaper names, their URLs, and page counts
    newspaper_names = []
    newspaper_urls = []
    newspaper_pages = []
    
    # Extract the names, URLs, and page counts
    for row in table_rows:
        # Look for a link in the first cell
        link = row.select_one('td a')
        if link:
            # Get the newspaper name from the link text
            name = link.text.strip()
            
            # Get the URL from the href attribute
            url_value = link.get('href', '')
            if url_value.startswith('/'):
                url_value = 'https://tidningar.kb.se' + url_value
            
            # Get the page count from the second cell
            page_count_cell = row.select_one('td.text-right span')
            if page_count_cell:
                # Remove non-breaking spaces and other whitespace
                page_count = page_count_cell.text.strip().replace('\xa0', '').replace(' ', '')
            else:
                page_count = "Unknown"
            
            newspaper_names.append(name)
            newspaper_urls.append(url_value)
            newspaper_pages.append(page_count)
    
    # Create a DataFrame to store the data
    newspapers_df = pd.DataFrame({
        'name': newspaper_names,
        'url': newspaper_urls,
        'pages': newspaper_pages
    })
    
    # Display the first few rows
    print(f"Found {len(newspapers_df)} newspapers")
    print(newspapers_df.head())
    
    # Save the data to a CSV file
    output_path = "../data/kb_newspapers.csv"
    newspapers_df.to_csv(output_path, index=False)
    print(f"Data saved to {output_path}")
    
    # Save as JSON as well for more structured data
    with open("../data/kb_newspapers.json", "w", encoding="utf-8") as json_file:
        json.dump(newspapers_df.to_dict(orient='records'), json_file, ensure_ascii=False, indent=4)
    print("Data also saved as JSON")
else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")

Found 1948 newspapers
                                                name  \
0                                            8 sidor   
1  Academiska och stifts tidningar utg. i Lund fö...   
2                              Adress- & varutidning   
3                    Adress- hyres- och annonsbladet   
4    Adress-kontor för Östergötland med Wadstena län   

                                                 url pages  
0  https://tidningar.kb.se/search?isPartOf.%40id=...  4692  
1  https://tidningar.kb.se/search?isPartOf.%40id=...   200  
2  https://tidningar.kb.se/search?isPartOf.%40id=...    54  
3  https://tidningar.kb.se/search?isPartOf.%40id=...    80  
4  https://tidningar.kb.se/search?isPartOf.%40id=...   174  
Data saved to ../data/kb_newspapers.csv
Data also saved as JSON


In the next step, we need to loop through the URLs and find out how many issues there are per year, which is shown on the page with the following style of histogram:

where there is a bar for each year, and the data label is the year and data value is the number of issues.

```html

<rect data-v-4a5ad4a2="" data-label="1899" data-value="4" height="4.19047619047619" width="57.77777777777778" x="0" y="120.80952380952381" class="bar--filled"></rect>

```


This sits inside the `histogram` class:

```html

<svg data-v-4a5ad4a2="" width="600" height="145" xmlns="http://www.w3.org/2000/svg" class="histogram"><g data-v-4a5ad4a2="" transform="translate(60, 0)" class="main"><g data-v-4a5ad4a2="" fill="none" transform="translate(0, 0)" class="y-axis" style="color: rgb(204, 204, 204);"><path data-v-4a5ad4a2="" stroke="currentColor" d="M0.5,125.0H0.5V0.5" class="domain"></path> <g data-v-4a5ad4a2="" opacity="1" font-size="12" font-family="sans-serif" text-anchor="end" transform="translate(0, 125.5)" class="tick"><line data-v-4a5ad4a2="" stroke="currentColor" x2="-6"></line> <text data-v-4a5ad4a2="" fill="#000" x="-9" dy="0.32em">0</text></g><g data-v-4a5ad4a2="" opacity="1" font-size="12" font-family="sans-serif" text-anchor="end" transform="translate(0, 73.11904761904762)" class="tick"><line data-v-4a5ad4a2="" stroke="currentColor" x2="-6"></line> <text data-v-4a5ad4a2="" fill="#000" x="-9" dy="0.32em">50</text></g><g data-v-4a5ad4a2="" opacity="1" font-size="12" font-family="sans-serif" text-anchor="end" transform="translate(0, 20.738095238095244)" class="tick"><line data-v-4a5ad4a2="" stroke="currentColor" x2="-6"></line> <text data-v-4a5ad4a2="" fill="#000" x="-9" dy="0.32em">100</text></g></g> <g data-v-4a5ad4a2="" fill="none" transform="translate(0, 125)" class="x-axis" style="color: rgb(204, 204, 204);"><path data-v-4a5ad4a2="" stroke="currentColor" d="M0,0.5H520.5" class="domain"></path> <g data-v-4a5ad4a2="" opacity="1" font-size="12" font-family="sans-serif" text-anchor="middle" transform="translate(35.30864197530863, 0)" class="tick"><g data-v-4a5ad4a2=""><line data-v-4a5ad4a2="" stroke="currentColor" y2="6"></line> <text data-v-4a5ad4a2="" fill="#000" y="9" dy="0.71em">1899</text></g></g><g data-v-4a5ad4a2="" opacity="1" font-size="12" font-family="sans-serif" text-anchor="middle" transform="translate(99.50617283950616, 0)" class="tick"><g data-v-4a5ad4a2=""><line data-v-4a5ad4a2="" stroke="currentColor" y2="6"></line> <text data-v-4a5ad4a2="" fill="#000" y="9" dy="0.71em">1900</text></g></g><g data-v-4a5ad4a2="" opacity="1" font-size="12" font-family="sans-serif" text-anchor="middle" transform="translate(163.7037037037037, 0)" class="tick"><g data-v-4a5ad4a2=""><line data-v-4a5ad4a2="" stroke="currentColor" y2="6"></line> <text data-v-4a5ad4a2="" fill="#000" y="9" dy="0.71em">1901</text></g></g><g data-v-4a5ad4a2="" opacity="1" font-size="12" font-family="sans-serif" text-anchor="middle" transform="translate(227.90123456790124, 0)" class="tick"><g data-v-4a5ad4a2=""><line data-v-4a5ad4a2="" stroke="currentColor" y2="6"></line> <text data-v-4a5ad4a2="" fill="#000" y="9" dy="0.71em">1902</text></g></g><g data-v-4a5ad4a2="" opacity="1" font-size="12" font-family="sans-serif" text-anchor="middle" transform="translate(292.0987654320988, 0)" class="tick"><g data-v-4a5ad4a2=""><line data-v-4a5ad4a2="" stroke="currentColor" y2="6"></line> <text data-v-4a5ad4a2="" fill="#000" y="9" dy="0.71em">1903</text></g></g><g data-v-4a5ad4a2="" opacity="1" font-size="12" font-family="sans-serif" text-anchor="middle" transform="translate(356.2962962962963, 0)" class="tick"><g data-v-4a5ad4a2=""><line data-v-4a5ad4a2="" stroke="currentColor" y2="6"></line> <text data-v-4a5ad4a2="" fill="#000" y="9" dy="0.71em">1904</text></g></g><g data-v-4a5ad4a2="" opacity="1" font-size="12" font-family="sans-serif" text-anchor="middle" transform="translate(420.4938271604939, 0)" class="tick"><g data-v-4a5ad4a2=""><line data-v-4a5ad4a2="" stroke="currentColor" y2="6"></line> <text data-v-4a5ad4a2="" fill="#000" y="9" dy="0.71em">1905</text></g></g><g data-v-4a5ad4a2="" opacity="1" font-size="12" font-family="sans-serif" text-anchor="middle" transform="translate(484.6913580246914, 0)" class="tick"><g data-v-4a5ad4a2=""><line data-v-4a5ad4a2="" stroke="currentColor" y2="6"></line> <text data-v-4a5ad4a2="" fill="#000" y="9" dy="0.71em">1906</text></g></g></g> <g data-v-4a5ad4a2="" fill="none" class="bars bars--wide" style="position: relative;"><rect data-v-4a5ad4a2="" fill="#000" fill-opacity="0" height="125" width="520" x="0" y="0"></rect> <g data-v-4a5ad4a2="" data-return-value="1899" transform="translate(6.419753086419746,0)" class="bar ds-selectable"><rect data-v-4a5ad4a2="" data-label="1899" data-value="4" fill="#000" fill-opacity="0" height="125" width="57.77777777777778" x="0" y="0"></rect> <rect data-v-4a5ad4a2="" data-label="1899" data-value="4" height="4.19047619047619" width="57.77777777777778" x="0" y="120.80952380952381" class="bar--filled"></rect></g><g data-v-4a5ad4a2="" data-return-value="1900" transform="translate(70.61728395061728,0)" class="bar ds-selectable"><rect data-v-4a5ad4a2="" data-label="1900" data-value="103" fill="#000" fill-opacity="0" height="125" width="57.77777777777778" x="0" y="0"></rect> <rect data-v-4a5ad4a2="" data-label="1900" data-value="103" height="107.9047619047619" width="57.77777777777778" x="0" y="17.0952380952381" class="bar--filled"></rect></g><g data-v-4a5ad4a2="" data-return-value="1901" transform="translate(134.8148148148148,0)" class="bar ds-selectable"><rect data-v-4a5ad4a2="" data-label="1901" data-value="104" fill="#000" fill-opacity="0" height="125" width="57.77777777777778" x="0" y="0"></rect> <rect data-v-4a5ad4a2="" data-label="1901" data-value="104" height="108.95238095238096" width="57.77777777777778" x="0" y="16.047619047619044" class="bar--filled"></rect></g><g data-v-4a5ad4a2="" data-return-value="1902" transform="translate(199.01234567901236,0)" class="bar ds-selectable"><rect data-v-4a5ad4a2="" data-label="1902" data-value="103" fill="#000" fill-opacity="0" height="125" width="57.77777777777778" x="0" y="0"></rect> <rect data-v-4a5ad4a2="" data-label="1902" data-value="103" height="107.9047619047619" width="57.77777777777778" x="0" y="17.0952380952381" class="bar--filled"></rect></g><g data-v-4a5ad4a2="" data-return-value="1903" transform="translate(263.2098765432099,0)" class="bar ds-selectable"><rect data-v-4a5ad4a2="" data-label="1903" data-value="105" fill="#000" fill-opacity="0" height="125" width="57.77777777777778" x="0" y="0"></rect> <rect data-v-4a5ad4a2="" data-label="1903" data-value="105" height="110" width="57.77777777777778" x="0" y="15" class="bar--filled"></rect></g><g data-v-4a5ad4a2="" data-return-value="1904" transform="translate(327.4074074074074,0)" class="bar ds-selectable"><rect data-v-4a5ad4a2="" data-label="1904" data-value="104" fill="#000" fill-opacity="0" height="125" width="57.77777777777778" x="0" y="0"></rect> <rect data-v-4a5ad4a2="" data-label="1904" data-value="104" height="108.95238095238096" width="57.77777777777778" x="0" y="16.047619047619044" class="bar--filled"></rect></g><g data-v-4a5ad4a2="" data-return-value="1905" transform="translate(391.60493827160496,0)" class="bar ds-selectable"><rect data-v-4a5ad4a2="" data-label="1905" data-value="104" fill="#000" fill-opacity="0" height="125" width="57.77777777777778" x="0" y="0"></rect> <rect data-v-4a5ad4a2="" data-label="1905" data-value="104" height="108.95238095238096" width="57.77777777777778" x="0" y="16.047619047619044" class="bar--filled"></rect></g><g data-v-4a5ad4a2="" data-return-value="1906" transform="translate(455.8024691358025,0)" class="bar ds-selectable"><rect data-v-4a5ad4a2="" data-label="1906" data-value="104" fill="#000" fill-opacity="0" height="125" width="57.77777777777778" x="0" y="0"></rect> <rect data-v-4a5ad4a2="" data-label="1906" data-value="104" height="108.95238095238096" width="57.77777777777778" x="0" y="16.047619047619044" class="bar--filled"></rect></g></g> <g data-v-4a5ad4a2=""><!----></g></g></svg>

```

In [8]:
import time
import json
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Load the previously saved newspaper data
newspapers_df = pd.read_csv("../data/kb_newspapers.csv")

def get_issues_by_year(url):
    print(f"Processing URL: {url}")
    try:
        # Set up Selenium with headless Chrome
        chrome_options = Options()
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("--disable-gpu")
        
        # Automatically install/update ChromeDriver
        service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=chrome_options)

        # Visit the URL and wait for JavaScript to load the SVG
        driver.get(url)
        time.sleep(2)  # Increase the delay if needed

        html = driver.page_source
        driver.quit()
        
        # Parse the rendered HTML content with BeautifulSoup
        soup = BeautifulSoup(html, 'html.parser')
        
        # Find the histogram SVG
        histogram = soup.select_one('svg.histogram')
        if not histogram:
            print(f"No histogram found for {url}")
            return {}
        
        # Extract the year and issue count from the bars in the histogram
        issues_by_year = {}
        bar_groups = histogram.select('g.bar.ds-selectable')
        
        for bar_group in bar_groups:
            # Get the year from the data-return-value attribute
            year = bar_group.get('data-return-value')
            # Find the filled rect within the bar group
            filled_rect = bar_group.select_one('rect.bar--filled')
            if filled_rect:
                count = filled_rect.get('data-value')
                if year and count:
                    issues_by_year[year] = int(count)
        
        return issues_by_year
    
    except Exception as e:
        print(f"Error processing {url}: {str(e)}")
        return {}

# Process only the first 3 newspapers as a test
test_newspapers = newspapers_df.head(3)
results = []

for idx, row in test_newspapers.iterrows():
    newspaper_name = row['name']
    newspaper_url = row['url']
    total_pages = row['pages']
    
    print(f"\nProcessing #{idx+1}: {newspaper_name}")
    issues_by_year = get_issues_by_year(newspaper_url)
    
    result = {
        'name': newspaper_name,
        'url': newspaper_url,
        'total_pages': total_pages,
        'issues_by_year': issues_by_year
    }
    
    results.append(result)
    print(f"Found {len(issues_by_year)} years with issues")

# Save the detailed results to a JSON file
output_path = "../data/kb_newspapers_with_years_test.json"
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=4)
    
print(f"\nSaved detailed information for {len(results)} newspapers to {output_path}")

# Display a sample of the results
for result in results:
    print(f"\nNewspaper: {result['name']}")
    print(f"Total pages: {result['total_pages']}")
    print(f"Years with issues: {len(result['issues_by_year'])}")
    # Show up to 5 years as examples
    sample_years = list(result['issues_by_year'].items())[:5]
    if sample_years:
        print("Sample years (year: issue count):")
        for year, count in sample_years:
            print(f"  {year}: {count}")



Processing #1: 8 sidor
Processing URL: https://tidningar.kb.se/search?isPartOf.%40id=https%3A%2F%2Flibris.kb.se%2Fwf7fbxb71333dp5%23it
Found 12 years with issues

Processing #2: Academiska och stifts tidningar utg. i Lund för år 1773 af G.S.
Processing URL: https://tidningar.kb.se/search?isPartOf.%40id=https%3A%2F%2Flibris.kb.se%2F08667xhcx8zwx7mj%23it
Found 1 years with issues

Processing #3: Adress- & varutidning
Processing URL: https://tidningar.kb.se/search?isPartOf.%40id=https%3A%2F%2Flibris.kb.se%2F4fr1qj1x2ddfhktc%23it
Found 2 years with issues

Saved detailed information for 3 newspapers to ../data/kb_newspapers_with_years_test.json

Newspaper: 8 sidor
Total pages: 4692
Years with issues: 12
Sample years (year: issue count):
  2013: 49
  2014: 48
  2015: 49
  2016: 49
  2017: 49

Newspaper: Academiska och stifts tidningar utg. i Lund för år 1773 af G.S.
Total pages: 200
Years with issues: 1
Sample years (year: issue count):
  1773: 50

Newspaper: Adress- & varutidning
Total pa

In [9]:
import time
import json
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Load the previously saved newspaper data
newspapers_df = pd.read_csv("../data/kb_newspapers.csv")

def get_issues_by_year(url):
    print(f"Processing URL: {url}")
    try:
        # Set up Selenium with headless Chrome
        chrome_options = Options()
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("--disable-gpu")
        
        # Automatically install/update ChromeDriver
        service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=chrome_options)

        # Visit the URL and wait for JavaScript to load the SVG
        driver.get(url)
        time.sleep(2)  # Increase the delay if needed

        html = driver.page_source
        driver.quit()
        
        # Parse the rendered HTML content with BeautifulSoup
        soup = BeautifulSoup(html, 'html.parser')
        
        # Find the histogram SVG
        histogram = soup.select_one('svg.histogram')
        if not histogram:
            print(f"No histogram found for {url}")
            return {}
        
        # Extract the year and issue count from the bars in the histogram
        issues_by_year = {}
        bar_groups = histogram.select('g.bar.ds-selectable')
        
        for bar_group in bar_groups:
            # Get the year from the data-return-value attribute
            year = bar_group.get('data-return-value')
            # Find the filled rect within the bar group
            filled_rect = bar_group.select_one('rect.bar--filled')
            if filled_rect:
                count = filled_rect.get('data-value')
                if year and count:
                    issues_by_year[year] = int(count)
        
        return issues_by_year
    
    except Exception as e:
        print(f"Error processing {url}: {str(e)}")
        return {}

# Process all newspapers
results = []

total_newspapers = len(newspapers_df)
for idx, row in newspapers_df.iterrows():
    newspaper_name = row['name']
    newspaper_url = row['url']
    total_pages = row['pages']
    
    print(f"\nProcessing #{idx+1}/{total_newspapers}: {newspaper_name}")
    issues_by_year = get_issues_by_year(newspaper_url)
    
    result = {
        'name': newspaper_name,
        'url': newspaper_url,
        'total_pages': total_pages,
        'issues_by_year': issues_by_year
    }
    
    results.append(result)
    print(f"Found {len(issues_by_year)} years with issues")
    
    # Optional: Save intermediate results every 10 newspapers
    if (idx + 1) % 10 == 0:
        intermediate_path = f"../data/kb_newspapers_with_years_intermediate_{idx+1}.json"
        with open(intermediate_path, "w", encoding="utf-8") as f:
            json.dump(results, f, ensure_ascii=False, indent=4)
        print(f"Saved intermediate results to {intermediate_path}")

# Save the detailed results to a JSON file
output_path = "../data/kb_newspapers_with_years_complete.json"
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=4)
    
print(f"\nSaved detailed information for {len(results)} newspapers to {output_path}")

# Display summary statistics
years_with_issues = sum(len(result['issues_by_year']) for result in results)
newspapers_with_data = sum(1 for result in results if result['issues_by_year'])

print(f"\nSummary:")
print(f"Total newspapers processed: {len(results)}")
print(f"Newspapers with year data: {newspapers_with_data}")
print(f"Total year-newspaper combinations: {years_with_issues}")

# Display a sample of the results (first 3 newspapers)
for result in results[:3]:
    print(f"\nNewspaper: {result['name']}")
    print(f"Total pages: {result['total_pages']}")
    print(f"Years with issues: {len(result['issues_by_year'])}")
    # Show up to 5 years as examples
    sample_years = list(result['issues_by_year'].items())[:5]
    if sample_years:
        print("Sample years (year: issue count):")
        for year, count in sample_years:
            print(f"  {year}: {count}")


Processing #1/1948: 8 sidor
Processing URL: https://tidningar.kb.se/search?isPartOf.%40id=https%3A%2F%2Flibris.kb.se%2Fwf7fbxb71333dp5%23it
Found 12 years with issues

Processing #2/1948: Academiska och stifts tidningar utg. i Lund för år 1773 af G.S.
Processing URL: https://tidningar.kb.se/search?isPartOf.%40id=https%3A%2F%2Flibris.kb.se%2F08667xhcx8zwx7mj%23it
Found 1 years with issues

Processing #3/1948: Adress- & varutidning
Processing URL: https://tidningar.kb.se/search?isPartOf.%40id=https%3A%2F%2Flibris.kb.se%2F4fr1qj1x2ddfhktc%23it
Found 2 years with issues

Processing #4/1948: Adress- hyres- och annonsbladet
Processing URL: https://tidningar.kb.se/search?isPartOf.%40id=https%3A%2F%2Flibris.kb.se%2Fcn08zv8s98fbzw73%23it
Found 1 years with issues

Processing #5/1948: Adress-kontor för Östergötland med Wadstena län
Processing URL: https://tidningar.kb.se/search?isPartOf.%40id=https%3A%2F%2Flibris.kb.se%2F7hcm0jr25pjw5dsc%23it
Found 1 years with issues

Processing #6/1948: Adres