# Plotting Newspaper Data
An example of how to use the eLuxemburgensia digital collection and plotting to visually display data.

This project uses Jupyter Notebooks to encapsulate all information regarding the project. The notebook requests a date range from the user. It then uses those dates to select a list of newspapers published during that time period. The newspapers are then plotted showing their publishing density over time.

## Requirements
* Python 3.12
* [requests](https://pypi.org/project/requests/): HTTP library to run HTTP requests
* [pandas](https://pandas.pydata.org/): format the output into tabular layout that can be read by seaborn to plot the data
* [matplotlib](https://matplotlib.org/): customising the plot - figure size, title, etc.
* [seaborn](https://seaborn.pydata.org/index.html): plotting the newspaper density

In [None]:
%pip install requests
%pip install pandas
%pip install matplotlib
%pip install seaborn

In [None]:
from datetime import datetime
import requests
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Request the start date from the user   
while (True):
    input_date = input("Enter the start date (dd/mm/yyyy):")
    try:
        start_date_value = datetime.strptime(input_date,'%d/%m/%Y')
        break
    except:
        print("Please enter a valid date in the format dd/mm/yyyy.")
        

In [None]:
# Request the end date from the user
while (True):
    input_date = input("Enter the end date (dd/mm/yyyy):")
    try:
        end_date_value = datetime.strptime(input_date,'%d/%m/%Y')
        break
    except:
        print("Please enter a valid date in the format dd/mm/yyyy.")

In [None]:
# get the BnL eluxembourgensia collection
elux_collection = requests.get("https://viewer.eluxemburgensia.lu/api/viewer2/cms/v2/digitalcollections")
elux_collection = elux_collection.json()

In [None]:
filtered_newspapers = []
df_first = True
for newspaper in elux_collection["data"]:
    newspaper_dict = {}
    newspaper_paperid = newspaper["paperid"]
    newspaper_start_date = newspaper["startdate"]
    try:
        newspaper_end_date = newspaper["enddate"]
        newspaper_end_date_print = newspaper_end_date
    except:
        # if no end date, we set the end date to 9999-12-31 in order to be able to easily compare the dates
        newspaper_end_date = "9999-12-31"
        # but we don't want to print 9999-12-31 so we set the print end date to an empty string
        newspaper_end_date_print = ""
        
    if newspaper_start_date <= end_date_value.strftime("%Y-%m-%d") and newspaper_end_date >= start_date_value.strftime("%Y-%m-%d"):
        newspaper_countbyyear_result = requests.get("https://viewer.eluxemburgensia.lu/api/viewer2/collections/" + newspaper_paperid + "/countByYear")
        newspaper_countbyyear = newspaper_countbyyear_result.json()
        # if the api returned data for the given paperid, then store it in the list of chosen newspapers
        # and concatenate the count by year data to the dataframe that will be used for plotting
        if newspaper_countbyyear["status"] == "OK":
            # only keep data for the years we are interested in
            countbyyear_data = []
            for data_entry in newspaper_countbyyear["data"]:
                if data_entry["year"] >= start_date_value.year and data_entry["year"] <= end_date_value.year:
                    countbyyear_data.append(data_entry)

            # if nothing in the county by year data then skip this newspaper and move to the next one.
            # this shouldn't happen since we've already checked that the newspaper was published during our selected time frame.
            if len(countbyyear_data) == 0:
                break
                
            countbyyear_df = pd.DataFrame(countbyyear_data, columns=["year", "n"])
            countbyyear_df.insert(0,"Title",newspaper["title"])
            newspaper_dict = {'Title': newspaper["title"], 'PaperId': newspaper_paperid, 'Start Date': newspaper_start_date, 'End Date': newspaper_end_date_print, 'CountByYear':  newspaper_countbyyear["data"]}                 
            filtered_newspapers.append(newspaper_dict)
            # if this is the first entry, simply create a new dataframe
            if df_first:
                final_df = pd.DataFrame(countbyyear_df)
                df_first = False
            else:
                # concatenate the new dataframe to the final dataframe that includes all chosen newspapers 
                final_df = pd.concat([final_df, countbyyear_df], ignore_index=True)

In [None]:
# As the violin plot counts the number of entries per year, 
# repeat years by number of issues to weight the distribution
# so if a newspaper had 50 issues in a year, that year is repeated 50 times
weighted_data = final_df.loc[final_df.index.repeat(final_df["n"])].reset_index(drop=True)

In [None]:
plt.figure(figsize=(14, 8))
sns.violinplot(data=weighted_data,x="year", y="Title", inner=None, density_norm="count", cut=0)
plt.title("Newspaper Issue Density Over Time for Newspapers published between " + start_date_value.strftime('%d/%m/%Y') + " and " + end_date_value.strftime('%d/%m/%Y') )
plt.xlabel("Year")
plt.ylabel("Newspaper Title")
plt.tight_layout()
plt.grid(True,axis='x', linestyle='--', alpha=0.3)

In [None]:
plt.figure(figsize=(15, 10))
sns.barplot(data=final_df,x="n", y="Title", hue="year", errorbar=None)
plt.title("Newspaper Issue Density Over Time for Newspapers published between " + start_date_value.strftime('%d/%m/%Y') + " and " + end_date_value.strftime('%d/%m/%Y') )
plt.xlabel("Number of Issues")
plt.ylabel("Newspaper Title")
plt.grid(True, axis='x', linestyle='--', alpha=0.3)