### *Requirements*
*To make use this notebook, you need an export of your search result from SolrWayback. (See **[SolrWayback > Export](https://nlnwa.github.io/research-services/docs/solrwayback/solrwayback-5export.html)** )*

*The exported data:*
- *must be in the JSONL format,*
- *must contain the fields **'url_norm'** and **'crawl_date'** field.*
- *should also contain the **'warc_key_id'** field.*

*It is highly recommended that the number of exported results are below 20000. If your data is based on a search result with vast more hits, you should reduce the scope, e.g. by applying facets for specific domains or crawl year.*

# Analyse Query Results: Detect Unique Urls vs Multiple Versions

Engaging with big data often necessitates initial analysis to establish some kind of sense of the data.

This notebook will allow you to analyse how many of the resources in a SolrWayback query result that are have *unique urls*, and how many that are different *versions* of the same url.

Different versions are not necessarily duplicates. The front page of a news site will often be very different from one day to the other. However, if you are using SolrWayback to define a corpus of resources you will examine, it is valuable to know if some of these resources are potentially "overrepresented" due to a high number of versions.

This notebook let you analyse the number of unique urls in the corpus, and detect which archived urls that appears several times.

## Import libraries

First, we need to import the necessary libraries.

In [None]:
import pandas as pd

## Load data from SolrWayback

Then, we load the JSONL data set you exported from SolrWayback.

In [None]:
df = pd.read_json('../data/solrwayback_JonasGahrStore.jsonl', lines=True)
display(df)

### 1. Convert timestamp

The timestamps in our JSONL file is in the machine-readable Unix format. Before computing this, we want to convert it into python's Datetime format

In [None]:
df['crawl_date'] = pd.to_datetime(df['crawl_date'], unit='ms')
display(df)

## 1. Count resources (unique urls and urls with several versions)

Let us count the number each url appear.

In [None]:
url_counts = df['url_norm'].value_counts()

Then, we want to filter urls that appears more than one time in the data set.
To make it easier to read, we limit the dataframe to the url, title and count columns.
Then, we display a table with 

In [None]:
def count_unique_vs_version(df):
    urls_with_multiple_timestamps = df.groupby('url_norm').filter(lambda x: x['crawl_date'].nunique() > 1)
    urls_with_counts = urls_with_multiple_timestamps.groupby(['url_norm', 'title'])['crawl_date'].nunique().reset_index(name='versions')
    urls_with_counts = urls_with_counts.sort_values(by='versions', ascending=False)
    return urls_with_counts

# Call the function to get urls_with_counts
urls_with_counts = count_unique_vs_version(df)

# Print summary message
total_rows = df.shape[0]
unique_urls = len(df['url_norm'].unique())
multiple_versions = len(urls_with_counts)
print(f"Total number of lines in the dataset: \033[1m{total_rows}\033[0m \nNumber of unique resources: \033[1m{unique_urls}\033[0m \nNumber of resources archived in several versions: \033[1m{multiple_versions}\033[0m.")

display(urls_with_counts)