# Analyse Query Results: Detect Versions vs Unique Resources

When you get query results in SolrWayback, it will provide archival resources that matches your query.

However, you might want to know how many of them that are unique, and how many that are different versions of the same resource.

Different versions are not necessarily duplicates. For a front page of a web news site, there will be great dissimilarities from one day to the next. However, when you use SolrWayback to define a corpus of resources for examination, it might be valuable to know if some of resources in your corpus are "overrepresented" due to a high number of versions.

This notebook let you analyse the number of unique urls in the corpus, and detect which archived urls that appears several times.

#### JSONL data from SolrWayback
You will need a file in JSONL format, exported from your SolrWayback query result. The notebook uses an example from a query on "Jonas Gahr Støre", but you can replace this data from your own queries.

For the operations in this notebook to work, the fields "warc_key_id", "url_norm" and "crawl_date" must be included in the export.

### Import libraries

First, we need to import the necessary libraries.

In [1]:
!pip install pandas
import pandas as pd



### Load data from SolrWayback

Then, we load the JSONL data set you exported from SolrWayback.

In [2]:
df = pd.read_json('data/solrwayback_JonasGahrStore.jsonl', lines=True)
display(df)

Unnamed: 0,warc_key_id,title,domain,content_type,crawl_date,crawl_year,url_norm
0,<urn:uuid:06add8d3-380a-4b3a-9cf5-2578f77d7f62>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,1645812529000,2022,http://europabloggen.no/tag/jonas-gahr-støre
1,<urn:uuid:8de22051-ee5a-4d25-b7eb-0c1d48f1e007>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,1646070193000,2022,http://europabloggen.no/tag/jonas-gahr-støre
2,<urn:uuid:db2fb7a5-0eb1-460e-b414-e77fa2d2b410>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,1646330192000,2022,http://europabloggen.no/tag/jonas-gahr-støre
3,<urn:uuid:498cd143-f833-4c6c-aec6-fc721e25bc88>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,1646765196000,2022,http://europabloggen.no/tag/jonas-gahr-støre
4,<urn:uuid:d65a4fb6-b316-480e-842c-d644d6c1ad49>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,1646919904000,2022,http://europabloggen.no/tag/jonas-gahr-støre
...,...,...,...,...,...,...,...
6408,<urn:uuid:0647b30d-bcbe-4537-a95f-a6a0ce5aaffa>,,tv2.no,text/plain,1646061800000,2022,http://tv2.no:443/spesialer/api/open/articles/...
6409,<urn:uuid:fc1b7067-c8ba-434d-bb85-5184ac83914f>,,tv2.no,text/plain,1646323954000,2022,http://tv2.no:443/spesialer/api/open/articles/...
6410,<urn:uuid:f916850c-9474-4da1-8229-7bc507112d27>,,tv2.no,text/plain,1648824785000,2022,http://tv2.no:443/spesialer/api/open/articles/...
6411,<urn:uuid:ba5c7751-2470-4d77-acb9-c955b3da8f0d>,,vl.no,text/plain,1685060661000,2023,http://vl.no:443/pf/api/v3/content/fetch/conte...


## 1. Convert timestamp

The timestamps in our JSONL file is in the machine-readable Unix format. Before computing this, we want to convert it into python's Datetime format

In [4]:
df['crawl_date'] = pd.to_datetime(df['crawl_date'], unit='ms')
display(df)

Unnamed: 0,warc_key_id,title,domain,content_type,crawl_date,crawl_year,url_norm
0,<urn:uuid:06add8d3-380a-4b3a-9cf5-2578f77d7f62>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,2022-02-25 18:08:49,2022,http://europabloggen.no/tag/jonas-gahr-støre
1,<urn:uuid:8de22051-ee5a-4d25-b7eb-0c1d48f1e007>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,2022-02-28 17:43:13,2022,http://europabloggen.no/tag/jonas-gahr-støre
2,<urn:uuid:db2fb7a5-0eb1-460e-b414-e77fa2d2b410>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,2022-03-03 17:56:32,2022,http://europabloggen.no/tag/jonas-gahr-støre
3,<urn:uuid:498cd143-f833-4c6c-aec6-fc721e25bc88>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,2022-03-08 18:46:36,2022,http://europabloggen.no/tag/jonas-gahr-støre
4,<urn:uuid:d65a4fb6-b316-480e-842c-d644d6c1ad49>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,2022-03-10 13:45:04,2022,http://europabloggen.no/tag/jonas-gahr-støre
...,...,...,...,...,...,...,...
6408,<urn:uuid:0647b30d-bcbe-4537-a95f-a6a0ce5aaffa>,,tv2.no,text/plain,2022-02-28 15:23:20,2022,http://tv2.no:443/spesialer/api/open/articles/...
6409,<urn:uuid:fc1b7067-c8ba-434d-bb85-5184ac83914f>,,tv2.no,text/plain,2022-03-03 16:12:34,2022,http://tv2.no:443/spesialer/api/open/articles/...
6410,<urn:uuid:f916850c-9474-4da1-8229-7bc507112d27>,,tv2.no,text/plain,2022-04-01 14:53:05,2022,http://tv2.no:443/spesialer/api/open/articles/...
6411,<urn:uuid:ba5c7751-2470-4d77-acb9-c955b3da8f0d>,,vl.no,text/plain,2023-05-26 00:24:21,2023,http://vl.no:443/pf/api/v3/content/fetch/conte...


### 1. Count resources (unique urls and urls with several versions)

Let us count the number each url appear.

In [5]:
url_counts = df['url_norm'].value_counts()

Then, we want to filter urls that appears more than one time in the data set.
To make it easier to read, we limit the dataframe to the url, title and count columns.
Then, we display a table with 

In [6]:
urls_with_multiple_timestamps = df.groupby('url_norm').filter(lambda x: x['crawl_date'].nunique() > 1)
urls_with_counts = urls_with_multiple_timestamps.groupby(['url_norm', 'title'])['crawl_date'].nunique().reset_index(name='versions')
urls_with_counts = urls_with_counts.sort_values(by='versions', ascending=False)

# Print summary message
total_rows = df.shape[0]
unique_urls = len(df['url_norm'].unique())
multiple_versions = len(urls_with_counts)
print(f"Total number of lines in the dataset: \033[1m{total_rows}\033[0m \nNumber of unique resources: \033[1m{unique_urls}\033[0m \nNumber of resources archived in several versions: \033[1m{multiple_versions}\033[0m.")

display(urls_with_counts)

Total number of lines in the dataset: [1m6413[0m 
Number of unique resources: [1m3482[0m 
Number of resources archived in several versions: [1m979[0m.


Unnamed: 0,url_norm,title,versions
898,http://sapmi.arbeiderpartiet.no:443/,Forsiden,25
189,http://friheten.no:443/,artikkel,10
901,http://sapmi.arbeiderpartiet.no:443/l/se/polit...,Válggaprogramma 2021-2025,9
907,http://sapmi.arbeiderpartiet.no:443/nyhetsbrev,Få nyhetsbrev fra Arbeiderpartiet!,8
0,http://abcnyheter.no:443/hvordan,Hvordan | ABC Nyheter,7
...,...,...,...
743,http://regjeringen.no:443/no/aktuelt/taler_art...,Taler og innlegg - regjeringen.no,2
237,http://nettavisen.no:443/norsk-debatt/tenk-om-...,"Norsk politikk, Ukraina | Tenk om Moxnes og Ly...",2
238,http://nettavisen.no:443/nyheter/blir-det-krig...,"Ukraina, Russland | Blir det krig i Norge? Det...",2
88,http://dagbladet.no:443/nyheter/tidslinje-russ...,Tidslinje: Russlands krig mot Ukraina,1
