# Analyse Search Results: Domains, Content Type and Year

Engaging with big data often necessitates initial analysis to establish some kind of sense of the data.

This notebook will allow you to analyse results from a search in SolrWayback, by checking how the resources are distributed by domain, media type and crawl year.

If you are new to python or programming - DON'T WORRY! The narrative between the code cells will explain what the script is performing. To run a cell, simply press SHIFT+Enter or the "Run"-button in the top.

NB!
You will need a file in JSONL format, exported from your SolrWayback query result (see [documentation]()). For this notebook's operations to work, the fields "domain", "content_type" and "crawl year" must be included in the export.

### Import libraries

First, we need to import the necessary libraries.

In [2]:
import pandas as pd
import plotly.express as px

### Load data from SolrWayback

Load the JSONL data set you exported from SolrWayback. *(The display function on line 2 outputs a table with the first and last 5 lines of your data.)*

In [3]:
df = pd.read_json('data/solrwayback_JonasGahrStore.jsonl', lines=True)
display(df)

Unnamed: 0,warc_key_id,title,domain,content_type,crawl_date,crawl_year,url_norm
0,<urn:uuid:06add8d3-380a-4b3a-9cf5-2578f77d7f62>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,1645812529000,2022,http://europabloggen.no/tag/jonas-gahr-støre
1,<urn:uuid:8de22051-ee5a-4d25-b7eb-0c1d48f1e007>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,1646070193000,2022,http://europabloggen.no/tag/jonas-gahr-støre
2,<urn:uuid:db2fb7a5-0eb1-460e-b414-e77fa2d2b410>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,1646330192000,2022,http://europabloggen.no/tag/jonas-gahr-støre
3,<urn:uuid:498cd143-f833-4c6c-aec6-fc721e25bc88>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,1646765196000,2022,http://europabloggen.no/tag/jonas-gahr-støre
4,<urn:uuid:d65a4fb6-b316-480e-842c-d644d6c1ad49>,Europabloggen.no » Jonas Gahr Støre,europabloggen.no,application/xhtml+xml,1646919904000,2022,http://europabloggen.no/tag/jonas-gahr-støre
...,...,...,...,...,...,...,...
6408,<urn:uuid:0647b30d-bcbe-4537-a95f-a6a0ce5aaffa>,,tv2.no,text/plain,1646061800000,2022,http://tv2.no:443/spesialer/api/open/articles/...
6409,<urn:uuid:fc1b7067-c8ba-434d-bb85-5184ac83914f>,,tv2.no,text/plain,1646323954000,2022,http://tv2.no:443/spesialer/api/open/articles/...
6410,<urn:uuid:f916850c-9474-4da1-8229-7bc507112d27>,,tv2.no,text/plain,1648824785000,2022,http://tv2.no:443/spesialer/api/open/articles/...
6411,<urn:uuid:ba5c7751-2470-4d77-acb9-c955b3da8f0d>,,vl.no,text/plain,1685060661000,2023,http://vl.no:443/pf/api/v3/content/fetch/conte...


## 1. Domain stats

### Count

Now, we can count the number of occurrences of each domain, and sort them in descending order.

In [4]:
domain_counts = df['domain'].value_counts().reset_index()
domain_counts.columns = ['domain', 'count']
domain_counts_sorted = domain_counts.sort_values('count', ascending=False)
display(domain_counts_sorted)

Unnamed: 0,domain,count
0,regjeringen.no,2556
1,vg.no,653
2,lp4.io,533
3,dagbladet.no,263
4,europabloggen.no,240
...,...,...
88,iris-core-ap.schibsted.tech,1
87,derimot.no,1
86,svalbardposten.no,1
85,videostep.com,1


### Visualisation

Since we have counted the number of resources per domain, it is also possible to visualise the distribution.

We will use the **plotly** graphing tool to generate an interactive pie chart.

*TIP: You can move your mouse over the pie chart to display data for that block, and active/deactive domains in the right column.*

In [6]:
fig = px.pie(domain_counts_sorted, values='count', names='domain')
fig.update_traces(textposition='inside', textinfo='label')
fig.update_layout(title_text="Distribution of resources per domain", font_size=15)
fig.show()

## 2. Media type

### Count

As with the domains above, we can also count the occurrences of each media type/subtype.

*(Media type is synonymous to what is called "Content type" in WARC records.)*

In [8]:
content_type_counts = df['content_type'].value_counts().reset_index()
content_type_counts.columns = ['content_type', 'count']
content_type_sorted = content_type_counts.sort_values('count', ascending=False)
print(content_type_sorted)

                       content_type  count
0                         text/html   4647
1                        text/plain    655
2             application/xhtml+xml    241
3          application/octet-stream    104
4                        image/jpeg     84
5                         image/gif     47
6                     text/calendar     12
7          text/html; charset=UTF-8     11
8                   application/pdf     11
9                  application/json      7
10  application/json; charset=utf-8      5
11                  text/javascript      4
12              application/x-empty      2
13   application/json;charset=UTF-8      2
14             application/atom+xml      1


### Visualisation

Run the code block below to visualise the distribution of each content type/subtype.

In [9]:
fig = px.pie(content_type_sorted, values='count', names='content_type')
fig.update_traces(textposition='inside', textinfo='label')
fig.update_layout(title_text="Distribution of resources by content type", font_size=15)
fig.show()

## 3. Crawl year

### Count

The code block below will count the number of resources per crawl year, and print the result.

In [10]:
crawl_year_counts = df['crawl_year'].value_counts().reset_index()
crawl_year_counts.columns = ['crawl_year', 'count']
crawl_year_sorted = crawl_year_counts.sort_values('crawl_year')
print(crawl_year_sorted)

   crawl_year  count
4        2001     13
2        2020     30
3        2021     17
0        2022   5825
1        2023    528


### Visualisation

Then, we want to visualise their distribution over time, using **plotly** to make a bar chart.

In [11]:
fig = px.bar(crawl_year_sorted, y='count', x='crawl_year', text_auto='.2s')
fig.show()