# Analyse Search Results: Domains, Content Type and Year

Engaging with big data often necessitates initial analysis to establish some kind of sense of the data.

This notebook will allow you to analyse results from a search in SolrWayback, by checking how the resources are distributed by domain, media type and crawl year.

If you are new to python or programming - DON'T WORRY! The narrative between the code cells will explain what the script is performing. To run a cell, simply press SHIFT+Enter or the "Run"-button in the top.

NB!
You will need a file in JSONL format, exported from your SolrWayback query result (see [documentation]()). For this notebook's operations to work, the fields "domain", "content_type" and "crawl year" must be included in the export.

### Import libraries

First, we need to install and import the necessary libraries.

In [None]:
!pip install pandas
!pip install plotly
import pandas as pd
import plotly.express as px

### Load data from SolrWayback

Then, we load the JSONL data set you exported from SolrWayback.

In [41]:
df = pd.read_json('data/solrwayback_JonasGahrStore.jsonl', lines=True)
print(df)

                domain           content_type  crawl_year
0     europabloggen.no  application/xhtml+xml        2022
1     europabloggen.no  application/xhtml+xml        2022
2     europabloggen.no  application/xhtml+xml        2022
3     europabloggen.no  application/xhtml+xml        2022
4     europabloggen.no  application/xhtml+xml        2022
...                ...                    ...         ...
7416             vl.no             text/plain        2023
7417            tv2.no             text/plain        2022
7418            tv2.no             text/plain        2022
7419            tv2.no             text/plain        2022
7420     dagsavisen.no             text/plain        2023

[7421 rows x 3 columns]


## 1. Domain stats

### Count

The code block below will count the occurrences of each domain, and sort them in descending order.

In [42]:
domain_counts = df['domain'].value_counts().reset_index()
domain_counts.columns = ['domain', 'count']
domain_counts_sorted = domain_counts.sort_values('count', ascending=False)
print(domain_counts_sorted)

                          domain  count
0                 regjeringen.no   3504
1                          vg.no    674
2                         lp4.io    534
3                   dagbladet.no    263
4               europabloggen.no    240
..                           ...    ...
89   iris-core-ap.schibsted.tech      1
88              opinionstage.com      1
87                        auf.no      1
86                       spkt.io      1
106                     dagen.no      1

[107 rows x 2 columns]


### Visualisation

Since we have counted the number of resources per domain, it is possible to visualise their distribution.

We will use a graphing tool called **plotly** to generate an interactive pie chart.

In [43]:
fig = px.pie(domain_counts_sorted, values='count', names='domain')
fig.update_traces(textposition='inside', textinfo='label')
fig.update_layout(title_text="Distribution of resources per domain", font_size=15)
fig.show()

## 2. Media type

### Count

The code block below will count the occurrences of each media type/subtype, and sort them in descending order.

*(Media type is synonymous to "content type" in WARC records'.)*

In [44]:
content_type_counts = df['content_type'].value_counts().reset_index()
content_type_counts.columns = ['content_type', 'count']
content_type_sorted = content_type_counts.sort_values('count', ascending=False)
print(content_type_sorted)

                       content_type  count
0                         text/html   5642
1                        text/plain    661
2             application/xhtml+xml    241
3          application/octet-stream    104
4                        image/jpeg     84
5                         image/gif     47
6                     text/calendar     12
7                   application/pdf     12
8          text/html; charset=UTF-8     11
9                  application/json      7
10  application/json; charset=utf-8      5
11                  text/javascript      4
12              application/x-empty      2
13   application/json;charset=UTF-8      2
14             application/atom+xml      1


### Visualisation

Now when we have counted the number of resources per domain, it is possible to visualise their distribution.

We will use a graphing package for python called **plotly** to generate an interactive pie chart.

In [45]:
fig = px.pie(content_type_sorted, values='count', names='content_type')
fig.update_traces(textposition='inside', textinfo='label')
fig.update_layout(title_text="Distribution of resources by content type", font_size=15)
fig.show()

## 3. Crawl year

### Count

The code block below will count the number of resources per year, and print the result.

In [46]:
crawl_year_counts = df['crawl_year'].value_counts().reset_index()
crawl_year_counts.columns = ['crawl_year', 'count']
crawl_year_sorted = crawl_year_counts.sort_values('crawl_year')
print(crawl_year_sorted)

   crawl_year  count
4        2001     13
2        2020     30
3        2021     17
0        2022   6821
1        2023    540


### Visualisation

Then, we want to visualise their distribution over time, using **plotly** to make a bar chart.

In [47]:
fig = px.bar(crawl_year_sorted, y='count', x='crawl_year', text_auto='.2s')
fig.show()