# Analyse Search Results: Domains, Content Type and Year

Engaging with big data often necessitates initial analysis to establish some kind of sense of the data.

This notebook will allow you to analyse results from a search in SolrWayback, by checking how the resources are distributed by domain, media type and crawl year.

If you are new to python or programming - DON'T WORRY! The narrative between the code cells will explain what the script is performing. To run a cell, simply press SHIFT+Enter or the "Run"-button in the top.

NB!
You will need a file in JSONL format, exported from your SolrWayback query result (see [documentation]()). For this notebook's operations to work, the fields "domain", "content_type" and "crawl year" must be included in the export.

### Import libraries

First, we need to import the necessary libraries.

In [None]:
import pandas as pd
import plotly.express as px

### Load data from SolrWayback

Load the JSONL data set you exported from SolrWayback. *(The display function on line 2 outputs a table with the first and last 5 lines of your data.)* 

In [None]:
df = pd.read_json('../data/solrwayback_JonasGahrStore.jsonl', lines=True)
display(df)

## 1. Domain stats

### Count

Now, we can count the number of occurrences of each domain, and sort them in descending order.

In [None]:
domain_counts = df['domain'].value_counts().reset_index()
domain_counts.columns = ['domain', 'count']
domain_counts_sorted = domain_counts.sort_values('count', ascending=False)
display(domain_counts_sorted)

### Visualisation

Since we have counted the number of resources per domain, it is also possible to visualise the distribution.

We will use the **plotly** graphing tool to generate an interactive pie chart.

*TIP: You can move your mouse over the pie chart to display data for that block, and active/deactive domains in the right column.*

In [None]:
fig = px.pie(domain_counts_sorted, values='count', names='domain')
fig.update_traces(textposition='inside', textinfo='label')
fig.update_layout(title_text="Distribution of resources per domain", font_size=15)
fig.show()

## 2. Media type

### Count

As with the domains above, we can also count the occurrences of each media type/subtype.

*(Media type is synonymous to what is called "Content type" in WARC records.)*

In [None]:
content_type_counts = df['content_type'].value_counts().reset_index()
content_type_counts.columns = ['content_type', 'count']
content_type_sorted = content_type_counts.sort_values('count', ascending=False)
print(content_type_sorted)

### Visualisation

Run the code block below to visualise the distribution of each content type/subtype.

In [None]:
fig = px.pie(content_type_sorted, values='count', names='content_type')
fig.update_traces(textposition='inside', textinfo='label')
fig.update_layout(title_text="Distribution of resources by content type", font_size=15)
fig.show()

## 3. Crawl year

### Count

The code block below will count the number of resources per crawl year, and print the result.

In [None]:
crawl_year_counts = df['crawl_year'].value_counts().reset_index()
crawl_year_counts.columns = ['crawl_year', 'count']
crawl_year_sorted = crawl_year_counts.sort_values('crawl_year')
print(crawl_year_sorted)

### Visualisation

Then, we want to visualise their distribution over time, using **plotly** to make a bar chart.

In [None]:
fig = px.bar(crawl_year_sorted, y='count', x='crawl_year', text_auto='.2s')
fig.show()