# Search 2.0 Data Planning (Indexing & Service Info)

## Introduction

Related ticket: https://github.com/orgs/GSA-TTS/projects/60/views/7?pane=issue&itemId=97252640&issue=GSA-TTS%7Cjemison%7C123


In this notebook, I will be wrangling our data samples to see if they are able to answer our metric questions with their output.

**The focus in this notebook will be on data related to indexing and the search developer statistics.**

## Data

### Libraries
Importing libraries to work with our JSON data:

In [1]:
# If pandas has not been installed yet:
# !pip install pandas

In [2]:
# for general data wrangling tasks
import pandas as pd

# for working with json
import json

### JSON sample data

We have 4 files currently (as of 2025-03-05):
* v2search_sample_data.json
* v2click_sample_data.json
* **v2indexing_sample_data.json** (focus for this notebook)
* **search_service_sample_data_cleaned.json** (focus for this notebook)

The 'v2' indicates that feedback from our [markdown file](https://github.com/GSA-TTS/jemison/blob/main/docs/architecture/services/json-data-structure-drafts.md) has been implemented.

These JSON structures (and as a result, the code below) will change and evolve as our developer teams iterate.

#### Guiding questions:
1. When did we last notice a change in the content?
3. For a domain, how many things have we indexed? (as of today, what is in the index?)
4. How many errors?
    * How many 500 errors?
         * Are we running into an anomalous number of 500s?
    * How many 400 errors?
5. Number of pages in the index over time?
6. Type of content in the index (html/etc)?
7. How many pages have rich metadata?
8. Insight into the indexing process, and any errors happening:
    * Is the system doing autothrottling to reduce load on the domain?
    * Has crawling/indexing of a domain been paused automatically?
    * How many domains are paused right now?
   
9. How old is the oldest url on this domain in the index (e.g. we aim for this to be no more than 30 days)?

**Performance (on our customer sites)**
1. Payload size?
2. Time to first paint?
3. More...

### Indexing Sample Data

We would not keep historical data.
Each row/column would be rewritten as a new crawl or indexing is done.

In [3]:
# Open and read the JSON file v2search_sample_data.json:
with open('./data/v2/v2indexing_sample_data.json', 'r') as file:
      v2indexing_sample_data = pd.json_normalize(json.load(file))

# Print the data
v2indexing_sample_data

Unnamed: 0,domain_indexed,url,canonical_url,status_code,status,index_status,index_status_reason,index_status_date,last_successful_crawl,last_successful_index,last_modified,sha1,last_change_detected,redirect_url,content_type,crawl_depth,first_contentful_paint,response_time,payload_size,autothrottle_enabled
0,www.usa.gov,https://www.usa.gov/passport,https://www.usa.gov/passport,200,OK,indexed,,2025-02-19 15:08:40,2025-02-19 15:08:25,2025-02-19 15:08:40,2234-23-23 123:123:12,34242353235,2025-02-19 1:36:34,,html,3.0,0.44,0.175,9.7,False
1,www.usa.gov,https://www.usa.gov/passport,https://www.usa.gov/passport,200,OK,indexed,,2025-02-19 15:08:40,2025-02-19 15:08:25,2025-02-19 15:08:40,2234-23-23 123:123:12,3a423b32e2342,2025-02-19 1:36:34,,html,3.0,0.44,0.175,9.7,False
2,www.usa.gov,https://www.usa.gov/test,https://www.usa.gov/test,200,OK,indexed,,2025-02-01 12:30:30,2025-02-19 15:10:25,2025-02-01 13:30:40,2025-02-19 15:09:50,3a423b32e1111,2025-02-19 1:40:34,,html,2.0,0.325,0.124,6.3,False
3,www.usa.gov,https://www.usa.gov/unclaimed-money,https://www.usa.gov/unclaimed-money,200,OK,indexed,,2025-02-01 12:31:33,2025-02-19 15:11:12,2025-02-01 13:31:20,2025-02-19 15:10:24,3a423b32e2222,2025-02-19 1:36:14,,html,1.0,0.242,0.135,11.5,False
4,www.usa.gov,https://www.usa.gov/tester,https://www.usa.gov/tester,404,Not Found,non-indexable,Not Found,,,,,,,,,,,,,
5,www.usa.gov,https://www.usa.gov/buying-home,https://www.usa.gov/buying-home,301,Moved Permanently,non-indexable,Redirected,2025-02-01 12:32:21,2025-02-19 15:11:59,2025-02-01 13:32:10,2025-02-01 13:32:10,3a423b32e5555,2025-02-19 15:11:59,https://www.usa.gov/buying-home-programs,html,1.0,0.325,0.342,,False
6,www.usa.gov,https://www.usa.gov/buying-home-programs,https://www.usa.gov/buying-home-programs,200,OK,indexed,,2025-02-01 12:32:31,2025-02-19 15:12:12,2025-02-01 13:32:20,2025-02-19 15:12:24,3a423b32e9999,2025-02-19 1:37:14,,html,3.0,0.325,0.135,8.9,False
7,www.va.gov,https://www.va.gov/health-care/about-va-health...,https://www.va.gov/health-care/about-va-health...,200,OK,indexed,,2025-02-01 13:31:32,2025-02-19 15:09:47,2025-02-01 13:31:41,2025-02-19 15:09:51,3a423b32e3333,2025-02-08 1:30:34,,html,3.0,0.3,0.175,9.7,True
8,www.example.gov,http://www.example.gov,http://www.example.gov,200,OK,indexed,,2025-02-01 13:56:32,2025-02-19 15:29:47,2025-02-01 13:41:41,2025-02-19 15:11:51,3a423b32e3322,2025-02-08 1:33:34,,html,1.0,0.322,0.244,7.5,False
9,www.example.gov,http://www.example.gov/next-page,http://www.example.gov/next-page,200,OK,indexed,,2025-02-01 13:57:32,2025-02-19 15:30:47,2025-02-01 13:43:41,2025-02-19 15:12:58,3a423b32e1122,2025-02-06 1:31:34,,html,2.0,0.241,0.222,7.3,False


In [4]:
v2indexing_sample_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   domain_indexed          12 non-null     object 
 1   url                     12 non-null     object 
 2   canonical_url           12 non-null     object 
 3   status_code             12 non-null     int64  
 4   status                  12 non-null     object 
 5   index_status            12 non-null     object 
 6   index_status_reason     4 non-null      object 
 7   index_status_date       11 non-null     object 
 8   last_successful_crawl   11 non-null     object 
 9   last_successful_index   11 non-null     object 
 10  last_modified           11 non-null     object 
 11  sha1                    11 non-null     object 
 12  last_change_detected    11 non-null     object 
 13  redirect_url            1 non-null      object 
 14  content_type            11 non-null     obje

### Indexing data fields

| Data field to collect | Data Type | Category | Description | Related Requested Metric | Related Metric Question | Priority | Notes / Questions |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| **domain\_indexed** | string | each URL indexed/crawled | In the guestbook |  |  | High | Unique identifier in the new system. |
| **url** | URL/string | each URL indexed/crawled | Would need to be tied to domain. In the guestbook | \# indexed items | Domain level: How many indexed items? | High |  |
| **canonical\_url** | URL/string | each URL indexed/crawled | A canonical URL is the URL of the best representative page from a group of duplicate pages. Could be logged, may not always be accurate. Is user input |  |  | High |  |
| **status\_code** | integer | each URL indexed/crawled | 404, etc. | \# of 500s? 400s? |  | High |  |
| **status** | string | each URL indexed/crawled | Descriptive text definition of the status\_code. |  |  | Medium |  |
| **index\_status** | string (indexed, non-indexed) | each URL indexed/crawled | Was the crawled page successfully indexed? | \# indexed items (failed & successful) | Domain level: How many successfully & failed indexed items? | High |  |
| **index\_status\_reason** | string (error / blocked / etc.) | each URL indexed/crawled | If not indexed, why? (due to error, blocked, or other) | \# indexed items (failed & why) |  | High |  |
| **index\_status\_date** | datetime | each URL indexed/crawled | The status might get updated in case of a failure, and therefore would be more recent than the last\_indexed date, which would only be updated when the indexing is successful. |  |  | High |  |
| **last\_successful\_crawl** | datetime | each URL indexed/crawled | This is set by \`fetch\` in the guestbook. Added \_successful\_ for clarity. |  |  | High |  |
| **last\_successful\_index** | datetime | each URL indexed/crawled | \`extract\` should be recorded in guestbook when successful. Added \_successful\_ for clarity. |  |  | High |  |
| **last\_modified** | datetime | each URL indexed/crawled | From webserver. It is in the guestbook. |  |  | High |  |
| **sha1** | string | each URL indexed/crawled | Hash of fetched content | Metadata questions |  | High | A hash of the fetched content. Possibly include metadata, such as last\_modified date from the webserver? |
| **last\_change\_detected** | datetime | each URL indexed/crawled | When was a change detected in the content? Ideally, would like to find out for users how old their content is |  |  | Medium | Updated when hash changes |
| **redirect\_url** | string/URL | each URL indexed/crawled | URL that page redirects to (if applicable). |  |  | Low | Should we record? |
| **content\_type** | string (html / pdf / etc.) | each URL indexed/crawled | html / pdf / etc. |  |  | High |  |
| **crawl\_depth** | integer | each URL indexed/crawled | How deep on the website |  |  | Low | May not be necessary |
| **first\_contentful\_paint** | decimal | each URL indexed/crawled | how long it takes the browser to render the first piece of DOM content after a user navigates to your page (in seconds) |  |  | Low | Later: performance metrics for crawled sites for our customers |
| **response\_time** | decimal | each URL indexed/crawled | in seconds |  |  | Medium |  |
| **payload\_size** | decimal | each URL indexed/crawled | in KB |  |  | Low |  |
| **autothrottle\_enabled** | TRUE/FALSE | each URL indexed/crawled | From scrapy \- AutoThrottle extension |  |  | Medium |  |



### For a domain, how many things have we indexed? (as of today, what is in the index?)

If we were interested in "www.usa.gov":

In [5]:
# set our domain of interest variable:
domain_of_interest = 'www.usa.gov'

index_data_domain_subset = v2indexing_sample_data[v2indexing_sample_data['domain_indexed'] == domain_of_interest]

index_data_domain_subset

Unnamed: 0,domain_indexed,url,canonical_url,status_code,status,index_status,index_status_reason,index_status_date,last_successful_crawl,last_successful_index,last_modified,sha1,last_change_detected,redirect_url,content_type,crawl_depth,first_contentful_paint,response_time,payload_size,autothrottle_enabled
0,www.usa.gov,https://www.usa.gov/passport,https://www.usa.gov/passport,200,OK,indexed,,2025-02-19 15:08:40,2025-02-19 15:08:25,2025-02-19 15:08:40,2234-23-23 123:123:12,34242353235,2025-02-19 1:36:34,,html,3.0,0.44,0.175,9.7,False
1,www.usa.gov,https://www.usa.gov/passport,https://www.usa.gov/passport,200,OK,indexed,,2025-02-19 15:08:40,2025-02-19 15:08:25,2025-02-19 15:08:40,2234-23-23 123:123:12,3a423b32e2342,2025-02-19 1:36:34,,html,3.0,0.44,0.175,9.7,False
2,www.usa.gov,https://www.usa.gov/test,https://www.usa.gov/test,200,OK,indexed,,2025-02-01 12:30:30,2025-02-19 15:10:25,2025-02-01 13:30:40,2025-02-19 15:09:50,3a423b32e1111,2025-02-19 1:40:34,,html,2.0,0.325,0.124,6.3,False
3,www.usa.gov,https://www.usa.gov/unclaimed-money,https://www.usa.gov/unclaimed-money,200,OK,indexed,,2025-02-01 12:31:33,2025-02-19 15:11:12,2025-02-01 13:31:20,2025-02-19 15:10:24,3a423b32e2222,2025-02-19 1:36:14,,html,1.0,0.242,0.135,11.5,False
4,www.usa.gov,https://www.usa.gov/tester,https://www.usa.gov/tester,404,Not Found,non-indexable,Not Found,,,,,,,,,,,,,
5,www.usa.gov,https://www.usa.gov/buying-home,https://www.usa.gov/buying-home,301,Moved Permanently,non-indexable,Redirected,2025-02-01 12:32:21,2025-02-19 15:11:59,2025-02-01 13:32:10,2025-02-01 13:32:10,3a423b32e5555,2025-02-19 15:11:59,https://www.usa.gov/buying-home-programs,html,1.0,0.325,0.342,,False
6,www.usa.gov,https://www.usa.gov/buying-home-programs,https://www.usa.gov/buying-home-programs,200,OK,indexed,,2025-02-01 12:32:31,2025-02-19 15:12:12,2025-02-01 13:32:20,2025-02-19 15:12:24,3a423b32e9999,2025-02-19 1:37:14,,html,3.0,0.325,0.135,8.9,False


In [6]:
# to get a count of unique URLs:

len(index_data_domain_subset['canonical_url'].unique())

6

## Search Service Sample Data

Over time, we will produce a different variation of this, with additional fields developers may find helpful.

In [7]:
# Open and read the JSON file v2search_sample_data.json:
with open('./data/v2/search_service_sample_data.json', 'r') as file:
      search_service_sample_data = pd.json_normalize(json.load(file))

# Print the data
search_service_sample_data

Unnamed: 0,date,uptime,ram_usage,db_usage
0,2025-02-19,0.99,0.101,0.101
1,2025-02-20,0.991,0.099,0.099
2,2025-02-21,0.993,0.11,0.099
3,2025-02-22,0.988,0.0913,0.0932
4,2025-02-23,0.991,0.099,0.099
5,2025-02-24,0.991,0.0932,0.0874
6,2025-02-25,0.988,0.0732,0.11
7,2025-02-26,0.991,0.099,0.099
8,2025-02-27,0.988,0.0913,0.0932
9,2025-02-28,0.991,0.0874,0.099


In [8]:
search_service_sample_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       10 non-null     object 
 1   uptime     10 non-null     float64
 2   ram_usage  10 non-null     float64
 3   db_usage   10 non-null     float64
dtypes: float64(3), object(1)
memory usage: 452.0+ bytes


#### For a time range, what is our average uptime, RAM usage, or DB usage?

In [31]:
# converting column to datetime format:
search_service_sample_data['date'] = pd.to_datetime(search_service_sample_data['date'])

# subsetting our data to a time range:
start_date = '2025-02-26' # inclusive
end_date = '2025-02-28' # inclusive
search_service_sample_date_subset = search_service_sample_data[(search_service_sample_data['date'] >= start_date) & (search_service_sample_data['date'] <= end_date)]
search_service_sample_date_subset

Unnamed: 0,date,uptime,ram_usage,db_usage
7,2025-02-26,0.991,0.099,0.099
8,2025-02-27,0.988,0.0913,0.0932
9,2025-02-28,0.991,0.0874,0.099


In [32]:
# calculating the average of uptime, RAM usage, and DB usage:

averages = search_service_sample_date_subset.mean(numeric_only=True)

# output as a data frame:
pd.DataFrame(averages, columns = ['Averages over time range'])

Unnamed: 0,Averages over time range
uptime,0.99
ram_usage,0.092567
db_usage,0.097067
