# AWS JP Doc 構成


robots.txt に記載の sitemap.xml は ja_jp 専用のものは無い

https://docs.aws.amazon.com/ja_jp/robots.txt
```
 Sitemap: https://docs.aws.amazon.com/sitemap_index.xml
```

## 各サービスの sitemap のインデックスとなる sitemap.xml

ja_jp の sitemap.xml へのリンクになっていない

https://docs.aws.amazon.com/ja_jp/sitemap_index.xml
```xml
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
  <loc>https://docs.aws.amazon.com/a4b/latest/ag/sitemap.xml</loc>
  </sitemap>
  ...
  </sitemap>
  <sitemap>
  <loc>https://docs.aws.amazon.com/xray-sdk-for-java/latest/javadoc/sitemap.xml</loc>
  </sitemap>
</sitemapindex>
```


## 各サービスの sitemap.xml

各サービスに対して ja_jp の sitemap.xml 自体は存在している

https://docs.aws.amazon.com/ja_jp/redshift/latest/mgmt/sitemap.xml

```xml
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://docs.aws.amazon.com/ja_jp/redshift/latest/mgmt/welcome.html</loc>
  </url>
  <url>
    <loc>https://docs.aws.amazon.com/ja_jp/redshift/latest/mgmt/overview.html</loc>
  </url>
  ...
</urlset>
```

# 実装方針 

保守性の向上とトラブルシューティングの簡易化のためにクローリングとスクレイピングを分けることにする

1. Document クローリング
    1. sitemap.xml を元にHTMLを取得
    1. URL, created_at 等と一緒にraw html としてS3に保存
1. Document スクレイピング
    1. S3 から raw html を取得
    1. indexing に必要な情報をスクレイピング
    1. HTML body を正規化
    1. Algolia に index upload

# 調査


In [3]:
import requests
import xml.etree.ElementTree as ET

In [4]:
root_sitemap = requests.get('https://docs.aws.amazon.com/sitemap_index.xml')

In [5]:
root_sitemap.headers

{'Server': 'Server', 'Date': 'Wed, 22 Jul 2020 07:37:28 GMT', 'Content-Type': 'text/xml', 'Content-Length': '7781', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN, SAMEORIGIN', 'Cache-Control': 'max-age=86400', 'Last-Modified': 'Wed, 22 Jul 2020 07:30:49 GMT', 'ETag': '"1c115-5ab02b804b66f-gzip"', 'Accept-Ranges': 'bytes', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding,User-Agent,Content-Type,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent', 'x-amz-rid': 'JZVDAGKDWF8WY0BGQVH5'}

In [6]:
root_sitemap.encoding

'ISO-8859-1'

In [7]:
root = ET.fromstring(root_sitemap.text.encode('utf-8'))
service_sitemap_urls = [child[0].text for child in root]

## サービスごとの sitemap.xml を取得

In [8]:
service_sitemap_urls[0:10]

['https://docs.aws.amazon.com/a4b/latest/ag/sitemap.xml',
 'https://docs.aws.amazon.com/a4b/latest/APIReference/sitemap.xml',
 'https://docs.aws.amazon.com/a4b/site_map/sitemap.xml',
 'https://docs.aws.amazon.com/access-analyzer/latest/APIReference/sitemap.xml',
 'https://docs.aws.amazon.com/account-billing/site_map/sitemap.xml',
 'https://docs.aws.amazon.com/acm/latest/APIReference/sitemap.xml',
 'https://docs.aws.amazon.com/acm/latest/userguide/sitemap.xml',
 'https://docs.aws.amazon.com/acm/site_map/sitemap.xml',
 'https://docs.aws.amazon.com/acm-pca/latest/APIReference/sitemap.xml',
 'https://docs.aws.amazon.com/acm-pca/latest/userguide/sitemap.xml']

## 以下で各サービスの日本語 sitemap.xml のリストが完成する

In [9]:
service_sitemap_urls_ja = [url.replace('.com/','.com/ja_jp/') for url in service_sitemap_urls]
len(service_sitemap_urls_ja)

1023

In [10]:
service_sitemap_urls_ja[0:10]

['https://docs.aws.amazon.com/ja_jp/a4b/latest/ag/sitemap.xml',
 'https://docs.aws.amazon.com/ja_jp/a4b/latest/APIReference/sitemap.xml',
 'https://docs.aws.amazon.com/ja_jp/a4b/site_map/sitemap.xml',
 'https://docs.aws.amazon.com/ja_jp/access-analyzer/latest/APIReference/sitemap.xml',
 'https://docs.aws.amazon.com/ja_jp/account-billing/site_map/sitemap.xml',
 'https://docs.aws.amazon.com/ja_jp/acm/latest/APIReference/sitemap.xml',
 'https://docs.aws.amazon.com/ja_jp/acm/latest/userguide/sitemap.xml',
 'https://docs.aws.amazon.com/ja_jp/acm/site_map/sitemap.xml',
 'https://docs.aws.amazon.com/ja_jp/acm-pca/latest/APIReference/sitemap.xml',
 'https://docs.aws.amazon.com/ja_jp/acm-pca/latest/userguide/sitemap.xml']

## サンプリング (ACM PCA)

In [11]:
acm_sitemap = requests.get(service_sitemap_urls_ja[9])
acm_root = ET.fromstring(acm_sitemap.text.encode('utf-8'))
acm_urls = [child[0].text for child in acm_root]

In [12]:
acm_urls[0:10]

['https://docs.aws.amazon.com/ja_jp/acm-pca/latest/userguide/PcaWelcome.html',
 'https://docs.aws.amazon.com/ja_jp/acm-pca/latest/userguide/PcaRegions.html',
 'https://docs.aws.amazon.com/ja_jp/acm-pca/latest/userguide/PcaIntegratedServices.html',
 'https://docs.aws.amazon.com/ja_jp/acm-pca/latest/userguide/PcaLimits.html',
 'https://docs.aws.amazon.com/ja_jp/acm-pca/latest/userguide/RFC-compliance.html',
 'https://docs.aws.amazon.com/ja_jp/acm-pca/latest/userguide/PcaPricing.html',
 'https://docs.aws.amazon.com/ja_jp/acm-pca/latest/userguide/security.html',
 'https://docs.aws.amazon.com/ja_jp/acm-pca/latest/userguide/data-protection.html',
 'https://docs.aws.amazon.com/ja_jp/acm-pca/latest/userguide/security-iam.html',
 'https://docs.aws.amazon.com/ja_jp/acm-pca/latest/userguide/security-logging-and-monitoring.html']

In [13]:
r = requests.get(acm_urls[0])

### Fix encoding

In [14]:
r.encoding

'ISO-8859-1'

In [15]:
r.encoding = r.apparent_encoding

In [16]:
r.encoding

'utf-8'

# Algolia indexing
対象は以下のような jsonl となる

In [24]:
!wc -l 20200722032415/crawled-html-0.jsonl

77 20200722032415/crawled-html-0.jsonl


In [25]:
import json

jsonlには１行が各ドキュメントのURLとなる

In [45]:
json_list = []

with open ('./20200722032415/crawled-html-0.jsonl', 'r') as jsonl_file:
    json_list = list(jsonl_file)
    
len(json_list)

78

各ドキュメントの属性は以下の通り

In [47]:
d = json.loads(json_list[0])
d.keys()

dict_keys(['url', 'status', 'last_modified', 'crawled_at', 'html'])

サンプルで a4b に含まれるドキュメント URLs

In [50]:
for j in json_list:
    d = json.loads(j)
    print(d['url'])

https://docs.aws.amazon.com/a4b/latest/ag/cloudtrail.html
https://docs.aws.amazon.com/a4b/latest/ag/manage-address-books.html
https://docs.aws.amazon.com/a4b/latest/ag/manage-rooms.html
https://docs.aws.amazon.com/a4b/latest/ag/manage-contacts.html
https://docs.aws.amazon.com/a4b/latest/ag/manage-users.html
https://docs.aws.amazon.com/a4b/latest/ag/compliance.html
https://docs.aws.amazon.com/a4b/latest/ag/manage-profiles.html
https://docs.aws.amazon.com/a4b/latest/ag/enroll-users.html
https://docs.aws.amazon.com/a4b/latest/ag/disaster-recovery-resiliency.html
https://docs.aws.amazon.com/a4b/latest/ag/add-users.html
https://docs.aws.amazon.com/a4b/latest/ag/manage-devices.html
https://docs.aws.amazon.com/a4b/latest/ag/connect-exchange.html
https://docs.aws.amazon.com/a4b/latest/ag/infrastructure-security.html
https://docs.aws.amazon.com/a4b/latest/ag/voice-restrict.html
https://docs.aws.amazon.com/a4b/latest/ag/manage-network-profiles.html
https://docs.aws.amazon.com/a4b/latest/ag/sched

In [86]:
import lxml
from lxml.html.clean import clean_html

インデックスする属性を HTML から取得する

In [None]:
# for j in json_list:
#     d = json.loads(j)
#     html = lxml.html.fromstring(d['html'])
#     print(html.cssselect('title')[0].text)

d = json.loads(json_list[0])
html = lxml.html.fromstring(d['html'])
# clean_html(html).text_content()

In [115]:
for j in json_list[0:10]:
    d = json.loads(j)
    print(d['url'])
    html = lxml.html.fromstring(d['html'])
    title = html.cssselect('title')[0].text
    print(title)
    
    for meta in html.cssselect('meta'):
#         print(meta.attrib)
        if meta.get("name") == "product":
            print(meta.get("content"))
        if meta.get("name") == "guide":
            print(meta.get("content"))

https://docs.aws.amazon.com/a4b/latest/ag/cloudtrail.html
Logging and Monitoring in Alexa for Business - Alexa for Business
Alexa for Business
Administration Guide
https://docs.aws.amazon.com/a4b/latest/ag/manage-address-books.html
Managing Address Books - Alexa for Business
Alexa for Business
Administration Guide
https://docs.aws.amazon.com/a4b/latest/ag/manage-rooms.html
Managing Rooms - Alexa for Business
Alexa for Business
Administration Guide
https://docs.aws.amazon.com/a4b/latest/ag/manage-contacts.html
Managing Contacts - Alexa for Business
Alexa for Business
Administration Guide
https://docs.aws.amazon.com/a4b/latest/ag/manage-users.html
Managing Users - Alexa for Business
Alexa for Business
Administration Guide
https://docs.aws.amazon.com/a4b/latest/ag/compliance.html
Compliance Validation for Alexa for Business - Alexa for Business
Alexa for Business
Administration Guide
https://docs.aws.amazon.com/a4b/latest/ag/manage-profiles.html
Managing Room Profiles - Alexa for Business