# Checking the SEO Test Setup

I.e. checking that the right urls are set up, without going too crazy on testing everything.

In brief:

* Focus on the articles
* Check the test urls are accessible and show v2
* If AMP is enabled, test that control urls show v1 AMP and test urls show v2
* Check the canonicals
* Check the breadcrumbs
* That cloudflare looks to be working


Out of scope:

* That an AMP page is valid AMP (we have a separate test suite for this)
* Being robust to html that isn't what's expected during blog v2 launch!
* Testing non-article pages (home page, category page etc)

In [3]:
import requests
from datetime import datetime
from bs4 import BeautifulSoup
import csv
import json
import urllib
import json
import time
import matplotlib.pyplot as plt
from random import random, randint

## Settings


In [1]:
country_to_test = "tw"
test_staging = True

test_urls_csv_filepath = {  # These originally come from https://docs.google.com/spreadsheets/d/1UdrncwVdObaA6d0lnvbAQCHBpQMzEhWPz5OTR9Syobo/edit#gid=0
    "id":"./SEO Test Urls/id_seo_test_urls.csv",
    "tw":"./SEO Test Urls/tw_seo_test_urls.csv",
    "sg": None,
    "hk": None,
}

production_staging_urls = {
    "id": ["www.moneysmart.id", "www.msiddev.com"],
    "tw":["www.moneysmart.id", "www.msiddev.com"],
    "sg": ["blog.moneysmart.sg","blog.mssgdev.com"],
    "hk":["www.moneysmart.hk", "blog.mshkdev.com"],
}

post_sitemaps = { #yes, we could load some more dynamically but it saves some coding
    "id": ["https://www.moneysmart.id/post-sitemap%i.xml"%i for i in range(1,13)],
    "tw": ["https://www.moneysmart.tw/post-sitemap.xml"],
    "sg": ["https://blog.moneysmart.sg/post-sitemap%i.xml"%i for i in range(1,5)],
    "hk": ["https://blog.moneysmart.hk/post-sitemap%i.xml"%i for i in range(1,3)],
    
    #Aside, there seems to be public pages for the ads in the sitemap :( 
    
}


#These passwords aren't sensitive, they're just there to block search engine.
v1_staging_passwords = {
    "id":["admin", "moneysmart"],
    "tw":["admin", "moneysmart"],
    "sg":["admin", "duri@n"],
    "hk":["admin", "yu3ch@"],
}

v2_staging_passwords = {
    "id":["admin", "moneysmart"],
    "tw":["admin", "moneysmart"],
    "sg":["admin", "moneysmart"],
    "hk":["admin", "moneysmart"],
}


## Test URL Sets

In [4]:
def load_seo_test_urls(country_code):
    filepath = test_urls_csv_filepath[country_code]
    urls = []
    with open(filepath) as f:
        csv_reader = csv.reader(f, delimiter=',')
        for i, row in enumerate(csv_reader):
            if i==0:
                print("Skipping first line of url file: %s"%row)
                continue
            urls.append(row[0].strip())

    return urls

seo_test_urls = load_seo_test_urls(country_to_test)

print("Found %i test urls to check for" % len(seo_test_urls))


Skipping first line of url file: ['url']
Found 51 test urls to check for


In [5]:
for a in seo_test_urls:print(a)

https://www.moneysmart.tw/articles/各家銀行信用卡免付費客服電話/
https://www.moneysmart.tw/articles/黃金存摺-開戶-銀行/
https://www.moneysmart.tw/articles/如果你是旅遊常客，何不使用雙幣信用卡？/
https://www.moneysmart.tw/articles/line-points-中信-line-pay-聯邦賴點一卡通/
https://www.moneysmart.tw/articles/海外旅遊-現金回饋-信用卡-2019/
https://www.moneysmart.tw/articles/2019年悠遊聯名卡自動加值回饋與相關優惠/
https://www.moneysmart.tw/articles/善用結帳日，額外爽賺利息/
https://www.moneysmart.tw/articles/linepay卡-中國信託-hotelscom/
https://www.moneysmart.tw/articles/燃料費-信用卡/
https://www.moneysmart.tw/articles/2019-電影-信用卡-優惠/
https://www.moneysmart.tw/articles/出國旅遊用現金、信用卡、旅支還是金融卡提款？/
https://www.moneysmart.tw/articles/uupon-20190319-02/
https://www.moneysmart.tw/articles/costco-會員千元年費，商品真的有比較便宜嗎？/
https://www.moneysmart.tw/articles/數位帳戶-王道-richart-大戶-永豐-台新/
https://www.moneysmart.tw/articles/line-pay-回饋-信用卡/
https://www.moneysmart.tw/articles/花旗紅利點數貶值之後，卡友們應該何去何從？/
https://www.moneysmart.tw/articles/買賣外幣停看聽，牌告匯率先看懂/
https://www.moneysmart.tw/articles/mot-monthly-ticket-20190313-

## All Posts from Sitemaps (and test sitemap works!)

In [6]:
def get_urls_from_sitemap(sitemap_url):
    
    r = requests.get(sitemap_url)
    soup = BeautifulSoup(r.text, "html.parser") #html parser does xml fine
    urls = [element.text for element in soup.findAll('loc')]
    return urls

def get_all_post_urls_from_sitemap(country_code):
    sitemaps = post_sitemaps[country_code]
    all_urls = []
    for i, sitemap in enumerate(sitemaps):
        print("Loading urls from sitemap %i of %i %s" % (i, len(sitemaps), sitemap))
        sitemap_urls = get_urls_from_sitemap(sitemap)
        all_urls+=sitemap_urls
    return all_urls

all_post_urls = get_all_post_urls_from_sitemap(country_to_test)

print("Found %i posts in the sitemap"%len(all_post_urls))

Loading urls from sitemap 0 of 1 https://www.moneysmart.tw/post-sitemap.xml
Found 671 posts in the sitemap


In [7]:
non_seo_test_urls = list(set(all_post_urls) - set(seo_test_urls))
print("That leaves %i post urls that aren't in the test" % len(non_seo_test_urls))

That leaves 668 post urls that aren't in the test


## Utility Functions

In [126]:
def is_v2_url(url):
    splits = urllib.parse.urlparse(url)
    if "www3." in splits.netloc or "blog3." in splits.netloc:
        return True
    return False

def is_v1_url(url):
    return not is_v2_url(url)

def make_v2_url(url):
    #>>>
    
    return v2_url
    
def make_v2_staging_url(url):
    # not responsible for auth
    pass
    
    
def make_v1_staging_url(url):
    # not responsible for auth
    pass
                                   

def is_amp_page(url, page_cache):
    """
    Very basic test on whether the page is marked in the html tag as AMP
    
       # <html ⚡> or <html amp>
    rab"""
    soup = page_cache.get_soup(url)
    html_tag = soup.find('html')
    html_tag.attrs
    
    if "amp" in html_tag.attrs or "⚡" in html_tag.attrs:
        return True
    return False
    
    
 
    
def is_v2_page(url, page_cache):
    """
    Very basic test if it's a v2 page based on knowledge of it.
    """
    html = page_cache.get_html(url)
    if "nuxt" in html:
        return True
    return False
    
    
def is_v2_amp_page(url, page_cache):
    if is_v2_page(url, page_cache) and is_amp_page(url, page_cache):
        return True
    return False
    
def extract_link_of_type(url, page_cache, type_of_link):
    """
    Generally don't call this directly.
    
    Expects it to be unique in the page / no duplicates
    """
    
    soup = page_cache.get_soup(url)
    links = soup.find_all("link")
    suitable_link_tags = [z for z in links if "rel" in z.attrs and z.attrs["rel"].lower()==type_of_link.lower()]
    if len(suitable_link_tags)==0:
        return None
    if len(suitable_link_tags)>1:
        raise Exception()
    
    link_url = amp_link_tags[0].get("href")
    
    return link_url
    
    
    
    
    
def extract_link_to_amp_version(url, page_cache):
    """
    Returns none if the amp version can't be found
    
    Looking for <link rel="amphtml" href="..."
    
    Returns the url it links to
    """
    amp_url = extract_link_of_type(url, page_cache, "amphtml")
    
    return amp_url

def amp_and_canonical_urls_match(amp_url, canonical_url):
    if amp_url == canonical_url+"?amp": # better would be to do a join using url library
        return True
    return False
    
    

    
    
def extract_canonical_tag(url, page_cache):
    """
    
    Looks for <link rel="canonical" href="....">
    
    This is similar to amphtml
    """
    canonical_url = extract_link_of_type(url, page_cache, "canonical")
    return canonical_url

    
def canonical_tag_is_valid(url, canonical_url):
    """
    Returns bool, error_message
    
    """
    url_components = urllib.parse.urlparse(url)
    canonical_components = urllib.parse.urlparse(canonical_url)
    
    error_message = ""
    
    
    #Canonical uses https
    if canoncical_components.scheme != "https":
        error_message += "Canonical does't use https. "
    
    # no query params in canonical
    if canonical_components.query:
        error_message += "Canonical url contains query parameters. "
    
    
    # goes to base url
    expected_root_url = production_staging_urls[country_code]
    if not canonical_url.startswith(expected_root_url):
        error_message += "Canonical url doesn't appear to point to right site."
    
    
    # url and canonical point to the same article
    if url_components.path != canonical_components.path:
        error_message += "Canonical and url don't point to same url. "
    
    
    return error_message!="", error_message
    
    
def extract_ld_json_structured_data(url, page_cache, type_of_structured_data):
    """
    Returns as parsed JSON i.e. should be able to inspect as a dictionary etc.
    """
    
    #Note there's lots of valid ways of doing this, I'm just trying to match the way it's been implemented
    
    soup = page_cache.get_soup(url)
    
    script_tags = soup.find_all("script")
    sd_tags = [z for z in script_tags if "type" in z.attrs and z.attrs["type"]=="application/ld+json"]
    json_structured_datas = [json.loads(z.contents) for z in sd_tags]
    structured_datas_of_type = [z for z in structured_datas if z["@type"]==type_of_strucutred_data]
    if len(structured_datas_of_type) > 1:
        raise Exception()
    if len(structured_datas_of_type) == 0:
        raise Exception() # just for debugging
        return None
    return structured_datas_of_type[0]
    
def extract_blogposting_structured_data(url, page_cache):
    sd = extract_ld_json_structured_data(url, page_cache, "BlogPosting")
    return sd
    
def extract_breadcrumbs_as_list(url, page_cache):
    """
    Returns [[name, url]]
    
        {
    "@context": "https://schema.org",
    "@type": "BreadcrumbList",
      "itemListElement": [{
        "@type": "ListItem",
        "position": 1,
        "name": "Books",
        "item": "https://example.com/books"
      },
      ...
      }
    
    """
    sd = extract_ld_json_structured_data(url, page_cache, "BreadcrumbList")
    
    breadcrumbs = [[z["position"], z["name"], z["item"]] \
                   for z in sd["itemListElement"]\
                      if z["@type"]=="ListItem"
                  ]
    breadcrumbs.sort(lambda x: x[0]) #sort by position
    
    
    return [z[1:] for z in breadcrumbs]

    

    
def article_breadcrumbs_are_valid(breadcrumb_ld_json):
    # separate function to download, extract and turn into list
    
    pass
    
    #Has category
    
    #Has at least 3 levels of depth (home, category, article)
    
    #All urls are canonical (i.e. they start the right way)
    
    
    


In [133]:
class PageCache(object):
    """
    This is an abstraction for getting a url, which understands
    translation between staging and production, authentication etc
    and uses a cache to allow efficient repeat requests.
    (and whether that cache is in memory, or not is academic)
    
    It also includes throttling.
    
    The result might be a redirect to say www3 or whatever.
    
    NB this does currently rely on some globals
    """
    
    def __init__(self):
        self._page_cache_dict = {} #url:[page_html, load_time, cloudflare_in_headers, cloudflare_hit, url_after_redirect]
        self._soup_cache_dict = {} #url: beautifulsoup
        self._page_load_time_dict = {} # a log of how long it takes to load the pages in seconds
        self.last_page_respone_time  = datetime.now()
        self.min_seconds_between_server_requests = 0.5 #currently after request completion, rather than start
    
    def get_html(self, url, refresh_cache = False):
        """
        This gets the url, but only requests it if it isn't in the page cache currently
        """
        if refresh_cache or url not in self._page_cache_dict:
            seconds_since_last_request = (datetime.now() - self.last_page_respone_time).total_seconds()
            wait_time = self.min_seconds_between_server_requests - seconds_since_last_request
            if wait_time>0:
                print("Throttling request for %.2f seconds"%wait_time)
                time.sleep(wait_time)
            
            self._download_url_and_cache(url) # this also does things like check caching and do timing.
            
        html = self._page_cache_dict[url][0]
        
        self.last_page_request_time = datetime.now()
        
        return html
        
    def get_soup(self, url):
        """
        Returns the beautifulsoup version of the page for tag extraction
        """
        if url not in self._soup_cache_dict:
            html = self.get_html(url)
            soup = BeautifulSoup(html, "html.parser")
            self._soup_cache_dict[url] = soup

        return self._soup_cache_dict[url]
        

        
    def _download_url_and_cache(self, url):
        user_pass = None
        
        
        if test_staging:
            #Need to deal with basic auth
            # TODO
            if is_v1_url(url):
                user_pass = v1_staging_passwords[country_to_test]
            else:
                user_pass = v2_staging_passwords[country_to_test]
        
        
        r = requests.get(url, auth= user_pass)
        if r.status_code == "401":
            
            print("Got a 401 on %s with credentials %s" % (url, user_pass))
            if test_staging:
                print("Going to try again with other creds as likely redirect with different creds.")
                if not is_v1_url(url): # flipping counter-intuitively
                    user_pass = v1_staging_passwords[country_to_test]
                else:
                    user_pass = v2_staging_passwords[country_to_test]
                r = requests.get(url, auth=user_pass)
                if r.status_code == "401":
                    print("failed with other auth")
                    raise Exception()
            else:
                raise Exception()
        
        r.raise_for_status() #Throw an exception if get e.g. a 404.
        
        load_time = r.elapsed.total_seconds()
        text = r.text
        headers = r.headers
        
        # Look at headers to judge if caching on the html is working
        is_cdn_caching = False
        is_cdn_cache_hit = False
        if "Server" in headers:
            if headers["Server"]=="cloudflare":
                is_cdn_caching = True
                if headers["cf-cache-status"]=="HIT":
                    is_cdn_cache_hit = True
            

        url_after_redirect = r.url
        #"cf-cache-status" "HIT"  /"EXPIRED"
        #"server:cloudflare"
        
        self._page_cache_dict[url] = text, load_time, is_cdn_caching, is_cdn_cache_hit, url_after_redirect

In [134]:
page_cache = PageCache()

In [135]:
soup = page_cache.get_soup("https://www3.moneysmart.id")

In [136]:
soup.html.attrs

{'lang': 'id-ID', 'prefix': 'og: http://ogp.me/ns#'}

In [63]:
urllib.parse.urlparse("https://www3.moneysmart.id/my_cat/my_article?amp")

ParseResult(scheme='https', netloc='www3.moneysmart.id', path='/my_cat/my_article', params='', query='amp', fragment='')

In [131]:
r = requests.get("https://www3.moneysmart.id/tas-hermes-mahal-ini-sebabnya/?forceTest=test")
r_amp = requests.get("https://www3.moneysmart.id/tas-hermes-mahal-ini-sebabnya/?amp")

In [132]:
r.status_code

200

In [20]:
r.text[:100]

'<!doctype html>\n<html data-n-head-ssr lang="id-ID" data-n-head="lang">\n<head data-n-head="">\n<title '

In [137]:
extract_blogposting_structured_data("https://www.moneysmart.id/ayah-nadiem-makarim", page_cache)

TypeError: the JSON object must be str, bytes or bytearray, not 'list'

# Tests

## Check the Homepage still loads!

In [None]:
homepage_url = production_staging_urls[country_to_test][(1 if test_staging else 0)]
html = page_cache.get_html(homepage_url)
if len(html)<1000:
    raise Exception()

## Check the AB split percentage (approximately)

In [111]:
# get url in the right format for the environment
requests_to_check_ab_split = 200
test_url = non_seo_test_urls[randint(0, len(non_seo_test_urls))]
hit_v2s = []
print("Testing %i requests outside of SEO test to see AB split with url %s" % (requests_to_check_ab_split, test_url))
for i in range(requests_to_check_ab_split):
    html=page_cache.get_html(test_url, refresh_cache = True)
    is_v2 = is_v2_page(test_url, page_cache)
    hit_v2s.append(is_v2)
    
num_v1s = len([z for z in hit_v2s if z==False])
num_v2s = len([z for z in hit_v2s if z==True])
percentage_v2 = 100 * float(num_v2s)/ (num_v1s + num_v2s)

labels = ["v1", "v2 - %.1f" % percentage_v2]
sizes = [num_v1s, num_v2s]
    
plt.pie(sizes, labels)
plt.show()

Testing 200 requests outside of SEO test to see AB split with url https://www.moneysmart.id/5-pertanyaan-mesti-dijawab-memutuskan-berhenti-kuliah/


TypeError: get_html() got an unexpected keyword argument 'refresh_cache'

## Check Test Articles are v2

In [138]:
for url in seo_test_urls:
    print(url, is_v2_page(url,page_cache))

https://www.moneysmart.id/klikbca-gini-lho-cara-cepat-daftar-internet-banking-bca/ True
https://www.moneysmart.id/klikbca-individual-ini-cara-daftar/ True
https://www.moneysmart.id/daftar-kode-bank-indonesia-terlengkap-lebih-dari-100-bank/ True
https://www.moneysmart.id/unipin-voucher-game-online/ True
https://www.moneysmart.id/djp-online-cara-cepat-bayar-pajak-lapor-spt/ True
https://www.moneysmart.id/harga-emas-hari-ini/ True
https://www.moneysmart.id/cara-ambil-antrian-online-bpjs-ketenagakerjaan/ True
https://www.moneysmart.id/vpn-tarumanagara-cara-log-in-klikbca-bisnis-tanpa-aplikasi/ True
https://www.moneysmart.id/daftar-internet-banking-bni-gak-ribet-kok-gini-cara-cepatnya/ True
https://www.moneysmart.id/cara-membuat-npwp-online/ True
https://www.moneysmart.id/daftar-cimb-clicks-gampang-kok/ True
https://www.moneysmart.id/nomor-call-center-bca-yang-bisa-dihubungi/ True
https://www.moneysmart.id/surat-lamaran-kerja-dan-cara-serta-contohnya/ True
https://www.moneysmart.id/bca-mobi

In [141]:
len(non_seo_test_urls)

11152

In [140]:
for url in non_seo_test_urls[:100]:
    print(url, is_v2_page(url,page_cache))

https://www.moneysmart.id/7-situs-survey-online-indonesia-bayar-kamu-pakai-dolar/ False
https://www.moneysmart.id/pengin-liburan-murah-meriah-yuk-manfaatkan-travel-fair/ False
https://www.moneysmart.id/lomba-blog-moneysmart-2018-cerdasdenganuangmu/ False
https://www.moneysmart.id/ingin-tembus-panggung-new-york-fashion-week-simak-kisah-inspiratif-3-desainer-ini/ False
https://www.moneysmart.id/begini-panduan-cara-memilih-kerja-di-perusahaan-besar-atau-kecil/ False
https://www.moneysmart.id/jam-tangan-chopard-maia-estianty-dan-fakta-menariknya/ False
https://www.moneysmart.id/mobil-ustadz-arifin-ilham/ False
https://www.moneysmart.id/ubah-gaya-belanja-jadi-begini-di-2017-biar-nabung-gak-cuma-wacana/ False
https://www.moneysmart.id/tarif-cukai-naik-ini-prediksiannya/ False
https://www.moneysmart.id/ternyata-restoran-artis-pantes-8-tempat-makan-ini-hits-banget/ False
https://www.moneysmart.id/wahai-para-suami-bedakan-ya-antara-uang-nafkah-dan-uang-belanja/ False
https://www.moneysmart.id/s

## Check Test Articles have v2 AMP

## Check Cloudflare is (to some extent) Working