# Checking the SEO Test Setup

I.e. checking that the right urls are set up, without going too crazy on testing everything.

In brief:

* Focus on the articles
* Check the test urls are accessible and show v2
* If AMP is enabled, test that control urls show v1 AMP and test urls show v2
* Check the canonicals
* Check the breadcrumbs
* That cloudflare looks to be working


Out of scope:

* That an AMP page is valid AMP (we have a separate test suite for this)
* Being robust to html that isn't what's expected during blog v2 launch!
* Testing non-article pages (home page, category page etc)

In [32]:
import requests
from datetime import datetime
from bs4 import BeautifulSoup
import csv

## Settings


In [43]:
country_to_test = "id"
test_staging = True

test_urls_csv_filepath = {  # These originally come from https://docs.google.com/spreadsheets/d/1UdrncwVdObaA6d0lnvbAQCHBpQMzEhWPz5OTR9Syobo/edit#gid=0
    "id":"./SEO Test Urls/id_seo_test_urls.csv",
    "tw":"./SEO Test Urls/tw_seo_test_urls.csv",
    "sg": None,
    "hk": None,
}

staging_url_replace_map = {
    "id": ["www.moneysmart.id", "www.msiddev.com"],
    "tw":["www.moneysmart.id", "www.msiddev.com"],
    "sg": ["blog.moneysmart.sg","blog.mssgdev.com"],
    "hk":["www.moneysmart.hk", "blog.mshkdev.com"],
}

post_sitemaps = { #yes, we could load some more dynamically but it saves some coding
    "id": ["https://www.moneysmart.id/post-sitemap%i.xml"%i for i in range(1,13)],
    "tw": ["https://www.moneysmart.tw/post-sitemap.xml"],
    "sg": ["https://blog.moneysmart.sg/post-sitemap%i.xml"%i for i in range(1,5)],
    "hk": ["https://blog.moneysmart.hk/post-sitemap%i.xml"%i for i in range(1,3)],
    
    #Aside, there seems to be public pages for the ads in the sitemap :( 
    
}



## Test URL Sets

In [38]:
def load_seo_test_urls(country_code):
    filepath = test_urls_csv_filepath[country_code]
    urls = []
    with open(filepath) as f:
        csv_reader = csv.reader(f, delimiter=',')
        for i, row in enumerate(csv_reader):
            if i==0:
                print("Skipping first line of url file: %s"%row)
                continue
            urls.append(row[0].strip())

    return urls

seo_test_urls = load_seo_test_urls(country_to_test)

print("Found %i test urls to check for" % len(seo_test_urls))


Skipping first line of url file: ['url']
Found 50 test urls to check for


## All Posts from Sitemaps (and test sitemap works!)

In [48]:
def get_urls_from_sitemap(sitemap_url):
    
    r = requests.get(sitemap_url)
    soup = BeautifulSoup(r.text, "html.parser") #html parser does xml fine
    urls = [element.text for element in soup.findAll('loc')]
    return urls

def get_all_post_urls_from_sitemap(country_code):
    sitemaps = post_sitemaps[country_code]
    all_urls = []
    for i, sitemap in enumerate(sitemaps):
        print("Loading urls from sitemap %i of %i %s" % (i, len(sitemaps), sitemap))
        sitemap_urls = get_urls_from_sitemap(sitemap)
        all_urls+=sitemap_urls
    return all_urls

all_post_urls = get_all_post_urls_from_sitemap(country_to_test)

print("Found %i posts in the sitemap"%len(all_post_urls))

Loading urls from sitemap 0 of 12 https://www.moneysmart.id/post-sitemap1.xml
Loading urls from sitemap 1 of 12 https://www.moneysmart.id/post-sitemap2.xml
Loading urls from sitemap 2 of 12 https://www.moneysmart.id/post-sitemap3.xml
Loading urls from sitemap 3 of 12 https://www.moneysmart.id/post-sitemap4.xml
Loading urls from sitemap 4 of 12 https://www.moneysmart.id/post-sitemap5.xml
Loading urls from sitemap 5 of 12 https://www.moneysmart.id/post-sitemap6.xml
Loading urls from sitemap 6 of 12 https://www.moneysmart.id/post-sitemap7.xml
Loading urls from sitemap 7 of 12 https://www.moneysmart.id/post-sitemap8.xml
Loading urls from sitemap 8 of 12 https://www.moneysmart.id/post-sitemap9.xml
Loading urls from sitemap 9 of 12 https://www.moneysmart.id/post-sitemap10.xml
Loading urls from sitemap 10 of 12 https://www.moneysmart.id/post-sitemap11.xml
Loading urls from sitemap 11 of 12 https://www.moneysmart.id/post-sitemap12.xml
Found 11164 posts in the sitemap


In [51]:
non_test_urls = list(set(all_post_urls) - set(seo_test_urls))
print("That leaves %i post urls that aren't in the test" % len(non_test_urls))

That leaves 11114 post urls that aren't in the test


## Utility Functions

In [12]:
def get_all_post_urls_from_sitemap(country_code):
    >>
    
    
def is_amp_page(url, page_cache):
    """
    Very basic test on whether the page is marked in the html tag as AMP
    
       # <html ⚡> or <html amp>
    """
    soup = page_cache.get_soup(url)
    html_tag = soup.find('html')
    html_tag.attrs
    
    if "amp" in html_tag.attrs or "⚡" in html_tag.attrs:
        return True
    return False
    
    
 
    
def is_v2_page(url, page_cache):
    
    
def is_v2_amp_page(url, page_cache):
    
    
def extract_link_to_amp_version(url, page_cache):
    """
    Returns none if the amp version can't be found
    """
    >>
    
def extract_breadcrumbs(url, page_cache):
    >>
    
def extract_canonical_tag(url, page_cache):
    >>
    
    
def extract_blogposting_structured_data(url, page_cache):
    >>
    
class PageCache(object):
    """
    This is an abstraction for getting a url, which understands
    translation between staging and production, authentication etc
    and uses a cache to allow efficient repeat requests.
    (and whether that cache is in memory, or not is academic)
    
    It also includes throttling.
    
    The result might be a redirect to say www3 or whatever.
    """
    
    def __init__(self):
        self._page_cache_dict = {} #url:[page_html, load_time, cloudflare_in_headers, cloudflare_hit, url_after_redirect]
        self._soup_cache_dict = {} #url: beautifulsoup
        self._page_load_time_dict = {} # a log of how long it takes to load the pages in seconds
        self.last_page_request_time = datetime.now()
        self.min_seconds_between_server_requests = 0.5 #currently after request completion, rather than start
    
    def get_html(self, url):
        """
        This gets the url, but only requests it if it isn't in the page cache currently
        """
        
        seconds_since_last_request = ()
        
        >>
        self.last_page_request_time = datetime.now()
        
    def get_soup(self, url):
        """
        Returns the beautifulsoup version of the page for tag extraction
        """
        if url not in self._soup_cache_dict:
            html = self.get_html(url)
            soup = BeautifulSoup(html, "html.parser")
            self._soup_cache_dict[url] = soup

        return self._soup_cache_dict[url]
        

        
    def _download_url_and_cache(self, url):
        r = requests.get(url)
        
        r.raise_for_status() #Throw an exception if get e.g. a 404.
        
        load_time = r.elapsed.total_seconds()
        text = r.text
        headers = r.headers
        
        # Look at headers to judge if caching on the html is working
        is_cdn_caching = False
        is_cdn_cache_hit = False
        if "Server" in headers:
            if headers["Server"]=="cloudflare":
                is_cdn_caching = True
                if headers["cf-cache-status"]=="HIT":
                    is_cdn_cache_hit = True
            

        url_after_redirect = r.url
        #"cf-cache-status" "HIT"  /"EXPIRED"
        #"server:cloudflare"
        
        self._page_cache_dict[url] = text, load_time, is_cdn_caching, is_cdn_cache_hit, url_after_redirect
        

SyntaxError: invalid syntax (<ipython-input-12-0d64d444c42d>, line 2)

In [3]:
datetime.now()

datetime.datetime(2019, 10, 26, 15, 15, 0, 956878)

In [27]:
r = requests.get("https://www3.moneysmart.id/tas-hermes-mahal-ini-sebabnya/?forceTest=test")
r_amp = requests.get("https://www3.moneysmart.id/tas-hermes-mahal-ini-sebabnya/?amp")

In [19]:
r.headers

{'Date': 'Sat, 26 Oct 2019 15:52:12 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid=d08f2c1f5389267ebc10d143f9c84b2901572105132; expires=Sun, 25-Oct-20 15:52:12 GMT; path=/; domain=.moneysmart.id; HttpOnly; Secure, ms_experiment_wa0051={MS-046:test}; path=/; domain=moneysmart.id', 'CF-Cache-Status': 'HIT', 'Cache-Control': 'public, max-age=3600, s-maxage=3600', 'CF-Ray': '52bd98934edfaa1e-SIN', 'Age': '7', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'Content-Encoding': 'gzip'}

In [20]:
r.text[:100]

'<!doctype html>\n<html data-n-head-ssr lang="id-ID" data-n-head="lang">\n<head data-n-head="">\n<title '

In [28]:
soup = BeautifulSoup(r_amp.text, "html.parser")

In [42]:
html_tag = soup.find('html')
html_tag.attrs
list(range(1,13))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

# Tests

## Check the Homepage still loads!

## Check the AB split percentage (approximately)

## Check Test Articles are v2

## Check Test Articles have v2 AMP

## Check Cloudflare is (to some extent) Working