CloudZero Technical Challenge
---

### Technical interview introduction:

We find that interviews can be stressful, and generally at your job you aren’t solving problems under that kind of stress.  To allow people showcase their skills in a more natural environment, we’ve created a take-home problem that we ask candidates to solve. These problems should allow people to demonstrate how they solve problems with code, which is what anyone joining the engineering team would do day to day.

While we recognize that a take home problem has its own downsides, chief among them it’s a non-trivial request on time from people who are already busy (see below for other options,) we believe that any interview process requires a significant time commitment and this is one that gives us a better objective measure of how candidates would solve the types of problems necessary to thrive at CloudZero.

Please note, if you feel that you cannot find the time to work on the below problem then we will happily facilitate one of two other options.

First, if you have code that you can send us (an open source project, or something you’ve worked on for an employer or client that you can share,) feel free to send us that.  We ask that it show off your ability with Python and AWS and be something that you were the primary decision maker and coder on.  Please also ensure that it is something we can execute ourselves.

The second option is that we will allow you to pair-program a problem during one hour of the interview process.  The problem will be given to you at the interview, and we’ll help with a laptop and IDE and an engineer to pair with. 

Please indicate your preference between these three options as early as possible so that we can plan for a good experience.

### The problem:
For the problem below, we’d like you produce code that solves the problem and get it to us before we have you come in so that we can evaluate the solution. We will review the output with you when you come in. Sending us to a GitHub repo is probably easiest, but if you prefer getting us your solution in another way that is fine as well. Take the time you need and feel free to reach out if any part of the problem is unclear.

Using Python 3.6+, create a system that analyzes the Alexa top 1,000 sites.  The analysis should include:
 * Per Site
   * Word count of the first page 
   * Rank across all sites based off the word count
   * Duration of the scan
 * Across All Sites
   * AVG word count of the first page
   * Top 20 HTTP headers and the percentage of sites they were seen in
   * Duration of the entire scan

You may use any library or AWS service that you desire as long as your system is installable by us.  Don't over-engineer the solution; keep it simple, elegant and functional.  Your solution should run to completion within 15 minutes.

Please be aware, the cost for calling some AWS services and APIs can add up quickly.  You may want to watch that as you develop so that you don't incur an unexpectedly expense during development!

In [1]:
import boto3

In [6]:
session = boto3.Session(profile_name='personal')
client = session.client('alexaforbusiness')

In [4]:
test_json = open("test.json", "r")

In [5]:
import json

In [6]:
hundred_sites = json.load(test_json)

In [7]:
hundred_sites

{'Ats': {'OperationRequest': {'RequestId': 'e314e05a-8302-11e9-92f2-6572acfe38de'},
  'Results': {'Result': {'Alexa': {'Request': {'Arguments': {'Argument': [{'Name': 'countrycode',
         'Value': 'US'},
        {'Name': 'count', 'Value': '100'},
        {'Name': 'responsegroup', 'Value': 'Country'}]}},
     'TopSites': {'Country': {'CountryName': 'United States',
       'CountryCode': 'US',
       'TotalSites': '392675',
       'Sites': {'Site': [{'DataUrl': 'google.com',
          'Country': {'Rank': '1',
           'Reach': {'PerMillion': '826100'},
           'PageViews': {'PerMillion': '239430', 'PerUser': '12.61'}},
          'Global': {'Rank': '1'}},
         {'DataUrl': 'youtube.com',
          'Country': {'Rank': '2',
           'Reach': {'PerMillion': '440200'},
           'PageViews': {'PerMillion': '46010', 'PerUser': '4.55'}},
          'Global': {'Rank': '2'}},
         {'DataUrl': 'facebook.com',
          'Country': {'Rank': '3',
           'Reach': {'PerMillion': '2

In [25]:
hundred_sites=hundred_sites['Ats']['Results']['Result']['Alexa']['TopSites']['Country']['Sites']['Site']

In [26]:
test_json.close()

In [27]:
hundred_sites

[{'DataUrl': 'google.com',
  'Country': {'Rank': '1',
   'Reach': {'PerMillion': '826100'},
   'PageViews': {'PerMillion': '239430', 'PerUser': '12.61'}},
  'Global': {'Rank': '1'}},
 {'DataUrl': 'youtube.com',
  'Country': {'Rank': '2',
   'Reach': {'PerMillion': '440200'},
   'PageViews': {'PerMillion': '46010', 'PerUser': '4.55'}},
  'Global': {'Rank': '2'}},
 {'DataUrl': 'facebook.com',
  'Country': {'Rank': '3',
   'Reach': {'PerMillion': '235000'},
   'PageViews': {'PerMillion': '17750', 'PerUser': '3.29'}},
  'Global': {'Rank': '3'}},
 {'DataUrl': 'amazon.com',
  'Country': {'Rank': '4',
   'Reach': {'PerMillion': '174700'},
   'PageViews': {'PerMillion': '33870', 'PerUser': '8.44'}},
  'Global': {'Rank': '10'}},
 {'DataUrl': 'reddit.com',
  'Country': {'Rank': '5',
   'Reach': {'PerMillion': '107500'},
   'PageViews': {'PerMillion': '18050', 'PerUser': '7.3'}},
  'Global': {'Rank': '13'}},
 {'DataUrl': 'wikipedia.org',
  'Country': {'Rank': '6',
   'Reach': {'PerMillion': '1086

In [35]:
import requests as r

In [34]:
from bs4 import BeautifulSoup
import time

In [39]:
scan_start = time.process_time_ns()
for site in hundred_sites:
    site_scan_start = time.process_time_ns()
    site_url = site["DataUrl"]
    print("Started analyzing {} at {}".format(site_url, site_scan_start))
    site_request = r.get("https://{}".format(site_url))
    site_text = site_request.text
    site_headers = site_request.headers
    site_soup = BeautifulSoup(site_text, "lxml")
    site_word_count = len(site_soup.text.split())
    print("Word count for {} is {}".format(site_url, site_word_count))
    site_scan_stop = time.process_time_ns()
    print("Stoped analyzing {} at {}, it took {} nanoseconds".format(site_url, site_scan_stop, site_scan_stop-site_scan_start))
scan_stop = time.process_time_ns()

Started analyzing google.com at 3567927000
Word count for google.com is 133
Stoped analyzing google.com at 3608171000, it took 40244000 nanoseconds
Started analyzing youtube.com at 3608249000
Word count for youtube.com is 3037
Stoped analyzing youtube.com at 3787561000, it took 179312000 nanoseconds
Started analyzing facebook.com at 3787619000
Word count for facebook.com is 1051
Stoped analyzing facebook.com at 3846983000, it took 59364000 nanoseconds
Started analyzing amazon.com at 3847036000
Word count for amazon.com is 131
Stoped analyzing amazon.com at 3888218000, it took 41182000 nanoseconds
Started analyzing reddit.com at 3888245000
Word count for reddit.com is 125
Stoped analyzing reddit.com at 3923745000, it took 35500000 nanoseconds
Started analyzing wikipedia.org at 3923774000
Word count for wikipedia.org is 1499
Stoped analyzing wikipedia.org at 3998198000, it took 74424000 nanoseconds
Started analyzing yahoo.com at 3998266000
Word count for yahoo.com is 10574
Stoped analyzi

KeyboardInterrupt: 

In [43]:
site_scan_start = time.process_time_ns()
# site_url = site["DataUrl"]
site_url = "microsoftonline.com"
print("Started analyzing {} at {}".format(site_url, site_scan_start))
site_request = r.get("http://{}".format(site_url))
site_text = site_request.text
site_headers = site_request.headers
site_soup = BeautifulSoup(site_text, "lxml")
site_word_count = len(site_soup.text.split())
print("Word count for {} is {}".format(site_url, site_word_count))
site_scan_stop = time.process_time_ns()
print("Stoped analyzing {} at {}, it took {} nanoseconds".format(site_url, site_scan_stop, site_scan_stop-site_scan_start))

Started analyzing microsoftonline.com at 4563051000


KeyboardInterrupt: 