# 03 - Web scraping: headers, the networks tab and parsing an API URL
## Helpful links and resources
- [urllib](https://docs.python.org/3/library/urllib.parse.html#) is a Python library that will pick apart URLs
- [Sessions object - request library](https://docs.python-requests.org/en/master/user/advanced/#session-objects)

## Table of contents
1. The networks tab and adnaced scraping
    1. Static data files
    1. "Secret" APIs
        1. Target's search API
        1. Target's aggregation API
        1. Target's client API
1. Using sessions to login
    1. Accessing password-protected pages

In [13]:
#import libraries
from bs4 import BeautifulSoup
import requests
from urllib.parse import urlparse, parse_qs
import json

## The networks tab and advanced scraping
### Static data files
[Covid cases in the US - New York Times](https://www.nytimes.com/interactive/2021/us/covid-cases.html)

In [180]:
# get static data file
covid_cases_r = requests.get('https://static01.nyt.com/newsgraphics/2021/coronavirus-tracking/data/pages/usa/data.json')

In [181]:
covid_cases = covid_cases_r.json()

In [183]:
# covid_cases

### "Secret" APIs
Shopping websites are good candidates for secret APIs, such as [Target](www.target.com)

#### Target's Search API

In [2]:
# search for an item with the networks tab open to ID which APIs you can use
# parse the URL so it's easier to read
parsed_url = urlparse('https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1?key=ff457966e64d5e877fdbad070f276d18ecec4a01&channel=WEB&count=24&default_purchasability_filter=true&include_sponsored=true&keyword=paper+plates&offset=0&page=%2Fs%2Fpaper+plates&platform=desktop&pricing_store_id=2850&scheduled_delivery_store_id=2850&store_ids=2850%2C1849%2C3284%2C3229%2C3249&useragent=Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_15_7%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F91.0.4472.114+Safari%2F537.36&visitor_id=017A71BED83F0201BCBD154FC5FC4C74')

In [3]:
# check the parsed URL
parsed_url

ParseResult(scheme='https', netloc='redsky.target.com', path='/redsky_aggregations/v1/web/plp_search_v1', params='', query='key=ff457966e64d5e877fdbad070f276d18ecec4a01&channel=WEB&count=24&default_purchasability_filter=true&include_sponsored=true&keyword=paper+plates&offset=0&page=%2Fs%2Fpaper+plates&platform=desktop&pricing_store_id=2850&scheduled_delivery_store_id=2850&store_ids=2850%2C1849%2C3284%2C3229%2C3249&useragent=Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_15_7%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F91.0.4472.114+Safari%2F537.36&visitor_id=017A71BED83F0201BCBD154FC5FC4C74', fragment='')

##### Formatting parameters

In [4]:
# format the endpoint and parameters
endpoint = parsed_url[0] + '://' + parsed_url[1] + parsed_url[2]
params = {}
for parameter in parsed_url[4].split('&'):
    key_value = parameter.split('=')
    params[key_value[0]] = key_value[1]
print(endpoint), print(params)

https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1
{'key': 'ff457966e64d5e877fdbad070f276d18ecec4a01', 'channel': 'WEB', 'count': '24', 'default_purchasability_filter': 'true', 'include_sponsored': 'true', 'keyword': 'paper+plates', 'offset': '0', 'page': '%2Fs%2Fpaper+plates', 'platform': 'desktop', 'pricing_store_id': '2850', 'scheduled_delivery_store_id': '2850', 'store_ids': '2850%2C1849%2C3284%2C3229%2C3249', 'useragent': 'Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_15_7%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F91.0.4472.114+Safari%2F537.36', 'visitor_id': '017A71BED83F0201BCBD154FC5FC4C74'}


(None, None)

In [6]:
parsed_url[4]

'key=ff457966e64d5e877fdbad070f276d18ecec4a01&channel=WEB&count=24&default_purchasability_filter=true&include_sponsored=true&keyword=paper+plates&offset=0&page=%2Fs%2Fpaper+plates&platform=desktop&pricing_store_id=2850&scheduled_delivery_store_id=2850&store_ids=2850%2C1849%2C3284%2C3229%2C3249&useragent=Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_15_7%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F91.0.4472.114+Safari%2F537.36&visitor_id=017A71BED83F0201BCBD154FC5FC4C74'

In [7]:
parsed_url[4].split('&')

['key=ff457966e64d5e877fdbad070f276d18ecec4a01',
 'channel=WEB',
 'count=24',
 'default_purchasability_filter=true',
 'include_sponsored=true',
 'keyword=paper+plates',
 'offset=0',
 'page=%2Fs%2Fpaper+plates',
 'platform=desktop',
 'pricing_store_id=2850',
 'scheduled_delivery_store_id=2850',
 'store_ids=2850%2C1849%2C3284%2C3229%2C3249',
 'useragent=Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_15_7%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F91.0.4472.114+Safari%2F537.36',
 'visitor_id=017A71BED83F0201BCBD154FC5FC4C74']

In [6]:
# change something in the parameters (like keyword)
params['keyword'] = 'paper+cups'

In [7]:
# get request with endpoint and params
r = requests.get(endpoint, params=params)

In [8]:
# drill down the json file
len(r.json()['data']['search']['products'])

24

In [9]:
# drill down some more
r.json()['data']['search']['products'][1]['parent']

{'__typename': 'ParentProductSummary',
 'tcin': '79620967',
 'item': {'relationship_type': 'Variation Parent',
  'relationship_type_code': 'VAP',
  'merchandise_classification': {'class_id': 5, 'department_id': 253},
  'eligibility_rules': {},
  'enrichment': {'buy_url': 'https://www.target.com/p/disposable-red-plastic-cups-18oz-up-up/-/A-79620967',
   'images': {'primary_image_url': 'https://target.scene7.com/is/image/Target/GUEST_2aad5f01-a64e-4bf2-b69b-07342f02326f',
    'alternate_image_urls': ['https://target.scene7.com/is/image/Target/GUEST_fe742705-1347-42ce-ae8a-3d7dc6488b2e']}},
  'has_extended_sizing': False,
  'cart_add_on_threshold': 35.0,
  'product_description': {'title': 'Disposable Red Plastic Cups - 18oz - up & up™',
   'bullet_descriptions': ['<B>Dimensions (Overall):</B> 4.73 Inches (H) x 3.88 Inches (W)',
    '<B>Includes:</B> Cups',
    '<B>Capacity (Volume):</B> 18  Ounces',
    '<B>Features:</B> Unprinted',
    '<B>Package Quantity:</B> 72',
    '<B>Food or drink

#### Target's aggregation API

In [10]:
# parse the URL so it's easier to read
target_list = urlparse('https://redsky.target.com/redsky_aggregations/v1/web/plp_fulfillment_v1?key=ff457966e64d5e877fdbad070f276d18ecec4a01&tcins=81107269%2C81068829%2C14135567%2C81068792%2C82079503%2C81829962%2C81068790%2C81506339%2C80935950%2C81107259%2C81068797%2C11069188%2C81506334%2C81107271%2C81068773%2C81180792%2C81107267%2C81068789%2C81068796%2C81506336%2C81107268%2C81068821%2C81564691%2C81953908%2C81068815%2C81068825%2C81068787%2C81564688&store_id=2850&zip=11201&state=NY&latitude=40.690&longitude=-74.000&scheduled_delivery_store_id=2850')

In [11]:
# check the parsed URL
target_list

ParseResult(scheme='https', netloc='redsky.target.com', path='/redsky_aggregations/v1/web/plp_fulfillment_v1', params='', query='key=ff457966e64d5e877fdbad070f276d18ecec4a01&tcins=81107269%2C81068829%2C14135567%2C81068792%2C82079503%2C81829962%2C81068790%2C81506339%2C80935950%2C81107259%2C81068797%2C11069188%2C81506334%2C81107271%2C81068773%2C81180792%2C81107267%2C81068789%2C81068796%2C81506336%2C81107268%2C81068821%2C81564691%2C81953908%2C81068815%2C81068825%2C81068787%2C81564688&store_id=2850&zip=11201&state=NY&latitude=40.690&longitude=-74.000&scheduled_delivery_store_id=2850', fragment='')

In [12]:
# format the endpoint and parameters
target_list_endpoint = target_list[0] + '://' + target_list[1] + target_list[2]
target_list_params = {}
for parameter in target_list[4].split('&'):
    key_value = parameter.split('=')
    target_list_params[key_value[0]] = key_value[1]

In [13]:
# change something in the parameters (like tcins)
target_list_params['tcins'] = '81107269'

In [14]:
# get request with endpoint and params
target_list_r = requests.get(target_list_endpoint, params=target_list_params)

In [15]:
# drill down the json file
target_list_r.json()['data']['product_summaries']

[{'__typename': 'ProductSummary',
  'tcin': '81107269',
  'fulfillment': {'product_id': '81107269',
   'is_out_of_stock_in_all_store_locations': False,
   'shipping_options': {'availability_status': 'IN_STOCK',
    'loyalty_availability_status': 'IN_STOCK',
    'available_to_promise_quantity': 399.0,
    'minimum_order_quantity': 1.0,
    'services': [{'shipping_method_id': 'STANDARD',
      'min_delivery_date': '2021-07-08',
      'max_delivery_date': '2021-07-08',
      'is_two_day_shipping': True,
      'is_base_shipping_method': True,
      'service_level_description': '2-day shipping',
      'shipping_method_short_description': 'Standard',
      'cutoff': '2021-07-05T16:00:00Z'}]},
   'store_options': [{'location_name': 'Brooklyn Fulton St',
     'location_address': '445 Albee Square West,BROOKLYN,NY,11201-3016',
     'location_id': '2850',
     'search_response_store_type': 'PRIMARY',
     'order_pickup': {'availability_status': 'UNAVAILABLE',
      'reason_code': 'IN_ELIGIBLE'},

In [16]:
# drill down some more
target_list_r.json()['data']['product_summaries'][0]

{'__typename': 'ProductSummary',
 'tcin': '81107269',
 'fulfillment': {'product_id': '81107269',
  'is_out_of_stock_in_all_store_locations': False,
  'shipping_options': {'availability_status': 'IN_STOCK',
   'loyalty_availability_status': 'IN_STOCK',
   'available_to_promise_quantity': 399.0,
   'minimum_order_quantity': 1.0,
   'services': [{'shipping_method_id': 'STANDARD',
     'min_delivery_date': '2021-07-08',
     'max_delivery_date': '2021-07-08',
     'is_two_day_shipping': True,
     'is_base_shipping_method': True,
     'service_level_description': '2-day shipping',
     'shipping_method_short_description': 'Standard',
     'cutoff': '2021-07-05T16:00:00Z'}]},
  'store_options': [{'location_name': 'Brooklyn Fulton St',
    'location_address': '445 Albee Square West,BROOKLYN,NY,11201-3016',
    'location_id': '2850',
    'search_response_store_type': 'PRIMARY',
    'order_pickup': {'availability_status': 'UNAVAILABLE',
     'reason_code': 'IN_ELIGIBLE'},
    'in_store_only': 

#### Target's pdp_client_v1 endpoint

In [8]:
client_endpoint = 'https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=ff457966e64d5e877fdbad070f276d18ecec4a01&tcin=82795418&store_id=2850&has_store_id=true&pricing_store_id=2850&has_pricing_store_id=true&scheduled_delivery_store_id=2850&has_scheduled_delivery_store_id=true&has_financing_options=true&visitor_id=017A71BED83F0201BCBD154FC5FC4C74&has_size_context=true'

In [10]:
client_parse = urlparse(client_endpoint)

The below method below achieves the same goal as [this for loop]()

In [14]:
# urllib has a way to parse queries (url parameters)
query_parse = parse_qs(client_parse[4])

In [15]:
query_parse

{'key': ['ff457966e64d5e877fdbad070f276d18ecec4a01'],
 'tcin': ['82795418'],
 'store_id': ['2850'],
 'has_store_id': ['true'],
 'pricing_store_id': ['2850'],
 'has_pricing_store_id': ['true'],
 'scheduled_delivery_store_id': ['2850'],
 'has_scheduled_delivery_store_id': ['true'],
 'has_financing_options': ['true'],
 'visitor_id': ['017A71BED83F0201BCBD154FC5FC4C74'],
 'has_size_context': ['true']}

## Using sessions to login
### Accessing password-protected pages
[Sessions object - request library](https://docs.python-requests.org/en/master/user/advanced/#session-objects)

In [17]:
# open up a session so that your login credentials are saved
session = requests.Session()

In [18]:
with open('../config/config.json') as json_file:
    config = json.load(json_file)

In [19]:
payload = {
    'username':'katiemarriner',
    'password': config['atom_password'],
}

In [20]:
# post the payload to the site to log in
s = session.post("https://atom.finance/session/signin", data=payload)

In [21]:
s.text

'{"success":true,"userId":"70920d34-5ae8-4bd3-81d0-f0ce721a1095","email":"kemarriner@gmail.com","username":"katiemarriner","firstName":"Katie","lastName":"Marriner","name":"Katie Marriner","inviteCode":null}'

In [22]:
payload = {
    "variables":{"symbol":"SPY"},
    "query": "query getETFProfile($symbol: String!) {\n  etfProfile(symbol: $symbol) {\n    id\n    issuer\n    description\n    }\n}\n"
}

In [23]:
# Navigate to the next page and scrape the data
s = session.post('https://atom.finance/graphql', json=payload)

In [24]:
s.text

'{"data":{"etfProfile":{"id":"e40f8558-a387-4fdf-89ea-934d2f776de0","issuer":"SSgA","description":"SPDR S&P 500 ETF Trust"}}}\n'