# Tutorial
In this tutorial, we'll see how Amazon.com autocomplete search suggestions work.

We will scrape Amazon search results by reverse-engineering how requests are made, and use it to collect structured data at scale.

### 1. First open the developer console.

See how on [Chrome](https://developer.chrome.com/docs/devtools/open/) or [Firefox](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_are_browser_developer_tools) here.  



One way to get to the dev tools it to right-click and “Inspect” an element on the page.

### 2. Click the “Network” tab.

This section of the dev tools is used to monitor network requests.

*Background*

Everything on a page is retrieved from some outside source, likely a server. This includes things like images embedded on the page, JavaScript code running in the background, and all the bits of “content” that populate the page before us.

Using the `Network` tab, we can find out how this information is requested from a server, and intercept the response before it is rendered on the page.

### 3. Filter requests by fetch/XHR

This will reveal only API calls made to servers. This includes internal servers that are hosted by the website we’re inspecting, as well as external servers. The latter often includes [third-party trackers](https://themarkup.org/blacklight) used in adtech, and verification services to authenticate user behavior.

You might see quite a few network requests that were loaded onto the page. Look at "Domain" and "File" to narrow down where requests were sent, and whether the names are telling of the purpose of the request. In this example, notice that a request was sent to the "Domain" `completion.amazon.com`, using an API endpoint (in the "File" column) named `suggestions`. This is likely the API being called to populate autocompleted search suggestions on the Amazon marketplace. Reading "File" names can help determine each API's function.

Look at the request's "Response" attributes.

### 4. Copy as cURL

If you find an HTTP request that returns a response with useful information you can start to reverse-engineer it. To do that, we can isolate it by right-clicking the HTTP request and selecting “copy as cURL”. ([cURL](https://developer.ibm.com/articles/what-is-curl-command/) stands for client URL, and is a tool used to transfer data across networks.)

<img src="https://github.com/yinleon/inspect-element/blob/main/assets/copy-curl.png?raw=1" width=90%>

### 5. Curl to requests
We can use a site like [curlconverter.com](https://curlconverter.com/) to convert the cURL we copied into a reusable API call. In this example, we use the default conversion to a Python `requests` script. You can do the same for any language and framework.

Here is what the converted cURL looks like after being converted to a Python request:

In [1]:
import requests

cookies = {
    'aws-ubid-main': '836-8365128-6734270',
    'session-id-time': '2082787201l',
    'ubid-main': '135-7086948-2591317',
    'aws-priv': 'eyJ2IjoxLCJldSI6MCwic3QiOjB9',
    'aws-target-static-id': '1593060129944-225088',
    'lc-main': 'en_US',
    'x-main': 'Oz3Tb5n2p0ic7OhF3cU5dc9B4ZR2gFjhKEsP4zikHHD3Gk2O7NpSmuShBxLFrhpZ',
    'at-main': 'Atza|IwEBILB5ARQ_IgTCiBLam_XE2pyT76jXTbAXHOm2AJomLPmDgoJUJIIlUmyFeh_gChLHCycKjNlys-5CqqMabKieAzqSf607ChJsNevw-V06e7VKgcWjvoMaZRWlGiZ-c5wSJ-e4QzIWzAxTS1EI6sRUaRZRv-a0ZpOJQ-sHHB99006ytcrHhubdrXYPJRqEP5Q-_30JtESMpAkASoOs4vETSFp5BDBJfSWWETeotpIVXwA4NoC8E59bZb_5wHTW9cRBSWYGi1XL7CRl2xGbJaO2Gv3unuhGMB1tiq9iwxodSPBBTw',
    'sess-at-main': '"PUq9PW1TbO9CTYhGMo7l1Dz+wedh40Ki8Z9rPC+1TSI="',
    'sst-main': 'Sst1|PQHsbeSFCMSY0X0_WgvTo5NUCaZkG2J9RPqWWy0fCpyWopJXgu6_drU_LstOdJB2cDmaVCXwkNpsF5yNPrBDj3Wtx-TC-AaYZn6WUdp8vNRPb6iYqxPAjRDnfK3pCnHqt19I0GoG7Bd1wnOxkAvnH0992IUq14kH6Ojm0J8noVPwMez0lltD-jxBwtDQ_EZYUkZG741RDVEojfziawJY9iKc-cLCnKmhi-ca1PPJnsimPV4lXRtMAGFbf9nMkKq4CbpkaRMdVtlPr20vF9eqg_V_-LY_V7S44WlO-_t_bFBnK8Q',
    'i18n-prefs': 'USD',
    'session-token': 'ptze73uznXExrMCSV9AklvNOKa1ND9F0rlQH2ioSM26Vr6hSheH8O4v4P8Lg3zuv7oDM+HZ+8f2TlyoPXUmPShprMXdvEpAQieXUw7+83PZOJvkkg1jwP0NiG0ZqksIYOr3Zuwt3omMcfCKRReWKxl5rGaDEM6AISpwI5aMDDCnA7fWbVO/QQYNxUZMifc599EZ5Fg3uGjCAhBlb6I7UO8ewRbXJ1bo9',
    'session-id': '139-9925917-2023535',
    'aws-userInfo-signed': 'eyJ0eXAiOiJKV1MiLCJrZXlSZWdpb24iOiJ1cy1lYXN0LTEiLCJhbGciOiJFUzM4NCIsImtpZCI6ImFhNDFkZjRjLTMxMzgtNGVkOC04YmU5LWYyMzUzYzNkOTEzYiJ9..LWFZOJMDcYdu6od6Nk8TmhAFMGA9O98O4tIOsVlR7w5vAS_JgVixL8j75u6jTgjfWkdddhKqa5kgsXDmGNbjhzLIsD48ch1BUodlzxqeQfn0r8onIwLbUIHEnk6X-AJE',
    'skin': 'noskin',
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Language': 'en-US,en;q=0.5',
    # 'Accept-Encoding': 'gzip, deflate, br',
    'Origin': 'https://www.amazon.com',
    'Connection': 'keep-alive',
    'Referer': 'https://www.amazon.com/',

    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-site',
}

params = {
    'limit': '11',
    'prefix': 'spicy',
    'suggestion-type': [
        'WIDGET',
        'KEYWORD',
    ],
    'page-type': 'Gateway',
    'alias': 'aps',
    'site-variant': 'desktop',
    'version': '3',
    'event': 'onKeyPress',
    'wc': '',
    'lop': 'en_US',
    'last-prefix': '\0',
    'avg-ks-time': '2486',
    'fb': '1',
    'session-id': '139-9925917-2023535',
    'request-id': 'SVMTJXRDBQ9T8M7BRGNJ',
    'mid': 'ATVPDKIKX0DER',
    'plain-mid': '1',
    'client-info': 'amazon-search-ui',
}

response = requests.get('https://completion.amazon.com/api/2017/suggestions',
                        params=params, cookies=cookies, headers=headers)

### 6. Run this Python code, as-is, and it should work.

In [None]:
# to see the response, run this cell:
response.json()

{'alias': 'aps',
 'prefix': 'spicy',
 'suffix': '',
 'suggestions': [{'suggType': 'KeywordSuggestion',
   'type': 'KEYWORD',
   'value': 'spicy chips',
   'refTag': 'nb_sb_ss_p13n-pd-dpltr-ranker_1_5',
   'candidateSources': 'local',
   'strategyId': 'p13n-pd-dpltr-ranker',
   'strategyApiType': 'RANK',
   'prior': 0.0,
   'ghost': False,
   'help': False},
  {'suggType': 'KeywordSuggestion',
   'type': 'KEYWORD',
   'value': 'spicy ramen',
   'refTag': 'nb_sb_ss_p13n-pd-dpltr-ranker_2_5',
   'candidateSources': 'local',
   'strategyId': 'p13n-pd-dpltr-ranker',
   'strategyApiType': 'RANK',
   'prior': 0.0,
   'ghost': False,
   'help': False},
  {'suggType': 'KeywordSuggestion',
   'type': 'KEYWORD',
   'value': 'spicy cubes',
   'refTag': 'nb_sb_ss_p13n-pd-dpltr-ranker_3_5',
   'candidateSources': 'local',
   'strategyId': 'p13n-pd-dpltr-ranker',
   'strategyApiType': 'RANK',
   'prior': 0.0,
   'ghost': False,
   'help': False},
  {'suggType': 'KeywordSuggestion',
   'type': 'KEYWOR

### 7. Port to a function
Try to submit a few— let’s say 10 or 20, requests with new parameters set by you.

For convenience, we write the API call as a function that takes any `keyword` as input.

In [2]:
import pandas as pd
import time

def search_suggestions(keyword):
    """
    Get autocompleted search suggestions for a `keyword` search on Amazon.com.
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0',
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Language': 'en-US,en;q=0.5',
    }

    params = {
        'prefix': keyword,
        'suggestion-type': [
            'WIDGET',
            'KEYWORD',
        ],
        'alias': 'aps',
        'plain-mid': '1',
    }

    response = requests.get('https://completion.amazon.com/api/2017/suggestions',
                            params=params, headers=headers)
    return response.json()

In [5]:
search_suggestions('maga')

{'alias': 'aps',
 'prefix': 'maga',
 'suffix': '',
 'suggestions': [{'suggType': 'KeywordSuggestion',
   'type': 'KEYWORD',
   'value': 'magazine holder',
   'refTag': 'nb_sb_ss_i_1_4',
   'candidateSources': 'local',
   'strategyId': 'organic',
   'prior': 0.0,
   'ghost': False,
   'help': False,
   'queryUnderstandingFeatures': [{'source': 'QU_TOOL', 'annotations': []}]},
  {'suggType': 'KeywordSuggestion',
   'type': 'KEYWORD',
   'value': 'maga hat',
   'refTag': 'nb_sb_ss_i_2_4',
   'candidateSources': 'local',
   'strategyId': 'organic',
   'prior': 0.0,
   'ghost': False,
   'help': False,
   'queryUnderstandingFeatures': [{'source': 'QU_TOOL', 'annotations': []}]},
  {'suggType': 'KeywordSuggestion',
   'type': 'KEYWORD',
   'value': 'magazine file holder',
   'refTag': 'nb_sb_ss_i_3_4',
   'candidateSources': 'local',
   'strategyId': 'organic',
   'prior': 0.0,
   'ghost': False,
   'help': False,
   'queryUnderstandingFeatures': [{'source': 'QU_TOOL', 'annotations': []}]},


In this step the code gets refactored to make it repeatable.

#### 8. Iterate through different searches, save to DF

Here we can set new input parameters in `keyword`, and make the an API call using each keyword. Try changing some of the code (eg. the keywords) and rerunning it to check your understanding.

In [6]:
# Here are our inputs (what searches we'll get autocompleted)
keywords = [
    'a', 'b', 'cookie', 'sock', 'zelda', '12'
]

# Here we'll go through each input, get the suggestions, and then add the `suggestions` to a list.
data = []
for keyword in keywords:
    suggestions = search_suggestions(keyword)
    suggestions['search_word'] = keyword # keep track of the seed keyword
    time.sleep(1) # best practice to put some time between API calls.
    data.extend(suggestions['suggestions'])

In [15]:
len(data)

60

In [9]:
for i in data:
  print(i['value'])

amazon gift card
apple watch
airpods
air fryer
apple watch bands for women
aa batteries
airpods pro 2
air purifier
car accessories
airtag
women blouses
bikini sets for women
baby registry search
body wash
bathing suit for women
birkenstock sandals women
bluetooth speaker
backpack
beach vacation clothes for women
blackout curtains
cookies
cookie bags
chocolate chip cookies
cookie sheets for baking
biscoff cookies
oreo cookies
cookie dough
cookie cutters
cookie jar
lactation cookies
compression socks for women
socks
nike socks
socks for men
socks for women
compression socks men
compression socks
socket set
socket organizer
socks for men 9-12
zelda
legend of zelda
zelda switch
zelda amiibo
zelda tears of the kingdom switch game
zelda breath of the wild switch
zelda lego
zelda merch
legend of zelda merch
legend of zelda tears of the kingdom
ipad pro 12.9 case
12v battery
12x16 frame
calcium 1200 mg with vitamin d3
12x18 frame
12x20 pillow insert
128gb micro sd card
12 inch subwoofers
12v p

We saved the API responses in a list called `data`, and put them into a [Pandas](https://pandas.pydata.org/) DataFrame to analyze.

In [16]:
df = pd.DataFrame(data)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   suggType                    60 non-null     object 
 1   type                        60 non-null     object 
 2   value                       60 non-null     object 
 3   refTag                      60 non-null     object 
 4   candidateSources            60 non-null     object 
 5   strategyId                  60 non-null     object 
 6   prior                       60 non-null     float64
 7   ghost                       60 non-null     bool   
 8   help                        60 non-null     bool   
 9   queryUnderstandingFeatures  60 non-null     object 
dtypes: bool(2), float64(1), object(7)
memory usage: 4.0+ KB




Unfortunately, because this API is undocumented, figuring out what all of the metadata here represents is difficult.

But we can take a peak at the data anyway:

In [14]:
# show 5 random auto suggestions
df.sample(5)

Unnamed: 0,suggType,type,value,refTag,candidateSources,strategyId,prior,ghost,help,queryUnderstandingFeatures
46,KeywordSuggestion,KEYWORD,zelda lego,nb_sb_ss_i_7_5,local,organic,0.0,False,False,"[{'source': 'QU_TOOL', 'annotations': []}]"
19,KeywordSuggestion,KEYWORD,blackout curtains,nb_sb_ss_i_10_1,local,organic,0.0,False,False,"[{'source': 'QU_TOOL', 'annotations': []}]"
5,KeywordSuggestion,KEYWORD,aa batteries,nb_sb_ss_i_6_1,local,organic,0.0,False,False,"[{'source': 'QU_TOOL', 'annotations': []}]"
10,KeywordSuggestion,KEYWORD,women blouses,nb_sb_ss_i_1_1,local,organic,0.0,False,False,"[{'source': 'QU_TOOL', 'annotations': []}]"
37,KeywordSuggestion,KEYWORD,socket set,nb_sb_ss_i_8_4,local,organic,0.0,False,False,"[{'source': 'QU_TOOL', 'annotations': []}]"


# Related readings
More tutorials on the same subject:

- ["Scraping XHR"](https://scrapism.lav.io/scraping-xhr/) - Sam Lavigne<br>
- ["Web Scraping 201: finding the API"](https://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/) - Greg Reda<br>
- ["How to use undocumented web APIs"](https://jvns.ca/blog/2022/03/10/how-to-use-undocumented-web-apis/) - Julia Evans

Topical and timeless:

- ["Computational research in the post-API age"](https://dfreelon.org/publications/2018_Computational_research_in_the_postAPI_age.pdf) - Deen Freelon

Notable investigations and audits using undocumented APIs:

- ["Ring’s Hidden Data Let Us Map Amazon's Sprawling Home Surveillance Network"](https://gizmodo.com/ring-s-hidden-data-let-us-map-amazons-sprawling-home-su-1840312279) - Dell Cameron and Dhruv Mehrota<br>
- "[Porch piracy: are we overracting to package thefts from doorsteps?](https://www.theguardian.com/us-news/2022/aug/25/porch-piracy-package-thefts-doorstep-delivery)" - Lam Thuy Vo<br>
- "[The Cop in Your Neighbor's Doorbell](https://site.dcalacci.net/papers/ring-cscw-2021.pdf)" - Dan Calacci et al.
- "[Analyzing gender inequality through large-scale Facebook advertising data](https://www.pnas.org/doi/full/10.1073/pnas.1717781115)" - David Garcia et al.<br>
- "[Freeing the Plum Book](https://source.opennews.org/articles/freeing-plum-book/)" - Derek Willis<br>

Please reach out with more examples to add.

## Artifacts
Slides from workshops can be found here:

[2023-02-24 @ Tow Center Columbia](https://docs.google.com/presentation/d/1e1QoSNXv2m90lhhyUMSzUlMxXJD_Ar5DtC43PcpTcWU)<br>
[2023-03-04 @ NICAR](https://docs.google.com/presentation/d/1hWMqcBNfs9BbaVywMGJPf_BcR9PpVLlHPCGBsAzz-No/)<br>
[2023-06-12 @ FAccT](https://docs.google.com/presentation/d/1-9rODLyxJawasNIn_rp9E_oHUOB1SUGCpAb8TbmyxKU/edit?usp=sharing)<br>
[2023-06-14 @ Journocoders](https://paper.dropbox.com/doc/Journocoders-June-2023-1t7nrjYNWoiPhK0sx8rH0)<br>
[2023-06-22 @ C+J DATAJ](https://docs.google.com/presentation/d/10_mWNwr_fsrX0r8e6xWhFruhdUZUbLEA9HqLI3HtEFg)