# <center>Web Scraping by API </center>

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import requests
import json
import pandas as pd

Packages used in this notebook:

- snscrape: for scrape tweets
- tika: parse PDF files

## 1. Scrape data through APIs 
- Online content providers usually provide APIs for you to access data. Two types of APIs:
   * Python packages: e.g. tweepy package from Twitter
   * REST APIs: e.g. OMDB APIs (http://www.omdbapi.com), or TMDB (https://developers.themoviedb.org/3/getting-started)
- You need to read documentation of APIs to figure out how to access data

## 2. Scrape data by REST APIs (e.g. OMDB API)
- A REST API is a web service that uses `HTTP` requests to `GET`, `PUT`, `POST` and `DELETE` data
- Example:
    - https://groceries.asda.com/api/items/search<font color="blue"><b>?</b></font><font color='green'><b>keyword</b></font>=<font color='red'><b>yogurt<b></font><front color='purple'><b>&</b></font><font color='green'><b>r</b></font>=<font color='red'><b>json<b></font>, where
        - `?`: separate API endpoint  `https://groceries.asda.com/api/items/search` from parameters
        - `keyword=yogurt`: search `yogurt` on parameter `keyword`
        - `&`: combine multiple search criteria
        - `r=json`: result is in json format 
    - You can directly paste the above API to your browser
    - Or issue API calls using requests
- You need to read API documentation to understand how to specify parameters

In [6]:
import requests
import json

keyword = 'yogurt'


url="https://groceries.asda.com/api/items/search?keyword=" + keyword + "&r=json"

print(url)

# invoke the API 
r = requests.get(url)

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    result = r.json()
    print (json.dumps(result, indent=4))



https://groceries.asda.com/api/items/search?keyword=yogurt&r=json
{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "isHookLogicInsert": "false",
    "totalResult": "618",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "11",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_63f8853977b0be530a1a46c9^^^Default",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "910000976085",
            "shelfName": "Kids Yogurts & Fromage Frais",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
            "cin": "6203603",
    

In [3]:
#result

In [5]:
# Exercise 2.2.  Another way to pass parameters

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', \
               params=parameters)

# in case authentication is needed, use
# r = requests.get('https://api.github.com/user', \
# auth=('user', 'pass'))

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    print (json.dumps(r.json(), indent=4))



{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "isHookLogicInsert": "false",
    "totalResult": "624",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "11",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_63f8853977b0be530a1a46c9^^^Default",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "1215286383583",
            "shelfName": "Corner Yogurts",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "10::for::\u00a34.50::true",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
            "cin": "7368400",
            "promoDetailFull": "10 for \u00a34.50",
      

## 3. JSON (JavaScript Object Notation)

### What is JSON
- A lightweight data-interchange format
- "self-describing" and easy to understand
- the JSON format is text only 
- Language independent: can be read and used as a data format by any programming language

###  JSON Syntax Rules
JSON syntax is derived from JavaScript object notation syntax:
- Data is in **name/value** pairs separated by commas
- Curly braces hold objects
- Square brackets hold arrays

### A JSON object is:
- **a dictionary** or 
- a **list of dictionaries**

### Useful JSON functions
- dumps: save json object to string
- dump: save json object to file
- loads: load from a string in json format
- load: load from a file in json format

In [6]:
# Exercise 3.1 API returns a JSON object 

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# if the API call returns a successful response
if r.status_code==200:
    result = r.json()
    #print(result)
    df = pd.DataFrame(result["items"])
    df.head()
    

Unnamed: 0,shelfId,shelfName,deptId,deptName,isBundle,meatStickerDetails,extraLargeImageURL,bundledItemCount,scene7Host,cin,...,avgWeight,iconDetails,maxQty,pricePerWt,productURL,pricePerUOM,searchTuningScore,onSale,salePrice,positionChngByMargin
0,1215286383583,Corner Yogurts,1215341888021,Yogurts & Desserts,False,10::for::£4.50::true,,0,https://ui.assets-asda.com:443/dm/,7368400,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,8999574.0,False,,0
1,910000976085,Kids Yogurts & Fromage Frais,1215341888021,Yogurts & Desserts,False,,,0,https://ui.assets-asda.com:443/dm/,6203603,...,,"{'promotionalIcons': ['59600051'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,8261075.5,False,,0
2,1215286439451,Natural & Greek Yogurts,1215341888021,Yogurts & Desserts,False,,,0,https://ui.assets-asda.com:443/dm/,1944643,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,6422927.5,False,,0
3,910000977060,"Diet, Low Fat & No Added Sugar Yogurts",1215341888021,Yogurts & Desserts,False,,,0,https://ui.assets-asda.com:443/dm/,2872628,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,2446765.5,False,,0
4,1215286383583,Corner Yogurts,1215341888021,Yogurts & Desserts,False,,,0,https://ui.assets-asda.com:443/dm/,7368412,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,2266376.2,False,,0


In [7]:
# Exercise 3.2. Parse JSON object (a dictionary)

# convert the first 3 items to string
#result["items"][0:2]

s = json.dumps(result["items"][0:2], indent=4)
print(s)

# load from a string
items = json.loads(s)
items

# save to file
json.dump(result["items"], open("items.json","w"))

# load from file
items = json.load(open("items.json","r"))
print("test loaded data\n")
len(items)
#items[0]

[
    {
        "shelfId": "1215286383583",
        "shelfName": "Corner Yogurts",
        "deptId": "1215341888021",
        "deptName": "Yogurts & Desserts",
        "isBundle": "false",
        "meatStickerDetails": "10::for::\u00a34.50::true",
        "extraLargeImageURL": "",
        "bundledItemCount": "0",
        "scene7Host": "https://ui.assets-asda.com:443/dm/",
        "cin": "7368400",
        "promoDetailFull": "10 for \u00a34.50",
        "availability": "A",
        "totalReviewCount": "38",
        "asdaSuggest": "",
        "itemName": "Corner Vanilla Yogurt with Chocolate Balls",
        "price": "\u00a30.90",
        "imageURL": "",
        "aisleName": "Yogurts & Fromage Frais",
        "id": "1000377031656",
        "promoId": "ls93054",
        "isFavourite": "false",
        "hasAlternates": "false",
        "wasPrice": "",
        "brandName": "Muller",
        "promoType": "No Promo",
        "weight": "124g      ",
        "promoOfferTypeCode": "15",
        "

[{'shelfId': '1215286383583',
  'shelfName': 'Corner Yogurts',
  'deptId': '1215341888021',
  'deptName': 'Yogurts & Desserts',
  'isBundle': 'false',
  'meatStickerDetails': '10::for::£4.50::true',
  'extraLargeImageURL': '',
  'bundledItemCount': '0',
  'scene7Host': 'https://ui.assets-asda.com:443/dm/',
  'cin': '7368400',
  'promoDetailFull': '10 for £4.50',
  'availability': 'A',
  'totalReviewCount': '38',
  'asdaSuggest': '',
  'itemName': 'Corner Vanilla Yogurt with Chocolate Balls',
  'price': '£0.90',
  'imageURL': '',
  'aisleName': 'Yogurts & Fromage Frais',
  'id': '1000377031656',
  'promoId': 'ls93054',
  'isFavourite': 'false',
  'hasAlternates': 'false',
  'wasPrice': '',
  'brandName': 'Muller',
  'promoType': 'No Promo',
  'weight': '124g      ',
  'promoOfferTypeCode': '15',
  'promoQty': '10',
  'promoValue': '£4.50',
  'productAttribute': '',
  'scene7AssetId': '4025500277031',
  'promoDetail': '10 for £4.50',
  'bundleDiscount': '0.00',
  'avgStarRating': '4.7105

test loaded data



60

## 4. Parse PDF Files
- Many python packages are available to parse pdf files
  - PDFMiner: A tool for extracting information from PDF documents. It can show exact location of text in a page, as well as other information such as fonts or lines. 
  - PyPDF2: A pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. 
  - Tabula-py:  It can read the table of PDF. You can read tables from PDF and convert into pandas’ DataFrame. 
  - py_pdf_parser: It can read pdf files, extract tables, and text from figures. For details, see https://py-pdf-parser.readthedocs.io/en/latest/examples/index.html

In [2]:
import py_pdf_parser
from py_pdf_parser.loaders import load_file

# change to any pdf document you have

document = load_file('Assignment_Python.pdf')

In [9]:
# show context
for e in document.elements:
    print(e.text())

Assignment_Python
http://localhost:8888/nbconvert/html/course/BIA-660/2019Fall/As...
Assignment 1: Python Basics
Q1. Document Term Matrix
1. Deﬁne a function called compute_dtm as follows:
as a parameter:
docs
docs
(i.e. document-term matrix), which stores a 2-dimensional array created from the
Take a list of documents, say 
Tokenize each document into lower-cased words without any leading and trailing punctuations (Hint: you
can refer to the solution to the Review Exercise at then end of Python_II lecture notes)
Let 
 denote the list of unique words in 
words
Compute 
dtm
documents as follows:
i
j
is the count of word   in document
(i, j)
j
words
i
return
and
Each row (say   ) represents a document
Each column (say  ) represents a unique word in 
Each cell 
dtm
words
Q2. Performance Analysis
1. Suppose your machine learning model returns a one-dimensional array of probabilities as the output. Write a
function "performance_analysis" to do the following:
th
, the prediction is positive;

# Other powerful scraping tools

-  Playwright: https://github.com/oxylabs/playwright-web-scraping
- Scrapy