# <center>Web Scraping by API </center>

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import requests
import json
import pandas as pd

Packages used in this notebook:

- snscrape: for scrape tweets
- tika: parse PDF files

## 1. Scrape data through APIs 
- Online content providers usually provide APIs for you to access data. Two types of APIs:
   * Python packages: e.g. tweepy package from Twitter
   * REST APIs: e.g. OMDB APIs (http://www.omdbapi.com), or TMDB (https://developers.themoviedb.org/3/getting-started)
- You need to read documentation of APIs to figure out how to access data

## 2. Scrape data by REST APIs (e.g. OMDB API)
- A REST API is a web service that uses `HTTP` requests to `GET`, `PUT`, `POST` and `DELETE` data
- Example:
    - https://groceries.asda.com/api/items/search<font color="blue"><b>?</b></font><font color='green'><b>keyword</b></font>=<font color='red'><b>yogurt<b></font><front color='purple'><b>&</b></font><font color='green'><b>r</b></font>=<font color='red'><b>json<b></font>, where
        - `?`: separate API endpoint  `https://groceries.asda.com/api/items/search` from parameters
        - `keyword=yogurt`: search `yogurt` on parameter `keyword`
        - `&`: combine multiple search criteria
        - `r=json`: result is in json format 
    - You can directly paste the above API to your browser
    - Or issue API calls using requests
- You need to read API documentation to understand how to specify parameters

In [2]:
import requests
import json

keyword = 'yogurt'


url="https://groceries.asda.com/api/items/search?keyword=" + keyword + "&r=json"

print(url)

# invoke the API 
r = requests.get(url)

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    result = r.json()
    print (json.dumps(result, indent=4))



https://groceries.asda.com/api/items/search?keyword=yogurt&r=json
{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "isHookLogicInsert": "false",
    "totalResult": "421",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "8",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_5f8046bf0931946b86fb4387^^^Default",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "1215286383583",
            "shelfName": "Corners",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "10::for::\u00a33.5::true",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
            "cin": "6362225",
 

In [3]:
# Exercise 2.2.  Another way to pass parameters

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# in case authentication is needed, use
# r = requests.get('https://api.github.com/user', \
# auth=('user', 'pass'))

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    print (json.dumps(r.json(), indent=4))



{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "isHookLogicInsert": "false",
    "totalResult": "417",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "7",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_5f8046bf0931946b86fb4387^^^Default",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "1215286383583",
            "shelfName": "Corners",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "10::for::\u00a33.5::true",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
            "cin": "6362225",
            "promoDetailFull": "10 for \u00a33.5",
            "ava

## 3. JSON (JavaScript Object Notation)

### What is JSON
- A lightweight data-interchange format
- "self-describing" and easy to understand
- the JSON format is text only 
- Language independent: can be read and used as a data format by any programming language

###  JSON Syntax Rules
JSON syntax is derived from JavaScript object notation syntax:
- Data is in **name/value** pairs separated by commas
- Curly braces hold objects
- Square brackets hold arrays

### A JSON object is:
- **a dictionary** or 
- a **list of dictionaries**

### Useful JSON functions
- dumps: save json object to string
- dump: save json object to file
- loads: load from a string in json format
- load: load from a file in json format

In [4]:
# Exercise 3.1 API returns a JSON object 

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# if the API call returns a successful response
if r.status_code==200:
    result = r.json()
    #print(result)
    df = pd.DataFrame(result["items"])
    df.head()
    

Unnamed: 0,shelfId,shelfName,deptId,deptName,isBundle,meatStickerDetails,extraLargeImageURL,bundledItemCount,scene7Host,cin,...,avgWeight,iconDetails,maxQty,pricePerWt,productURL,pricePerUOM,searchTuningScore,onSale,salePrice,positionChngByMargin
0,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£3.5::true,,0,https://ui.assets-asda.com:443/dm/,6362225,...,,"{'promotionalIcons': ['59600049'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,19777406.0,False,,0
1,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£3.5::true,,0,https://ui.assets-asda.com:443/dm/,6362239,...,,"{'promotionalIcons': ['59600049'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,15463644.0,False,,0
2,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£3.5::true,,0,https://ui.assets-asda.com:443/dm/,6362227,...,,"{'promotionalIcons': ['59600049'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,12781616.0,False,,0
3,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£3.5::true,,0,https://ui.assets-asda.com:443/dm/,6362229,...,,"{'promotionalIcons': ['59600049'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,8983064.0,False,,0
4,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£3.5::true,,0,https://ui.assets-asda.com:443/dm/,6362233,...,,"{'promotionalIcons': ['59600049'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,8785418.0,False,,0


In [None]:
# Exercise 3.2. Parse JSON object (a dictionary)

# convert the first 3 items to string
#result["items"][0:2]

s = json.dumps(result["items"][0:2], indent=4)
print(s)

# load from a string
items = json.loads(s)
items

# save to file
json.dump(result["items"], open("items.json","w"))

# load from file
items = json.load(open("items.json","r"))
print("test loaded data\n")
len(items)
items[0]

## 4. Parse PDF Files
- Many python packages are available to parse pdf files
  - PDFMiner: A tool for extracting information from PDF documents. It can show exact location of text in a page, as well as other information such as fonts or lines. 
  - PyPDF2: A pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. 
  - Tabula-py:  It can read the table of PDF. You can read tables from PDF and convert into pandas’ DataFrame. 
  - Tika: A Python port of the Apache Tika library (https://github.com/chrismattmann/tika-python)
- For detailed analysis, see https://towardsdatascience.com/python-for-pdf-ef0fac2808b0

In [5]:
! pip install tika

Collecting tika
  Downloading tika-1.24.tar.gz (28 kB)
Building wheels for collected packages: tika
  Building wheel for tika (setup.py): started
  Building wheel for tika (setup.py): finished with status 'done'
  Created wheel for tika: filename=tika-1.24-py3-none-any.whl size=32888 sha256=c124212df7514734b6af8f3fb147e58bd79d4f9fea1b2db64b8f548a00bee3fc
  Stored in directory: c:\users\matthewxzt\appdata\local\pip\cache\wheels\75\66\8b\d1acbac7d49f3d98ade76c51ae5d72cec1866131a3b1ad9f82
Successfully built tika
Installing collected packages: tika
Successfully installed tika-1.24


In [6]:
# 4.1. Parse PDF file using Tika

from tika import parser

In [8]:
# Parse a local pdf file

# replace 'Assignment_Python.pdf' by any pdf file you can find

parsed = parser.from_file('Lecture 5 Regression_annotated.pdf')

# Print meta data of the pdf file
print(parsed["metadata"])

2021-09-28 14:21:53,172 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to C:\Users\MATTHE~1\AppData\Local\Temp\tika-server.jar.
2021-09-28 14:22:22,545 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to C:\Users\MATTHE~1\AppData\Local\Temp\tika-server.jar.md5.
2021-09-28 14:22:23,162 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


{'Content-Type': 'application/pdf', 'Creation-Date': '2019-10-02T02:05:44Z', 'Keywords': '', 'Last-Modified': '2021-03-31T13:58:02Z', 'Last-Save-Date': '2021-03-31T13:58:02Z', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.pdf.PDFParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '1653', 'access_permission:assemble_document': 'true', 'access_permission:can_modify': 'true', 'access_permission:can_print': 'true', 'access_permission:can_print_degraded': 'true', 'access_permission:extract_content': 'true', 'access_permission:extract_for_accessibility': 'true', 'access_permission:fill_in_form': 'true', 'access_permission:modify_annotations': 'true', 'created': '2019-10-02T02:05:44Z', 'date': '2021-03-31T13:58:02Z', 'dc:format': 'application/pdf; version=1.3', 'dc:subject': '', 'dc:title': 'Lecture 5', 'dcterms:created': '2019-10-02T02:05:44Z', 'dcterms:modified': '2021-03-31T13:58:02Z', 'meta

In [9]:
#Print the text of the pdf file
print(parsed["content"])













































Lecture 5


A Cost Model

Total cost in $Million

Output in Million KWH

N = 123 American electric utilities

Model:  Cost  =  α +  βKWH  +  ε

FM

FM



Cost Relationship

Output

Co
st

80000700006000050000400003000020000100000

500

400

300

200

100

0

Scatterplot of Cost vs Output



Sample Regression

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM

FM



Interpreting the Model
• Cost = 2.44 + 0.00529 Output + e
• Cost is $Million, Output is Million KWH.
• Fixed Cost = Cost when output = 0 

Fixed Cost = $2.44Million
• Marginal cost 

= Change in cost/change in output
= .00529 * $Million/Million KWH
= .00529 $/KWH = 0.529 cents/KWH.

FM

FM

FM

FM

FM



Using the Residuals

• How do you know the model is “good?”
• Various diagnostics to be developed over the 

semester.
• But, the first place to look is at the residuals.

FM



Residuals C

In [None]:
# Parse the file to XHTML
# Sometimes, it's better to see the document structure through XHTML

parsed = parser.from_file('Assignment_Python.pdf', xmlContent=True)
print(parsed["content"])

## 5. Get Tweets

Reference: 
- https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af
- https://github.com/scalto/snscrape-by-location/blob/main/snscrape_by_location_tutorial.ipynb
- https://medium.com/swlh/how-to-scrape-tweets-by-location-in-python-using-snscrape-8c870fa6ec25

Note: 

- User object is not exposed by TwitterSearchScraper any more
- snscrape does not work with Python 3.9. See https://github.com/JustAnotherArchivist/snscrape/issues/111 to fix the bug. If your Python is earlier than 3.9, it should be OK



In [18]:
！pip install snscrape
import pandas as pd

import snscrape.modules.twitter as sntwitter
import itertools


SyntaxError: invalid character in identifier (<ipython-input-18-2bb5bc40840a>, line 1)

In [19]:
#  search by keywords + time

    
df = pd.DataFrame(itertools.islice(sntwitter.TwitterSearchScraper(
    '"blockchain + since:2020-10-31 until:2020-11-03"').get_items(), 100))

print(len(df))
df.head()

NameError: name 'sntwitter' is not defined

In [20]:
# search by user

df = pd.DataFrame(itertools.islice(sntwitter.TwitterUserScraper(
    '"zawphyowai199"').get_items(), 100))

print(len(df))
df.head()

NameError: name 'sntwitter' is not defined