 # STEP 1: SCRAPE RESPONSES

<h4>This notebook will collect responses containing article metadata from RESTful web-facing search APIs.</h4>
 
It supports the Wordpress API (reference here: https://developer.wordpress.org/rest-api/) and the Google Custom Search Engine API (reference here: https://developers.google.com/custom-search/v1/) out of the box, though other APIs can be added. For more information and further instructions, consult the Chomp documentation at https://github.com/kwgws/we1s_chomp.

 ## INFO

__authors__    = 'Catherine Gilleran'  
__copyright__  = 'copyright 2019, The WE1S Project'  
__license__    = 'MIT'  
__version__    = '0.1.0'  


 ## SETTINGS

In [None]:
import json
from os import getenv
from pathlib import Path

from we1s_chomp import clean, db, google, wordpress
from we1s_chomp.model import Response
from we1s_chomp.web import Browser


project_dir = Path.home() / "write" / "dev" / "we1s_chomp"
url_stopwords_file = project_dir / "notebooks" / "url_stopwords.txt"

grid_url = getenv("CHOMP_SELENIUM_GRID_URL")
google_cx = getenv("CHOMP_GOOGLE_CX")
google_key = getenv("CHOMP_GOOGLE_KEY")

wp_endpoints = ["pages", "posts"]
url_stops = set()

# Get stopwords.
url_stopwords = set()
with open(url_stopwords_file, encoding="utf-8") as txtfile:
    for line in txtfile.readlines():
        stopword = line.strip()
        if stopword != "":
            url_stopwords.add(stopword)
            print(f'Added URL stopword: "{stopword}".')
print("\n")

 ## DATA DIRECTORIES

 Chomp will import and export JSON files to these directories, using them not
 only to set up and store the results of a collection run and but also to keep
 track of metadata, duplicate URLs, page numbers, etc. as it goes.

 <p style="color:red;">Because collection is a time- and resource-intensive
 task, it is preferable to keep these metadata files in a single location and
 allow Chomp to manage them internally rather than modifying or deleting them
 between each job.</p>

In [None]:
query_dir = project_dir / "data" / "json" / "queries"
source_dir = project_dir / "data" / "json" / "sources"
response_dir = project_dir / "data" / "json" / "responses"

# Make response directory if it does not already exist.
if not response_dir.exists():
    response_dir.mkdir(parents=True)

print(f"Loading sources from {source_dir}.")
print(f"Loading queries from {query_dir}.")
print(f"Saving responses to {response_dir}.\n\n")

 ## BROWSE: Search queries for keywords

Choose `search_text` to filter available query files. If you are searching for a specific word or phrase, enter it WITHIN the single quotes below. Note that you will be searching the filenames of the JSON files stored on in the `query_dir` path (usually `data/json/queries`). If you want to simply list all of the available queries, change the value of the `search_text` variable below to `None` WITHOUT single quotes (so the line should read `search_text = None`).

In [None]:
search_text = None

Run the cell and review the results. The default is to search through the `data/json/queries` directory. If your data is in a different location on harbor, change the `query_dir` variable above to the directory you want to search.

In [None]:
print("query_list = [")
for filename in query_dir.glob("*.json"):
    if search_text is not None and search_text not in filename:
        continue
    with open(filename, encoding="utf-8") as jsonfile:
        name = json.load(jsonfile).get("name", "")
    print('    "' + name + '",')
print("]\n\n")

## LIST: Define which queries will be scraped

Copy the entire cell output above and replace the `query_list` array in the following cell. Each name should be surrounded by quotes, and after each name there should be a comma (for the last filename in the list it doesn't matter if you include the comma or not).

Don't forget to run the cell!

In [None]:
query_list = [
    "we1s_humanities_01-01-2014_12-31-2019",
    "libcom-org_humanities_01-01-2000_12-31-2019",
]

 ## IMPORT & TEST QUERIES

 Let's take a second here to make sure our queries are in good shape--that the
 dates are all correct and that they all connect to a proper source. That way
 there won't be any surprises later.

 Run this cell and check for any errors in the output. Make sure that the
 number of queries imported is equal to the number of queries you intended
 to import.

In [None]:
queries = []
for query_name in query_list:
    query = db.load_query(query_name, query_dir)
    if not db.load_source(query.source, source_dir):
        print(f'WRN: "{query.source}" not found, skipping "{query.name}"')
        continue
    queries.append(query)
    print(f'Imported "{query.name}".')
print(f"{len(queries)} queries imported out of {len(query_list)} total.")
if len(queries) == len(query_list):
    print("Everything looks good so far!\n\n")
else:
    print("Hmm, does that seem right to you? Double-check!\n\n")

 ## LOAD URL STOPS (Optional)

 Load previously collected URLs. Skipping this step will force all responses to be re-collected.

In [None]:
for query in queries:
    for response_name in query.responses:
        response = db.load_response(response_name, response_dir)
        url_stops.add(response.url)
print(f"Added {len(url_stops)} URLs to URL stop list.\n\n")

 # GET RESPONSES

 Use the queries to start scraping responses.
 
 <h3 style="color:red;font-weight:bold">Pay close attention to errors here--many things can go wrong when dealing with the web!</h3>

In [None]:
%%capture output
%%time
total = 0

# Start browser connection.
browser = Browser(grid_url)

for query in queries:

    # Load associated source.
    source = db.load_source(query.source, source_dir)
    base_url = source.webpage

    # Select scraping API.
    api = "wordpress" if wordpress.is_api_available(base_url, browser) else "google"
    responses = None

    # Wordpress API ##########################################################
    if api == "wordpress":
        print(f'\nCollecting "{query.name}" via Wordpress API...')
        responses = wordpress.get_responses(
            query_str=query.query_str,
            base_url=base_url,
            endpoints=wp_endpoints,
            url_stops=url_stops,
            url_stopwords=url_stopwords,
            browser=browser,
        )

    # Google API #############################################################
    else:
        print(f'\nCollecting "{query.name}" via Google API...')
        responses = google.get_responses(
            query_str=query.query_str,
            base_url=base_url,
            google_cx=google_cx,
            google_key=google_key,
            url_stops=url_stops,
            url_stopwords=url_stopwords,
            browser=browser,
        )

    if not responses:
        print("ERR: No results or connection error!")
        continue

    # Loop over each page we get back and save the raw response JSON.
    count = 0
    for res in responses:
        url, content = res
        if not res or not content or content == "":
            print("WRN: No response, skipping.")
            continue

        # Add Chomp metadata.
        response = Response(
            name="_".join(
                [
                    "chomp-response",
                    source.name,
                    query.query_str,
                    clean.date_to_str(query.start_date),
                    clean.date_to_str(query.end_date),
                    str(count),
                ]
            ),
            url=url,
            content=content,
            chompApi=api,
            source=source.name,
            query=query.name,
        )

        # Save result.
        count += 1
        total += 1
        db.save_response(response, response_dir)

        # Update query.
        query.responses.add(response.name)
        db.save_query(query, query_dir)

        # Update source.
        source.responses.add(response.name)
        db.save_source(source, source_dir)

        print(f"- {response.url}")
    print(f"Done! Got {count} responses from this query.\n\n")
print(f"\nAll queries complete! Got a total of {total} responses.\n\n")
print("\n\n----------Time----------")


In [None]:
output.show()


 ## NEXT NOTEBOOK

In [None]:
# TODO: next notebook code
# Go to 02_articles.ipynb
