 # STEP 2: SCRAPE DOCUMENTS

 <h4>This notebook will collect article metadata from scraped responses.</h4>

It supports the Wordpress API (reference here: https://developer.wordpress.org/rest-api/) and the Google Custom Search Engine API (reference here: https://developers.google.com/custom-search/v1/) out of the box, though other APIs can be added. For more information and further instructions, consult the Chomp documentation at https://github.com/kwgws/we1s_chomp.

 ## INFO
 
__authors__    = 'Catherine Gilleran'  
__copyright__  = 'copyright 2019, The WE1S Project'  
__license__    = 'MIT'  
__version__    = '0.1.0'  


 ## SETTINGS

In [None]:
from os import getenv
from pathlib import Path

from we1s_chomp import google, wordpress
from we1s_chomp.model import Article
from we1s_chomp.web import Browser

project_dir = Path.home() / "write" / "dev" / "we1s_chomp"
url_stopwords_file = project_dir / "notebooks" / "url_stopwords.txt"

grid_url = getenv("CHOMP_SELENIUM_GRID_URL")

url_stops = set()

# Get stopwords.
url_stopwords = set()
with open(url_stopwords_file, encoding="utf-8") as txtfile:
    for line in txtfile.readlines():
        stopword = line.strip()
        if stopword != "":
            url_stopwords.add(stopword)
            print(f'Added URL stopword: "{stopword}".')
print("\n")

 ## DATA DIRECTORIES

 Chomp will import and export JSON files to these directories, using them not
 only to set up and store the results of a collection run and but also to keep
 track of metadata, duplicate URLs, page numbers, etc. as it goes.

 <p style="color:red;">Because collection is a time- and resource-intensive
 task, it is preferable to keep these metadata files in a single location and
 allow Chomp to manage them internally rather than modifying or deleting them
 between each job.</p>

In [None]:
query_dir = project_dir / "data" / "json" / "queries"
source_dir = project_dir / "data" / "json" / "sources"
response_dir = project_dir / "data" / "json" / "responses"
article_dir = project_dir / "data" / "json" / "articles"

# Make article directory if it does not already exist.
if not article_dir.exists():
    article_dir.mkdir(parents=True)

print(f"Loading sources from {source_dir}.")
print(f"Loading queries from {query_dir}.")
print(f"Loading responses from {response_dir}.")
print(f"Saving articles to {article_dir}.\n\n")

 ## BROWSE: Search responses for keywords

Choose `search_text` to filter available response files. If you are searching for a specific word or phrase, enter it WITHIN the single quotes below. Note that you will be searching the filenames of the JSON files stored on in the `response_dir` path (usually `data/json/responses`). If you want to simply list all of the available responses, change the value of the `search_text` variable below to `None` WITHOUT single quotes (so the line should read `search_text = None`).

In [None]:
search_text = None

Run the cell and review the results. The default is to search through the `data/json/responses` directory. If your data is in a different location on harbor, change the `response_dir` variable above to the directory you want to search.

In [None]:
import json

print("response_list = [")
for filename in response_dir.glob("*.json"):
    with open(filename, encoding="utf-8") as jsonfile:
        name = json.load(jsonfile).get("name", "")
    if not name or "chomp-response" not in name:
        continue
    print('    "' + name + '",')
print("]\n\n")

## LIST: Define which queries will be scraped

Copy the entire cell output above and replace the `response_list` array in the following cell. Each response name should be surrounded by quotes, and after each name there should be a comma (for the last filename in the list it doesn't matter if you include the comma or not).

Don't forget to run the cell!

In [None]:
response_list = [
    "chomp-response_we1s_humanities_2014-01-01_2019-12-31_0",
    "chomp-response_we1s_humanities_2014-01-01_2019-12-31_1",
    "chomp-response_we1s_humanities_2014-01-01_2019-12-31_2",
    "chomp-response_we1s_humanities_2014-01-01_2019-12-31_3",
    "chomp-response_we1s_humanities_2014-01-01_2019-12-31_4",
    "chomp-response_we1s_humanities_2014-01-01_2019-12-31_5",
    "chomp-response_we1s_humanities_2014-01-01_2019-12-31_6",
    "chomp-response_we1s_humanities_2014-01-01_2019-12-31_7",
    "chomp-response_libcom-org_humanities_2000-01-01_2019-12-31_0",
    "chomp-response_libcom-org_humanities_2000-01-01_2019-12-31_1",
    "chomp-response_libcom-org_humanities_2000-01-01_2019-12-31_2",
    "chomp-response_libcom-org_humanities_2000-01-01_2019-12-31_3",
    "chomp-response_libcom-org_humanities_2000-01-01_2019-12-31_4",
    "chomp-response_libcom-org_humanities_2000-01-01_2019-12-31_5",
    "chomp-response_libcom-org_humanities_2000-01-01_2019-12-31_6",
    "chomp-response_libcom-org_humanities_2000-01-01_2019-12-31_7",
    "chomp-response_libcom-org_humanities_2000-01-01_2019-12-31_8",
    "chomp-response_libcom-org_humanities_2000-01-01_2019-12-31_9",
]

 ## IMPORT & TEST RESPONSES

 Let's take a second here to make sure our responses are all in good shape--
 that the dates are all correct and that they all connect to a proper source.
 That way there won't be any surprises later.

 Run this cell and check for any errors in the output. Make sure that the
 number of responses imported is equal to the number of responses you intended
 to import.

In [None]:
from we1s_chomp import clean, db

responses = []
for response_name in response_list:
    response = db.load_response(response_name, response_dir)
    if not db.load_source(response.source, source_dir):
        print(f'WRN: "{response.source}" not found, skipping "{response.name}".')
        continue
    if not db.load_query(response.query, query_dir):
        print(f'WRN: "{response.query}" not found, skipping "{response.name}".')
        continue
    responses.append(response)
    print(f'Imported "{response.name}".')
print(f"{len(responses)} responses imported out of {len(response_list)} total.")
if len(responses) == len(response_list):
    print("Everything looks good so far!\n\n")
else:
    print("Hmm, does that seem right to you? Double-check!\n\n")

 ## LOAD URL STOPS (Optional)

 Load previously collected URLs. Skipping this step will force all responses to be re-collected.

In [None]:
for response in responses:
    for article_name in response.articles:
        article = db.load_article(article_name, article_dir)
        url_stops.add(article.url)
print(f"Added {len(url_stops)} URLs to URL stop list.\n\n")

 # GET DOCUMENTS

 Use the queries to start scraping responses.
 
 <h3 style="color:red;font-weight:bold">Pay close attention to errors here--many things can go wrong when dealing with the web!</h3>

In [None]:
%%capture output
%%time
total = 0

# Start browser connection. We need this for Google articles.
browser = Browser(grid_url)

# We need to restart article count whenever we get a new query; that means
# keeping track of the old query.
last_query = db.load_query(responses[0].query, query_dir)
count = no_exact_match_count = 0

for response in responses:

    # Load associated source & query.
    source = db.load_source(response.source, source_dir)
    query = db.load_query(response.query, query_dir)
    
    # Reset count on new query.
    if last_query.name != query.name:
        count = no_exact_match_count = 0

    # Select collection API.
    api = response.chompApi

    # Wordpress API ##########################################################
    if api == "wordpress":
        print(f'\nCollecting "{response.name}" via Wordpress API...')
        articles_raw = wordpress.get_metadata(
            response=response.content,
            query_str=query.query_str,
            start_date=query.start_date,
            end_date=query.end_date,
            url_stops=url_stops,
            url_stopwords=url_stopwords
        )

    # Google API #############################################################
    else:
        print(f'\nCollecting "{response.name}" via Google API...')
        articles_raw = google.get_metadata(
            response=response.content,
            query_str=query.query_str,
            start_date=query.start_date,
            end_date=query.end_date,
            url_stops=url_stops,
            url_stopwords=url_stopwords,
            browser=browser,
        )

    if not articles_raw:
        print("ERR: No results or connection error!")
        continue

    # Loop over each article we get back and save it.
    for doc in articles_raw:
        if not doc:
            print("WRN: No article found, skipping.")
            continue

        # Parse result.
        name = "_".join(
            [
                "chomp",
                query.name,
                str(count if not doc["no_exact_match"] else no_exact_match_count)
            ]
        )

        # Articles that do not explicitly contain the search term may require
        # special handling--set them aside.
        if doc["no_exact_match"]:
            name += "(no-exact-match)"

        article = Article(
            name=name,
            title=doc["title"],
            shortTitle=name,
            chompApi=api,
            url=doc["url"],
            pub_date=doc["pub_date"],
            content_html=doc["content_html"],
            content=doc["content"],
            pub=source.title,
            copyright=source.copyright,
            source=source.name,
            query=query.name,
            response=response.name,
        )

        if not doc["no_exact_match"]:
            count += 1
        else:
            no_exact_match_count += 1
        total += 1
        db.save_article(article, article_dir)

        # Update response.
        response.articles.add(article.name)
        db.save_response(response, response_dir)

        # Update query.
        query.articles.add(article.name)
        db.save_query(query, query_dir)

        # Update source.
        source.articles.add(article.name)
        db.save_source(source, source_dir)
        
        # Save previous query.
        last_query = query

        print(f"- {article.url}")
    print(f"Done! Got {count + no_exact_match_count} articles from this response.\n\n")
print(f"\nAll responses complete! Got a total of {total} articles.\n\n")
print("\n\n----------Time----------")

In [None]:
output.show()

 ## NEXT NOTEBOOK

In [None]:
# TODO: next notebook code
# Go to 03_export.ipynb