<a href="https://colab.research.google.com/github/lustraka/data-analyst-portfolio-project-2022/blob/main/cs01_cds_methods/20211202_Search_Web_Resources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Search Web Resources
## Define
### Outline requirments
In order to *search web resources*,
- design the structure of a web search database for storing and comparing automated web searches,
- design and test functions for
  - searching,
  - comparing,
  - reporting web resources,
- design and test an ontology for discovering (minining) and storing domain knowledge,
- populate system in order to compile 2021 annual report on CDS and their methods.

### Reflext the previous code

- Data_Analysis_Workouts > [20211111_Scrape_WebPages_Root.ipynb](https://github.com/lustraka/Data_Analysis_Workouts/blob/main/Wrangle_Data/Scrape_Google_Search/20211111_Scrape_WebPages_Root.ipynb)
  - gather results of the google search (`goo_urls`)
  - iterate over `goo_urls` to gather web resources (including the text of a web page if `status_code == 200`)
  - create a dataframe with `columns=['id' ,'title', 'wp_url', 'status', 'status_ts', 'text', 'text_len'])`
  - store dataframe to SQLite database (`SQLAlchemy`)
- suitecrm > [20211126-WebScrape-Pytude.ipynb](https://github.com/lustraka/suitecrm/blob/main/iim/20211126-WebScrape-Pytude.ipynb) >> [Scrape_Google_Sandbox.md](https://github.com/lustraka/suitecrm/blob/main/iim/Scrape_Google_Sandbox.md)
  - define `scrape_web(search, count)` function to return a dataframe with `columns=['url', 'site', 'title', 'rec_created']`
  ```python
  search = 'https://www.google.com/search?q=externí+hodnotitel'
df = scrape_web(search, 100)
  ```
  - export search results to markdown

### Design an initial data structure of `url_master.csv`

Variable | Description
- | -
term | An unique idenitifier curated by hand.
title | A title of the web page retrieved from the google search results.
url | A URL of the web page retrieved from the google search.
inn | An Internet node (hostname).
accessed | A date of the web page access.
search | The posted search string.

### Define the `search_web(master, search, count)` function
Use the Google search engine to find `count` responses to `search`.

Params:
- `master`: a master dataframe for identifying duplicates
- `search`: search terms connected with `+`
- `count`: required number of results, default is `40`; pages are retrieved in multiples of 10

Returns:
- a list with records with same columns as in `url_master`

If a web resource is already in `url_master`, then `term` is filled with that value, otherwise it has the `tbd` value.

In [1]:
# Import dependencies
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
from datetime import date
import os

# Define an auxiliary function
def get_url(link):
  """Extract URL from a google search <a> element"""
  if link[:7] == '/url?q=':
    url = re.search(r'q=(.*?)&', link).group(1)
  else:
    url = re.search(r'\A(.*?)&', link).group(1)
  return url

# Define load_url_master function
def load_url_master(name='url_master.csv'):
  """Load master file or initialize empty dataframe"""

  if os.path.isfile(name):
    df = pd.read_csv(name)
  else:
    df = pd.DataFrame(columns=['term', 'title', 'url', 'inn', 'accessed', 'search'])
  
  return df


In [2]:
# Define search_web() function
def search_web(master, search, count=40, debug=False):
  """<fixme> from markdown cell"""

  data = []
  accessed = str(date.today())

  for i in range(0, count, 10):
    gsearch = 'https://www.google.com/search?q=' + search + '&start=' + str(i)
    if debug:
      print(f"gsearch = {gsearch}")
    gpage = requests.get(gsearch)
    if debug:
      print(f"gpage.status_code = {gpage.status_code}")
    gsoup = BeautifulSoup(gpage.content, 'html.parser')
    h3_list = gsoup.find_all('h3')
    if debug:
      print(f"len(h3_list) = {len(h3_list)}")

    # Continue only if there are some results
    if len(h3_list) == 0:
      break

    for h3 in h3_list:
      title = h3.text
      if debug:
        print(f"\th3.text = {h3.text} | h3.parent.attrs.keys() = {h3.parent.attrs.keys()}")

      # Skip the rest of loop if no url in h3.parent
      if 'href' not in h3.parent.attrs.keys():
        continue

      url = get_url(h3.parent['href'])
      inn = re.search('//(.*?)/', url).group(1)
      
      # Check the url in url_master
      check = master.loc[master.url == url]

      if check.shape[0] == 0:
        term = 'tbd'
      elif check.shape[0] == 1:
        term = check.term.values[0]
      else:
        print(f"Duplicted records for `{check.term.values[0]}` in `url_master`!!")
      
      data.append([term, title, url, inn, accessed, search])

    print(gpage.status_code, gsearch, len(h3_list))

  return data

In [3]:
# Test web search
master = load_url_master()
search = 'kitchin+data+revolution'
data = search_web(master, search, 11, debug=False)
data[-1:]

200 https://www.google.com/search?q=kitchin+data+revolution&start=0 12
200 https://www.google.com/search?q=kitchin+data+revolution&start=10 10


[['tbd',
  'Academic books | Rob Kitchin',
  'https://www.kitchin.org/%3Fpage_id%3D2',
  'www.kitchin.org',
  '2021-12-02',
  'kitchin+data+revolution']]

In [4]:
for row in data:
  print(row[3], row[1])

uk.sagepub.com The Data Revolution | SAGE Publications Ltd
methods.sagepub.com Big Data, Open Data, Data Infrastructures & Their Consequences
www.amazon.com The Data Revolution: Big Data, Open Data, Data Infrastructures and ...
www.theoryculturesociety.org Review: Rob Kitchin, 'The Data Revolution' - Theory, Culture & Society
books.google.com The Data Revolution: Big Data, Open Data, Data ... - Google Books
www.youtube.com Rob Kitchin talks about big data, open data and the 'data revolution'
www.researchgate.net The Data Revolution: Big Data, Open Data, Data Infrastructures and ...
thedatarevolutionbook.wordpress.com The Data Revolution | A book about big data, open data, data ...
arthistory2015.doingdh.org [PDF] data revolution - Building a Digital Portfolio
www.barnesandnoble.com A Critical Analysis of Big Data, Open Data and Data Infrastructures
books.google.com The Data Revolution - Rob Kitchin - Google Books
onlinelibrary.wiley.com The Data Revolution - Wiley Online Library
online

## Plan the next step
- save data to `url_review.csv` for by hand review and for the Colab > local > GitHub > Colab cycle
- print data for `url_review.md` document which can be used to access pages during analysis
- load curated data and append them to `url_master`