<a href="https://colab.research.google.com/github/lustraka/data-analyst-portfolio-project-2022/blob/main/cs01_cds_methods/20211202_Search_Web_Resources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Search Web Resources
## Define
### Outline requirments
In order to *search web resources*,
- design the structure of a web search database for storing and comparing automated web searches,
- design and test functions for
  - searching,
  - comparing,
  - reporting web resources,
- design and test an ontology for discovering (minining) and storing domain knowledge,
- populate system in order to compile 2021 annual report on CDS and their methods.

### Reflext the previous code

- Data_Analysis_Workouts > [20211111_Scrape_WebPages_Root.ipynb](https://github.com/lustraka/Data_Analysis_Workouts/blob/main/Wrangle_Data/Scrape_Google_Search/20211111_Scrape_WebPages_Root.ipynb)
  - gather results of the google search (`goo_urls`)
  - iterate over `goo_urls` to gather web resources (including the text of a web page if `status_code == 200`)
  - create a dataframe with `columns=['id' ,'title', 'wp_url', 'status', 'status_ts', 'text', 'text_len'])`
  - store dataframe to SQLite database (`SQLAlchemy`)
- suitecrm > [20211126-WebScrape-Pytude.ipynb](https://github.com/lustraka/suitecrm/blob/main/iim/20211126-WebScrape-Pytude.ipynb) >> [Scrape_Google_Sandbox.md](https://github.com/lustraka/suitecrm/blob/main/iim/Scrape_Google_Sandbox.md)
  - define `scrape_web(search, count)` function to return a dataframe with `columns=['url', 'site', 'title', 'rec_created']`
  ```python
  search = 'https://www.google.com/search?q=externí+hodnotitel'
df = scrape_web(search, 100)
  ```
  - export search results to markdown

### Design an initial data structure of `url_master.csv`

Variable | Description
- | -
term | An unique idenitifier curated by hand.
title | A title of the web page retrieved from the google search results.
url | A URL of the web page retrieved from the google search.
inn | An Internet node (hostname) extracted from the URL.
accessed | A date of the web page access.
search | The search string posted to the Google search page.

### Define the `search_web(master, search, count)` function
Use the Google search engine to find `count` responses to `search`.

Params:
- `master`: a master dataframe for identifying duplicates
- `search`: search terms connected with `+`
- `count`: required number of results, default is `40`; pages are retrieved in multiples of 10

Returns:
- a list with records with same columns as in `url_master`

If a web resource is already in `url_master`, then `term` is filled with that value, otherwise it has the `url ` value.

In [1]:
# Import dependencies
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
from datetime import date
from urllib.error import HTTPError

path = 'https://raw.githubusercontent.com/lustraka/data-analyst-portfolio-project-2022/main/data/web_searches/'

# Define an auxiliary function
def get_url(link):
  """Extract URL from a google search an <a> element"""
  if link[:7] == '/url?q=':
    url = re.search(r'q=(.*?)&', link).group(1)
  else:
    url = re.search(r'\A(.*?)&', link).group(1)
  return url

# Define load_url_master function
def load_url_master(name='url_master.csv'):
  """Load master file or initialize empty dataframe"""

  try:
        df = pd.read_csv(path+name)
  except HTTPError as error:
    print(f"{error} >> empty master initialized.")
    df = pd.DataFrame(columns=['term', 'title', 'url', 'inn', 'accessed', 'search'])
  
  return df

In [2]:
# Define search_web() function
def search_web(master, search, count=40, debug=False):
  """<fixme> from markdown cell"""

  data = []
  accessed = str(date.today())

  for i in range(0, count, 10):
    gsearch = 'https://www.google.com/search?q=' + search + '&start=' + str(i)
    if debug:
      print(f"gsearch = {gsearch}")
    gpage = requests.get(gsearch)
    if debug:
      print(f"gpage.status_code = {gpage.status_code}")
    gsoup = BeautifulSoup(gpage.content, 'html.parser')
    h3_list = gsoup.find_all('h3')
    if debug:
      print(f"len(h3_list) = {len(h3_list)}")

    # Continue only if there are some results
    if len(h3_list) == 0:
      break

    for h3 in h3_list:
      title = h3.text
      if debug:
        print(f"\th3.text = {h3.text} | h3.parent.attrs.keys() = {h3.parent.attrs.keys()}")

      # Skip the rest of loop if no url in h3.parent
      if 'href' not in h3.parent.attrs.keys():
        continue

      url = get_url(h3.parent['href'])
      inn = re.search('//(.*?)/', url).group(1)
      
      # Check the url in url_master
      check = master.loc[master.url == url]

      if check.shape[0] == 0:
        term = 'url '
      elif check.shape[0] == 1:
        term = check.term.values[0]
      else:
        print(f"Duplicted records for `{check.term.values[0]}` in `url_master`!!")
      
      data.append([term, title, url, inn, accessed, search])

    print(gpage.status_code, gsearch, len(h3_list))

  return data

## Search the web

In [None]:
# Load an url_master file
master = load_url_master()
print(f"master.shape = {master.shape}")

In [3]:
# Search the web resources
search = 'kitchin+data+revolution'
data = search_web(master, search, debug=False)
data[-1:]

master.shape = (19, 6)
200 https://www.google.com/search?q=kitchin+data+revolution&start=0 12
Duplicted records for `url Google Books search01` in `url_master`!!
200 https://www.google.com/search?q=kitchin+data+revolution&start=10 10
200 https://www.google.com/search?q=kitchin+data+revolution&start=20 10
200 https://www.google.com/search?q=kitchin+data+revolution&start=30 10


[['url ',
  'Rob Kitchin - Google Scholar',
  'https://scholar.google.com/citations%3Fuser%3DY_3-GBQAAAAJ%26hl%3Den',
  'scholar.google.com',
  '2021-12-02',
  'kitchin+data+revolution']]

In [4]:
# Print already known records
for row in [row for row in data if len(row[0]) > 4]:
  print(row[0])
  print("\t", row[3], " | ", row[1])

url Kitchin 2014 SagePub item
	 uk.sagepub.com  |  The Data Revolution | SAGE Publications Ltd
url Kitchin 2014 Amazon item
	 methods.sagepub.com  |  Big Data, Open Data, Data Infrastructures & Their Consequences
url Kitchin 2014 SagePub ResearchMethods item
	 www.amazon.com  |  The Data Revolution: Big Data, Open Data, Data Infrastructures and ...
url Kitchin 2014 TheCultureSociety bookreview
	 www.theoryculturesociety.org  |  Review: Rob Kitchin, 'The Data Revolution' - Theory, Culture & Society
url Google Books search00
	 books.google.com  |  The Data Revolution: Big Data, Open Data, Data ... - Google Books
url Rob Kitchin YouTube video00
	 www.youtube.com  |  Rob Kitchin talks about big data, open data and the 'data revolution'
url Kitchin 2014 bookwebsite
	 thedatarevolutionbook.wordpress.com  |  The Data Revolution | A book about big data, open data, data ...
url Kitchin 2021b GeographicalResearch bookreview
	 onlinelibrary.wiley.com  |  The Data Revolution - Wiley Online Library

## Review results

- select only the new data (`forreview`)
- save the new data to `url_review.csv` for a by-hand review and for the Colab > local > GitHub > Colab cycle
- print and copy the new data to the `url_review.md` document and use this document to access web resources during analysis


In [5]:
# Select only newly idenfied web resources
forreview = [row for row in data if len(row[0]) == 4]

In [6]:
# Save data as CSV file
pd.DataFrame(forreview, columns=master.columns).to_csv('url_review.csv', index=False)

In [7]:
def print_md(data):
  """Print data for markdown document"""
  print(f"## Searched terms: `{data[0][5]}`")
  for row in data:
    print(f"- [{row[1]}]({row[2]}) (***`{row[3]}`** | {row[0]})")

# Copy data to a markdown file in GitHub
print_md(forreview)

## Searched terms: `kitchin+data+revolution`
- [The Data Revolution: Big Data, Open Data, Data Infrastructures and ...](https://www.researchgate.net/publication/307894195_The_Data_Revolution_Big_Data_Open_Data_Data_Infrastructures_and_Their_Consequences_by_Rob_Kitchin_2014_Thousand_Oaks_California_Sage_Publications_222xvii_ISBN_978-1446287484_100) (***`www.researchgate.net`** | url )
- [[PDF] data revolution - Building a Digital Portfolio](http://arthistory2015.doingdh.org/wp-content/uploads/sites/6/2015/06/Kitchin-Chapter1.pdf) (***`arthistory2015.doingdh.org`** | url )
- [A Critical Analysis of Big Data, Open Data and Data Infrastructures](https://www.barnesandnoble.com/w/the-data-revolution-rob-kitchin/1139609906) (***`www.barnesandnoble.com`** | url )
- [The Data Revolution - Rob Kitchin - Google Books](https://books.google.com/books/about/The_Data_Revolution.html%3Fid%3DhqdqrgEACAAJ) (***`books.google.com`** | url )
- [The Data Revolution - Rob Kitchin - Google Books](https://book

## Review data by hand now
- fetch the local repository
- download `url_review.csv` to the local repository
- review the new web resources and fill their terms
- save `url_review.csv` locally
- push changes to the GitHub repository

In [12]:
# Check terms
master.loc[master.term.apply(lambda s: 'google' in s.lower()), ['term', 'url']]

Unnamed: 0,term,url
5,url Google Books search00,https://books.google.com/books/about/The_Data_...
6,url Google Books search01,https://books.google.com/books/about/The_Data_...
10,url Google Books search03,https://books.google.com/books/about/The_Data_...


## Append reviewed data to master
- read `url_review.csv` (from the GitHub repository)
- append rows to the master
- check the master file

In [13]:
# Read reviewed data
reviewed = pd.read_csv(path+'url_review.csv')
reviewed.shape

(23, 6)

In [14]:
# Append reviewed row to master
master = master.append(reviewed, ignore_index=True, verify_integrity=True)
master.shape

(42, 6)

In [15]:
master.duplicated(['term', 'url']).sum()

0

## Save updated master `url_master`
- save updated master `url_master`
- download `url_master` to the local repository
- push changes to the GitHub repository

In [16]:
# Save master as CSV file
master.to_csv('url_master.csv', index=False)

## Analyze the master file

In [18]:
master[['term','inn','title']].sort_values(by='term')

Unnamed: 0,term,inn,title
5,url Google Books search00,books.google.com,"The Data Revolution: Big Data, Open Data, Data..."
6,url Google Books search01,books.google.com,The Data Revolution - Rob Kitchin - Google Books
10,url Google Books search03,books.google.com,The Data Revolution - Rob Kitchin - Google Books
22,url Google Books search04,books.google.com,The Data Revolution - Rob Kitchin - Google Books
23,url Google Books search05,books.google.com,The Data Revolution - Rob Kitchin - Google Books
24,url Google Books search06,books.google.com,The Data Revolution: A Critical Analysis of Bi...
1,url Kitchin 2014 Amazon item,methods.sagepub.com,"Big Data, Open Data, Data Infrastructures & Th..."
40,url Kitchin 2014 AwesomeBooks item,www.awesomebooks.com,"A Critical Analysis of Big Data, Open Data and..."
30,url Kitchin 2014 Biblio items,www.biblio.com,9781446287484 - The Data Revolution - Biblio.com
29,url Kitchin 2014 Bokus item,www.bokus.com,The Data Revolution - Rob Kitchin - Häftad (97...


In [20]:
# Check duplicated terms
master.loc[master.term.duplicated()]

Unnamed: 0,term,title,url,inn,accessed,search
25,url Kitchin 2014 SurveillanceSociety bookreview,"Book review: Kitchin, R. 2014. 'The Data Revol...",https://openresearch.surrey.ac.uk/esploro/outp...,openresearch.surrey.ac.uk,2021-12-02,kitchin+data+revolution
