<a href="https://colab.research.google.com/github/lustraka/data-analyst-portfolio-project-2022/blob/main/cs01_cds_methods/20211202_Search_Web_Resources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Search Web Resources
## Define
### Outline requirments
In order to *search web resources*,
- design the structure of a web search database for storing and comparing automated web searches,
- design and test functions for
  - searching,
  - comparing,
  - reporting web resources,
- design and test an ontology for discovering (minining) and storing domain knowledge,
- populate system in order to compile 2021 annual report on CDS and their methods.

### Reflext the previous code

- Data_Analysis_Workouts > [20211111_Scrape_WebPages_Root.ipynb](https://github.com/lustraka/Data_Analysis_Workouts/blob/main/Wrangle_Data/Scrape_Google_Search/20211111_Scrape_WebPages_Root.ipynb)
  - gather results of the google search (`goo_urls`)
  - iterate over `goo_urls` to gather web resources (including the text of a web page if `status_code == 200`)
  - create a dataframe with `columns=['id' ,'title', 'wp_url', 'status', 'status_ts', 'text', 'text_len'])`
  - store dataframe to SQLite database (`SQLAlchemy`)
- suitecrm > [20211126-WebScrape-Pytude.ipynb](https://github.com/lustraka/suitecrm/blob/main/iim/20211126-WebScrape-Pytude.ipynb) >> [Scrape_Google_Sandbox.md](https://github.com/lustraka/suitecrm/blob/main/iim/Scrape_Google_Sandbox.md)
  - define `scrape_web(search, count)` function to return a dataframe with `columns=['url', 'site', 'title', 'rec_created']`
  ```python
  search = 'https://www.google.com/search?q=externí+hodnotitel'
df = scrape_web(search, 100)
  ```
  - export search results to markdown

### Design an initial data structure of `url_master.csv`

Variable | Description
- | -
term | An unique idenitifier curated by hand.
title | A title of the web page retrieved from the google search results.
url | A URL of the web page retrieved from the google search.
inn | An Internet node (hostname) extracted from the URL.
accessed | A date of the web page access.
search | The search string posted to the Google search page.

### Define the `search_web(master, search, count)` function
Use the Google search engine to find `count` responses to `search`.

Params:
- `master`: a master dataframe for identifying duplicates
- `search`: search terms connected with `+`
- `count`: required number of results, default is `40`; pages are retrieved in multiples of 10

Returns:
- a list with records with same columns as in `url_master`

If a web resource is already in `url_master`, then `term` is filled with that value, otherwise it has the `url ` value.

In [1]:
# Import dependencies
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
from datetime import date
from urllib.error import HTTPError

path = 'https://raw.githubusercontent.com/lustraka/data-analyst-portfolio-project-2022/main/data/web_searches/'

# Define an auxiliary function
def get_url(link):
  """Extract URL from a google search an <a> element"""
  if link[:7] == '/url?q=':
    url = re.search(r'q=(.*?)&', link).group(1)
  else:
    url = re.search(r'\A(.*?)&', link).group(1)
  return url

# Define load_url_master function
def load_url_master(name='url_master.csv'):
  """Load master file or initialize empty dataframe"""

  try:
        df = pd.read_csv(path+name)
  except HTTPError as error:
    print(f"{error} >> empty master initialized.")
    df = pd.DataFrame(columns=['term', 'title', 'url', 'inn', 'accessed', 'search'])
  
  return df

In [2]:
# Define search_web() function
def search_web(master, search, count=40, debug=False):
  """<fixme> from markdown cell"""

  data = []
  accessed = str(date.today())

  for i in range(0, count, 10):
    gsearch = 'https://www.google.com/search?q=' + search + '&start=' + str(i)
    if debug:
      print(f"gsearch = {gsearch}")
    gpage = requests.get(gsearch)
    if debug:
      print(f"gpage.status_code = {gpage.status_code}")
    gsoup = BeautifulSoup(gpage.content, 'html.parser')
    h3_list = gsoup.find_all('h3')
    if debug:
      print(f"len(h3_list) = {len(h3_list)}")

    # Continue only if there are some results
    if len(h3_list) == 0:
      break

    for h3 in h3_list:
      title = h3.text
      if debug:
        print(f"\th3.text = {h3.text} | h3.parent.attrs.keys() = {h3.parent.attrs.keys()}")

      # Skip the rest of loop if no url in h3.parent
      if 'href' not in h3.parent.attrs.keys():
        continue

      url = get_url(h3.parent['href'])
      inn = re.search('//(.*?)/', url).group(1)
      
      # Check the url in url_master
      check = master.loc[master.url == url]

      if check.shape[0] == 0:
        term = 'url '
      elif check.shape[0] == 1:
        term = check.term.values[0]
      else:
        print(f"Duplicted records for `{check.term.values[0]}` in `url_master`!!")
      
      data.append([term, title, url, inn, accessed, search])

    print(gpage.status_code, gsearch, len(h3_list))

  return data

## Search the web

In [3]:
# Load an url_master file
master = load_url_master()
print(f"master.shape = {master.shape}")

master.shape = (42, 6)


In [4]:
# Search the web resources
search = 'data+university+society+digital+information+2021'
data = search_web(master, search, debug=False)
data[-1:]

200 https://www.google.com/search?q=data+university+society+digital+information+2021&start=0 10
200 https://www.google.com/search?q=data+university+society+digital+information+2021&start=10 10
200 https://www.google.com/search?q=data+university+society+digital+information+2021&start=20 10
200 https://www.google.com/search?q=data+university+society+digital+information+2021&start=30 10


[['url ',
  '3rd Digital Health Society Summit 2021 - Meetmaps',
  'https://event.meetmaps.com/dhssummit21/en/landing',
  'event.meetmaps.com',
  '2021-12-03',
  'data+university+society+digital+information+2021']]

In [5]:
# Print already known records
for row in [row for row in data if len(row[0]) > 4]:
  print(row[0])
  print("\t", row[3], " | ", row[1])

## Set Up HTML Export

In [6]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
from datetime import date
import os

# Initialize the template for an export
html_template = """<!DOCTYPE html>
<html><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title></title>
<style>
body { background-color: #ffffff; color: #000000; margin:10px 10px 10px 10px; font-family: Arial, Helvetica, sans-serif; }
h1 { font-size: 1.75em; color: #003199; }
h2 { font-size: 1.25em; color: #003199; }
h3 { font-size: 1em; color: #003199; }
p, li { font-size: 1em; }
.header { font-size: 2em; color: #000099; }
.maincontent { }
.footer { text-align: center; font-size: 0.675em; }
table, th, td { border: 1px solid black; border-collapse: collapse; }
table { width: 18cm; }
th, td { padding: 2px; }
.highligth { background-color: #ffffe0;}
</style>
</head>
<body>
<div class="header"><span></span></div>
<hr/>
<div class="maincontent">
</div>
<hr/>
<div class="footer">Updated: <span class="update"></span>.</div>
</body></html>"""

def export_html(df, filename):
  """Export the dataframe df to a HTML file called `filename`."""

  soup = BeautifulSoup(html_template, 'html.parser')
  soup.find(class_='update').string = str(date.today())

  # Locate a div maincontent element
  maincontent = soup.find('div', class_='maincontent')

  for _, row in df.iterrows():
    div_row = BeautifulSoup('<div class="row"></div>').div
    
    # Prepare a <h3> element for term and title
    h3_tag = BeautifulSoup('<h3></h3>').h3
    div_row.append(h3_tag)
    h3 = div_row.h3

    # Prepare an <a> element for url
    a_tag = BeautifulSoup('<a></a>').a
    div_row.append(a_tag)
    a =  div_row.a

    for col in df.columns:
      span = BeautifulSoup('<span></span>').span
      span['class'] = f'{col}'
      span.string = str(row[col])
      if col == 'term':
        h3.append(span)
        h3.append(BeautifulSoup('<br/>').br)
      elif col == 'title':
        h3.append(span)
      elif col == 'url':
        a['href'] = row[col]
        # Open link in a new tab
        a['target'] = '_blank'
        # Prevent tabnabbing!!
        a['rel'] = "noopener noreferrer"
        a.append(span)
        div_row.append(BeautifulSoup('<br/>').br)
      else:
        div_row.append(span)
        div_row.append(BeautifulSoup('<br/>').br)
    maincontent.append(div_row)

  with open(filename, 'w') as file:
    file.write(soup.prettify())
  
  return


## Review results

- select only the new data (`forreview`)
- save the new data to `url_review.csv` for a by-hand review and for the Colab > local > GitHub > Colab cycle
- print and copy the new data to the `url_review.md` document and use this document to access web resources during analysis

**To use CSV for preprocessing** (as of 2021-12-02)
```python
# Save data as CSV file
pd.DataFrame(forreview, columns=master.columns).to_csv('url_review.csv', index=False)

def print_md(data):
  """Print data for markdown document"""
  print(f"## Searched terms: `{data[0][5]}`")
  for row in data:
    print(f"- [{row[1]}]({row[2]}) (***`{row[3]}`** | {row[0]})")

# Copy data to a markdown file in GitHub
print_md(forreview)
```

In [9]:
# Select only newly idenfied web resources
forreview = pd.DataFrame([row for row in data if len(row[0]) == 4], columns=master.columns)

# Export data to HTML
export_html(forreview, 'url_review_20211203.html')

## Review data by hand now
- fetch the local repository
- download `url_review.csv` to the local repository
- review the new web resources and fill their terms
- save `url_review.csv` locally
- push changes to the GitHub repository

In [None]:
# Check terms
master.loc[master.term.apply(lambda s: 'google' in s.lower()), ['term', 'url']]

Unnamed: 0,term,url
5,url Google Books search00,https://books.google.com/books/about/The_Data_...
6,url Google Books search01,https://books.google.com/books/about/The_Data_...
10,url Google Books search03,https://books.google.com/books/about/The_Data_...


## Append reviewed data to master
- read `url_review.csv` (from the GitHub repository)
- append rows to the master
- check the master file

In [None]:
# Read reviewed data
reviewed = pd.read_csv(path+'url_review.csv')
reviewed.shape

(23, 6)

In [None]:
# Append reviewed row to master
master = master.append(reviewed, ignore_index=True, verify_integrity=True)
master.shape

(42, 6)

In [None]:
master.duplicated(['term', 'url']).sum()

0

## Save updated master `url_master`
- save updated master `url_master`
- download `url_master` to the local repository
- push changes to the GitHub repository

In [None]:
# Save master as CSV file
master.to_csv('url_master.csv', index=False)

## Analyze the master file

In [None]:
master[['term','inn','title']].sort_values(by='term')

Unnamed: 0,term,inn,title
5,url Google Books search00,books.google.com,"The Data Revolution: Big Data, Open Data, Data..."
6,url Google Books search01,books.google.com,The Data Revolution - Rob Kitchin - Google Books
10,url Google Books search03,books.google.com,The Data Revolution - Rob Kitchin - Google Books
22,url Google Books search04,books.google.com,The Data Revolution - Rob Kitchin - Google Books
23,url Google Books search05,books.google.com,The Data Revolution - Rob Kitchin - Google Books
24,url Google Books search06,books.google.com,The Data Revolution: A Critical Analysis of Bi...
1,url Kitchin 2014 Amazon item,methods.sagepub.com,"Big Data, Open Data, Data Infrastructures & Th..."
40,url Kitchin 2014 AwesomeBooks item,www.awesomebooks.com,"A Critical Analysis of Big Data, Open Data and..."
30,url Kitchin 2014 Biblio items,www.biblio.com,9781446287484 - The Data Revolution - Biblio.com
29,url Kitchin 2014 Bokus item,www.bokus.com,The Data Revolution - Rob Kitchin - Häftad (97...


In [None]:
# Check duplicated terms
master.loc[master.term.duplicated()]

Unnamed: 0,term,title,url,inn,accessed,search
25,url Kitchin 2014 SurveillanceSociety bookreview,"Book review: Kitchin, R. 2014. 'The Data Revol...",https://openresearch.surrey.ac.uk/esploro/outp...,openresearch.surrey.ac.uk,2021-12-02,kitchin+data+revolution
