<a href="https://colab.research.google.com/github/lustraka/data-analyst-portfolio-project-2022/blob/main/code/20211111_Scrape_WebPages_Root.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrape Web Pages From a Google Search
## Background
### Purpose
Purpose of this code is to scrape results of the google search and store it into an SQlite database.
### Input
Query: `data analytics portfolio projects`
### Output

A Data Set Stucture:

Variable | Description
-|-
id | An unique idenitifier starting with `WP`.
title | A title of the web page retrieved from the google search results.
url | A URL of the web page retrieved from the google search.
status | A statuts of the `requests`' response object.
status_ts | A datetime of the web page scraping.
text | A raw text of the response (body.text) if status == 200.
text_len | The lenght of the text extracted from the page.

The dataset is named 'wp_root' and stored in the SQLite database named `dapp2022.db`.


In [1]:
# Import dependencies
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

## Gather URLs Returned by the Google Search

In [2]:
# Gather results of the google search
goo_urls = [
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&oq=data&aqs=chrome.0.69i59j69i57j69i59l2j46i199i291i512j0i512j46i199i465i512l2j0i512j46i199i465i512.11687j0j15&sourceid=chrome&ie=UTF-8',
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&sxsrf=AOaemvIv9xMqnEX0ovFt11T6fuyM3M4Mfg:1636633175528&ei=VwqNYfGxH_6Oxc8P0p2iMA&start=10&sa=N&ved=2ahUKEwixubfYpZD0AhV-R_EDHdKOCAYQ8tMDegQIARA4&biw=1265&bih=1287&dpr=1',
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&sxsrf=AOaemvJxKspQaYOd2cHQYRiU1Sf8QhxyTg:1636633222254&ei=hgqNYczXDu2Fxc8P5q6tOA&start=20&sa=N&ved=2ahUKEwjMstvupZD0AhXtQvEDHWZXCwc4ChDy0wN6BAgBEDo&biw=1265&bih=1287&dpr=1',
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&sxsrf=AOaemvKSYC5n5JEBpCy1JS03A8TfWmEpDA:1636633242644&ei=mgqNYcnVJv-Jxc8Ph-yROA&start=30&sa=N&ved=2ahUKEwjJirj4pZD0AhX_RPEDHQd2BAc4FBDy0wN6BAgBEDw&biw=1265&bih=1287&dpr=1',
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&sxsrf=AOaemvJ3lAHBLk3KNhRouzNegD1l-pAc_A:1636633280596&ei=wAqNYf_fI9aGxc8P8cmyMA&start=40&sa=N&ved=2ahUKEwj_v8SKppD0AhVWQ_EDHfGkDAY4HhDy0wN6BAgBED0&biw=1265&bih=1287&dpr=1',
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&sxsrf=AOaemvIkIcBAQyog6AgIXogehS8vXyzFbQ:1636633296603&ei=0AqNYcuWJIuHxc8Pn7ipOA&start=50&sa=N&ved=2ahUKEwjLvpWSppD0AhWLQ_EDHR9cCgc4KBDy0wN6BAgBED8&biw=1265&bih=1287&dpr=1',
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&sxsrf=AOaemvIQJjKphf0kB1gq1GPaRGmIpt7pCQ:1636633311550&ei=3wqNYYGIIY6Sxc8P5IO0OA&start=60&sa=N&ved=2ahUKEwjB86WZppD0AhUOSfEDHeQBDQc4MhDy0wN6BAgBEEE&biw=1265&bih=1287&dpr=1',
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&sxsrf=AOaemvLmYcP9Aba0cs-KxtNJLa4gWnW0hw:1636633329250&ei=8QqNYYa-DumSxc8PjJ2WGA&start=70&sa=N&ved=2ahUKEwjG-t2hppD0AhVpSfEDHYyOBQM4PBDy0wN6BAgBEEQ&biw=1265&bih=1287&dpr=1',
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&sxsrf=AOaemvILi0x00qM4HGKBdbscmnlLzGWucw:1636633364526&ei=FAuNYffDH6aXxc8Pxv67OA&start=80&sa=N&ved=2ahUKEwj3nceyppD0AhWmS_EDHUb_Dgc4RhDy0wN6BAgBEEE&biw=1265&bih=1287&dpr=1',
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&sxsrf=AOaemvKCuhyPCC8b-XqKIughGMfRyUokpQ:1636633394668&ei=MguNYe6OKICXxc8PtuO0OA&start=90&sa=N&ved=2ahUKEwju7_bAppD0AhWAS_EDHbYxDQc4UBDy0wN6BAgBEEE&biw=1265&bih=1287&dpr=1',
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&sxsrf=AOaemvLmUEYA6FytS6bbDZuFqvnnm0eXKA:1636633424507&ei=UAuNYYqnHpOHxc8Pha26OA&start=100&sa=N&ved=2ahUKEwiKj5TPppD0AhWTQ_EDHYWWDgc4WhDy0wN6BAgBEEE&biw=1265&bih=1287&dpr=1',
            'https://www.google.com/search?q=data+analytics+portfolio+projects&rlz=1C1GCEA_enCZ869CZ869&sxsrf=AOaemvLySR9z6lLw5QkWZKZYUihdQ9rnng:1636633438638&ei=XguNYdSWHsGQxc8P-Mij8As&start=110&sa=N&ved=2ahUKEwjUverVppD0AhVBSPEDHXjkCL44ZBDy0wN6BAgBEEI&biw=1265&bih=1287&dpr=1',
            ]
print(f'Here are {len(goo_urls)} URLs with results of a google search.')

Here are 12 URLs with results of a google search.


## Scrape URLs Returned by the Google Search
**An algorithm in a pseudocode**:
```
for each goo_url in goo_urls:
    extract title and wp_url of the search result
    for each wp_url extracted:
        get response and timestamp
        if response.status == 200:
            save body.text and its length
        else:
            save empty string and 0
        save gathered observation
identify observations and save the data set
```
**Algorithm's details**:

To extract a title and a url of web pages found by the google search
- get the response of the `goo_url` and create a soup object
```python
gpage = requests.get(goo_url)
gsoup = BeautifulSoup(gpage.content, 'html.parser')
```
- find elements with search results within the `gsoup` (titles are in the **h3** tags)
```python
gsoup.find_all('h3')
```
- extract title and url related to this **h3** tag
```python
for h3 in gsoup.find_all('h3'):
    title = h3.text
    wp_url = re.search(r'q=(.*?)&', h3.parent['href']).group(1)
```

To scrape the text of the web page on `wp_url`:
- get the response and its status code
```python
page = requests.get(wp_url)
status = page.status_code
status_ts = pd.Timestamp.today()
```
- extract text and its length if status code is 200 and the url is not of a pdf file otherwise impute empty values
```python
if status == 200 and wp_url[-3:] != "pdf":
  soup = BeautifulSoup(page.content, 'html.parser')
  if soup.body: # check that html.parser succeeded
    text = soup.body.text
    text_len = len(text)
  else:
    text, text_len = '', 0
else:
  text, text_len = '', 0
```

To store the results:
- append objects to the list
- create a dataframe from the list
- initialize identifiers
- create SQLAlchemy engine and empty database
- store dataframe in the database

In [9]:
# Initialize the list for observations
wp_root_list = []
id_ofset = 1

for goo_url in goo_urls:
  # Get the response of the goo_url address
  gpage = requests.get(goo_url)
  gsoup = BeautifulSoup(gpage.content, 'html.parser')
  # Process all results of the search found in <h3> elements
  for h3 in gsoup.find_all('h3'):
    # Assign title and wp_url
    title = h3.text
    wp_url = re.search(r'q=(.*?)&', h3.parent['href']).group(1)
    # Get response, its status and actual time
    page = requests.get(wp_url)
    status = page.status_code
    status_ts = pd.Timestamp.today()
    # Process the response if it is a html page
    if status == 200 and wp_url[-3:] != "pdf":
      soup = BeautifulSoup(page.content, 'html.parser')
      if soup.body: # check that html.parser succeeded
        text = soup.body.text
        text_len = len(text)
      else:
        text, text_len = '', 0
    else:
      text, text_len = '', 0
    id = 'WP'+str(id_ofset)
    id_ofset += 1
    wp_root_list.append([id, title, wp_url, status, status_ts, text, text_len])
    print(f'- [{id}] [{title}]({wp_url}) : (status: {status} | len : {text_len})')
print(f'\n\nid_offset = {id_ofset}.')

- [WP1] [9 Data Analytics Portfolio Examples [2021 Edition] - CareerFoundry](https://careerfoundry.com/en/blog/data-analytics/data-analytics-portfolio-examples/) : (status: 200 | len : 25376)
- [WP2] [How To Build A Data Analytics Portfolio [Complete Guide]](https://careerfoundry.com/en/blog/data-analytics/data-analyst-portfolio/) : (status: 200 | len : 17954)
- [WP3] [How to Build a Data Analyst Portfolio: Tips for Success | Coursera](https://www.coursera.org/articles/how-to-build-a-data-analyst-portfolio) : (status: 200 | len : 269057)
- [WP4] [How to Build an Impressive Data Analytics Portfolio | Springboard Blog](https://www.springboard.com/blog/data-analytics/data-analyst-portfolio/) : (status: 200 | len : 16842)
- [WP5] [Data Analyst Portfolio Project | SQL Data Exploration | Project 1/4](https://www.youtube.com/watch%3Fv%3DqfyynHBFOsM) : (status: 429 | len : 0)
- [WP6] [Guide to building a data analyst portfolio - Codecademy](https://www.codecademy.com/resources/blog/data-analys

In [4]:
# Check results
wp_root_list[-1]

['WP120',
 'Data Engineer, Group Portfolio Analytics & Reporting (1 year contract)',
 'https://www.efinancialcareers.fr/emploi-Singapour-Singapour-Data_Engineer_Group_Portfolio_Analytics__Reporting_1_year_contract.id12658463',
 200,
 Timestamp('2021-11-11 15:58:00.078262'),
 '\n\n\n\n            window.ssdl = window.ssdl || {};\n            window.ssdl.trackEvent = window.ssdl.trackEvent || function() {};\n        \n\n\n\n    .menu-item button.default {\n        width: 100%;\n        text-align: left;\n    }\n    .menu-item .js-msg-count:after { content: \'(\' attr(count) \')\'; }\n    .menu-item .js-msg-count[count=\'0\']:after { content: none; }\n\n    .handle .js-msg-count:after {\n        position: absolute;\n        top: 1rem;\n        left: 1.5rem;\n        background-color: #c20a0a;\n        background-color: var(--red);\n        color: #fff;\n        min-width: 1.1rem;\n        line-height: 1.1rem; \n        border-radius: 50%;\n        font-size: .8rem;\n        padding: 0.15r

In [5]:
# Create a dataframe
wp_root_df = pd.DataFrame(wp_root_list, columns=['id' ,'title', 'wp_url', 'status', 'status_ts', 'text', 'text_len'])
wp_root_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   id         120 non-null    object        
 1   title      120 non-null    object        
 2   wp_url     120 non-null    object        
 3   status     120 non-null    int64         
 4   status_ts  120 non-null    datetime64[ns]
 5   text       120 non-null    object        
 6   text_len   120 non-null    int64         
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 6.7+ KB


In [8]:
# Store dataframe for further processing
from sqlalchemy import create_engine

# Create SQLAlchemy engine and empty database
engine = create_engine('sqlite:///dapp2022.db')

# Store dataframe in the database
wp_root_df.to_sql('wp_root', engine, index=True)

# Upload the file to GitHub !
# 2021-11-10 : DB has 14 MB
# 2021-11-11 : DB has  6 MB