# Config

This notebooks provides examples for the module `config`. Config provides functions for working with scrape configurations.


# Initialization

The following code imports config. The code assumes that the current directory contains the scrape package.

In [3]:
import os
import sys
CURR_DIR = os.path.dirname(os.path.abspath('..'))
print('Current dir: ' + CURR_DIR)
sys.path.append(CURR_DIR)
import scrape

Current dir: D:\Projects\Python\projects\scrape
Initializing scrape ...


# Working with configurations
Scrape uses configurations to provide information for scraping a specific website. A configuration is a dictionary with the following keys:

+ url: Data on website URL
+ driver: Data on browser driver
+ website: Data on website HTML layout
+ dataset: Data on dataset output file handling


### Creating a configuration
The function `create` creates a config dictionary. The optional parameter `url` specifies the website URL.

In [4]:
aurl = r'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'
aconfig = scrape.config.create(aurl)

**url**

The url keyword contains a string with the URL of the website to scrape. The url mainly serves as a placeholder and example of an URL that can be scraped with that configuration.

In [5]:
aconfig['url']

'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'

**driver**

Driver contains a dictionary with data on the browser driver. 

Name|Type|Description
:-|:-|:-
package|str | `requests`, `selenium`
filename| str | filename dataset
log| str  | filename log
timeout| int |  Timeout in seconds
sleep| dict | Timeout in seconds
headless|  bool | Headless mode true (default) or false

The folder separator in filenames can be either represented by a double backslash `\\` or a slash `/`.

In [6]:
#Driver key-values
for k,v in aconfig['driver'].items():
    if k !='headers':
        print(k + ':' + str(v))

package:requests
filename:
log:
timeout:10
sleep:{'1': 1, '200': 3600}
headless:True


In [7]:
#Driver headers dictionary
aconfig['driver']['headers']

{'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0',
 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
 'accept-language': 'nl,en-US;q=0.7,en;q=0.3',
 'accept-encoding': 'gzip, deflate, br'}

**website**

Website contains a dictionary with data on the website HTML layout.

Name|Type|Description
:-|:-|:-
parser|str | 'html.parser'
id| str | 
rows| dict  | 
columns| dict |  
page| dict | 

In [8]:
aconfig['website']

{'parser': 'html.parser',
 'id': '',
 'rows': {'elem': {}, 'class': {}},
 'columns': {},
 'page': {'id': {}, 'elem': {}, 'class': {}}}

**dataset**

Dataset contains a dictionary with data on dataset output file handling.

Name|Type|Description
:-|:-|:-
filename|str | Filname output file
date_format| str | Date format
overwrite| bool  | Overwrite existing file if true XX if false

The folder separator in filenames can be either represented by a double backslash `\\` or a slash `/`.

In [9]:
aconfig['dataset']

{'filename': '', 'date_format': '%Y%m%d', 'overwrite': False}

## Config functions
The modules contains the following functions: `read`,`write` and `process`.

In [10]:
#Write config to file
filename = 'config.csv'
scrape.config.write(filename, aconfig)
os.path.exists(filename)

#Read config from file
bconfig = scrape.config.read(filename)
bconfig

{'url': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/',
 'driver': {'package': 'requests',
  'headers': {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0',
   'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'accept-language': 'nl,en-US;q=0.7,en;q=0.3',
   'accept-encoding': 'gzip, deflate, br'},
  'filename': '',
  'log': '',
  'timeout': 10,
  'sleep': {'1': 1, '200': 3600},
  'headless': True},
 'website': {'parser': 'html.parser',
  'id': '',
  'rows': {'elem': {}, 'class': {}},
  'columns': {},
  'page': {'id': {}, 'elem': {}, 'class': {}}},
 'dataset': {'filename': '', 'date_format': '%Y%m%d', 'overwrite': False}}

One can process a config file. Processing is a form of validation where non-keys are removed, and non-existing keys are added with default values.

In [11]:
# Add non-existing key
aconfig['non-key'] = {}
#Remove driver key
aconfig.pop('driver')

#Process a config
scrape.config.process(aconfig)
aconfig

{'url': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/',
 'website': {'parser': 'html.parser',
  'id': '',
  'rows': {'elem': {}, 'class': {}},
  'columns': {},
  'page': {'id': {}, 'elem': {}, 'class': {}}},
 'dataset': {'filename': '', 'date_format': '%Y%m%d', 'overwrite': False},
 'driver': {'package': 'requests',
  'headers': {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0',
   'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'accept-language': 'nl,en-US;q=0.7,en;q=0.3',
   'accept-encoding': 'gzip, deflate, br'},
  'filename': '',
  'log': '',
  'timeout': 10,
  'sleep': {'1': 1, '200': 3600},
  'headless': True}}

# Versions

In [12]:
%reload_ext watermark
%watermark

Last updated: 2021-11-28T16:47:40.732052+01:00

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.19.0

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU cores   : 8
Architecture: 64bit

