### <font color="red">NOTE</font>

This was developed/tested in Google Colab, Python 3.10.12.

Also tested locally in VS Code, Python 3.9.16

# Project:  BCCR CRC/USD exchange rate (from different costarrican entities) extraction using webscraping

## Objective

The Central Bank of Costa Rica website where the latest information on CRC/USD echange rate for many official finantial entities is found at:

https://gee.bccr.fi.cr/IndicadoresEconomicos/Cuadros/frmConsultaTCVentanilla.aspx


The page also offers the option of checking the same information on previous dates

The objective of this project is to produce a file that contains the most up to date CRC/USD exchange rate for the diferent entities, that can later be used in other data science projects, as input data, for example as a new column, merging it to other tables/dataframes using the date index.



## Methodology

The steps that the code follows to fulfill the objective, are:


1. Download the exchange rate historical file. Using Selenium is needed as a click on an image is required
2. Extract the table using Pandas pd.extract_html
3. Get the headers and data from the table
4. Store the previos step items into a Pandas dataframe
5. Backup the dataframe into a file.  In a real scenario, that file should only be produced perhaps only once a month
6. Using BeautifulSoup, scrape the latest exchange rate data from the page.  It is updated daily by the bank (BCCR), and it contains the last 30 days
7. Load the scraped data into a dataframe
8. Repeat steps 3-4 to the scraped data dataframe
9. Load exchange rate file that was produced at step 5 into another dataframe
10. Add the newest records found at df from step 7, to the dataframe from step 9
11. Overwrite file from step 5 with the dataframe from previous step


---
## Description of used Packages and libraries

os - file operations such as renaming or getting the creation file\
requests - webscrapping of newer results\
shutil - High-level file operations, used to easily\ copy a file as a backup\
fastparquet - needed to produce a parquet file
matplotlib - basic plotting
pandas - dataframe operations \
numpy - required by Pandas\
datetime - timestamp operations\
python-magic / magic - needed to check/confirm file format
io - needed to wrap html string for pandas html read\
pytz - timezone operations for datetime
apt install -qq chromium-chromedriver  ----  installed at OS level as this is required by Selenium

---
## Findings/Lessons learned






### Initial page inspection

At the previous project, it was decided to download the historical data file of official BCCR exchange rate.    

The downloadable file that the bank provides here, only contains data for one specific date.

However, thru browser inspection is was found that each time the date selection is made and then applied, a POST request (with no query string) is sent back to the same URL with a distinctive parameter at the payload, and figured out that it represents the number of days that have passed since 2000-01-01:

For example, for 2024-08-05 the parameter seen is:

**EVENTARGUMENT: 8983**

It can be validated that the assumption made was in fact true, as follows:



```
from datetime import date
(date(2024, 8, 5) - date(2000, 1, 1)).days

8963
```

There is no need to download the XLS file for each day, we can just request the page for each day we need.   Therefore, Selenium is not needed here.

But, that parameter is not the only one seen at the payload, which is actually 13kb long.   For that reason, the payload, expressed as a non-url-encoded json, is placed inside a text file, except for that specific parameter above so it is easier to manage scraping multiple past days but just iterating on that value.


### Pandas reading from html object

The idea is NOT to download html files at all, but instead store the html content from a requests.text variable.

*   IO must be used as `pd.read_html(**io.StringIO(str**(page_request.text))` in order to avoid the following warning:

> FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.


*   read_html interprets a comma as a thousands separator, by default.  In this case we need to ignore that (using argument thousands=None) as the bank uses comma as decimal separator.



### Data Wrangling




###Data backup storage saving

Code was written to generate a backup of the generated file, as soon as it is downloaded.

By testing backups on different format files, in terms of storage space it was found that html file is the worst choice.   Parquet file is the one we can choose to make backups for the least amount of space.



---


## Enhancement opportunities


1.   File generation and download

By inspecting the extracted html, it was found that the file can actually be generated directly by modifying the query string values of the following URL

https://gee.bccr.fi.cr/indicadoreseconomicos/Cuadros/frmVerCatCuadro.aspx?CodCuadro=400&Idioma=1&FecInicial=2023/01/01&FecFinal=2024/07/19&Filtro=0&Exportar=True&Excel=True

Basically the initial and end dates can be modified at will to get the necessary file.

This can be done to avoid having to use logic (with Selenium) to simulate the clicking of the button to generate the file, so the code is smaller and quicker.
\
2.   Add code to download the full historical file, once a month.
\
3.   Add code to check if the updated exchange file is missing more than just one record (today), if it is, then complete the missing records from the fresh scraped data that should be done every day
\
4.   Clean the backup logic, the files at the end look a bit disorganized.  Some need timestamp at the filename






---

# Project Code





## Packages,Libraries,Constants

Packages installation takes about 2 minutes

In [192]:
!pip -V

pip 24.2 from /home/milos/Documents/Proyectos/CienciaDeDatos/BCCR-tcdolar-entidades/venv/lib/python3.9/site-packages/pip (python 3.9)


In [193]:
!pip install --upgrade pip



In [250]:
!pip install --quiet matplotlib python-magic fastparquet requests pandas lxml
# !apt-get update
# !apt install chromium-chromedriver
import os ,requests, shutil, time, magic, io, lxml, json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, date, timedelta
from io import StringIO
import pytz


In [251]:
# Constants
DATASETS_PATH = './datasets'
DATASETS_TEMP_PATH = f'{DATASETS_PATH}/temp'
DATASETS_BACKUP_PATH = f'{DATASETS_PATH}/backups'
CURRENT_DATASET_BASEFILENAME = 'bccr_dol_exch_entities'
JSON_FILE_NAME = 'payload_json_unencoded.txt'

## Checking existance of current dataset file

If if does not exist, then generate the directory where it is to be  placed at later with data from the past 365 days






In [252]:
# prompt: need an if statement, that if CURRENT_DATASET_BASEFILENAME exists, print OK, if not print NOK and also create the DATASETS_PATH path

if os.path.exists(f'{DATASETS_PATH}/{CURRENT_DATASET_BASEFILENAME}.parquet'):
  print ("OK")
else:
  print ("NOK")
  os.makedirs(DATASETS_PATH, exist_ok=True)


OK


In [253]:
if os.path.exists(f'{DATASETS_PATH}/{CURRENT_DATASET_BASEFILENAME}.parquet'):
  print(f"OK, {DATASETS_PATH}/{CURRENT_DATASET_BASEFILENAME}.parquet already exists")
else:
  print(f"There is no historical data file {DATASETS_PATH}/{CURRENT_DATASET_BASEFILENAME}.parquet")
  os.makedirs(DATASETS_PATH, exist_ok=True)


OK, ./datasets/bccr_dol_exch_entities.parquet already exists


## Scraping today's entities dollar exchange rate

In [254]:
# Scraping html of the most recent data from the bank webpage

# Configuration

host='gee.bccr.fi.cr'
urlpath='IndicadoresEconomicos/Cuadros/frmConsultaTCVentanilla.aspx'
# Sending some headers to try and hide the scraper default values
hdrs={'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
      'Host' : f'{host}',
      'Accept-Language': 'en-US,en;q=0.9,es;q=0.8,es-CR;q=0.7,de;q=0.6',
      'Accept-Encoding': 'gzip, deflate, br, zstd'
      }

url_page = f'https://{host}/{urlpath}'

In [255]:
# Performing the actual request.  Today's data is done with a GET request
page_request = requests.get(url_page, headers=hdrs)

In [256]:
# some interesting options that could be used
page_request.status_code, page_request.reason ,  page_request.ok , page_request.url , \
page_request.headers['Content-Length'], page_request.headers['Date'] , page_request.encoding, \
 page_request.headers['Content-Type']

(200,
 'OK',
 True,
 'https://gee.bccr.fi.cr/IndicadoresEconomicos/Cuadros/frmConsultaTCVentanilla.aspx',
 '13258',
 'Mon, 12 Aug 2024 08:08:34 GMT',
 'utf-8',
 'text/html; charset=utf-8')

In [257]:
page_request.text[:300]

'\r\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\r\n<HTML>\r\n\t<HEAD>\r\n\t\t<title>Tipo de cambio anunciado en ventanilla</title>\r\n\t\t<script type="text/javascript">var _gaq = _gaq || [];_gaq.push([\'_setAccount\', \'UA-25040215-3\']);_gaq.push([\'_trackPageview\']);(function() {var ga = document.'

### Building the dataframe from scraped html

In [258]:
# test with basic cleaning
pd.read_html(io.StringIO(str(page_request.text)) , thousands=None,
                            decimal=',' , header=0 )[2].head(3)

Unnamed: 0,Tipo de Entidad,Entidad Autorizada,Compra,Venta,Diferencial Cambiario,Última Actualización
0,Bancos públicos,Banco de Costa Rica,518.0,532.0,14.0,10/08/2024 12:03 a.m.
1,,Banco Nacional de Costa Rica,518.0,532.0,14.0,09/08/2024 03:39 p.m.
2,,Banco Popular y de Desarrollo Comunal,520.0,534.0,14.0,07/08/2024 12:50 p.m.


In [259]:
[s for s in pytz.all_timezones if 'Costa' in s]

['America/Costa_Rica']

In [260]:
datetime.now(pytz.timezone('America/Costa_Rica')).strftime('%Y-%m-%d')

'2024-08-12'

In [261]:
# -setting the header as the resulting line with index 0
# -recognizing the original decimal char as the comma, so dataframe is shown as usual with it as a dot
# -dropping rows made of NaN at every column
# -name the columns properly
# -extend the ent_type for those with NaN
# -apply datetime format to previous_updt column
# -insert 'date' column with current date (in Costa Rica as the bank is in that country)
#  as for all rows.

if page_request.ok:
  print('Request OK')
  cols=['date', 'dollar_buy','dollar_sale']
  cols=['ent_type', 'ent_name', 'dollar_buy','dollar_sale','b_s_diff','previous_updt']
  df_dol_ent = pd.read_html(io.StringIO(str(page_request.text)) , thousands=None,
                            decimal=',' , header=0 )[2]

  df_dol_ent.dropna(axis = 0, how = 'all', inplace = True)
  df_dol_ent.columns = cols
  df_dol_ent.ffill( inplace=True)
  df_dol_ent['previous_updt'] = pd.to_datetime(df_dol_ent['previous_updt'] , format='mixed',dayfirst=True)
  # df_dol_ent.insert(loc=0, column = 'dateUTC', value =  datetime.today().strftime('%Y-%m-%d'))
  today_CostaRica = datetime.now(pytz.timezone('America/Costa_Rica')).strftime('%Y-%m-%d')
  df_dol_ent.insert(loc=0, column = 'date', value = today_CostaRica)
  df_dol_ent['date']=pd.to_datetime(df_dol_ent['date'])

else:
  print('Request was NOT OK, received status code', page_request.status_code)

df_dol_ent.tail(7)
# df_dol_ent.head(7)


Request OK


Unnamed: 0,date,ent_type,ent_name,dollar_buy,dollar_sale,b_s_diff,previous_updt
31,2024-08-12,Casas de Cambio,Casa de Cambio Global Exchange,441.6,617.05,175.45,2024-08-09 21:12:00
32,2024-08-12,Casas de Cambio,Casa de Cambio Teledolar S. A.,518.0,542.0,24.0,2024-08-12 00:19:00
33,2024-08-12,Puestos de Bolsa,"BCT Valores, Puesto De Bolsa, S.A.",519.0,537.0,18.0,2024-08-09 13:58:00
34,2024-08-12,Puestos de Bolsa,"BN Valores S.A., Puesto de Bolsa",520.0,534.0,14.0,2024-08-09 09:03:00
35,2024-08-12,Puestos de Bolsa,Mercado Valores de Costa Rica Puesto de Bolsa,518.0,536.0,18.0,2024-08-07 12:42:00
36,2024-08-12,Puestos de Bolsa,PB Inversiones SAMA,517.0,533.0,16.0,2024-08-09 16:30:00
37,2024-08-12,Puestos de Bolsa,"Popular Valores, Puesto de Bolsa",521.0,535.0,14.0,2024-08-08 10:34:00


In [262]:
df_dol_ent.ent_type.unique() , df_dol_ent.ent_type.nunique() , df_dol_ent.ent_type.value_counts()

(array(['Bancos públicos', 'Bancos privados', 'Financieras',
        'Mutuales de Vivienda', 'Cooperativas', 'Casas de Cambio',
        'Puestos de Bolsa'], dtype=object),
 7,
 ent_type
 Bancos privados         11
 Cooperativas             9
 Puestos de Bolsa         5
 Financieras              4
 Casas de Cambio          4
 Bancos públicos          3
 Mutuales de Vivienda     2
 Name: count, dtype: int64)

In [263]:
df_dol_ent.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38 entries, 0 to 37
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           38 non-null     datetime64[ns]
 1   ent_type       38 non-null     object        
 2   ent_name       38 non-null     object        
 3   dollar_buy     38 non-null     float64       
 4   dollar_sale    38 non-null     float64       
 5   b_s_diff       38 non-null     float64       
 6   previous_updt  38 non-null     datetime64[ns]
dtypes: datetime64[ns](2), float64(3), object(2)
memory usage: 2.4+ KB


## Scraping entities dollar exchange rate from a previous date

This section's purpose it to understand and confirm how the POST request to get previous days data is to be made, successfully

In [264]:
!ls pay*

payload_json_unencoded.txt


In [265]:
# this one did not work as data is already encoded and also NOT in json format
# with open('payload_json.txt','r') as f:
  # payld =  f'{f.read()}'
# payld
# type(payld)


In [266]:
# # get text dict from file, and then convert to a true dict object
# with open('payload_json_unencoded.txt','r') as f:
#   payld =  f.read()

# import json
# payld = json.loads(payld)
# payld
# # type(payld)

In [267]:
## Config is mostly the same as previous section. Adding the payload needed for the
## POST request that is done instead the GET request done before

# url_page = f'https://{host}/{urlpath}'

# When inspecting the browser behaviour, this request header is sent
# Content-Type: application/x-www-form-urlencoded

hdrs={'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
      'Host' : f'{host}',
      'Accept-Language': 'en-US,en;q=0.9,es;q=0.8,es-CR;q=0.7,de;q=0.6',
      'Accept-Encoding': 'gzip, deflate, br, zstd',
      # 'Content-Type': 'application/x-www-form-urlencoded'  #removed, requests library takes care of it
      }

# payload = { '__EVENTTARGET' : 'Calendar1' , '__EVENTARGUMENT' : '8983' }

# with open('payload_json_unencoded.txt','r') as f:
#   payld =  f.read()
# import json
# payld = json.loads(payld)

# taken from recursively scraping section
if os.path.exists(JSON_FILE_NAME):
  print(f'Payload {JSON_FILE_NAME} exists locally, loading it into a variable...')
  with open(JSON_FILE_NAME,'r') as f:
    payld =  f.read()
  payld = json.loads(payld)
else:
  print(f'Payload {JSON_FILE_NAME} does not exist locally, loading it into a variable from github rawfile...')
  jsonfilereq = requests.get('https://raw.githubusercontent.com/lemilosm/bccr_dol_exc_entities_rate_history_webscraping/main/payload_json_unencoded.txt')
  if jsonfilereq.ok:
    payld = json.loads(jsonfilereq.text)
    #storing the downloaded json data into a file for future runs to have it
    with open(JSON_FILE_NAME,'w') as f:
      f.write(str(payld).replace("'", '"')) #future loads from file require " instead '
    print(f'{JSON_FILE_NAME} saved locally')
  else:
    print('Payload json file could not be read from github either.')

# argument that controls the date, as described at the project's intro
past_date_to_scrape = date(2024, 8, 5)
past_date_to_scrape_inDays = (past_date_to_scrape - date(2000, 1, 1)).days

payld['__EVENTARGUMENT'] = past_date_to_scrape_inDays
# "__EVENTARGUMENT": "8983"

past_date_to_scrape_inDays , payld['__EVENTTARGET'] ,payld['__VIEWSTATEGENERATOR'] , payld['__EVENTARGUMENT']

Payload payload_json_unencoded.txt exists locally, loading it into a variable...


(8983, 'Calendar1', '5CF5411C', 8983)

In [268]:
# Performing the actual request.  Previous day's data is done with a POST request
page_request_prev = requests.post(url_page, headers=hdrs, data= payld )

In [269]:
# req headers
page_request_prev.request.headers

{'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate, br, zstd', 'Accept': '*/*', 'Connection': 'keep-alive', 'Host': 'gee.bccr.fi.cr', 'Accept-Language': 'en-US,en;q=0.9,es;q=0.8,es-CR;q=0.7,de;q=0.6', 'Content-Length': '13457', 'Content-Type': 'application/x-www-form-urlencoded'}

In [270]:
# some interesting options that could be used, like response headers
page_request_prev.status_code, page_request_prev.reason ,  page_request_prev.ok , page_request_prev.url , \
page_request_prev.headers['Content-Length'], page_request_prev.headers['Date'] , page_request_prev.encoding, \
 page_request_prev.headers['Content-Type']

(200,
 'OK',
 True,
 'https://gee.bccr.fi.cr/IndicadoresEconomicos/Cuadros/frmConsultaTCVentanilla.aspx',
 '13008',
 'Mon, 12 Aug 2024 08:08:37 GMT',
 'utf-8',
 'text/html; charset=utf-8')

In [271]:
# It is confirmed it works, as the http object reports that table is for the date we selected manually
pd.read_html(io.StringIO(str(page_request_prev.text)) , thousands=None,
                          decimal=','  )[1][0][0]

'lunes, 5 de agosto de 2024'

### Building the dataframe from scraped html

In [272]:
past_date_to_scrape.strftime('%Y-%m-%d')

'2024-08-05'

In [273]:
# reusing the code we already had for scraping today's echange rate

if page_request_prev.ok:
  print('Request OK')
  cols=['date', 'dollar_buy','dollar_sale']
  cols=['ent_type', 'ent_name', 'dollar_buy','dollar_sale','b_s_diff','previous_updt']
  df_dol_ent_prev = pd.read_html(io.StringIO(str(page_request_prev.text)) , thousands=None,
                            decimal=',' , header=0 )[2]

  df_dol_ent_prev.dropna(axis = 0, how = 'all', inplace = True)
  df_dol_ent_prev.columns = cols
  df_dol_ent_prev.ffill( inplace=True)
  df_dol_ent_prev['previous_updt'] = pd.to_datetime(df_dol_ent_prev['previous_updt'] , format='mixed',dayfirst=True)
  # df_dol_ent_prev.insert(loc=0, column = 'dateUTC', value =  datetime.today().strftime('%Y-%m-%d'))
  today_CostaRica = datetime.now(pytz.timezone('America/Costa_Rica')).strftime('%Y-%m-%d')
  df_dol_ent_prev.insert(loc=0, column = 'date', value = past_date_to_scrape.strftime('%Y-%m-%d'))
  df_dol_ent_prev['date']=pd.to_datetime(df_dol_ent_prev['date'])

else:
  print('Request was NOT OK, received status code', page_request_prev.status_code)

df_dol_ent_prev.tail(7)

Request OK


Unnamed: 0,date,ent_type,ent_name,dollar_buy,dollar_sale,b_s_diff,previous_updt
31,2024-08-05,Casas de Cambio,Casa de Cambio Global Exchange,437.34,610.37,173.03,2024-08-01 23:13:00
32,2024-08-05,Casas de Cambio,Casa de Cambio Teledolar S. A.,514.0,536.0,22.0,2024-08-05 13:25:00
33,2024-08-05,Puestos de Bolsa,"BCT Valores, Puesto De Bolsa, S.A.",514.0,532.0,18.0,2024-08-05 08:20:00
34,2024-08-05,Puestos de Bolsa,"BN Valores S.A., Puesto de Bolsa",514.0,528.0,14.0,2024-08-05 08:21:00
35,2024-08-05,Puestos de Bolsa,Mercado Valores de Costa Rica Puesto de Bolsa,512.0,530.0,18.0,2024-08-05 08:51:00
36,2024-08-05,Puestos de Bolsa,PB Inversiones SAMA,513.0,529.0,16.0,2024-08-05 08:19:00
37,2024-08-05,Puestos de Bolsa,"Popular Valores, Puesto de Bolsa",514.0,528.0,14.0,2024-08-05 09:45:00


## Recursively scraping entities dollar exchange rate from many different days before today

Now that it is known how the past day scraping works, its time to scrape many different days, in bulk

In [274]:
datetime.today().strftime('%Y-%m-%d') , datetime.today().date()  ,  date(2000, 1, 1)

('2024-08-12', datetime.date(2024, 8, 12), datetime.date(2000, 1, 1))

In [275]:
## Config is mostly the same as previous section. Adding the payload needed for the
## POST request that is done instead the GET request done before

# url_page = f'https://{host}/{urlpath}'

# When inspecting the browser behaviour, this request header is sent
# Content-Type: application/x-www-form-urlencoded

hdrs={'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
      'Host' : f'{host}',
      'Accept-Language': 'en-US,en;q=0.9,es;q=0.8,es-CR;q=0.7,de;q=0.6',
      'Accept-Encoding': 'gzip, deflate, br, zstd',
      # 'Content-Type': 'application/x-www-form-urlencoded'  #removed, requests library takes care of it
      }

# with open('payload_json_unencoded.txt','r') as f:
#   payld =  f.read()
# import json
# payld = json.loads(payld)

if os.path.exists(JSON_FILE_NAME):
  print(f'Payload {JSON_FILE_NAME} exists locally, loading it into a variable...')
  with open(JSON_FILE_NAME,'r') as f:
    payld =  f.read()
  payld = json.loads(payld)
else:
  print(f'Payload {JSON_FILE_NAME} does not exist locally, loading it into a variable from github rawfile...')
  jsonfilereq = requests.get('https://raw.githubusercontent.com/lemilosm/bccr_dol_exc_entities_rate_history_webscraping/main/payload_json_unencoded.txt')
  if jsonfilereq.ok:
    payld = json.loads(jsonfilereq.text)
    #storing the downloaded json data into a file for future runs to have it
    with open(JSON_FILE_NAME,'w') as f:
      f.write(str(payld).replace("'", '"')) #future loads from file require " instead '
    print(f'{JSON_FILE_NAME} saved locally')
  else:
    print('Payload json file could not be read from github either.')


Payload payload_json_unencoded.txt exists locally, loading it into a variable...


In [276]:
type(payld), payld['__EVENTTARGET'] , payld['__VIEWSTATEGENERATOR']

(dict, 'Calendar1', '5CF5411C')

In [277]:
type(payld)

dict

In [278]:
# payld

In [279]:
# TEST:  today's  EVENTARGUMENT days number.  We have to recursively substract n number of days
# to scrape, from this number
(datetime.today().date() - date(2000, 1, 1)).days

8990

In [280]:
# TEST  example of one previous date EVENTARGUMENT days number
(date(2024, 8, 5) - date(2000, 1, 1)).days

8983

In [281]:
# Performing the actual request, recursively.  Previous day's data is done with a POST request

# Defining the number of previous days to scrape is defined, and today's eventarg days number
PREV_DAYS_TO_SCRAPE = 18
todays_eventArgument = (datetime.today().date() - date(2000, 1, 1)).days
oldest_date_to_extract = (datetime.today().date() - timedelta(days=PREV_DAYS_TO_SCRAPE)).strftime('%Y-%m-%d')

# initializing dataframe to store all the data
df_dol_ent_prev_days = pd.DataFrame()

for d in range(1,(PREV_DAYS_TO_SCRAPE+1)):
# Constructing the full json dict with proper argument
  past_eventArgument = todays_eventArgument - d
  print(f'Payload __EVENTARGUMENT {past_eventArgument}' ,  end = ' -- '  )
  # adding current iteration _EVENTARGUMENT key and value to the payload data
  payld['__EVENTARGUMENT'] = past_eventArgument
  # date_evaluated in YYYY-mm-dd format
  date_evaluated = (datetime.today().date() - timedelta(days=d)).strftime('%Y-%m-%d')
  # print(payld) ,  print(type(payld))
# Executing the POST request
  page_request_prev = requests.post(url_page, headers=hdrs, data= payld )

# reusing the code we already had for scraping today's echange rate

  if page_request_prev.ok:
    print(f'Request OK for day {date_evaluated}\n---------')
    # cols=['date', 'dollar_buy','dollar_sale']
    cols=['ent_type', 'ent_name', 'dollar_buy','dollar_sale','b_s_diff','previous_updt']
    df_dol_ent_prev_temp = pd.read_html(io.StringIO(str(page_request_prev.text)) , thousands=None,
                              decimal=',' , header=0 )[2]

    df_dol_ent_prev_temp.dropna(axis = 0, how = 'all', inplace = True)
    df_dol_ent_prev_temp.columns = cols
    df_dol_ent_prev_temp.ffill( inplace=True)
    df_dol_ent_prev_temp['previous_updt'] = pd.to_datetime(df_dol_ent_prev_temp['previous_updt'] , format='mixed',dayfirst=True)
    # df_dol_ent_prev_temp.insert(loc=0, column = 'dateUTC', value =  datetime.today().strftime('%Y-%m-%d'))
    today_CostaRica = datetime.now(pytz.timezone('America/Costa_Rica')).strftime('%Y-%m-%d')
    df_dol_ent_prev_temp.insert(loc=0, column = 'date', value = date_evaluated  )
    df_dol_ent_prev_temp['date']=pd.to_datetime(df_dol_ent_prev_temp['date'])
# Concatenanting result to main dataframe  df_dol_ent_prev_days
    df_dol_ent_prev_days = pd.concat([df_dol_ent_prev_days,df_dol_ent_prev_temp])

  else:
    print(f'Request was NOT OK for {date_evaluated}, received status code', page_request_prev.status_code)

# Final report of the recursive extraction
earliest_date_stored = df_dol_ent_prev_days['date'].unique().min().strftime('%Y-%m-%d')
# of days succesfully extracted
succ_days = len(df_dol_ent_prev_days['date'].unique())
print(f'\n>>>Done, dataframe stored from yesterday and back to {earliest_date_stored} \
with {df_dol_ent_prev_days.shape[0]} rows. {succ_days} days were successfully extracted')


Payload __EVENTARGUMENT 8989 -- Request OK for day 2024-08-11
---------
Payload __EVENTARGUMENT 8988 -- Request OK for day 2024-08-10
---------
Payload __EVENTARGUMENT 8987 -- Request OK for day 2024-08-09
---------
Payload __EVENTARGUMENT 8986 -- Request OK for day 2024-08-08
---------
Payload __EVENTARGUMENT 8985 -- Request OK for day 2024-08-07
---------
Payload __EVENTARGUMENT 8984 -- Request OK for day 2024-08-06
---------
Payload __EVENTARGUMENT 8983 -- Request OK for day 2024-08-05
---------
Payload __EVENTARGUMENT 8982 -- Request OK for day 2024-08-04
---------
Payload __EVENTARGUMENT 8981 -- Request OK for day 2024-08-03
---------
Payload __EVENTARGUMENT 8980 -- Request OK for day 2024-08-02
---------
Payload __EVENTARGUMENT 8979 -- Request OK for day 2024-08-01
---------
Payload __EVENTARGUMENT 8978 -- Request OK for day 2024-07-31
---------
Payload __EVENTARGUMENT 8977 -- Request OK for day 2024-07-30
---------
Payload __EVENTARGUMENT 8976 -- Request OK for day 2024-07-29
--

## NOT WORKING for more than 14 days ago.

Need to find a way to get the '__EVENTVALIDATION' and '__VIEWSTATE'  keys properly, depending on the date that needs to be extracted

In [283]:
df_dol_ent_prev_days.info()

<class 'pandas.core.frame.DataFrame'>
Index: 532 entries, 0 to 37
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           532 non-null    datetime64[ns]
 1   ent_type       532 non-null    object        
 2   ent_name       532 non-null    object        
 3   dollar_buy     532 non-null    float64       
 4   dollar_sale    532 non-null    float64       
 5   b_s_diff       532 non-null    float64       
 6   previous_updt  532 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(3), object(2)
memory usage: 33.2+ KB


In [284]:
df_dol_ent_prev_days['date'].unique().max() , df_dol_ent_prev_days['date'].unique().min()

(Timestamp('2024-08-11 00:00:00'), Timestamp('2024-07-29 00:00:00'))

In [285]:
df_dol_ent_prev_days

Unnamed: 0,date,ent_type,ent_name,dollar_buy,dollar_sale,b_s_diff,previous_updt
0,2024-08-11,Bancos públicos,Banco de Costa Rica,518.0,532.0,14.0,2024-08-10 00:03:00
1,2024-08-11,Bancos públicos,Banco Nacional de Costa Rica,518.0,532.0,14.0,2024-08-09 15:39:00
2,2024-08-11,Bancos públicos,Banco Popular y de Desarrollo Comunal,520.0,534.0,14.0,2024-08-07 12:50:00
3,2024-08-11,Bancos privados,Banco BAC San José S.A.,522.0,536.0,14.0,2024-08-09 08:24:00
4,2024-08-11,Bancos privados,Banco BCT S.A.,519.0,537.0,18.0,2024-08-09 13:11:00
...,...,...,...,...,...,...,...
33,2024-07-29,Puestos de Bolsa,"BCT Valores, Puesto De Bolsa, S.A.",515.0,533.0,18.0,2024-07-23 15:25:00
34,2024-07-29,Puestos de Bolsa,"BN Valores S.A., Puesto de Bolsa",516.5,530.0,13.5,2024-07-29 08:24:00
35,2024-07-29,Puestos de Bolsa,Mercado Valores de Costa Rica Puesto de Bolsa,514.5,532.0,17.5,2024-07-24 09:12:00
36,2024-07-29,Puestos de Bolsa,PB Inversiones SAMA,515.0,531.0,16.0,2024-07-29 08:38:00


## Backing up entities dollar exchange rate values at dataset parquet file

In [286]:
# Now lets update the exchange data file, using the updated dataframe

df_dol_ent_prev_days.to_parquet(f'{DATASETS_PATH}/{CURRENT_DATASET_BASEFILENAME}.parquet')

current_dataset_file_creationTstamp = datetime.fromtimestamp (os.path.getctime(f'{DATASETS_PATH}/{CURRENT_DATASET_BASEFILENAME}.parquet')).strftime('%Y-%m-%d %H:%M')

print(f'Current dataset file has been updated: {DATASETS_PATH}/{CURRENT_DATASET_BASEFILENAME}.parquet \ncreation time: {current_dataset_file_creationTstamp} \nCurrent date-time is:', datetime.today().strftime('%Y-%m-%d %H:%M'))

Current dataset file has been updated: ./datasets/bccr_dol_exch_entities.parquet 
creation time: 2024-08-12 02:08 
Current date-time is: 2024-08-12 02:08
