# How to...read data from csv files and store them on Oracle database

In [1]:
# Sets up the location of the api relative to this notebook 
import sys
sys.path.append('../../../')

In [2]:
# Import the module for connection to a sqllite database
from esgmatching.db_engine.engines.connector_oracle import OracleConnector

In [3]:
# Import the modules for file management
from esgmatching.file_reader.file import File
from esgmatching.file_reader.file_reader_csv import FileReaderCsv

In [4]:
# Import the modules for the etl processing: reading, transformation and loading data to a database
from esgmatching.processing.etl_processing import EtlProcessing

## 1. Database setup

The database connector is represented by the class OracleConnector. The following properties need to be provided:
- client_driver_dir: Directory of the oracle client library
- username: Username with permission to acess the database
- user_password: Password with permission to acess the database
- host_url: URL of the oracle database server
- port_number: Port number to acess the database server
- service_name: Database name

In [5]:
# The database connector is represented by the class OracleConnector 
db_conn = OracleConnector()

In [6]:
# Setting upt the properties
db_conn.client_driver_dir ='C:\oracle\instantclient_21_3'
db_conn.username ='admin'
db_conn.user_password ='oraclebnp'
db_conn.host_url ='esgmatching.ctqjxnfdw57h.eu-central-1.rds.amazonaws.com'
db_conn.port_number ='1521'
db_conn.service_name ='DATABASE'
db_conn.show_sql_statement = True

In [7]:
# The connect() method of the OracleConnector is used to stablish a connection with the database. 
db_conn.connect()

In [8]:
# Check if the connection was stablished
db_conn.is_connected()

True

## 2. File setup

In [9]:
# Settings for Referential 1
file1_settings = '../../../tests/data/oracle/test_referential1_oracle.json'
file1_settings

'../../../tests/data/oracle/test_referential1_oracle.json'

In [10]:
# Create a file object
file_obj = File(file1_settings)

In [11]:
# Checking some properties of the File object
print('Filename:{}, Json Settings:{}'.format(file_obj.filename, file_obj.filename_settings))

Filename:../../../tests/data/test_referential1.csv, Json Settings:../../../tests/data/oracle/test_referential1_oracle.json


## 3. Read a csv file and load its content to the database

The Esg-Entity-Matching library provides a FileReaderCsv that understands the content of csv files. 
It also provides an EtlProcessing object that combines file, connector and reader in order to perform the complete pipeline of reading, transforming and loading data into a database.

In [12]:
# Crete a file reader object for csv files
csv_reader_obj = FileReaderCsv()  

In [13]:
# Create an ETL process object
etl_proc_obj = EtlProcessing(db_conn)

In [14]:
# Call the load_file_to_db() method by passing the File, FileReader and SqlLiteConnector
# The ETL process returns a database source object
db_source = etl_proc_obj.load_file_to_db(file_obj, csv_reader_obj)

2022-01-26 12:02:44,880 INFO sqlalchemy.engine.Engine select sys_context( 'userenv', 'current_schema' ) from dual
2022-01-26 12:02:44,881 INFO sqlalchemy.engine.Engine [raw sql] {}
2022-01-26 12:02:45,001 INFO sqlalchemy.engine.Engine SELECT value FROM v$parameter WHERE name = 'compatible'
2022-01-26 12:02:45,005 INFO sqlalchemy.engine.Engine [raw sql] {}
2022-01-26 12:02:45,053 INFO sqlalchemy.engine.Engine select value from nls_session_parameters where parameter = 'NLS_NUMERIC_CHARACTERS'
2022-01-26 12:02:45,055 INFO sqlalchemy.engine.Engine [raw sql] {}
2022-01-26 12:02:45,173 INFO sqlalchemy.engine.Engine SELECT table_name FROM all_tables WHERE table_name = :name AND owner = :schema_name
2022-01-26 12:02:45,174 INFO sqlalchemy.engine.Engine [generated in 0.00128s] {'name': 'ESG_MATCH_REF', 'schema_name': 'ADMIN'}
2022-01-26 12:02:45,254 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2022-01-26 12:02:45,256 INFO sqlalchemy.engine.Engine 
CREATE TABLE "ESG_MATCH_REF" (
	"UNIQUE_ID" N

## 4. Report on Etl Process

In [15]:
# Printing the ELT Processing Report 
etl_proc_obj.print_report()

-------------------------------------- ETL PROCESSING REPORT ---------------------------------------
Description: Details of the ETL process performed on [DS_REF] data source.
Datetime:2022-01-26 12:02:46
----------------------------------------------------------------------------------------------------
File Name: ../../../tests/data/test_referential1.csv
Columns in the File: 4
Columns read from File: 4
Lines Extracted from File: 9


## 5. Checking the attribute names of DbDataSource

There are three methods to check the column or attribute names of the DbDataSource object:
1. Use get_original_field_names(): to retrieve the original attribute names of the columns in the csv file
2. Use get_field_names(): to retrieve the attribute names of the database table
3. Use get_primary_keys(): to retrieve the attribute names of the primary keys in the database table

In [16]:
# Retrieve the original attribute names (read from the csv file)
db_source.get_original_attribute_names()

['UNIQUE_ID', 'ISIN', 'COMPANY', 'COUNTRY']

In [17]:
# Retrieve the attribute names of the database table
db_source.get_attribute_names()

['UNIQUE_ID', 'ISIN', 'COMPANY', 'COUNTRY']

In [18]:
# Retrieve the attribute names of the primary keys in the database table
db_source.get_primary_keys()

['UNIQUE_ID']

## 6. Checking the Data Source

In [19]:
print('Data Source Name: {}, Table name: {}'.format(db_source.name, db_source.table_name))

Data Source Name: DS_REF, Table name: ESG_MATCH_REF


In [20]:
# Total entries of the table
result = db_source.get_total_entries()
print('Total entries in table {} = {}'.format(db_source.table_name, result))

2022-01-26 12:02:54,014 INFO sqlalchemy.engine.Engine SELECT count(*) AS count_1 
FROM "ESG_MATCH_REF"
2022-01-26 12:02:54,015 INFO sqlalchemy.engine.Engine [generated in 0.00133s] {}
Total entries in table ESG_MATCH_REF = 9


In [21]:
# Total entries of the table by a column name
result = db_source.get_total_entries_by_column('ISIN')
print('Total entries by ISIN = {}'.format(result))

2022-01-26 12:02:54,495 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2022-01-26 12:02:54,501 INFO sqlalchemy.engine.Engine SELECT anon_1.count_1 
FROM (SELECT count("ESG_MATCH_REF"."ISIN") AS count_1 
FROM "ESG_MATCH_REF") anon_1 
WHERE ROWNUM <= 1
2022-01-26 12:02:54,502 INFO sqlalchemy.engine.Engine [generated in 0.00202s] {}
2022-01-26 12:02:54,548 INFO sqlalchemy.engine.Engine ROLLBACK
Total entries by ISIN = 7


In [22]:
# Total entries of the table by a column name with distinct values
result = db_source.get_total_entries_by_column('ISIN', distinct_values=True)
print('Total entries by ISIN with distinct values = {}'.format(result))

2022-01-26 12:02:56,466 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2022-01-26 12:02:56,471 INFO sqlalchemy.engine.Engine SELECT anon_1.count_1 
FROM (SELECT count(DISTINCT "ESG_MATCH_REF"."ISIN") AS count_1 
FROM "ESG_MATCH_REF") anon_1 
WHERE ROWNUM <= 1
2022-01-26 12:02:56,472 INFO sqlalchemy.engine.Engine [generated in 0.00114s] {}
2022-01-26 12:02:56,516 INFO sqlalchemy.engine.Engine ROLLBACK
Total entries by ISIN with distinct values = 6


## 7. Checking the content of the DbDataSource

The get_data() method of the DbDataSource object performs a full select in the table, returning a list of tupples. Each item of the list is a row in the table and each element is the value per column.

In [23]:
# Query all the values of the table
# Equivalent to SELECT * FROM TABLE_NAME
lst_result = db_source.get_data()
lst_result

2022-01-26 12:02:57,779 INFO sqlalchemy.engine.Engine SELECT "ESG_MATCH_REF"."UNIQUE_ID", "ESG_MATCH_REF"."ISIN", "ESG_MATCH_REF"."COMPANY", "ESG_MATCH_REF"."COUNTRY" 
FROM "ESG_MATCH_REF"
2022-01-26 12:02:57,780 INFO sqlalchemy.engine.Engine [generated in 0.00148s] {}


[(1.0, 'SK1120005824', 'CENTRAL PERK', 'SK'),
 (2.0, None, 'HONEYDUKES', 'UNITED STATES OF AMERICA'),
 (3.0, None, 'STARCOURT MALL', 'AUSTRIA'),
 (4.0, 'GB00B1YW4409', 'STERLING COOPER', 'GBR'),
 (5.0, 'CH0012221716', 'Bluth company', 'CHE'),
 (6.0, 'US0200021014', 'InGen', 'usa'),
 (7.0, 'US0231351067', 'Stark Industries', 'us'),
 (8.0, 'US0126531013', 'SPECTRE', 'USA'),
 (9.0, 'US0126531013', 'SPECTRE 33 SUBSIDIARY', 'USA')]

The get_data_as_df() method of the DbDataSource also performs a select in the table, but returns a pandas dataframe as result.

In [24]:
# Query the table
df_result = db_source.get_data_as_df()
df_result

2022-01-26 12:03:00,678 INFO sqlalchemy.engine.Engine SELECT "ESG_MATCH_REF"."UNIQUE_ID", "ESG_MATCH_REF"."ISIN", "ESG_MATCH_REF"."COMPANY", "ESG_MATCH_REF"."COUNTRY" 
FROM "ESG_MATCH_REF"
2022-01-26 12:03:00,679 INFO sqlalchemy.engine.Engine [cached since 2.901s ago] {}


Unnamed: 0,UNIQUE_ID,ISIN,COMPANY,COUNTRY
0,1.0,SK1120005824,CENTRAL PERK,SK
1,2.0,,HONEYDUKES,UNITED STATES OF AMERICA
2,3.0,,STARCOURT MALL,AUSTRIA
3,4.0,GB00B1YW4409,STERLING COOPER,GBR
4,5.0,CH0012221716,Bluth company,CHE
5,6.0,US0200021014,InGen,usa
6,7.0,US0231351067,Stark Industries,us
7,8.0,US0126531013,SPECTRE,USA
8,9.0,US0126531013,SPECTRE 33 SUBSIDIARY,USA


## 8. Drop the table using DbDataSource object

In [None]:
db_source.drop_table()

## 9. Close database connection

In [25]:
db_conn.disconnect()

In [26]:
db_conn.is_connected()

False