# How to...read data from csv files to a database

In [1]:
# Sets up the location of the api relative to this notebook 
import sys
sys.path.append('../../../')

In [2]:
# Import the module for accessing a database
from esgmatching.dbmanager.SqlEngine import SqlEngine

In [3]:
# Import the module for reading csv files
from esgmatching.reader.FileReaderCsvToDB import FileReaderCsvToDB

## 1. Database setup

In [4]:
# Localization of the database to be created in relation to this jupyter notebook
# The database will be created in the /data/dabase folder, under the project main folder (EntityMatching)
path_db = '../../../data/database/'

In [5]:
# String connection used for sqlite. Others databases might require different information.
# In this example the connection is a combination of [sqlite statement] + [database path] + [database name]
str_connection = 'sqlite:///' + path_db + 'entitymatching.db'
str_connection

'sqlite:///../../../data/database/entitymatching.db'

In [6]:
# The database engine object is created by passing the string connection 
sqlengine_obj = SqlEngine(str_connection)

In [7]:
# The connect() method of the SqlEngine is used to stablish a connection with the database if it exists, 
# or to create a new one, otherwise. The parameter show_eco is False by default and indicates if the SQL statements 
# are echoed (or printed) in the default output channel. Therefore, let's set show_echo = True to see the Sql statements. 
sqlengine_obj.connect(show_echo=True)

In [8]:
# Check if the connection was stablished
sqlengine_obj.is_connected()

True

## 2. FileReader setup

In [9]:
# Path to the csv files and its mapping files
# Localization of the test files
path_test = '../../../data/test/'

In [10]:
# CSV file for Data source 1 
file1_path = path_test + 'test_data_source1.csv'
file1_path

'../../../data/test/test_data_source1.csv'

In [11]:
# Data mapping for Data source 1
file1_map = path_test + 'test_data_source1.json'
file1_map

'../../../data/test/test_data_source1.json'

In [12]:
# Initialize the FileReaderCsvToDB
csvreader_obj = FileReaderCsvToDB()  

The database engine object must be set in the FileReader as to provide connectivity with a database.
It is important to certify that a connection was established through sqlengine_obj.connect() method.  

In [13]:
# Set the database engine into the FileReader. Define use_session=True to indicate that a session object must be used. 
csvreader_obj.set_database_engine(sqlengine_obj, use_session=True)

## 3. Read a file and load its content to the database

The read_file() method reads the csv file passed as parameter when the FileReader was created. This method returns a data source object that can be used latter to query the database or to perfor other tasks, such as matching.
The following parameters gives some control of the reading process:
- delimiter='\t': the character used to separate values in the csv file. By default is used the tab character.
- chunk_size=1: total number of lines to read and persist to a database table at a time. It can speed up the process when reading long files

In [14]:
# Read 'test_data_source1.csv'
data_source_obj = csvreader_obj.read_file(file1_path, file1_map, delimiter=',')

2021-09-07 19:55:51,267 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2021-09-07 19:55:51,270 INFO sqlalchemy.engine.Engine 
CREATE TABLE data_source_ref (
	idx INTEGER NOT NULL, 
	data_source VARCHAR, 
	isin VARCHAR, 
	isvalid_isin BOOLEAN, 
	company_name VARCHAR, 
	original_company_name VARCHAR, 
	country VARCHAR, 
	country_alpha2 VARCHAR, 
	country_alpha3 VARCHAR, 
	original_country VARCHAR, 
	PRIMARY KEY (idx)
)


2021-09-07 19:55:51,271 INFO sqlalchemy.engine.Engine [no key 0.00101s] ()
2021-09-07 19:55:51,295 INFO sqlalchemy.engine.Engine COMMIT
2021-09-07 19:55:51,483 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2021-09-07 19:55:51,485 INFO sqlalchemy.engine.Engine INSERT INTO data_source_ref (data_source, isin, isvalid_isin, company_name, original_company_name, country, country_alpha2, country_alpha3, original_country) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
2021-09-07 19:55:51,486 INFO sqlalchemy.engine.Engine [no key 0.00094s] ('data_source_ref', 'SK1120005824', 1, 'central per

## 4. Reports

In [15]:
# Show the ETL report for the latest file read by the FileReader
report_obj = csvreader_obj.report_obj
report_obj.print_report()

---------------------------------------- FILE READER REPORT ----------------------------------------
Description: Details of the ETL process performed on [data_source_ref] data source.
Datetime:2021-09-07 19:55:51
----------------------------------------------------------------------------------------------------
File Name: ../../../data/test/test_data_source1.csv
Columns in File: 3
Rows in File: 8
Columns Read from File: 3
Columns in Database: 10
Rows in Database: 8


## 5. Checking the attribute names of EntityDB

There are multiple methods to check the column or attribute names of the EntityDB object:
1. Use get_original_attribute_names(): to retrieve the original attribute names of the columns in the csv file
2. Use get_renamed_attribute_names(): to retrieve the renamed attribute names of the columns in the csv file
3. Use get_table_column_names(): to retrieve the equivalent attribute names in the database

In [16]:
# Retrieve the original attribute names (read from the csv file)
data_source_obj.get_original_attribute_names()

['ISIN', 'COMPANY', 'COUNTRY']

In [17]:
# Retrieve the original attribute names (read from the csv file)
data_source_obj.get_renamed_attribute_names()

['isin', 'company_name', 'country']

In [18]:
data_source_obj.get_table_column_names()

['idx',
 'data_source',
 'isin',
 'isvalid_isin',
 'company_name',
 'original_company_name',
 'country',
 'country_alpha2',
 'country_alpha3',
 'original_country']

## 6. Checking the total entries in the EntityDB

In [19]:
# Total entries using the default index column
result = data_source_obj.get_total_entries_by_idx()
result

2021-09-07 19:55:51,675 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2021-09-07 19:55:51,679 INFO sqlalchemy.engine.Engine SELECT count(DISTINCT data_source_ref.idx) AS count_1 
FROM data_source_ref
 LIMIT ? OFFSET ?
2021-09-07 19:55:51,680 INFO sqlalchemy.engine.Engine [cached since 0.1252s ago] (1, 0)
2021-09-07 19:55:51,684 INFO sqlalchemy.engine.Engine ROLLBACK


8

In [20]:
# Total entries using other column names
# This method can be used to check the non-null columns
result = data_source_obj.get_total_entries_by_column_name('isin')
result

2021-09-07 19:55:51,702 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2021-09-07 19:55:51,705 INFO sqlalchemy.engine.Engine SELECT count(DISTINCT data_source_ref.isin) AS count_1 
FROM data_source_ref
 LIMIT ? OFFSET ?
2021-09-07 19:55:51,706 INFO sqlalchemy.engine.Engine [generated in 0.00114s] (1, 0)
2021-09-07 19:55:51,708 INFO sqlalchemy.engine.Engine ROLLBACK


6

## 7. Checking the content of the EntityDB

The get_data() method of the EntityDB object performs a full select in the table, returning a list of tupples. Each item of the list is a row in the table and each element is the value per column.

In [21]:
# Query the table 
lst_result = data_source_obj.get_data()
lst_result

2021-09-07 19:55:51,738 INFO sqlalchemy.engine.Engine SELECT data_source_ref.idx, data_source_ref.data_source, data_source_ref.isin, data_source_ref.isvalid_isin, data_source_ref.company_name, data_source_ref.original_company_name, data_source_ref.country, data_source_ref.country_alpha2, data_source_ref.country_alpha3, data_source_ref.original_country 
FROM data_source_ref
2021-09-07 19:55:51,741 INFO sqlalchemy.engine.Engine [generated in 0.00376s] ()


[(1, 'data_source_ref', 'SK1120005824', True, 'central perk', 'CENTRAL PERK', 'slovakia', 'sk', 'svk', 'SK'),
 (2, 'data_source_ref', None, None, 'honeydukes', 'HONEYDUKES', 'united states of america', 'us', 'usa', 'UNITED STATES OF AMERICA'),
 (3, 'data_source_ref', None, None, 'starcourt mall', 'STARCOURT MALL', 'austria', 'at', 'aut', 'AUSTRIA'),
 (4, 'data_source_ref', 'GB00B1YW4409', True, 'sterling cooper', 'STERLING COOPER', 'united kingdom of great britain and northern ireland', 'gb', 'gbr', 'GBR'),
 (5, 'data_source_ref', 'CH0012221716', True, 'bluth company', 'Bluth company', 'switzerland', 'ch', 'che', 'CHE'),
 (6, 'data_source_ref', 'US0200021014', True, 'ingen', 'InGen', 'united states of america', 'us', 'usa', 'usa'),
 (7, 'data_source_ref', 'US0231351067', True, 'stark industries', 'Stark Industries', 'united states of america', 'us', 'usa', 'us'),
 (8, 'data_source_ref', 'US0126531013', True, 'spectre', 'SPECTRE', 'united states of america', 'us', 'usa', 'USA')]

The get_data_as_df() method of the EntityDB also performs a select in the table, but returns a pandas dataframe as result.

In [22]:
# Query the table
df_result = data_source_obj.get_data_as_df()
df_result

2021-09-07 19:55:51,767 INFO sqlalchemy.engine.Engine SELECT data_source_ref.idx, data_source_ref.data_source, data_source_ref.isin, data_source_ref.isvalid_isin, data_source_ref.company_name, data_source_ref.original_company_name, data_source_ref.country, data_source_ref.country_alpha2, data_source_ref.country_alpha3, data_source_ref.original_country 
FROM data_source_ref
2021-09-07 19:55:51,768 INFO sqlalchemy.engine.Engine [cached since 0.03049s ago] ()


Unnamed: 0,idx,data_source,isin,isvalid_isin,company_name,original_company_name,country,country_alpha2,country_alpha3,original_country
0,1,data_source_ref,SK1120005824,True,central perk,CENTRAL PERK,slovakia,sk,svk,SK
1,2,data_source_ref,,,honeydukes,HONEYDUKES,united states of america,us,usa,UNITED STATES OF AMERICA
2,3,data_source_ref,,,starcourt mall,STARCOURT MALL,austria,at,aut,AUSTRIA
3,4,data_source_ref,GB00B1YW4409,True,sterling cooper,STERLING COOPER,united kingdom of great britain and northern i...,gb,gbr,GBR
4,5,data_source_ref,CH0012221716,True,bluth company,Bluth company,switzerland,ch,che,CHE
5,6,data_source_ref,US0200021014,True,ingen,InGen,united states of america,us,usa,usa
6,7,data_source_ref,US0231351067,True,stark industries,Stark Industries,united states of america,us,usa,us
7,8,data_source_ref,US0126531013,True,spectre,SPECTRE,united states of america,us,usa,USA


## 8. Drop the table

In [23]:
sqlengine_obj.drop_table(data_source_obj.table_obj)

2021-09-07 19:55:51,820 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2021-09-07 19:55:51,822 INFO sqlalchemy.engine.Engine 
DROP TABLE data_source_ref
2021-09-07 19:55:51,824 INFO sqlalchemy.engine.Engine [no key 0.00142s] ()
2021-09-07 19:55:51,852 INFO sqlalchemy.engine.Engine COMMIT


## 9. Close database connection

In [24]:
sqlengine_obj.disconnect()