# Loading a Dataset from different file types

## Abstract

This notebook was created to show the difference of loading a dataset from different data file formats.<br>
This dataset contains data about stroke to people with various vital checks. It was downloaded from Kaggle: https://www.kaggle.com/datasets/lirilkumaramal/heart-stroke/download. The dataset has 43,400 rows and 12 variables that, stored in a csv file, weigh 2.6Mb.<br>
During this notebook you could experiment loading and writing times into different data saving formats: csv (text), xlsx (excel) and pkl (binary) and also, we load data from a SQLite Database.

### Libraries

In [1]:
import pandas as pd
import datetime as datetime
import pickle
import sqlite3

## 1. Loading from a CSV (text) file

In [2]:
# Loading time from a CSV file
init0=datetime.datetime.now()
df=pd.read_csv('datasets/train_strokes.csv')
end0=datetime.datetime.now()
delta0=end0-init0
print('Loading time of a Pandas Dataframe from a text CSV file: ',delta0)


Loading time of a Pandas Dataframe from a text CSV file:  0:00:00.133967


### Storing dataset into Excel and binary format

In [3]:
df.to_pickle('datasets/train_strokes.pkl')
df.to_excel('datasets/train_strokes.xls')

  df.to_excel('datasets/train_strokes.xls')


## 2. Loading dataset from a Database

In [12]:
# Loading time from a DataBase
init=datetime.datetime.now()
# Crear el cursor y la conexión
nombreDB='datasets/train_strokes.db'
conexion = sqlite3.connect(nombreDB)
cursor = conexion.cursor()
# SELECT
sentence="SELECT * FROM strokes"
df2 = pd.read_sql_query(sentence, conexion)
end=datetime.datetime.now()
delta=end-init
print('Loading time of a Pandas Dataframe from a DataBase: ',delta)

Loading time of a Pandas Dataframe from a DataBase:  0:00:00.460661


## 3. Loading dataset from binary file

In [13]:
# Loading time from a PKL file
init1=datetime.datetime.now()
df=pd.read_pickle('datasets/train_strokes.pkl')
end1=datetime.datetime.now()
delta1=end1-init1
print('Loading time of a Pandas Dataframe from a binary PKL file: ',delta1)

Loading time of a Pandas Dataframe from a binary PKL file:  0:00:00.018615


## 4. Loading dataset from an Excel file

In [14]:
# Loading time from a XLS file
init2=datetime.datetime.now()
df=pd.read_excel('datasets/train_strokes.xls')
end2=datetime.datetime.now()
delta2=end2-init2
print('Loading time of a Pandas Dataframe from an Excel XLS file: ',delta2)

Loading time of a Pandas Dataframe from an Excel XLS file:  0:00:05.239574


## 5. Results Obtained

In [15]:
print('Binary PKL File: ',delta1)
print('From a Database: ',delta)
print('Text CSV File:   ',delta0)
print('Excel XLS File:  ',delta2)

Binary PKL File:  0:00:00.018615
From a Database:  0:00:00.460661
Text CSV File:    0:00:00.133967
Excel XLS File:   0:00:05.239574


## 6. Conclusions

The fastest way to load a dataset is from a binary format file (PKL). The disadvantage is that such file can't be browsed or edited. To load a dataset from a text file as CSV takes near than 6 times more time than from a binary file, but these files are very flexible. The slowest way to load a dataset is finally to load it from an Excel file (XLS). It takes more than 200 times compared with binary files and near than 30 times compared with text files. It is interesting too that the load of data from pkl and from csv files is faster than to load from a Database.