df_utils

Set of Data Engineering functions which I use a lot in ETL data processing.

It all started with print(df).. then print(df.values.tolist()).. then prettify with tabulate... After some time and work, I noticed which parts of code I use the most and in every script, thus I made functions. Fast forward.. after some amount of functions this is a logical step to Python module on PyPI (soon).

I use them as a local module

import sys
sys.path.append(r"/home/igorp/Documents/CODE")
from df_utils import db, du, fi

I separate them in 3 different files:

db - for database connectors wiki
- mssqlsrv(sqlserver, database, **kwargs)
- oracle(hostname, database, **kwargs)
- postgresql(hostname, database, **kwargs)
du - as dataframe utils
- df_2_mssqlsrv(df, engine_name, schema_name, table_name, ifexist)
- df_2_sqlite(df, db_path, table_name)
- get_xlsx_data(file, sheet)
- df_merged_headers(cl, delimiter)
- 🔥 get_xlsx(fn, **kwargs)
- from_excel_ordinal(ordinal, _epoch0=datetime(1899, 12, 31))
- df_append_2_xlsx(df, file_name, sheet_name)
- 🔥 df_2_xlsx(df, fn, sn, ac=1, m=0, s=0, sr=0)
- print_df(df, **kwargs)
- df_dtypes(df, mode)
- split_df(df, lines = 1000)
- df_unique(df, col, id=0)
- clean_df(res, fillna='')
fi - as file and list helpers
- transpose_list(list_in, na=None)
- rotate_list(l, n)
- flatten_list(in_list)
- remove_sublist(main_list, unwanted_list)
- check_sublist(main_list, sub_list, exception=0)
- list_2_pickle(src_list, fn)
- pickle_2_list(fn)
- list_slice(list, chunks)
- replace_many(text, dic)
- fuzzy_compare_lists(source_list, match_list, limit_level, fast=0)
- list_2_str(source_list)
- num_2_str(data)
- new_folder(folder, mode=0)
- get_file_list(path, extension, mode=0)
- split_by_size(path, file_list, size)
- cp_multi_2_one(file_list, dst_path)

💡 Example 1 - Read xlsx

Let say I want to read an xlsx (MS Excel) file and print the content of a worksheet.

import sys
sys.path.append('/home/igorp/Documents/CODE/GitHub') #place your path to the df_utils folder
from df_utils import du

from pathlib import Path

fn = Path(r'/home/igorp/Documents/CODE/MISC/Python/Proba.xlsx') # Place your path to the source xlsx file

# Now, just read xlsx
df = du.get_xlsx_data(fn, 'Milka')

# print raw DataFrame
du.print_df(df, vt=2, e=0)

# print cleaned DataFrame
res = du.clean_df(df)
du.print_df(res, vt=3)

Source data:

Result:

Notice that first dataframe is printed as is, without cleaning, then for comparison, there is another print with cleaned dataframe.

💡 Example 2 - Connect to databases

In this example, I am going to establish a connection to 2 databases; one local and one remote (Heroku). I will query and collect some data from tables in both databases and left join them together into final result.

Code analysis

# 🧰 My toolbox
import sys
sys.path.append(r"/home/igorp/Documents/CODE/GitHub")
from df_utils import db, du, up

import pandas as pd
from timeit import default_timer as timer
start = timer()

import sqlite3
# 🔑 My password vault...
conn = sqlite3.connect('/home/igorp/Documents/CODE/Vault/secure_box.sqlite')

Above code is header and declaration of modules I need in this script. Notice that I use SQLite local database for storing my passwords, so they cannot be visible in my scripts (like this one).

# Local database (PostgreSQL)
host_name = '127.0.0.1'
database  = 'data'
user_name = 'root'
password  = up.get_password(conn, 1)

# Heroku database (PostgreSQL)
host_name2 = "ec2-52-208-254-158.eu-west-1.compute.amazonaws.com"
database2  = "da2g6qll5cjhkd"
user_name2 = "qkrfbcuatvnhha"
password2  = up.get_password(conn, 2)

engine = db.postgresql(host_name, database,
                     a_username = user_name,
                     a_password = password
                     )

engine2 = db.postgresql(host_name2, database2,
                     a_username = user_name2,
                     a_password = password2
                     )

ex= ''

Connection strings to both local and remote database and engine creation for both of them.

try:
    df  = pd.read_sql("select id as professorid, first_name, last_name from dbo.professor_csv", con=engine)
    du.print_df(df, c=0, d=0, vt=2, e=0)

    df2 = pd.read_sql("select * from dbo.course_csv", con=engine2)
    du.print_df(df2, c=0, d=0, vt=2, e=0)
except Exception as ex:
    print(ex)

Here I am trying to run queries on each database and collect the results into the DataFrames. Notice that I turned off columns (c=0), dtypes (d=0) and exit (e=0) in du.print_df(df, c=0, d=0, vt=2, e=0). I only left values tabular (vt=2).

df_join = pd.merge(df2,df, how='left', on='professorid', indicator=True)
du.print_df(df_join, c=0, d=0, vt=2, e=0)

Left join between 2 dataframes. This is equivalent for left joining tables in SQL, as it is usual in pandas DataFrame environment. Indicator is turned ON (True), so I have got additional column in result DataFrame "_merge" with indicator do I have the joined data in both tables or just left_only

proc_time = round(timer() - start,2)
print(f'\n⚡⚡⚡ ⏱ Done in: {proc_time} s ⚡⚡⚡')

This is processing time counter.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
README.md		README.md
db.py		db.py
du.py		du.py
fi.py		fi.py
up.py		up.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

df_utils

db - for database connectors wiki

du - as dataframe utils

fi - as file and list helpers

💡 Example 1 - Read xlsx

Source data:

Result:

💡 Example 2 - Connect to databases

Code analysis

About

Releases

Packages

Languages

igorp74/df_utils

Folders and files

Latest commit

History

Repository files navigation

df_utils

db - for database connectors wiki

du - as dataframe utils

fi - as file and list helpers

💡 Example 1 - Read xlsx

Source data:

Result:

💡 Example 2 - Connect to databases

Code analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages