Skip to content

igorp74/df_utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

df_utils

Set of Data Engineering functions which I use a lot in ETL data processing.

It all started with print(df).. then print(df.values.tolist()).. then prettify with tabulate... After some time and work, I noticed which parts of code I use the most and in every script, thus I made functions. Fast forward.. after some amount of functions this is a logical step to Python module on PyPI (soon).

I use them as a local module

import sys
sys.path.append(r"/home/igorp/Documents/CODE")
from df_utils import db, du, fi

I separate them in 3 different files:

  • db - for database connectors wiki

    • mssqlsrv(sqlserver, database, **kwargs)
    • oracle(hostname, database, **kwargs)
    • postgresql(hostname, database, **kwargs)
  • du - as dataframe utils

    • df_2_mssqlsrv(df, engine_name, schema_name, table_name, ifexist)
    • df_2_sqlite(df, db_path, table_name)
    • get_xlsx_data(file, sheet)
    • df_merged_headers(cl, delimiter)
    • 🔥 get_xlsx(fn, **kwargs)
    • from_excel_ordinal(ordinal, _epoch0=datetime(1899, 12, 31))
    • df_append_2_xlsx(df, file_name, sheet_name)
    • 🔥 df_2_xlsx(df, fn, sn, ac=1, m=0, s=0, sr=0)
    • print_df(df, **kwargs)
    • df_dtypes(df, mode)
    • split_df(df, lines = 1000)
    • df_unique(df, col, id=0)
    • clean_df(res, fillna='')
  • fi - as file and list helpers

    • transpose_list(list_in, na=None)
    • rotate_list(l, n)
    • flatten_list(in_list)
    • remove_sublist(main_list, unwanted_list)
    • check_sublist(main_list, sub_list, exception=0)
    • list_2_pickle(src_list, fn)
    • pickle_2_list(fn)
    • list_slice(list, chunks)
    • replace_many(text, dic)
    • fuzzy_compare_lists(source_list, match_list, limit_level, fast=0)
    • list_2_str(source_list)
    • num_2_str(data)
    • new_folder(folder, mode=0)
    • get_file_list(path, extension, mode=0)
    • split_by_size(path, file_list, size)
    • cp_multi_2_one(file_list, dst_path)

💡 Example 1 - Read xlsx

Let say I want to read an xlsx (MS Excel) file and print the content of a worksheet.

import sys
sys.path.append('/home/igorp/Documents/CODE/GitHub') #place your path to the df_utils folder
from df_utils import du

from pathlib import Path

fn = Path(r'/home/igorp/Documents/CODE/MISC/Python/Proba.xlsx') # Place your path to the source xlsx file

# Now, just read xlsx
df = du.get_xlsx_data(fn, 'Milka')

# print raw DataFrame
du.print_df(df, vt=2, e=0)

# print cleaned DataFrame
res = du.clean_df(df)
du.print_df(res, vt=3)

Source data:

screenshot_20220405-231144

Result:

Notice that first dataframe is printed as is, without cleaning, then for comparison, there is another print with cleaned dataframe.

screenshot_20220709-122407

💡 Example 2 - Connect to databases

In this example, I am going to establish a connection to 2 databases; one local and one remote (Heroku). I will query and collect some data from tables in both databases and left join them together into final result.

screenshot_20220410-131902

Code analysis

# 🧰 My toolbox
import sys
sys.path.append(r"/home/igorp/Documents/CODE/GitHub")
from df_utils import db, du, up

import pandas as pd
from timeit import default_timer as timer
start = timer()

import sqlite3
# 🔑 My password vault...
conn = sqlite3.connect('/home/igorp/Documents/CODE/Vault/secure_box.sqlite')

Above code is header and declaration of modules I need in this script. Notice that I use SQLite local database for storing my passwords, so they cannot be visible in my scripts (like this one).

# Local database (PostgreSQL)
host_name = '127.0.0.1'
database  = 'data'
user_name = 'root'
password  = up.get_password(conn, 1)

# Heroku database (PostgreSQL)
host_name2 = "ec2-52-208-254-158.eu-west-1.compute.amazonaws.com"
database2  = "da2g6qll5cjhkd"
user_name2 = "qkrfbcuatvnhha"
password2  = up.get_password(conn, 2)

engine = db.postgresql(host_name, database,
                     a_username = user_name,
                     a_password = password
                     )

engine2 = db.postgresql(host_name2, database2,
                     a_username = user_name2,
                     a_password = password2
                     )

ex= ''

Connection strings to both local and remote database and engine creation for both of them.

try:
    df  = pd.read_sql("select id as professorid, first_name, last_name from dbo.professor_csv", con=engine)
    du.print_df(df, c=0, d=0, vt=2, e=0)

    df2 = pd.read_sql("select * from dbo.course_csv", con=engine2)
    du.print_df(df2, c=0, d=0, vt=2, e=0)
except Exception as ex:
    print(ex)

Here I am trying to run queries on each database and collect the results into the DataFrames. Notice that I turned off columns (c=0), dtypes (d=0) and exit (e=0) in du.print_df(df, c=0, d=0, vt=2, e=0). I only left values tabular (vt=2).

df_join = pd.merge(df2,df, how='left', on='professorid', indicator=True)
du.print_df(df_join, c=0, d=0, vt=2, e=0)

Left join between 2 dataframes. This is equivalent for left joining tables in SQL, as it is usual in pandas DataFrame environment. Indicator is turned ON (True), so I have got additional column in result DataFrame "_merge" with indicator do I have the joined data in both tables or just left_only

proc_time = round(timer() - start,2)
print(f'\n⚡⚡⚡ ⏱ Done in: {proc_time} s ⚡⚡⚡')

This is processing time counter.

About

Set of pandas DataFrame functions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages