The glob module in Python is used to find all pathnames matching a specified pattern according to the rules used by the Unix shell. It's a powerful tool for finding files and directories on your filesystem using wildcards. 📂

## How It Works: Wildcard Patterns
glob works by using special characters, called wildcards, to create a search pattern. The most common wildcards are:

Wildcard	Description	Example
*	Matches zero or more characters.	*.csv matches data.csv and sales.csv.
?	Matches exactly one character.	photo_?.jpg matches photo_1.jpg but not photo_10.jpg.
[]	Matches a single character from a set.	report_[123].txt matches report_1.txt or report_2.txt.
**	Matches all files and zero or more directories and subdirectories.	**/*.csv finds all CSV files in the current directory and all subdirectories.

Export to Sheets
## Practical Examples
To use glob, you first need to import it. Let's assume you have the following directory structure:

/my_project/
├── data.csv
├── notes.txt
├── report.txt
└── /archive/
    ├── old_notes.txt
    └── backup.zip
1. Finding All .txt Files in the Current Directory
The glob.glob() function returns a list of matching file paths.

Python

import glob

# Assuming the current working directory is /my_project/
txt_files = glob.glob('*.txt')
print(txt_files)
Output:

['notes.txt', 'report.txt']
2. Recursive Search with **
To search in all subdirectories, use the ** wildcard and set the recursive=True flag.

Python

import glob

# Assuming the current working directory is /my_project/
all_txt_files = glob.glob('**/*.txt', recursive=True)
print(all_txt_files)
Output:

['notes.txt', 'report.txt', 'archive/old_notes.txt']
👉 Notice how it found the text file inside the archive folder.

## Key Functions: glob vs. iglob
The glob module has two primary functions:

glob.glob(): Returns a list containing all the matching pathnames. This is easy to use but can consume a lot of memory if you are matching a very large number of files.

glob.iglob(): Returns an iterator instead of a list. This is much more memory-efficient because it yields one pathname at a time in a for loop, rather than loading them all into memory at once. You should prefer iglob when you expect a large number of results.

Python

import glob

# Using iglob is more memory-efficient for many files
for filename in glob.iglob('**/*.txt', recursive=True):
    print(f"Found file: {filename}")

In [159]:
import glob

In [160]:
for i in glob.iglob('*.ipynb'):
    print(f"Found file: {i}")


Found file: DE.ipynb
Found file: File_Format_convertor.ipynb
Found file: pyprog.ipynb


In [161]:
glob.glob('retail_db/**', recursive=True)

['retail_db\\',
 'retail_db\\categories',
 'retail_db\\categories\\categories.txt',
 'retail_db\\customers',
 'retail_db\\customers\\customers.txt',
 'retail_db\\departments',
 'retail_db\\departments\\departments.txt',
 'retail_db\\departments\\test1.csv',
 'retail_db\\departments\\test2.csv',
 'retail_db\\orders',
 'retail_db\\orders\\orders.txt',
 'retail_db\\order_items',
 'retail_db\\order_items\\order_items.txt',
 'retail_db\\products',
 'retail_db\\products\\products.txt',
 'retail_db\\schemas.json',
 'retail_db\\simple.json']

In [162]:
src_file_name = glob.glob('retail_db/*/*.txt',recursive=True)
src_file_name

['retail_db\\categories\\categories.txt',
 'retail_db\\customers\\customers.txt',
 'retail_db\\departments\\departments.txt',
 'retail_db\\orders\\orders.txt',
 'retail_db\\order_items\\order_items.txt',
 'retail_db\\products\\products.txt']

In [163]:
import re

In [164]:
re.split(r'[\s,;]+', 'foo,bar;baz  qux')

['foo', 'bar', 'baz', 'qux']

In [165]:
for file in src_file_name:
    file_path_list = re.split(r'[\s,;]+', file)
    print(file_path_list)

['retail_db\\categories\\categories.txt']
['retail_db\\customers\\customers.txt']
['retail_db\\departments\\departments.txt']
['retail_db\\orders\\orders.txt']
['retail_db\\order_items\\order_items.txt']
['retail_db\\products\\products.txt']


In [166]:
import pprint as pp
import pandas as pd
import json
import requests
import os
import time
import datetime
import  shutil
print('-----------------',file_path_list, "--------------------")
pp.pprint(src_file_name)

----------------- ['retail_db\\products\\products.txt'] --------------------
['retail_db\\categories\\categories.txt',
 'retail_db\\customers\\customers.txt',
 'retail_db\\departments\\departments.txt',
 'retail_db\\orders\\orders.txt',
 'retail_db\\order_items\\order_items.txt',
 'retail_db\\products\\products.txt']


In [167]:
for file in src_file_name:
    print(file,':')
    df = pd.read_csv(file, header=None)
    print(f'Shape of  {file} is {df.shape}')

retail_db\categories\categories.txt :
Shape of  retail_db\categories\categories.txt is (58, 3)
retail_db\customers\customers.txt :
Shape of  retail_db\customers\customers.txt is (12435, 9)
retail_db\departments\departments.txt :
Shape of  retail_db\departments\departments.txt is (6, 2)
retail_db\orders\orders.txt :
Shape of  retail_db\orders\orders.txt is (68883, 4)
retail_db\order_items\order_items.txt :
Shape of  retail_db\order_items\order_items.txt is (172198, 6)
retail_db\products\products.txt :
Shape of  retail_db\products\products.txt is (1345, 6)


In [168]:

json_schema = json.load(open('retail_db/schemas.json', 'r'))
pp.pprint(json_schema)

{'categories': [{'column_name': 'category_id',
                 'column_position': 1,
                 'data_type': 'integer'},
                {'column_name': 'category_department_id',
                 'column_position': 2,
                 'data_type': 'integer'},
                {'column_name': 'category_name',
                 'column_position': 3,
                 'data_type': 'string'}],
 'customers': [{'column_name': 'customer_id',
                'column_position': 1,
                'data_type': 'integer'},
               {'column_name': 'customer_fname',
                'column_position': 2,
                'data_type': 'string'},
               {'column_name': 'customer_lname',
                'column_position': 3,
                'data_type': 'string'},
               {'column_name': 'customer_email',
                'column_position': 4,
                'data_type': 'string'},
               {'column_name': 'customer_password',
                'column_position': 5,
       

In [169]:
def get_column_name(schemas,db_name, Sorting_key = 'column_position'):
    key = schemas.get(db_name)
    columns_list = sorted(key, key=lambda x: x[Sorting_key])
    #pp.pprint(columns_list)
    column_name = [col['column_name'] for col in columns_list]
    return column_name
   

In [None]:
#order_col_name = get_column_name(json_schema, 'orders')
category_col_name = get_column_name(json_schema, 'categories')
print(category_col_name)
department_col_name = get_column_name(json_schema, 'departments')
print(department_col_name)
for i in 

['category_id', 'category_department_id', 'category_name']
['department_id', 'department_name']


In [171]:
import pandas as pd
categories = pd.read_csv(r"C:\Users\punitkumar.more\Documents\Elisa\gcp_de\GCP_DE\AUTOMATION\retail_db\categories\categories.txt",names=category_col_name)
print('***************************************************************')
print(categories.head(),end='\n\n')
print('****************************************************************')

***************************************************************
   category_id  category_department_id        category_name
0            1                       2             Football
1            2                       2               Soccer
2            3                       2  Baseball & Softball
3            4                       2           Basketball
4            5                       2             Lacrosse

****************************************************************


In [182]:
for  file in src_file_name:
    file_path_list = re.split(r'[\\]+', file)
    print(file_path_list)
    

['retail_db', 'categories', 'categories.txt']
['retail_db', 'customers', 'customers.txt']
['retail_db', 'departments', 'departments.txt']
['retail_db', 'orders', 'orders.txt']
['retail_db', 'order_items', 'order_items.txt']
['retail_db', 'products', 'products.txt']


In [192]:
file = src_file_name[0]
file

'retail_db\\categories\\categories.txt'

In [262]:
json_file_path = r'C:\Users\punitkumar.more\Documents\Elisa\gcp_de\GCP_DE\AUTOMATION\retail_db_json'
json_full_file_path = []
for file in src_file_name:
    file_name = re.split(r'[\\]',file)
    # print(file_name)
    file_folder_name = file_name[-2]
    # print(file_folder_name)
    json_full_file_path.append(f'{json_file_path}/{file_folder_name}')

print(json_full_file_path)



['C:\\Users\\punitkumar.more\\Documents\\Elisa\\gcp_de\\GCP_DE\\AUTOMATION\\retail_db_json/categories', 'C:\\Users\\punitkumar.more\\Documents\\Elisa\\gcp_de\\GCP_DE\\AUTOMATION\\retail_db_json/customers', 'C:\\Users\\punitkumar.more\\Documents\\Elisa\\gcp_de\\GCP_DE\\AUTOMATION\\retail_db_json/departments', 'C:\\Users\\punitkumar.more\\Documents\\Elisa\\gcp_de\\GCP_DE\\AUTOMATION\\retail_db_json/orders', 'C:\\Users\\punitkumar.more\\Documents\\Elisa\\gcp_de\\GCP_DE\\AUTOMATION\\retail_db_json/order_items', 'C:\\Users\\punitkumar.more\\Documents\\Elisa\\gcp_de\\GCP_DE\\AUTOMATION\\retail_db_json/products']


In [275]:
import os
for f in json_full_file_path:
    if os.path.exists(f):
        print(f'path aleady exist {f}')
    else : 
        print("creating the path")
        try : 
            os.makedirs(f,exist_ok=True)
            print(f'path created successfully {f}')
        except : 
            print(f"Error in path creation {f}")


creating the path
path created successfully C:\Users\punitkumar.more\Documents\Elisa\gcp_de\GCP_DE\AUTOMATION\retail_db_json/categories
creating the path
path created successfully C:\Users\punitkumar.more\Documents\Elisa\gcp_de\GCP_DE\AUTOMATION\retail_db_json/customers
creating the path
path created successfully C:\Users\punitkumar.more\Documents\Elisa\gcp_de\GCP_DE\AUTOMATION\retail_db_json/departments
creating the path
path created successfully C:\Users\punitkumar.more\Documents\Elisa\gcp_de\GCP_DE\AUTOMATION\retail_db_json/orders
creating the path
path created successfully C:\Users\punitkumar.more\Documents\Elisa\gcp_de\GCP_DE\AUTOMATION\retail_db_json/order_items
creating the path
path created successfully C:\Users\punitkumar.more\Documents\Elisa\gcp_de\GCP_DE\AUTOMATION\retail_db_json/products


In [2]:
import pandas as pd
import json
import os
import pprint as pp
import glob


In [4]:
schemas = json.load(open(r'C:\Users\punitkumar.more\Documents\Elisa\gcp_de\GCP_DE\AUTOMATION\db_schemas.json','r'))
retail_db_json = r'C:\Users\punitkumar.more\Documents\Elisa\gcp_de\GCP_DE\AUTOMATION\retail_db_json'
# pp.pprint(schemas)
file_path_list = []
for i in glob.iglob('retail_db/*/*.txt', recursive=True):
    file_path_list.append(i)
file_path_list.sort()
print(f'Extracted the File path : ')
pp.pprint(file_path_list)
print("****"*25)

def get_column_name(schemas,*db_name_list, sorting_key = 'column_position'):
    for i,file in enumerate(file_path_list):
        db_name = db_name_list[i]
        print(f'DB_NAME : ',db_name)
        column_name_list = schemas.get(db_name)
        print(f'Columns name list of {db_name} : ')
        print(column_name_list)
        column_names = [col['column_name'] for col in sorted(column_name_list, key= lambda x : x[sorting_key], reverse=False)]
        print(f'{db_name} Column name : ',column_names)
        print('----------'*25)

        try:
            df = pd.read_csv(file, names=column_names)
            base_file_name = os.path.splitext(os.path.basename(file))[0]
        except :
            print("no such file")
        print(f'base_file_name: {base_file_name}')
        retail_db_json_full_path = os.path.join(retail_db_json,db_name,f'{base_file_name}.json')
        retail_db_json_path = os.path.join(retail_db_json,db_name)
        print(f'retail_db_json_full_path : {retail_db_json_full_path}')
        if os.path.exists(retail_db_json_path):
            print(f"loading json for {db_name}")
            df.to_json(retail_db_json_full_path, orient='records', lines=True)
        else :
            os.makedirs(retail_db_json_path, exist_ok=True)
            df.to_json(retail_db_json_full_path,orient='records',lines=True)


db_name_list = [i for i in schemas.keys()]
db_name_list.sort()
print(f"Extracted the table name from JSON : ")
pp.pprint(db_name_list)
print('****'*25)

get_column_name(schemas, *db_name_list)



Extracted the File path : 
['retail_db\\categories\\categories.txt',
 'retail_db\\customers\\customers.txt',
 'retail_db\\departments\\departments.txt',
 'retail_db\\order_items\\order_items.txt',
 'retail_db\\orders\\orders.txt',
 'retail_db\\products\\products.txt']
****************************************************************************************************
Extracted the table name from JSON : 
['categories', 'customers', 'departments', 'order_items', 'orders', 'products']
****************************************************************************************************
DB_NAME :  categories
Columns name list of categories : 
[{'column_name': 'category_id', 'data_type': 'integer', 'column_position': 1}, {'column_name': 'category_department_id', 'data_type': 'integer', 'column_position': 2}, {'column_name': 'category_name', 'data_type': 'string', 'column_position': 3}]
categories Column name :  ['category_id', 'category_department_id', 'category_name']
----------------------