# Using Python to inventory files

This notebook uses pandas (a data analysis library) to develop some information inventorying files related to the UMMAA collection, which are the content for the final site project.

In [1]:
import glob
from pathlib import Path
import os
from collections import Counter

In [2]:
# if you need to install... pip!
#!pip install pandas
#!pip install openpyxl

In [3]:
import pandas as pd

In [4]:
loc = os.getcwd()
print(loc)

/Users/jajohnst/Desktop/si676-2025-data/examples


In [5]:
museum_file_path = Path('/','Users','jajohnst','Documents','teaching','2025-4-SI-676-NetworkedLAMS','ummaa_project_data_samples')

os.path.isdir(museum_file_path)

True

Set up some file counting functions

In [6]:
def count_files_by_extension(directory):
    '''
    This function takes a path object
    '''
    path = Path(directory)
    extensions = Counter(
        p.suffix.lower() if p.suffix else '[no extension]'
        for p in path.rglob('*')
        if p.is_file()
    )
    return extensions

def count_files_by_directory(directory):
    '''
    This function takes a string
    '''
    path = Path(directory)
    dir_counts = {}
    for subdir in path.iterdir():
        if subdir.is_dir():
            file_count = sum(1 for p in subdir.rglob('*') if p.is_file())
            dir_counts[subdir.name] = file_count
    return dir_counts

In [7]:
extension_counts = count_files_by_extension(museum_file_path)
file_counts = count_files_by_directory(museum_file_path)

print('Files listed by extension:',extension_counts,'\n')
print('Files by directory:', file_counts,'\n')

Files listed by extension: Counter({'.jpg': 2508, '.zip': 8, '.xlsx': 2, '.docx': 1, '[no extension]': 1}) 

Files by directory: {'Baskets': 99, 'Other Ethnographic': 83, 'Weapons & Armor': 258, 'Pipes': 33, 'Textiles': 31, 'Herbarium Sheets': 715, 'Containers': 840, 'Dress and Personal Adornment': 449} 



In [8]:
for k, v in extension_counts.items():
    print(k, v)

.zip 8
.docx 1
[no extension] 1
.xlsx 2
.jpg 2508


In [9]:
# function for getting more file details
def get_file_details(directory):
    ''' 
    This function takes a directory path as a string
    Returns a pandas dataframe with the data about files in the directory
    '''
    path = Path(directory)
    file_data = []

    for file in path.rglob('*'):
        if file.is_file():
            file_data.append({
                'directory': file.parent.name,
                'full_path': str(file.parent),
                'filename': file.name,
                'extension': file.suffix.lower() if file.suffix else '[no extension]',
                'size_bytes': file.stat().st_size
            })

    return pd.DataFrame(file_data)

Another function get the file information but also summarize by directory:

In [10]:
def get_directory_file_summary(directory):
    ''' 
    This function takes a string directory path
    returns a dataframe
    '''
    path = Path(directory)
    data = []

    for subdir in path.iterdir():
        if subdir.is_dir():
            for file in subdir.rglob('*'):
                if file.is_file():
                    data.append({
                        'directory': subdir.name,
                        'filename': file.name,
                        'file_ext': file.suffix.lower().strip('.') if file.suffix else '[no extension]',
                        'size_bytes': file.stat().st_size,
                        'size_kib': round(file.stat().st_size / 1024, 1),
                        'size_kb' : round(file.stat().st_size / 1000),
                        'size_mib': round(file.stat().st_size / 1024**2, 1),
                        'size_mb' : round(file.stat().st_size / 1000**2),
                        'rel_path': str(file.parent.name) + '/' + str(file.name),
                        'full_path': str(file.parent), #str(file.relative_to(subdir)),
                    })
    return pd.DataFrame(data)


## Making dataframes

First, a basic approach that uses the file count function to list directories and record the number of files:

In [11]:
file_counts = count_files_by_directory(museum_file_path)

ummaa_phil_digitized_item_info = pd.DataFrame(list(file_counts.items()),
                  columns=['directory','file_count'])

In [12]:
ummaa_phil_digitized_item_info

Unnamed: 0,directory,file_count
0,Baskets,99
1,Other Ethnographic,83
2,Weapons & Armor,258
3,Pipes,33
4,Textiles,31
5,Herbarium Sheets,715
6,Containers,840
7,Dress and Personal Adornment,449


Next, another approach using the `get_file_details()` function to get information about all of the files in directory and subdirectory. Then, this could be filtered by parent directory...

In [13]:
all_ummaa_digitized_phil_files_df = get_file_details(museum_file_path)

In [14]:
all_ummaa_digitized_phil_files_df.head()

Unnamed: 0,directory,full_path,filename,extension,size_bytes
0,ummaa_project_data_samples,/Users/jajohnst/Documents/teaching/2025-4-SI-6...,Textiles.zip,.zip,209515208
1,ummaa_project_data_samples,/Users/jajohnst/Documents/teaching/2025-4-SI-6...,Dress and Personal Adornment.zip,.zip,1386456944
2,ummaa_project_data_samples,/Users/jajohnst/Documents/teaching/2025-4-SI-6...,UMMAA Philippine Dropbox Index.docx,.docx,19156
3,ummaa_project_data_samples,/Users/jajohnst/Documents/teaching/2025-4-SI-6...,.DS_Store,[no extension],14340
4,ummaa_project_data_samples,/Users/jajohnst/Documents/teaching/2025-4-SI-6...,Weapons & Armor.zip,.zip,10965264


Now, grouping files by directory . . .

In [15]:
ummaa_phil_digitized_files_by_dir = get_directory_file_summary(museum_file_path)

ummaa_phil_digitized_files_by_dir.head()

Unnamed: 0,directory,filename,file_ext,size_bytes,size_kib,size_kb,size_mib,size_mb,rel_path,full_path
0,Baskets,17585_1.jpg,jpg,4918008,4802.7,4918,4.7,5,Baskets/17585_1.jpg,/Users/jajohnst/Documents/teaching/2025-4-SI-6...
1,Baskets,1989-74-79_1.jpg,jpg,4305781,4204.9,4306,4.1,4,Baskets/1989-74-79_1.jpg,/Users/jajohnst/Documents/teaching/2025-4-SI-6...
2,Baskets,8130_1.jpg,jpg,3745167,3657.4,3745,3.6,4,Baskets/8130_1.jpg,/Users/jajohnst/Documents/teaching/2025-4-SI-6...
3,Baskets,17601_1.jpg,jpg,4219945,4121.0,4220,4.0,4,Baskets/17601_1.jpg,/Users/jajohnst/Documents/teaching/2025-4-SI-6...
4,Baskets,23835_1.jpg,jpg,4792470,4680.1,4792,4.6,5,Baskets/23835_1.jpg,/Users/jajohnst/Documents/teaching/2025-4-SI-6...


In [16]:
ummaa_phil_digitized_files_by_dir.shape

(2508, 10)

Using the above, generate some summary information:

In [17]:
# total file size
print('Total file size by directory:\n\t')
print(ummaa_phil_digitized_files_by_dir.groupby('directory')['size_bytes'].sum().sort_values(ascending=False))

by_directory_size = pd.DataFrame(ummaa_phil_digitized_files_by_dir.groupby('directory')['size_bytes'].sum().sort_values(ascending=False))
by_directory_size

Total file size by directory:
	
directory
Containers                      3063730176
Dress and Personal Adornment    1386396276
Baskets                          415215527
Other Ethnographic               302461453
Textiles                         209510890
Pipes                            118946151
Herbarium Sheets                  54875689
Weapons & Armor                   10931358
Name: size_bytes, dtype: int64


Unnamed: 0_level_0,size_bytes
directory,Unnamed: 1_level_1
Containers,3063730176
Dress and Personal Adornment,1386396276
Baskets,415215527
Other Ethnographic,302461453
Textiles,209510890
Pipes,118946151
Herbarium Sheets,54875689
Weapons & Armor,10931358


In [18]:
# size by file types
print("Total size by file type:\n\t")
print(ummaa_phil_digitized_files_by_dir.groupby('file_ext')['size_bytes'].sum().sort_values(ascending=False))

by_file_type = ummaa_phil_digitized_files_by_dir.groupby('file_ext').agg({
    'size_bytes': 'sum',
    'filename': 'count'
}).rename(columns={'filename':'File count', 'size_bytes':'Total bytes'}).sort_values('Total bytes', ascending=False)

by_file_type['size_mib'] = round(by_file_type['Total bytes'] / (1024**2), 2)
by_file_type['size_mb'] = round(by_file_type['Total bytes'] / (1000**2))

by_file_type = by_file_type[['File count','Total bytes','size_mib','size_mb']]

by_file_type

Total size by file type:
	
file_ext
jpg    5562067520
Name: size_bytes, dtype: int64


Unnamed: 0_level_0,File count,Total bytes,size_mib,size_mb
file_ext,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
jpg,2508,5562067520,5304.4,5562.0


File count, type, and size by directory

In [19]:
print("\nFile count and total size by directory:")
files_by_directory_summary = ummaa_phil_digitized_files_by_dir.groupby('directory').agg({
    'filename': 'count',
    'size_bytes': 'sum'
}).rename(columns={'filename':'File count'})

# readable files sizes
files_by_directory_summary['Kib'] = round(files_by_directory_summary['size_bytes'] / 1024, 1)
files_by_directory_summary['Kb'] = round(files_by_directory_summary['size_bytes'] / 1000, 1)
files_by_directory_summary['MiB'] = round(files_by_directory_summary['size_bytes'] / 1024**2, 1)
files_by_directory_summary['Mb'] = round(files_by_directory_summary['size_bytes'] / 1000**2, 0)
files_by_directory_summary.rename(columns={'filename': 'File count', 'size_bytes': 'Total bytes'})

files_by_directory_summary


File count and total size by directory:


Unnamed: 0_level_0,File count,size_bytes,Kib,Kb,MiB,Mb
directory,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Baskets,99,415215527,405483.9,415215.5,396.0,415.0
Containers,840,3063730176,2991924.0,3063730.2,2921.8,3064.0
Dress and Personal Adornment,449,1386396276,1353902.6,1386396.3,1322.2,1386.0
Herbarium Sheets,715,54875689,53589.5,54875.7,52.3,55.0
Other Ethnographic,83,302461453,295372.5,302461.5,288.4,302.0
Pipes,33,118946151,116158.4,118946.2,113.4,119.0
Textiles,31,209510890,204600.5,209510.9,199.8,210.0
Weapons & Armor,258,10931358,10675.2,10931.4,10.4,11.0


In [20]:
print("Largest files:\n\t")
ten_largest = (ummaa_phil_digitized_files_by_dir
               .nlargest(10, 'size_bytes')[['directory', 'filename', 'size_bytes', 'size_mb', 'size_mib']]
               .rename(columns={'filename':'File', 'size_bytes':'Total bytes', 'size_mb':'MB', 'size_mib':'MiB'})
)

ten_largest

Largest files:
	


Unnamed: 0,directory,File,Total bytes,MB,MiB
476,Textiles,2002-3-4.jpg,14038771,14,13.4
485,Textiles,2002-3-24.jpg,13057970,13,12.5
492,Textiles,2002-3-22.jpg,11126876,11,10.6
489,Textiles,2002-3-33.jpg,10719075,11,10.2
486,Textiles,2002-3-32.jpg,10343317,10,9.9
477,Textiles,2002-3-40.jpg,9869808,10,9.4
484,Textiles,2002-3-30.jpg,9354311,9,8.9
498,Textiles,2002-3-39.jpg,8723641,9,8.3
479,Textiles,2002-3-3.jpg,7721780,8,7.4
474,Textiles,2002-3-7.jpg,6995894,7,6.7


In [None]:
inventory_basket_files = (ummaa_phil_digitized_files_by_dir
                          .loc[ummaa_phil_digitized_files_by_dir['directory'] == 'Baskets',
                            ['directory','filename','file_ext','size_bytes','size_kb','size_mb','rel_path']]
                          .rename(columns={'directory':'Directory Grouping','filename':'File','file_ext':'Extension'})
)

inventory_basket_files

Unnamed: 0,Directory Grouping,File,Extension,size_bytes,size_kb,size_mb,rel_path
0,Baskets,17585_1.jpg,jpg,4918008,4918,5,Baskets/17585_1.jpg
1,Baskets,1989-74-79_1.jpg,jpg,4305781,4306,4,Baskets/1989-74-79_1.jpg
2,Baskets,8130_1.jpg,jpg,3745167,3745,4,Baskets/8130_1.jpg
3,Baskets,17601_1.jpg,jpg,4219945,4220,4,Baskets/17601_1.jpg
4,Baskets,23835_1.jpg,jpg,4792470,4792,5,Baskets/23835_1.jpg
...,...,...,...,...,...,...,...
94,Baskets,92126_1.jpg,jpg,4545584,4546,5,Baskets/92126_1.jpg
95,Baskets,8654_1.jpg,jpg,4545323,4545,5,Baskets/8654_1.jpg
96,Baskets,1989-74-30_1.jpg,jpg,4272009,4272,4,Baskets/1989-74-30_1.jpg
97,Baskets,17574_1.jpg,jpg,5517381,5517,6,Baskets/17574_1.jpg


In [None]:
# a function to create the above for other named categories
def create_subinventory_df(category, full_df):
    ''' 
    this function uses pandas
    category takes a string that indicates a directory name
    full_df is a pandas dataframe
    generates a df of files in that directory
    '''
    df_name = 'inventory_' + str(category) + '_files'
    df_name = (full_df
               .loc[full_df['directory'] == category,
                    ['directory','filename','file_ext','size_bytes','size_kb','size_mb','rel_path']]
                .rename(columns={'directory':'Directory Grouping','filename':'File','file_ext':'Extension'})
                    )
    return df_name

## Write to a file

Create an excel sheet with the desired dataframes as individual sheets:

- full inventory (all files with paths, sizes, etc)
- files by directory
- files by type
- files by Baskets
- files by Ethographic
- files by Textiles
- 10 largest files

In [25]:
with pd.ExcelWriter('ummaa_phil_digitized_files_inventories.xlsx', engine='openpyxl') as writer:
    ummaa_phil_digitized_files_by_dir.to_excel(writer, sheet_name='All files', index=False)
    files_by_directory_summary.to_excel(writer, sheet_name='File counts by directory')
    by_file_type.to_excel(writer, sheet_name='File type count')
    inventory_basket_files.to_excel(writer, sheet_name='Inventory by baskets', index=False)
    # ethnographic
    create_subinventory_df('Other Ethnographic',ummaa_phil_digitized_files_by_dir).to_excel(writer, sheet_name='Inventory by Other ethnographic', index=False)
    # textiles
    create_subinventory_df('Textiles',ummaa_phil_digitized_files_by_dir).to_excel(writer, sheet_name='Inventory by Textiles', index=False)
    ten_largest.to_excel(writer, sheet_name='Ten largest files',index=False)

print('wrote your inventory!')

wrote your inventory!
