# Extracting Metadata from filenames in Python using the `glob2` package

## Installing the Package

GitHub Page with Docs: https://github.com/miracle2k/python-glob2/

In [13]:
%pip install glob2

Note: you may need to restart the kernel to use updated packages.


## Our Problem

Below are the filenames that are in our directory.  What we want to do are:

1. Get only the CSV files
2. Get metadata from each filename:
   - The File's ID (the number after the "t")
   - The Town from which the file came from (Southampton, Queensland, or Cherbourg)
   - Whether the data comes from survivors or non-survivors (1 and 0, respectively)





In [14]:
example_filenames = [
    't604697_sout_1.csv',
    't82533_quee_1.csv',
    't88553_quee_0.csv',
    't244431_sout_0.csv',
    '.gitpod.yml',
    'aa.py',
    't61137_cher_1.csv',
    't13387_cher_0.csv',
    '.git'
]

#### Setup the Example

In the code below, I'm simply overriding the glob2's corresponding method so that it returns our example files, instead of the actual files on the computer, just to make it easier to try out this example situation.

Note: This wouldn't be done in a real-world situation.

In [15]:
from unittest.mock import Mock
from glob2 import Globber

Globber.listdir = Mock()
Globber.listdir.return_value = example_filenames
Globber.listdir()

['t604697_sout_1.csv',
 't82533_quee_1.csv',
 't88553_quee_0.csv',
 't244431_sout_0.csv',
 '.gitpod.yml',
 'aa.py',
 't61137_cher_1.csv',
 't13387_cher_0.csv',
 '.git']

## What Can We Do?

All we need is the `glob()` function from the `glob2` package; the difference from the built-in version is in its extra keywords.

In [16]:
from glob2 import glob

#### Find All Files, just like the built-in glob.glob()

In [17]:
glob('*')

['t604697_sout_1.csv',
 't82533_quee_1.csv',
 't88553_quee_0.csv',
 't244431_sout_0.csv',
 'aa.py',
 't61137_cher_1.csv',
 't13387_cher_0.csv']

#### Get all Files that match a wildcard pattern:

In [18]:
glob('*.csv')

['t604697_sout_1.csv',
 't82533_quee_1.csv',
 't88553_quee_0.csv',
 't244431_sout_0.csv',
 't61137_cher_1.csv',
 't13387_cher_0.csv']

#### Retain the data that matched each wildcard! 

Notice below, by adding more wildcards around the file seperaters, we could get each filename *and* the pattern it matched.

In [19]:
glob("t*_*_*.csv", with_matches=True)

[('t604697_sout_1.csv', ('604697', 'sout', '1')),
 ('t82533_quee_1.csv', ('82533', 'quee', '1')),
 ('t88553_quee_0.csv', ('88553', 'quee', '0')),
 ('t244431_sout_0.csv', ('244431', 'sout', '0')),
 ('t61137_cher_1.csv', ('61137', 'cher', '1')),
 ('t13387_cher_0.csv', ('13387', 'cher', '0'))]

#### Useful Pattern: Organize filenames and metadata into a Pandas DataFrame

In [20]:
import pandas as pd
 
records = []
for fname, (id_number, city, survived) in glob("t*_*_*.csv", with_matches=True):
    record = {'filename': fname, 'id': id_number, 'city': city, 'survived': survived}
    records.append(record)
    
pd.DataFrame(records)
    

Unnamed: 0,filename,id,city,survived
0,t604697_sout_1.csv,604697,sout,1
1,t82533_quee_1.csv,82533,quee,1
2,t88553_quee_0.csv,88553,quee,0
3,t244431_sout_0.csv,244431,sout,0
4,t61137_cher_1.csv,61137,cher,1
5,t13387_cher_0.csv,13387,cher,0


That's it!  Pretty neat, right?