# Navigating the Filesystem

Let's delve into the essential skills of navigating and managing files and directories, a fundamental aspect of handling experimental data in neuroscience research. We will explore various commands and techniques to efficiently organize and access your experimental data, ensuring seamless integration into your analysis workflow.

Run the Following Code to Get the Data for this Notebook:

In [75]:
from pathlib import Path
paths = [
    "data/exp1/joey_2021-05-01_001/spikes.npy", 
    "data/exp1/joey_2021-05-02_001/spikes.npy", 
    "data/exp1/joey_2021-05-02_001/lfps.h5", 
    "data/exp1/phoebe_2021-05-02_001/spikes.npy",
    "data/exp1/phoebe_2021-05-03_001/spikes.npy", 
    "data/exp1/phoebe_2021-05-03_001/lfps.h5", 
    "data/exp1/phoebe_2021-05-04_001/spikes.npy",
]

for path in paths:
    path = Path(path)
    path.parent.mkdir(exist_ok=True, parents=True)
    path.touch()

## Using the pathlib library

The pathlib module in Python introduces an object-oriented approach to file system paths--. This section is designed to familiarize you with this powerful library, enhancing your ability to handle file paths and directories with more flexibility and intuitiveness. We'll cover basic operations like listing directories, globbing for pattern matching, and more, all through the lens of object-oriented programming.

| Command | Purpose |
| :-- | :-- |
| `from pathlib import Path` | | 
| `Path.cwd()` | Gets the current working directory. |
| `Path('.').resolve()` | Also gets the current working directory. |
| `path = Path('./data')` | Make a `Path` object located in the data folder of the working directory. |
| `list(path.iterdir())` | List all the files and folders in the specified path |
| `new_path = path.joinpath("raw")` | Append the "/raw" folder to the current path |
| `new_path = path / "raw"` | Also append the "/raw" folder to the current path. |
| `glob.glob('*.h5')` | Search for files that end in ".h5" in the current path. |
| `glob.glob('data*')` | Search for files that start with "data" in the current path. |
| `glob.glob('./**/data*')` | Search for files that start with "data" in the any subfolder in the current path. |


In [45]:
from pathlib import Path

What is the current working directory?

In [49]:
Path.cwd()

WindowsPath('c:/Users/NickDG/Projects/remoteDuckDB/draft3')

In [52]:
Path('.').resolve()

WindowsPath('C:/Users/NickDG/Projects/remoteDuckDB/draft3')

In [53]:
Path().resolve()

WindowsPath('C:/Users/NickDG/Projects/remoteDuckDB/draft3')

What files and folders are inside the current working directory?

In [56]:
list(Path().iterdir())

[WindowsPath('1_navigating_filesystems_os_fsspec_objects.ipynb'),
 WindowsPath('2_parsing_metadata_from_filenames_str_glob.ipynb'),
 WindowsPath('3_metadata_in_json_arrays_dict.ipynb'),
 WindowsPath('4_sql_across_json_files_with_duckdb_and_hive.ipynb'),
 WindowsPath('5_storing_arrays_flat_npy.ipynb'),
 WindowsPath('6_hdf5.ipynb'),
 WindowsPath('7_sql_schemas_sql_joins_with_duckdb.ipynb'),
 WindowsPath('8_pipelines_finalizing_data_into_parquet_files.ipynb'),
 WindowsPath('data')]

What Files and folders are inside the "data" directory?

In [57]:
list(Path("data").iterdir())

[WindowsPath('data/exp1')]

What Files and Folders are inside the "exp1" directory, inside the "data" directory?

In [58]:
list(Path("data/exp1").iterdir())

[WindowsPath('data/exp1/joey_2021-05-01_001'),
 WindowsPath('data/exp1/joey_2021-05-02_001'),
 WindowsPath('data/exp1/phoebe_2021-05-02_001'),
 WindowsPath('data/exp1/phoebe_2021-05-03_001'),
 WindowsPath('data/exp1/phoebe_2021-05-04_001')]

In [59]:
list(Path().joinpath("data").joinpath("exp1").iterdir())

[WindowsPath('data/exp1/joey_2021-05-01_001'),
 WindowsPath('data/exp1/joey_2021-05-02_001'),
 WindowsPath('data/exp1/phoebe_2021-05-02_001'),
 WindowsPath('data/exp1/phoebe_2021-05-03_001'),
 WindowsPath('data/exp1/phoebe_2021-05-04_001')]

What folders in exp1 start with the subject "phoebe" (Hint: use Path().glob())?

In [61]:
list(Path("data/exp1").glob("phoebe*"))

[WindowsPath('data/exp1/phoebe_2021-05-02_001'),
 WindowsPath('data/exp1/phoebe_2021-05-03_001'),
 WindowsPath('data/exp1/phoebe_2021-05-04_001')]

What folders in exp1 start with the subject "joey"?

In [62]:
list(Path("data/exp1").glob("joey*"))

[WindowsPath('data/exp1/joey_2021-05-01_001'),
 WindowsPath('data/exp1/joey_2021-05-02_001')]

What folders in exp1 were recorded on the 2nd of May (hint-glob on the date part of the filename)?

In [63]:
list(Path("data/exp1").glob("*2021-05-02*"))

[WindowsPath('data/exp1/joey_2021-05-02_001'),
 WindowsPath('data/exp1/phoebe_2021-05-02_001')]

What files have the ".h5" file extension (include all files in any subfolders of exp1)?

In [67]:
list(Path("data/exp1").glob("**/*.h5"))

[WindowsPath('data/exp1/joey_2021-05-02_001/lfps.h5'),
 WindowsPath('data/exp1/phoebe_2021-05-03_001/lfps.h5')]

What files have the ".npy" file extension (include all files in any subfolders of exp1)?

In [66]:
list(Path("data/exp1").glob("**/*.npy"))

[WindowsPath('data/exp1/joey_2021-05-01_001/spikes.npy'),
 WindowsPath('data/exp1/joey_2021-05-02_001/spikes.npy'),
 WindowsPath('data/exp1/phoebe_2021-05-02_001/spikes.npy'),
 WindowsPath('data/exp1/phoebe_2021-05-03_001/spikes.npy'),
 WindowsPath('data/exp1/phoebe_2021-05-04_001/spikes.npy')]

Which of phoebe's files contain lfp data?

In [69]:
list(Path("data/exp1").glob("phoebe*/**/lfps*"))

[WindowsPath('data/exp1/phoebe_2021-05-03_001/lfps.h5')]

## Accessing Remote File Systems using `fsspec`: 

In modern neuroscience research, accessing and manipulating data stored in remote file systems is increasingly common. This section introduces fsspec, a library for interacting with various file systems, including remote and cloud-based storage. We'll explore how to list, search, and manage files on different remote systems, an invaluable skill in a data-intensive field like neuroscience.


| Code | Description |
| :-- | :-- |
|`fs.ls()` | Lists all files and directories in the current directory of the filesystem. |
| `fs.glob('*.h5')` | Searches for files matching a specified pattern (in this case, all files ending with '.h5') in the current directory and subdirectories. |
| `fs.makedirs()` | Creates a new directory at the specified path, including any necessary intermediate directories. |
| `fs.removedirs()` | Removes directories recursively. Deletes a directory and, if it's empty, its parent directories as well. |
| `fs.rm()` | Removes (deletes) a file or directory. |
| `fs.read_text()`| Reads the contents of a file and returns it as a string. |
| `fs.read_bytes()` | Reads the contents of a file and returns it as bytes. |
| `fs.download()`| Downloads a file from the remote filesystem to the local filesystem. |


### GitHub Repos as a Remote Filesystem

GitHub, a platform widely used for code sharing and collaboration, can also serve as a remote filesystem for data storage and retrieval. This section guides you through using GitHub repositories for accessing and managing data files, leveraging the `GithubFileSystem` class in `fsspec`. 

```python
from fsspec.implementations.github import GithubFileSystem
fs = GithubFileSystem(org="ibehave-ibots", repo="iBOTS-Tools")
```



**Exercises**: Explore navigating remote GitHub filesystems using the `fsspec`'s `GithubFileSystem` class.

In [3]:
import fsspec
from fsspec.implementations.github import GithubFileSystem

List all the files in the root directory of https://github.com/mwaskom/seaborn-data

In [82]:
fs = GithubFileSystem(org="mwaskom", repo="seaborn-data")
fs.ls("/")

['README.md',
 'anagrams.csv',
 'anscombe.csv',
 'attention.csv',
 'brain_networks.csv',
 'car_crashes.csv',
 'dataset_names.txt',
 'diamonds.csv',
 'dots.csv',
 'dowjones.csv',
 'exercise.csv',
 'flights.csv',
 'fmri.csv',
 'geyser.csv',
 'glue.csv',
 'healthexp.csv',
 'iris.csv',
 'mpg.csv',
 'penguins.csv',
 'planets.csv',
 'png',
 'process',
 'raw',
 'seaice.csv',
 'taxis.csv',
 'tips.csv',
 'titanic.csv']

List all the files whose filenames start with the letter "p" (i.e. "glob" the files)

In [84]:
fs.glob("p*")

['penguins.csv', 'planets.csv', 'png', 'process']

List all the files whose filenames end in the "CSV" extension.

In [85]:
fs.glob("*.csv", )

['anagrams.csv',
 'anscombe.csv',
 'attention.csv',
 'brain_networks.csv',
 'car_crashes.csv',
 'diamonds.csv',
 'dots.csv',
 'dowjones.csv',
 'exercise.csv',
 'flights.csv',
 'fmri.csv',
 'geyser.csv',
 'glue.csv',
 'healthexp.csv',
 'iris.csv',
 'mpg.csv',
 'penguins.csv',
 'planets.csv',
 'seaice.csv',
 'taxis.csv',
 'tips.csv',
 'titanic.csv']

List all the PNG image files in the "png" folder.

In [86]:
fs.ls("png")

['png/img1.png',
 'png/img2.png',
 'png/img3.png',
 'png/img4.png',
 'png/img5.png',
 'png/img6.png']

Download all the PNG image files in the "png" folder.

In [91]:
fs.download("png/*", "data/seaborn-images")  # note: need glob (has to download files, apparantly)

List all the files in the root directory of the repo, with `detail=True` (i.e. `fs.ls("/", detail=True)`).  What information does it give us about these files?

In [99]:
import pandas as pd
pd.DataFrame(fs.ls("/", detail=True))

Unnamed: 0,name,mode,type,size,sha
0,README.md,100644,file,3101,453ab596a15d1f38f2514770783bda43d97ed755
1,anagrams.csv,100644,file,361,1d88d051b7fff295350bc2ed509b1946d41190b4
2,anscombe.csv,100644,file,556,62792b68fa5eed40eb75fe00e8daeaaf700f4f82
3,attention.csv,100644,file,1198,8d1f684e36f36aea05b10408c055eb4b30a3fcef
4,brain_networks.csv,100644,file,1075911,1ca1f474fa81aa8ee01654da5d6c9fd90c96fa27
5,car_crashes.csv,100644,file,3301,2248a441bfbbfb1d5c9fa7dbc9dae641c34829a1
6,dataset_names.txt,100644,file,174,2a27f085940eba05b41e87bbcc2d8c075c000831
7,diamonds.csv,100644,file,2772143,92259b40dbeea3165759a8f2cb576896612828ac
8,dots.csv,100644,file,25742,9b7eebf50146fd573b055b3b9f8d2caa57879723
9,dowjones.csv,100644,file,11349,8c35bf1355e823bd2aa119d2f4979c812e898df1


Read and print the text contents of the "anscombe.csv" file. What data is inside this file?

In [102]:
print(fs.read_text("/anscombe.csv").replace(',', '\t'))

dataset	x	y
I	10.0	8.04
I	8.0	6.95
I	13.0	7.58
I	9.0	8.81
I	11.0	8.33
I	14.0	9.96
I	6.0	7.24
I	4.0	4.26
I	12.0	10.84
I	7.0	4.82
I	5.0	5.68
II	10.0	9.14
II	8.0	8.14
II	13.0	8.74
II	9.0	8.77
II	11.0	9.26
II	14.0	8.1
II	6.0	6.13
II	4.0	3.1
II	12.0	9.13
II	7.0	7.26
II	5.0	4.74
III	10.0	7.46
III	8.0	6.77
III	13.0	12.74
III	9.0	7.11
III	11.0	7.81
III	14.0	8.84
III	6.0	6.08
III	4.0	5.39
III	12.0	8.15
III	7.0	6.42
III	5.0	5.73
IV	8.0	6.58
IV	8.0	5.76
IV	8.0	7.71
IV	8.0	8.84
IV	8.0	8.47
IV	8.0	7.04
IV	8.0	5.25
IV	19.0	12.5
IV	8.0	5.56
IV	8.0	7.91
IV	8.0	6.89



**DeepLabCut**: Answer the following questions about the DeepLabCut GitHub Repo:   https://github.com/DeepLabCut/DeepLabCut

What files are in the root directory of the DeepLabCut repo?

In [4]:
fs = GithubFileSystem(org="DeepLabCut", repo="DeepLabCut")
fs.ls("/")

['.circleci',
 '.codespellrc',
 '.github',
 '.gitignore',
 'AUTHORS',
 'CODE_OF_CONDUCT.md',
 'CONTRIBUTING.md',
 'LICENSE',
 'NOTICE.yml',
 'README.md',
 '_config.yml',
 '_toc.yml',
 'conda-environments',
 'deeplabcut',
 'dlc.py',
 'docker',
 'docs',
 'examples',
 'reinstall.sh',
 'requirements.txt',
 'setup.py',
 'tests',
 'testscript_cli.py',
 'tools']

How many files or folders are in the "openfield-Pranav-2018-10-30" folder, which is in the "examples" folder?  (Tip: the `len()` function can be helpful here.)

In [5]:
fs.glob("examples/open*/*")

['examples/openfield-Pranav-2018-10-30/config.yaml',
 'examples/openfield-Pranav-2018-10-30/labeled-data',
 'examples/openfield-Pranav-2018-10-30/videos']

How many files are there, if you include every single file or folder in all the subfolders of the openfield example?

In [6]:
len(fs.glob("examples/open*/**"))

124

Download all the "labeled-data" files in the openfield example (`fs.download(recursive=True)`)

In [137]:
fs.download("examples/open*/labeled-data", "deeplabcut/pranav/labeled-data", recursive=True)
