# Lecture 3.3 - Working with files
Experimental data is saved in a bunch of files - often one file per animal or experiment. If you want to analyse the data with python you need to:
1. Discover and list these data files, so you can process each of them automatically in a for loop.
2. Read the data in the files into python variables for later manipulation.

Say we have experimental data stored in the following directory structure:
```
experiments
├── experiment_1
│   ├── final_behavioral_scores.txt
│   └── initial_behavioral_scores.txt
├── experiment_2
│   ├── final_behavioral_scores.txt
│   └── initial_behavioral_scores.txt
├── information.txt
└── mouse_names.txt
```

A bit of nomenclature: Take this path `experiments/experiment_1/final_behavioral_scores.txt`
- `final_behavioral_scores.txt` is the file name, `final_behavioral_scores` is called the file stem, `.txt` is called the suffix or extension
- `experiments` and `experiment_1` are directories or folders. `experiments` is the parent directory of `experiment_1`. `experiment_1` is  a subdirectory of `experiments`
- `/` is the path separator. On windows it can also be `\`. The system specific path separator can be obtained throug `os.sep`:

In [1]:
import os
os.sep

'/'

## Discovering files

The [glob](https://docs.python.org/3/library/glob.html) module allows you to list files in a directory.

_Wild card_ characters allow you to find file names matching a specific pattern:
- `?` matches individual characters
- `*` matches any string of characters

In [2]:
from glob import glob
print(f"{glob('experiments/*')=}")  # find all files and directories in the experiments directory
print(f"{glob('experiments/*.txt')=}")  # find files ending in '.txt'
print(f"{glob('experiments/i*.txt')=}")  # find files and directories in 'experiments', starting with 'i', and ending in '.txt'

print(f"{glob('experiments/*/')=}")  # find all subdirectories
print(f"{glob('experiments/*/*.txt')=}")  # find 'txt' files in all subdirectories

glob('experiments/*')=['experiments/experiment_1', 'experiments/information.txt', 'experiments/mouse_names.txt', 'experiments/experiment_2']
glob('experiments/*.txt')=['experiments/information.txt', 'experiments/mouse_names.txt']
glob('experiments/i*.txt')=['experiments/information.txt']
glob('experiments/*/')=['experiments/experiment_1/', 'experiments/experiment_2/']
glob('experiments/*/*.txt')=['experiments/experiment_1/initial_behavioral_scores.txt', 'experiments/experiment_1/final_behavioral_scores.txt', 'experiments/experiment_2/initial_behavioral_scores.txt', 'experiments/experiment_2/final_behavioral_scores.txt']


## Manipulating paths
We often need to manipulate path names.

Say we want to process the data in `initial_behavioral_scores.txt` and `final_behavioral_scores.txt` for each experiment in `experiments`, and want to save the results in a folder called `results` that mimics the structure of the data folder: `results/experiment_1/behavior.xls`, `results/experiment_2/behavior.xls`

We want to generate the paths for the new results files automatically from the paths of the data files. That means we need to manipulate the path names. In one exercise, you will do just that!!

There are two ways of working with paths in python
- [os.path](https://docs.python.org/3/library/os.path.html) (old)
- [pathlib](https://docs.python.org/3/library/pathlib.html) (new but more complicated - we won't cover it here)


In [1]:
import os.path
path = 'updir/subdir/name.txt'
print(f"{os.path.splitext(path)=}")
path_parts = os.path.splitext(path)
trunk = path_parts[0]
extension = path_parts[1]
print(f"{path=}, {trunk=}, {extension=}")

# this will split off the file name from the rest of the path:
print(f"{os.path.split(path)=}")

# we can treat path as a string and use the "split" function to split the string at a specified sign - in this case the path separator
print(f"{path.split('/')=}")

print(f"{os.path.basename(path)=}")

print(f"{os.path.dirname(path)=}")

os.path.splitext(path)=('updir/subdir/name', '.txt')
path='updir/subdir/name.txt', trunk='updir/subdir/name', extension='.txt'
os.path.split(path)=('updir/subdir', 'name.txt')
path.split('/')=['updir', 'subdir', 'name.txt']
os.path.basename(path)='name.txt'
os.path.dirname(path)='updir/subdir'


We now know how to split a path in various ways. How can we assemble the parts into something new?

If the parts are strings, we can use the `+` operator to concatenate them. This will create a file name with a new suffix:

In [26]:
print(f"old path: {path}")
path_parts = os.path.splitext(path)
trunk = path_parts[0]
extension = path_parts[1]

new_extension = '.mp3'
new_path = trunk + new_extension
print(f"new path: {new_path}")


old path: updir/subdir/name.txt
new path: updir/subdir/name.mp3


## Creating directories
To save files to new directories, we can create them directly with python:

`os.makedirs('tmp/sub1/sub2', exist_ok=True)`

`exist_ok=True` prevents an error if the directory already exists.

## Saving and loading data from files

- Text files: `.txt` or `.csv`
- Excel files ending in `.xls` or `.xlsx`
- Matlab files ending in `.mat`
- Numpy files ending in `.npy`, `.npz`

### Loading data from text files
Data in text files is typically saved as a single column of data - with one value per row/line and a column label:
```csv
Responses
0.561
0.342
0.23
0.144
```
Tabular data with multiple columns is saved with a specific character separating the individual columns in a row: the _delimiter_. Common delimiters are `,` (csv - comma-sparated values), `;`, or `\tab`.
```csv
Time,Responses
1,0.561
2,0.342
3,0.23
4,0.144
```

We can use numpy or the [pandas library](https://pandas.pydata.org) for loading and saving data. Numpy and pandas are numerical computation packages, and come with function for data IO.

In [5]:
import numpy as np
data = np.loadtxt('data/mouse1.txt')

ValueError: could not convert string 'Responses' to float64 at row 0, column 1.

Open the file and inspect it! Can you guess the cause of the error?

This is how to solve it. Check the documentation to understand what the `skiprows` argument does. A good google query is the package name followed by the function name: "numpy loadtxt" or straight "np.loadtxt".



In [6]:
np_data = np.loadtxt('data/mouse1.txt', skiprows=1)
print(np_data, type(np_data))

[ 1.  2.  3.  4.  5. 56.  6.] <class 'numpy.ndarray'>


### Saving data to text files
We can use numpy to save any numpy array to a text file:

In [22]:
data = [1.3, 2.2, 3.5, 4.8]
np.savetxt('test.txt', data, header='Responses')

### Loading data from excel files
We can use the pandas library to load `xls` or `xlsx` files. Pandas loads data not into a list or a dictionary but has it's own data type - the `DataFrame`. We will learn next week how to work with `DataFrames`. For now, we can easily convert the data from the `DataFrame` to a list or a numpy array.

In [None]:
import pandas as pd
df = pd.read_excel('data/mouse_data.xls', sheet_name="Mouse1")
df

What's the type of df?

In [30]:
print(f"The type of df is {type(df)}.")

The type of df is <class 'pandas.core.frame.DataFrame'>.


We don't really know yet what to do with a `DataFrame` - so let's convert it to a numpy array:

In [31]:
df.to_numpy()  # to a numpy array

array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12],
       [13],
       [14],
       [15]])

### Loading data from matlab files
- old format: `scipy.io.loadmat(filename)` ([docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html))
- new format: Open as an hdf5 file using h5py ([docs](https://docs.h5py.org/en/stable/quick.html)) 

Both will open the file as a dictionary, with variable names as keys and the data as values.

### Loading data from numpy files

In [12]:
import numpy as np

data = np.load('dummy.npz')
data

NpzFile 'dummy.npz' with keys: data, names

## Saving and loading data to/from a text file

- Text files: `.txt` or `.csv`
- Excel files ending in `.xls` or `.xlsx`
- Numpy files ending in `.npy`, `.npz`

### Saving data to text files
`np.loadtxt` loads data from a text file - guess what `np.savetxt` does! `np.savetxt` will use the `,` as the default delimiter, but you can change that.

In [3]:
import numpy as np
data = [1,2,3]
filename = 'dummy.txt'
np.savetxt(filename, data)

### Saving and loading data to/from non-text files
For large datasets - think hours of multi-channel electrophyiology recordings - text files become very big and slow to work with. In that case, we can save data directly as binary, non-text data:

In [1]:
filename = 'dummy.npy'
np.save(filename, data)
np.load(filename)

NameError: name 'np' is not defined

We can also save multiple variables to the same file:

In [11]:
filename = 'dummy.npz'
data_new = ['alpha', 'beta', 'gamma', 'epsilon']
np.savez(filename, data=data, names='data_new')

file_contents = np.load(filename)
print(f"What is this? {file_contents}. Check the docs for np.load!")
print(f"as a list: {list(file_contents.keys())=}")
print(f"{file_contents['names']=}")

What is this? NpzFile 'dummy.npz' with keys: data, names. Check the docs for np.load!
as a list: list(file_contents.keys())=['data', 'names']
file_contents['names']=array('data_new', dtype='<U8')
