# File handling

Data are often stored in multiple files or folders that need to be read. Python has several modules and functions to help handling files and especially file paths that will save you a lot of manual work. We review some of those functionalities.

## Pathlib

Whenever you want to read a file, you need to specify its location on your computer. You can usually do that in a relative or absolute manner:

- relative: indicate the location of the file *respective* to your current location (usually the location of the notebook)
- absolute: indicate the full path of your file on your system

You can often specify a path using a simple string, but this can be tedious, as you will for example of the construct paths of subfolders "manually". We highly recommend to use the ```pathlib``` module which provides a lot of very useful tools to handle path names, extensions etc.

First we use the ```Path``` object to define the path of the folder containing data. For example we might want to specify that the location is "right here" using a dot:

In [2]:
from pathlib import Path

In [3]:
folder = Path('.')

```folder``` is not just a string containing '.' but an actual object that is much more useful. For example we can ask for the absolute path:

In [15]:
folder = folder.absolute()

or we can ask if the defined path is a folder:

In [16]:
folder.is_dir()

True

In [17]:
folder.is_file()

False

There are also many usueful functions to handle the path itself. For example if you have a folder and a file name, you can simply join them with:

In [19]:
folder.joinpath('myfile.txt')

PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1/myfile.txt')

This spares you the hassle of adding slashes, making sure your code will work on an other OS etc.

### Listing files

Now we can use methods attached to this path object to explore its contents. For example we can check the folder contents with ```iterdir```

In [20]:
files_in_folder = folder.iterdir()

As you can see, the returned object is a *generator*. We haven't seen yet this object which is very specific to Python. It is a sort of a list whose contents can be queried one after the other, for example using the ```next``` statement:

In [21]:
next(files_in_folder)

PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1/06-Flow_control.ipynb')

For the moment we just transform this generator into a regular list:

In [22]:
files_in_folder = folder.iterdir()

files_in_folder = list(files_in_folder)

files_in_folder

[PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1/06-Flow_control.ipynb'),
 PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1/environment.yml'),
 PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1/03-Functions_packages.ipynb'),
 PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1/README.md'),
 PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1/05-File_handling.ipynb'),
 PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1/02-Variables.ipynb'),
 PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1/01-Notebooks.ipynb'),
 PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1/04-Data_structures.ipynb')]

### Investigating files

Our goal will be to analyze all the notebook files in that folder. However we will need to do some clean-up first as some of the files should be discarded.

Again, each of the elements of ```files_in_folder``` is a ```Path``` object and we can get multiple features such as: 
- the folder the file belongs to:

In [23]:
files_in_folder[0].parent

PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1')

- the name of the file:

In [24]:
files_in_folder[0].name

'06-Flow_control.ipynb'

- the two parts of the file: name and extension:

In [25]:
files_in_folder[0].stem

'06-Flow_control'

In [26]:
files_in_folder[0].suffix

'.ipynb'

While all these elements could be recovered from a path represented as a simple string, the ```Path``` object just makes this massively easier, so we definitely recommend to use it!

## Other functions and modules

A few other functionalities are useful to know. For example you can directly find files containing certain sub-texts using the `glob` function:

In [28]:
folder.glob('*.ipynb')

<generator object Path.glob at 0x105766790>

The `os` module can also be very useful. It gives you a lot of information about your system, includng current location etc. For example:

In [29]:
import os

os.getcwd()

'/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1'

Here we can see that the current location is the folder where this notebook is located. Naturally, we can transform the returned path into a `Path` object to further manipulate it:

In [31]:
Path(os.getcwd())

PosixPath('/Users/gw18g940/GoogleDrive/DSL/Trainings/Crash_Course_DataSciPy/Day1')

We can also create new directories using the `mkdir` method:

In [33]:
newfolder = folder.joinpath('newfolder')
newfolder.mkdir(parents=True, exist_ok=True)

## Exeracise

1. Create a Path object that points ot the main course reposiitory on your computer (the one containing the Da1, Day2 etc. folders).
2. Create a list of the contents of that directory.
3. For a few of the files, check if they are files or directories.
4. Create a new folder in that directory called `myfolder`.