# Practical Python – Learning useful python skills

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Files" data-toc-modified-id="Files-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Files</a></span><ul class="toc-item"><li><span><a href="#Paths" data-toc-modified-id="Paths-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Paths</a></span><ul class="toc-item"><li><span><a href="#Absolute-paths" data-toc-modified-id="Absolute-paths-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Absolute paths</a></span></li><li><span><a href="#The-pathlib-module" data-toc-modified-id="The-pathlib-module-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>The <code>pathlib</code> module</a></span><ul class="toc-item"><li><span><a href="#Path-objects'-special-syntax" data-toc-modified-id="Path-objects'-special-syntax-1.1.2.1"><span class="toc-item-num">1.1.2.1&nbsp;&nbsp;</span>Path objects' special syntax</a></span></li></ul></li><li><span><a href="#Relative-paths" data-toc-modified-id="Relative-paths-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Relative paths</a></span></li></ul></li><li><span><a href="#Opening-files" data-toc-modified-id="Opening-files-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Opening files</a></span><ul class="toc-item"><li><span><a href="#Reading-files" data-toc-modified-id="Reading-files-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Reading files</a></span><ul class="toc-item"><li><span><a href="#Reading-binary-files" data-toc-modified-id="Reading-binary-files-1.2.1.1"><span class="toc-item-num">1.2.1.1&nbsp;&nbsp;</span>Reading binary files</a></span></li></ul></li><li><span><a href="#Creating-files-and-directories" data-toc-modified-id="Creating-files-and-directories-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Creating files and directories</a></span></li></ul></li></ul></li></ul></div>

**Introduction**

If you've made it this far, congratulations! You have learned most of the basics in the Python programming language! The first two notebooks in this course focuses mainly on understanding the foundations of Python. This notebook is where we start learning more of how to use Python to do useful things. 

So what's useful? Consider the following: you work at some company. You have just been handed a USB stick with ten thousand pdf files. These files have a date in their names, such as `rec20200414.pdf` (i.e. April 14, 2020) . The files are photocopies of receipts. On each receipt there is a name of the employed person at your company that has created the file. 

Your task is to see:

- Who are the employed that have created all the receipts?
- How many receipts are created per employée?
- How many receipts are there per employée, per month?

Now, we _could_ do this by hand. Reading all files ourselves and typing the information by hand into an Excel spreadsheet. But this would, firstly, take weeks. Secondly, be incredably boring. And, thirdly, probably (because the assignement's been so tedious and boring) be riddled with errors.

However, this is actually a perfect example of a task that is easily solved with Python! Here's a task list of what we could do: 
- Create a list of all the files
- Create a script that extract the name from a pdf receipt
- Loop over all files 

And on each file in our loop:
1. Run our script on each one of the pdf files
2. Extract date from file name
3. Save the name and the date into a data structure

In this course, you will learn to do all task in above task list! But let's take it step by step, and start with learning about filepaths. Then how to create, remove and alter files on our computer using python. We will also learn the basics of regular expressions, or text recognition. And finally, some webscraping basics.

## Files

All files on your computer have these three componants: a name (this notebook, for instance, is called `continuation_course`), a file extension (Jupyter Notebooks have the extension `.ipynb`, textfiles `.txt`, Microsoft Word has `.doc`, etc etc…), and a path – the file's location on your harddrive.

### Paths

As mentioned above, all files have a path. This lets us know where it is located on the computer's hard drive. All folders – also called directories – also have paths. There are two kinds of paths: **relative** and **absolute**. We'll start with the absolute, and continue with the relative further down. 

#### Absolute paths

An absolute path is a file's location in relation to your hard drive's root folder. The root folder – or **root directory** –  is the top level of your hard drive. All files and folders are inside this directory. 

If you use Windows, you probably recognise your root directory when you see it. It's the `C:\` when you check your hard drive folders. See this picture:

![image](course_material/windows_root.png)

On Mac/Linux, the root directory is just named `/`. Pretty lame, but there you go.

Another important difference is the **directory separator**, the slash. On Windows, this is a backslash `\`, while on Mac/Linux, it's just a regular slash `/`.

This notebook you're running, is located somewhere on your computer's hard drive. This location can be found when checking the file's absolute path. Or to be more precise: we can check where this notebook "lives" within your root directory! We could use the `os` module to check the absolute path:

In [52]:
import os

In [53]:
os.getcwd()

'/Users/johekm/Documents/lectures/learning_python'

So for me, this notebook has this path: `'/Users/johekm/Documents/lectures/learning_python'`. It lives in the folder `learning_python`, which is in the folder `lectures`, which is in `Documents`, which is in my user profile `johekm`, and so forth all the way to the root directory. 

This notebook's name is `continuation_course.ipynb`, which means that its absolute path is:

`'/Users/johekm/Documents/lectures/learning_python'/continuation_course.ipynb'`

If this was a Windows laptop, it would look something like:

`'C:\Users\johekm\Documents\lectures\learning_python\continuation_course.ipynb'`



All absolute paths have the root directory to the left. That is, **absolute paths always starts with the root directory**. The directory or the file we're looking for is furthest to the right in the path. So in above examples, we point towards the file `continuation_course.ipynb`, since it is to the right in the path. 

We can also use the `.isabs()` method on our path. It takes a string value and sees if it's a absolute path on our computer:

In [54]:
os.path.isabs('/Users/johekm/Documents/lectures/learning_python/continuation_course.ipynb')

True

As you can see, the `os` module takes string values as arguments. It also returns paths as string values:

In [55]:
type(os.getcwd())

str

This means that we can use the `.join()` method on our directory separator character (`/` on Mac/Linux, `\` on Windows), to get strings with paths! So on on my current running absolute path I could do this:

In [58]:
my_path = ['','Users','johekm','Documents','lectures','learning_python','continuation_course.ipynb']

In [59]:
my_path

['',
 'Users',
 'johekm',
 'Documents',
 'lectures',
 'learning_python',
 'continuation_course.ipynb']

In [60]:
my_path = '/'.join(my_path) # see section 9.9.2 if you want a refreasher!

In [61]:
my_path

'/Users/johekm/Documents/lectures/learning_python/continuation_course.ipynb'

In [62]:
os.path.isabs(my_path)

True

We can use the `.listdir()` method to get a list of all files in a directory. We just pass a path to the method as an argument, and it returns a list. Let's try it on the `course_material` folder:

In [90]:
my_path = '/Users/johekm/Documents/lectures/learning_python/course_material'

In [91]:
os.listdir(my_path)

['mutable_scope.png',
 '.DS_Store',
 'while_loop.png',
 'scopes.png',
 'readme',
 'windows_root.png',
 'immutable_1.png',
 'interrupt.png',
 'immutable_3.png',
 'speach.txt',
 'immutable_2.png',
 'if_statement.png',
 'mutable.png',
 'mutable_2.png']

#### The `pathlib` module

**CAUTION!** It is common practice to use string values when working with file paths. But as you can see from this example above, this code wouldn't work on Windows, since that path syntax requires the backslash `\`.

So instead, we're gonna use the `pathlib` module. This works on all operating systems since the Python interpreter converts all paths into whatever syntax your computer uses! Let's import the `Path` class from the `pathlib` module:

In [63]:
from pathlib import Path

The `Path` class has a method called `.cwd()` ("current working directory") that returns the absolute path of your current "position" on your harddrive. Let's have a look at the current working path using this method:

In [64]:
Path.cwd()

PosixPath('/Users/johekm/Documents/lectures/learning_python')

As you can see, the value returned isn't a string. It is a path object:

In [41]:
type(Path.cwd())

pathlib.PosixPath

This path object actually differs depending on what operating system you're using. For me, using mac, it is a `PosixPath` object. If you're using Windows, it should be a `WindowsPath` object. But the name is not important. Just know a path object is a way to help us construct paths in a very convenient way! 

We can pass any string to the Path class to convert it into a path object:

In [68]:
Path("Johan")

PosixPath('Johan')

The `os` module can read path objects, so we can use path objects to check if this path object is an absolute path: 

In [50]:
os.path.isabs(Path("Johan"))

False

"Johan" isn't an absolute path, but our current working directory is:

In [73]:
os.path.isabs(Path.cwd())

True

The `.home()` method returns the home directory on the computer. For me, this is `/Users/johekm`:

In [69]:
Path.home()

PosixPath('/Users/johekm')

##### Path objects' special syntax

Path objects can use operators as their own syntax. This means that I can use the `.home()` method and then construct a path I now will work on whatever operating system you're now running:

In [181]:
path = Path.home() / "Documents" / "lectures" / "learning_python"
path

PosixPath('/Users/johekm/Documents/lectures/learning_python')

Hang on, what the hell happend?? Why did we just use the division operator together with strings and somehow just magically created a path??

If a line of Python code includes a path object, the `/` will not be read as a division operator by the interpreter, it will be read as a path seperator! It will then reconstruct this entire line into one path object. Above code is the same as typing:

In [178]:
path = Path.home() / Path("Documents") / Path("lectures") / Path("learning_python")
path

PosixPath('/Users/johekm/Documents/lectures/learning_python')

...or:

In [179]:
path = Path.home() / Path("Documents/lectures/learning_python")
path

PosixPath('/Users/johekm/Documents/lectures/learning_python')

...or just:

In [180]:
path = Path.home() / "Documents/lectures/learning_python"
path

PosixPath('/Users/johekm/Documents/lectures/learning_python')

#### Relative paths

A relative path always starts in the current working directory. We use realtive paths to find files and directories in relation to where we are currently situated – where our program currently runs – on our hard drive. 

Above, we listed all files and directories in the folder `course_material`, using the `.listdir()` method. Let's do so again, but with a relative path instead of an absolute one:

In [101]:
os.listdir('course_material')

['mutable_scope.png',
 '.DS_Store',
 'while_loop.png',
 'scopes.png',
 'readme',
 'windows_root.png',
 'immutable_1.png',
 'interrupt.png',
 'immutable_3.png',
 'speach.txt',
 'immutable_2.png',
 'if_statement.png',
 'mutable.png',
 'mutable_2.png']

Since this notebook lives in the same directory as `course_material`. This means the relative path is `'course_material'`. Let's go a bit deeper, there is a folder within `course_material` named `readme`. Let's list its content using a relative path:

In [102]:
os.listdir('course_material/readme/')

['material.png',
 '.DS_Store',
 'navigator.png',
 'duplicate.png',
 'searchbar.png',
 'course_start.png',
 'jupyter.png',
 'create_nb.png',
 'documents.png',
 'launchpad.png']

If we check our absolute path once more, using the `.cwd()` method:

In [103]:
Path.cwd()

PosixPath('/Users/johekm/Documents/lectures/learning_python')

What if we wanted to use a relative path to see what is within the "Documents" folder? This is (on my computer) two "levels" above our working directory. We can type the `..` folder! This isn't a real folder, just a specially named folder to indicate "check one directory level above" – **the parent directory**. 

Uncomment this following code cell and run it, the check to see if the listed files are as you expected. If you placed this course folder in the "Documents" folder on you computer, you should see the contents in your "Documents" folder:

In [113]:
#os.listdir('..')

We can continue using `..` with directory separators to go even further up the directory tree:

In [114]:
#os.listdir("../..")

We can also type a single dot `.`, which indicates _this_ folder. The one we're in. Uncomment to check if it's what you expect it to be on your computer:

In [116]:
#os.listdir(".")

### Opening files

Now that we've had a look at paths, we can use them to open files! 

Files can be binary files or plaintext files. Binary files consists of a complicated soup of code patterns that is unreadable for humans. Most files you use at your office are probably binary files: excel files, pdf documents, etc etc. 

Here, we're going to start with a plaintext file. Plaintext means that there are nothing but just raw text in the file. When you write a python script (using the file extension `.py`), this is a plaintext file. There isn't any other information than the actual text characters within the file. Text files (with the extension `.txt`) is also plaintext.

Let's look for a plaintext file! If we check the file contents in the `course_material` folder, we can see that there are two plaintext files therein. Let's use the `.listdir()` method of the `os` module:

In [201]:
os.listdir('course_material/')

['mutable_scope.png',
 '.DS_Store',
 'while_loop.png',
 'scopes.png',
 'readme',
 'windows_root.png',
 'immutable_1.png',
 'interrupt.png',
 'immutable_3.png',
 'speach.txt',
 'immutable_2.png',
 'if_statement.png',
 'hello.txt',
 'mutable.png',
 'mutable_2.png']

Here we see two text files! "speach.txt" and "hello.txt". Let's start with the latter and read its content.

#### Reading files

We can open files with the built-in `open()` function. It has two crucial arguments (it has way more that we will ignore at the moment). First, a _filepath_ that points to the file we want to open (including the filename). 

Second, we pass a string that determines _how_ to open the file. Default is to open in "read" mode, which opens the file, but hinders us from changing its content. Let's open the file "hello.txt" in the `course_material` directory:

In [159]:
file = open("course_material/hello.txt","r")

The `open()` function returns a file object, so we save that to a `file` variable! Let's have a look at our file object:

In [160]:
file

<_io.TextIOWrapper name='course_material/hello.txt' mode='r' encoding='UTF-8'>

Here, we can see that the object is opened in read mode, and that it's encoded in unicode, UTF-8 (=not important at the moment). We can use the `.read()` method to have a look at the file content:

In [161]:
file.read()

'Hello world!\n\nSo happy to see that you guys made it to the continuation course.\nThis is where we start having fun!'

The `.read()` method returns all the file's text as one string. As you can see, the file includes newline characters `\n`. The method `.readlines()` also opens the file's contents, but here, all the file's lines are items organised in a list:

In [165]:
file = open("course_material/hello.txt","r")
file.readlines()

['Hello world!\n',
 '\n',
 'So happy to see that you guys made it to the continuation course.\n',
 'This is where we start having fun!']

When we're done with the file and want to close it, we use the `.close()` method:

In [167]:
file.close()

This means we can't access the file object any longer:

In [168]:
file.read()

ValueError: I/O operation on closed file.

##### Reading binary files

We can also read binary files, but binary content will look like gibberish to a human eye. To read a binary file, we need to pass the argument `"rb"` ("read binary") instead of `"r"` as the second argument of the open function. Let's have a look at an excel file:

In [213]:
excelFile = open('course_material/test.xlsx',"rb")

Let's not open the entire file, just the first 200 characters:

In [214]:
excelFile.read()[:200]

b'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00\x0c\xeb\xe3\xff[\x01\x00\x00\x88\x04\x00\x00\x13\x00\x08\x02[Content_Types].xml \xa2\x04\x02(\xa0\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

#### Creating files and directories

We can pass the string `"w"` as an argument to open in write mode, which lets us change the file's content. If the file our path points to doesn't exist, and we open in write mode, _we will create a file_. Let's try it!

In [191]:
file = open("test.txt","w")

Since we didn't give the `open()` function an absolute path, it took the path and looked for a file named `test.txt` in the current working directory. Since no such file existed, it created on. Have a look in the course folder, there should now be a new text file named "test"!

So, if we open a file in write mode, and no such file exists, we will create a new file. But what happens if we try to open a file that doesn't exist in read mode?

In [192]:
open("xyz.txt","r")

FileNotFoundError: [Errno 2] No such file or directory: 'xyz.txt'

We get an error!

Ok, so we have created a file, and simultanously opened this file in write mode. Let's check the file object:

In [193]:
file

<_io.TextIOWrapper name='test.txt' mode='w' encoding='UTF-8'>

Since it is in write mode, we can add content to the file! Let's start by creating a string that we'd like to add to our new file:

In [195]:
text = "This is some very exciting and new content going on here!"

Our file object has a method called `.write()` that takes whatever content we want to add to the file as an argument. This will then be written to the file object:

In [196]:
file.write(text)

57

The method returns an integer, in this case 57. It just returns the length of the content we just added:

In [197]:
len(text)

57

We've now added our content, let's close the file:

In [200]:
file.close()

If you now check the text file, you'll see that our text string was added! Yay!

**CAUTION!** If you now open the file in write mode again, you'll see that its content has been erased.

In [216]:
file = open("test.txt","w")
file.close()

In [217]:
file = open("test.txt","r")
file.read()

''

If we want to add content to our file, we can open it in "append mode", using the argument `"a"`. Let's add our text again, and then open the file in append mode:

In [222]:
file = open("test.txt","w")

In [223]:
file.write(text)
file.close()

In [224]:
file = open("test.txt","a") # append mode

In [225]:
new_text = "\nSome new exciting text that we've added!"

In [226]:
file.write(new_text)

41

In [227]:
file.close()

Let's check the files content to see if our new text was added:

In [231]:
file = open('test.txt', "r")

In [232]:
print(file.read())

This is some very exciting and new content going on here!
Some new exciting text that we've added!


It worked!

Using the `os` module, we can also create new directories!