# UNIT 3: Reuse, Modularity, and External Resources

## Table of Contents

1. [Functions](#1.-Functions)  
    1.1. [Declaring and Calling Functions](#1.1.-Declaring-and-Calling-Functions)  
    1.2. [Passing Arguments](#1.2.-Passing-Arguments)  
    1.3. [Documenting Functions](#1.3.-Documenting-Functions)  
    1.4. [Nesting and Recursion](#1.4.-Nesting-and-Recursion)  
    1.5. [Lambda Functions](#1.5.-Lambda-Functions)  
2. [Modularity](#2.-Modularity)  
    2.1. [Classes and Objects](#2.1.-Classes-and-Objects)  
    2.2. [Modules and Packages](#2.2.-Modules-and-Packages)  
    2.3. [The `import` Statement](#2.3.-The-import-Statement)  
3. [External Modules and Packages](#3.-External-Modules-and-Packages)  
    3.1. [Dependencies and Virtual Environments](#3.1.-Dependencies-and-Virtual-Environments)  
    3.2. [Environment and Package Management](#3.2.-Environment-and-Package-Management)  
4. [External Data](#4.-External-Data)  
    4.1. [Reading and Writing Files](#4.1.-Reading-and-Writing-Files)  
    4.2. [Working with Files and Directories](#4.2.-Working-with-Files-and-Directories)  
    4.3. [Using Common Exchange Formats](#4.3.-Using-Common-Exchange-Formats)  

# 1. Functions

A function is a **self-contained code block that runs only when explicitly called**. Functions encapsulate common tasks into a reusable format that helps avoid repetition. We have previoulsy encountered a number of Python's built-in functions, such as `range()` and `enumerate()`, but it is also possible for us to define our own.

## 1.1. *Declaring* and *Calling* Functions

The keyword `def` is used to **declare a new function**. It should be followed by the **name** we wish to give it, a **set of parenthesis**, and a colon. The code to be executed by the function should be **indented** below the definition:

<div class="alert alert-block alert-warning">
<b>Warning:</b> There <b>must</b> be a code block under a function definition, otherwise Python will raise an error. Like in the case of <code>if</code> statements, however, one can use <code>pass</code> as a placeholder.
</div>

In [1]:
def print_something():
    print('something')

In the cell above, we *declare* a function called `print_something` that prints the text `something` to the console. 

Once a function has been declared, we can **call it** (*i.e.* invoke it, or use it) from anywhere in the current programme. To call a function one simply uses its **name followed by parentheses**:

In [2]:
print_something()

something


The most common usage of functions involves **passing them data**, having them **perform certain operations** on the data, and **return the results**. The last step, *i.e.* returning the results, is accomplished via the `return` statement. Let's declare a **new function that returns *something* instead of printing**:

In [3]:
def return_something():
    return 'something'

The `return` statement gives back control to the line of code where the function is called while also providing whatever follows the `return` keyword (the text `something` in the example above) as the **value of the function call**. We can, for example, **store the return value in a variable** for later use:

In [4]:
result = return_something() # result now contains the return value of return_something()
print(result)

something


A function can **return multiple values** at the same time: simply separate them with commas:

In [5]:
def return_something():
    return 'something', 'something else'

Multiple values are returned as a **n-tuple** by default, where *n* equals the number of values to return. This means that multiple return values can be *unpacked* just like any tuple:

In [6]:
full_result = return_something() # full_result now contains the 2-tuple
result_1, result_2 = return_something() # here the 2-tuple is unpacked into two variables
print(full_result)
print(result_1)
print(result_2)

('something', 'something else')
something
something else


## 1.2. Passing Arguments

If we intend to supply data to a function, we need to specify how when declaring it. The data **received** by a function are called **parameters** and they are specified in the declaration as variables inside the parenthesis. Multiple *parameters* can be added by separating them commas:

In [8]:
def concatenate_us(word_1, word_2):
    return f'{word_1} and {word_2}'

In the example above, we define the function `concatenate_us` with two parameters: `word_1` and `word_2` (*i.e.* the words to be concatenated).

Data **supplied** to functions are called **arguments**, and must be included inside the parenthesis when calling the function:

In [9]:
result = concatenate_us('salt', 'pepper')
print(result)

salt and pepper


### Positional Arguments

In the example above, `salt` and `pepper` are **positional** arguments: the first argument in the function call (`salt`) is used as the value for the first parameter in the function declaration (`word_1`) and the second argument (`pepper`) is used as the value of the second parameter (`word_2`). **Positional arguments must be passed in the order in which the corresponding parameters were declared**. Compare the results of the following cell with those from the cell above:

In [10]:
result = concatenate_us('pepper', 'salt')
print(result)

pepper and salt


In addition, when using *positional* arguments, a function **must be called with the correct number of arguments**, otherwise Python will raise an error:

In [11]:
result = concatenate_us('salt')
print(result)

TypeError: concatenate_us() missing 1 required positional argument: 'word_2'

### Default Arguments

Function *parameters* can be **assigned default values** so that, if the function is called without a corresponding argument, the paremeter still gets some data. **Default** arguments need to be indicated when declaring the function using the `parameter=value` syntax. Let's rewrite `concatenate_us()` and assign a default to the second parameter:

In [12]:
def concatenate_us(word_1, word_2='pepper'):
    return f'{word_1} and {word_2}'

Let's try calling it again with only one argument:

In [13]:
result = concatenate_us('salt')
print(result)

salt and pepper


The *default value* takes the place of the *missing argument*. If we call the function with **both arguments**, however, the **default value is ignored**:

In [14]:
result = concatenate_us('salt', 'vinegar')
print(result)

salt and vinegar


### Keyword Arguments

When using **keyword** arguments, in addition to providing argument values to a function, we **also supply the name of the corresponding parameter**. *Keyword* arguments are passed using the syntax `parameter_name=parameter_value`, more commonly stated as `keyword=value`. 

Unlike positional arguments, then, *keyword* arguments are matched to parameters based on their names, thus eliminating the need to worry about the order in which arguments are passed:

In [15]:
result = concatenate_us('salt', 'pepper') # positional arguments
print(result)

salt and pepper


In [16]:
result = concatenate_us('pepper', 'salt') # reversed positional arguments
print(result)

pepper and salt


In [17]:
result = concatenate_us(word_1='salt', word_2='pepper') # keyword arguments
print(result)

salt and pepper


In [18]:
result = concatenate_us(word_2='pepper', word_1='salt') # reversed keyword arguments
print(result)

salt and pepper


So far, **we have known in advance the number of arguments a function will receive**, but what if we didn't?

### Arbitrary Arguments

*Arbitrary* arguments allow us to pass a **variable number** of arguments to a function. For this to work, we **must indicate which parameter will take arbitrary arguments** when the function is declared. This is accomplished by prefixing the parameter's name with an asterisk (`*`).

<div class="alert alert-block alert-info">
<b>Note:</b> Any parameter name can be assigned to an arbitrary argument but, by convention, the name <code>args</code> is reserved for this role. 
</div>

Let's declare a new function that will **take any number of integers** and **return their sum**:

In [19]:
def get_sum(*args):
    result = 0 # set a variable to hold the results
    for arg in args: # this loop iterates over the arguments - regardless of their number
        result = result + arg # and adds each to the ongoing sum results
    return result

The reason we can use a `for` loop, as above, is because the parameter taking arbitrary arguments becomes an iterable (technically a *tuple*). Each item in the tuple is one of the arguments passed to the function. 

Let's test it:

In [20]:
result = get_sum(2, 2, 4, 6)
print(result)

14


*Arbitrary* arguments can be **used alongside other types**. Let's rewrite `get_sum()` so that it also **takes the name of a unit of measure** to be included in the return value:

In [21]:
def get_sum(unit, *args):
    result = 0
    for arg in args:
        result = result + arg
    return f'{result}{unit}'

Let's test it:

In [22]:
result = get_sum('m', 2, 2, 4, 6) # we'll assume the numbers represent distances in metres
print(result)

14m


`*args` **always returns an iterable, even if it is an empty one**:

In [24]:
result = get_sum('m')
print(result)

0m


Because of this, adding `*args` as the **last** parameter when declaring a function is a common *error handling* practice: it ensures that the function will not raise an error regardless of the number of arguments passed. If it is called with more arguments than expected, the extra ones will be treated as `args`, if not, no worries, `*args` will be empty.

Let's add a boolean parameter to `get_sum()`, to indicate whether the unit name should be used as a **suffix** (*e.g.* SI measures, as above) or **prefix** (*e.g.* currency):

In [26]:
def get_sum(unit, is_prefix=False, *args):
    result = 0
    for arg in args:
        result = result + arg

    if is_prefix: # test if the unit should be prefixed or not
        output = f'{unit}{result}' # and format the output accordingly
    else:
        output = f'{result}{unit}'
    
    return output

Let's test it:

In [28]:
result = get_sum('m', False, 2, 2, 4, 6)
print(result)

14m


In [29]:
result = get_sum('$', True, 2, 2, 4, 6)
print(result)

$14


It appears to work as intended, but, **is it really necessary to supply `False` in the first example?** After all, we assigned a *default* value for `is_prefix`, we should be able to get the same result without providing a value:

In [30]:
result = get_sum('m', 2, 2, 4, 6)
print(result)

m12


Alas, we do not: **arguments will be assigned to positional parameters until those are filled, then, and only then, will any remaining ones be added to** `*args`. In this case, the value `2` is being matched with the `is_prefix` parameter. Since `2` will return `True` if tested as a boolean, the *unit* is treated as a prefix:

In [31]:
print(bool(2))

True


<div class="alert alert-block alert-success">
<b>Exercise:</b> There <b>is</b> a way to trigger the <i>default</i> value for <code>is_prefix</code>. Can you find it?
</div>

In [None]:
result = get_sum() # add your arguments here to try!
print(result)

Arbitrary arguments are added to the `*args` iterable in the order in which they are supplied. Strictly speaking, they are **arbitrary *positional* arguments**. The distinction is important, because Python also supports:

### Arbitrary *Keyword* Arguments

These work much like their *positional* counterparts: it is necessary to indicate which parameter is to receive them when the function is declared. Instead of using a single asterisk as a prefix, however, for *arbitraty keyword arguments* we use **two asterisks** (`**`).

<div class="alert alert-block alert-info">
<b>Note:</b> As before, any parameter name can be assigned to receive arbitrary <i>keyword</i> arguments, but <code>kwargs</code> is used by convention. 
</div>

Let's rewrite our `get_sum()` function using `**kwargs`:

In [32]:
def get_sum(unit, *args, **kwargs):
    result = 0
    for arg in args:
        result = result + arg
    
    # since kwargs could be empty, we must test that 'is_prefix' exists
    # AND that it is True
    if 'is_prefix' in kwargs and kwargs['is_prefix'] is True:
        output = f'{unit}{result}'
    else:
        output = f'{result}{unit}'
    
    return output

When calling a function that accepts *arbitrary keyword* arguments, we can pass any number of `keyword=value` pairs. The parameter taking these arguments (*i.e.* `*kwargs`) **becomes a dictionary** that maps each *keyword* to the *value* that was passed alongside it. This means that we can, for example, **test whether a specific keyword is present** using the `in` operator, and **extract its value** using the square brackets syntax of dictionaries, as in the example above.

Let's see how our new function works:

In [33]:
result = get_sum('m', 2, 2, 4, 6)
print(result)

14m


In [34]:
result = get_sum('$', 2, 2, 4, 6, is_prefix=True)
print(result)

$14


As you can see in the second example, unlike with the previous version, we can now **omit the `is_prefix` argument** if we just want the default behaviour.

## 1.3. Documenting Functions

It is considered good practice to **document what a function does**. The recommended way is to include a *document string*, or ***docstring*** for short, **between the function declaration and the function code**. **docstrings** are enclosed in triple-quotes and should explain **what arguments the function takes**, **what it does with them**, and **what it returns**. 

Let's rewrite `get_sum()` to include a *docstring*:

In [35]:
def get_sum(unit, *args, **kwargs):
    '''Takes a unit name, any number of integers, and an optional 
    boolean indicating whether the unit name should be prefixed, 
    and returns their sum concatenated with the unit name'''
    result = 0
    for arg in args:
        result = result + arg
        
    if 'is_prefix' in kwargs and kwargs['is_prefix'] is True:
        output = f'{unit}{result}'
    else:
        output = f'{result}{unit}'
    
    return output

One of the key advantages of *docstrings* is that they can be **retrieved programmatically**, which helps enormously when documenting large projects:

In [36]:
print(get_sum.__doc__)

Takes a unit name, any number of integers, and an optional 
    boolean indicating whether the unit name should be prefixed, 
    and returns their sum concatenated with the unit name


## 1.4. Nesting and Recursion

Python allows what are called **inner** or **nested** functions: namely ones **declared inside another function**. 
*Nested* functions are part of the scope of their parent (or *outer*) function. This means that a *nested* function can only be executed if the parent is called. Because of this, *nested* functions are often used as a means to restrict the use of certain code to very specific execution contexts (*i.e.* that of the parent function). 

Let's rewrite `get_sum()` using a *nested* function:

In [37]:
def get_sum(unit, *args, **kwargs):
    '''Takes a unit name, any number of integers, and an optional 
    boolean indicating whether the unit name should be prefixed, 
    and returns their sum concatenated with the unit name'''
    result = 0
    
    def get_output():
        '''Formats the result based on the value of is_prefix'''
        if 'is_prefix' in kwargs and kwargs['is_prefix'] is True:
            output = f'{unit}{result}'
        else:
            output = f'{result}{unit}'
        
        return output
    
    for arg in args:
        result = result + arg

    return get_output()

In the cell above, the function `get_output()`, declared within `get_sum()`, formats the output string according to the value of the `is_prefix` parameter. 

Note how in order to do its job, `get_output()` **needs** `kwargs` (to determine the value of `is_prefix`) **and** `result` (to append the unit), but we did not include either as parameters when declaring it, nor do we supply them as arguments when we call `get_output()` in line `19`. And yet, it works:

In [38]:
result = get_sum('$', 2, 2, 4, 6, is_prefix=True)
print(result)

$14


The reason is that *nested* functions have **access to the variables of the enclosing scope**, in this case those of `get_sum()`, such as `*kwargs` and `result`). Like the fact that *nested* functions can only be executed when the parent is called, this is a side-effect of them being part of the scope of their parent:

In [40]:
get_output() # will raise error, as the parent get_sum() is not being called

NameError: name 'get_output' is not defined

Another property of functions in Python is that they support **recursion**. This means that **a function can call itself**. In certain situations, recursion provides an extremely efficient and mathematically-elegant solution to common programming problems. One such case is iteration over nested data structures: let's say that we have a list of integers, with some of them arranged in sub-groups (*i.e.* nested lists), like this:

<div style="border: 2px solid #abd4ff; padding:3px; display: flex; width: 332px; border-radius: 4px; margin-top: 20px">
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-right: auto">12</div>
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-right: auto">8</div>
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-right: auto">93</div>
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-right: auto">9</div>
<div style="border: 2px solid #ffb979; padding: 3px; margin-top: -2px; margin-bottom: -2px; display: flex; border-radius: 4px; margin-right: auto">
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-left: 2px">2</div>
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-left: 2px; margin-right: 2px">4</div>
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-right: 2px">9</div>
</div>
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-right: auto">89</div>
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-right: auto">7</div>
<div style="border: 2px solid #ffb979; padding: 3px; margin-top: -2px; margin-bottom: -2px; display: flex; border-radius: 4px; margin-right: auto">
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-left: 2px">2</div>
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-left: 2px; margin-right: 2px">34</div>
</div>
<div class="alert-success" style="display: inline; padding: 5px; border-radius: 4px; margin-right: 2px">90</div>
</div>

We want to write a function that takes the above structure and returns a flat list of all the integers, like this:

<div style="border: 2px solid #abd4ff; padding: 3px; display: flex; width: 300px; border-radius: 4px; margin-top: 20px; margin-bottom: 20px">
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">12</div>
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">8</div>
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">93</div>
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">9</div>
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">2</div>
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">4</div>
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">9</div>
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">89</div>
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">7</div>
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">2</div>
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">34</div>
<div class="alert-success" style="display: flex; padding: 5px; border-radius: 4px; margin-right: auto">90</div>
</div>

We could do it like this:

In [41]:
nested_data = [12, 8, 93, 9, [2, 4, 9], 89, 7, [2, 34], 90]

def flatten(data):
    output = []
    
    for item in data:
        if type(item) is list:
            output = output + flatten(item)
        else:
            output.append(item)
    
    return output

print(flatten(nested_data))


[12, 8, 93, 9, 2, 4, 9, 89, 7, 2, 34, 90]


The function checks the `type` of the item being processed. If it is a list, then it calls itself to process it. Because of this recursive behaviour, the function is able to deal with **any level of nesting**:

In [42]:
triple_nested_data = [12, 8, [2, [4, [5, 5.5], 6], 9], 7, [2, 34], 90]
print(flatten(triple_nested_data))

[12, 8, 2, 4, 5, 5.5, 6, 9, 7, 2, 34, 90]


## 1.5. Lambda Functions

A **lambda** function is an anonymous function that can take any number of arguments but, unlike normal functions, **evaluates and returns only one expression**. *Lambda* functions are declared using the keyword `lambda` instead of `def`. They can **take both positional and keyword arguments**, but, unlike a normal function, the **parameters are not specified in parentheses**. Their body, *i.e.* the code to execute, should be a **single expression** (ideally a single line). The general signature then looks like this:

`lambda` *`parameters`*: *`expression`*

Let's write a *lambda* function that takes three integers as arguments and returns their sum:

In [43]:
simple_sum = lambda x, y, z: x + y + z

print(simple_sum(4, 6, 20))

30


In the cell above, we assigned the *lambda* function to a variable so that we could call it easily, but this is not necessary. 

Unlike regular functions, ***lambda* functions are not meant to be reused**, they are a means to **evaluate an expression in a single context**. They are usually **passed as arguments** to *higher-order* functions (*i.e.* those that take other functions as arguments).

A good example of the latter is the `sort()` method of lists. As a reminder, in **Unit 1** we saw how this method can be used to sort a list alphanumerically, in either *ascending* (default) or *descending* order:

In [46]:
beatles = ['John Lennon', 'Paul McCartney', 'George Harrison', 'Ringo Starr', 'Pete Best']

beatles.sort()
print(beatles)

['George Harrison', 'John Lennon', 'Paul McCartney', 'Pete Best', 'Ringo Starr']


In [47]:
beatles.sort(reverse=True)
print(beatles)

['Ringo Starr', 'Pete Best', 'Paul McCartney', 'John Lennon', 'George Harrison']


In the example, the names in the list are treated as strings and **sorted by their first letter** (*i.e.* by the given name). **What if we wanted to sort the list by surname?** 

`sort()` can take a `key` argument that determines what values are used for sorting, and that argument can be a *lambda* function. For our purposes, the function in question will have to take the full name and return just the surname. We can write it like this:

```python
lambda name: name.split()[1]
```

Let's break it down: first, we **declare the function** using the keyword `lambda` and **assign a single parameter**, `name`, which for each item in the list, will contain the full string (*e.g* `George Harrison`). After the colon, we **include the expression to be evaluated**, which in this case **splits the string** (with the default space separator) and **returns the second item** in the resulting list:

```python
name => 'George Harrison'
name.split() => ['George', 'Harrison']
name.split()[1] => 'Harrison'
```

The `sort()` method will then arrange the list in alphabetical order (either *ascending* or *descending*, depending on whether we supply the `reverse` parameter or not:

In [48]:
beatles.sort(key=lambda name: name.split()[1])
print(beatles)

['Pete Best', 'George Harrison', 'John Lennon', 'Paul McCartney', 'Ringo Starr']


In [49]:

beatles.sort(key=lambda name: name.split()[1], reverse=True)
print(beatles)

['Ringo Starr', 'Paul McCartney', 'John Lennon', 'George Harrison', 'Pete Best']


# 2. Modularity

*Modularity* refers to the practice of **breaking up a large, unwieldy code base into separate, smaller, and more manageable modules**, which can then be chained together as necessary to accomplish a task. *Simplicity*, *maintainability*, and *reusability* are some of the key advantages to modularizing code. In this section we'll review some of the key constructs Python uses to facilitate modular programming.

## 2.1. Classes and Objects

Python is an [object-oriented programming language](https://www.educative.io/blog/object-oriented-programming) (**OOP**). The OOP paradigm is built around the concept of **objects**, which are structures that contain **data**, in the form of **attributes**, and **code**, in the form of **methods**, to performs operations on those attributes. 

Python's approach to OOP is *class-based*, meaning that objects are considered **instances of classes**. A **class** is a sort of **generic blueprint or prototype** for an object: it determines their type and characteristics. **objects**, then, are *instances*, or copies, of the class with actual values.

Let's look at an example: `date` is a [built-in class](https://docs.python.org/3/library/datetime.html?highlight=datetime#date-objects) that represents a date in an idealized calendar (the Gregorian calendar extended indefinitely into both past and future). The class itself is just a generic representation of all possible dates, if we want to work with an **actual date** we need to *instantiate* it into a *date object*, i.e. create a specific instance of the class. 

*Date objects* are created by calling `date` with three integer arguments representing a `year`, a `month`, and a `day`:

In [50]:
from datetime import date  # we need to import the class first, this is covered in the next section

# now we can instantiate a date object
lincoln_dob = date(1809, 2, 12) # yes, this is Abraham Lincoln's date of birth

We can check that `lincoln_dob` is indeed a *date* object by using the `type` function we learned in **Unit 1**:

In [51]:
print(type(lincoln_dob))

<class 'datetime.date'>


We can also check the actual value by printing `lincoln_dob`. *Date objects* are printed by default in the `year-month-day` format:

In [52]:
print(lincoln_dob)

1809-02-12


As discussed above, **an object contains data in the form of *attributes***. The key attributes of a *date object* are the values for `day`, `month`, and `year`. Object attributes can be retrieved by **appending a dot and their name** to the object (no parentheses!):

In [53]:
print(lincoln_dob.day)
print(lincoln_dob.month)
print(lincoln_dob.year)

12
2
1809


Objects also have **methods, functions that perform certain useful operations**, usually on its *attributes*. Object methods are called by **appending a dot and their name followed by parentheses** to the object (methods are functions, hence the parentheses).

For example, `weekday()`, one of the methods of *date objects*, returns the day of the week as an integer (Monday=0, Sunday=6):

In [55]:
print(lincoln_dob.weekday())

6


So, Lincoln was born on a Sunday. **What if we wanted to print the name of the day programmatically?** We could always build a dictionary to map *day integers* to their *corresponding English names*, but this sounds like the kind of common task that a class should provide a method for, and `date` does! 

*Date objects* have a method called `strftime()` that returns a string representing the date in whatever format is requested by supplying a [format string](https://docs.python.org/3/library/datetime.html?highlight=datetime#strftime-strptime-behavior) in the parentheses:

In [56]:
print(lincoln_dob.strftime('%d/%m/%Y')) # output as day/month/year
print(lincoln_dob.strftime('%d/%m/%y')) # same but use year without century (i.e. just two digits)
print(lincoln_dob.strftime('%d-%b-%Y')) # with dashes as separators and the name of the month (abbreviated)

12/02/1809
12/02/09
12-Feb-1809


In [57]:
# let's go for broke:
full_date = lincoln_dob.strftime('a %A, on the %dth day of %B, in the year %Y')
print(f'Abe Lincoln was born on {full_date}.')

Abe Lincoln was born on a Sunday, on the 12th day of February, in the year 1809.


And those are the basics of *classes* and *objects*.

<div class="alert alert-block alert-info">
<b>Note:</b> We'll learn how to write our own classes in <b>Unit 4</b> . 
</div>

## 2.2. Modules and Packages

These represent the two key ways to group and organize code in Python. **Modules** are files containing Python *classes* and *functions*, usually grouped on the basis of some similarity, such as their focus on a common task or set of tasks. **Packages** are *collections* of *modules*, organized as directories and subdirectories containing *sub-packages* and *modules*.

### Dot Notation

*Modules* and *packages* organize their constituent elements, such as *functions* and *classes* with their *methods* and *attributes*, in a hierarchical manner. A generic version of such a structure would look something like this:

<img src="https://mermaid.ink/img/pako:eNp9kt9rgzAQgP8VyV46aCkKwghFWOu630_bmyklJpc2NJoSI2OI__tirKPYUp-S-75c7s40iGkOCCOh9A_bU2OD75SUgfuYolWVgggKzWsFgZBK4Tv2ALEQI0PUJbNSlydH5Dxm4chhqqpOHCIhIBpxau1wXEAM8bgIsHvNr1zw2BC0yJMjZQe6g8U8Twhqg9ksCZZZ1pe-DTcbjHG_OZ3yxurfiC6MlTees6E3l8QZw-7ceWkaX-Y2bFundI32eOlxejNF7zzdTrEecDTCa49fJ_10tuF914Rfn_O3gUfX-fskczOUifsFRua1dQNbzKWb46bzXfRc_riQo0sZTVEBpqCSu6fVdBGC7B4KIAi7JafmQBApW-fR2uqv35IhbE0NU1QfObWQSroztEBYUFW5KHBptfns36p_su0fh-zcVg" width=700>

Python uses **dot notation** as a means to reference the different elements in the hierarchy. For example, to reference `attribute_2` of `class_2` in the chart above, we would use:

<div style="margin: 20px">
<div style="display: inline; padding: 5px; border-radius: 4px; background-color: #c8e5ff; font-family: monospace">module_1</div><b>.</b>
<div style="display: inline; padding: 5px; border-radius: 4px; background-color: #e2ffe2; font-family: monospace">class_2</div><b>.</b>
<div style="display: inline; padding: 5px; border-radius: 4px; background-color: #ffe5e5; font-family: monospace">attribute_2</div>
</div>

Basically, **dots are used to separate elements in the tree**, which are arranged in **decreasing hierarchical order** down to the desired element. The starting point, `module_1` in the example above, is determined by the way in which a *module* or *package* is made accessible to your code.

## 2.3. The `import` Statement

The `import` statement is the way in which the contents of a module are made available to a programme. There are a number of ways in which `import` can be used. Let's look at them with some examples. We'll focus on the module called `datetime`, which contains, amongst others, the `date` class we used earlier.

At its most basic `import` can be called with just the name of the *module* we wish to make available:

In [1]:
import datetime

**Multiple modules can be imported in the same statement** by separating them with a comma. If, in addition to `datetime`, we wished to also import the `math` module (which provides access to [mathematical functions](https://docs.python.org/3/library/math.html#module-math)), we could do it like this:

In [3]:
import datetime, math # datetime, which is already imported, will just be overwritten

Importing a module, however, **does not make the module's functions and classes *directly* accessible to us**: even after importing `datetime` we can't, for instance, invoke the class `time` (peer of the `date` class from the earlier example):

In [4]:
lincoln_tob = time(8, 44) # no, this is NOT Lincoln's time of birth!
print(lincoln_tob)

NameError: name 'time' is not defined

When we `import` a module, what we are doing is **creating a new *namespace***, a sort of **entrypoint** for all the objects it contains. The individual objects **need to be accessed using *dot notation***, as explained above, **using the module's name as the starting point**.

The `date` and `time` classes are organized like this withing the `datetime` module:

<img src="https://mermaid.ink/img/pako:eNqdk1trwyAUgP9KcC8ttJQVAkNCYG13Z0_boy9Gj4s0icMcKaX0v0-TLAvpHtrmSfR837kYD0QYCYQSVZidyLnF6HPFqsh_ouB1vQEVlUa6AiKli4LeiDuIlRpFKFcJ1KbqYlQmY3E7ihFFXXfnsFQKlqNzjviLK4ghHhcBmBv5T4L7CSNJlkqOgLqEZJGlSWYXaaLTSVv5NFnolJEppbRrZT6fR6vAhebTFmuWPdnkHYCh-i5fQ697OiQ-n1439GZA78ewH4TVmcNh3X5vyD_0fGkqzK8wPPaGPXB7heCpF-wAtr6NyXRsaa9sOP1mY2h57i01WhUu4kLNqtG89JrcuMu6aQWvfwPVlTu90DMUb9e3QmakBFtyLf1DPAQpI5hDCYxQv5Tcbhlh1dHHcYfmY18JQtE6mBH3Hf6_jeZflpeEKl7UfhekRmPf25fdPPDjD4c4NhQ" width=900>

If we import `datetime` like this (as we did above):

```python
import datetime
```

We need to use the following syntax to call the `time` class:

In [5]:
lincoln_tob = datetime.time(8, 44)
print(lincoln_tob)

08:44:00


The `import` statement allows **importing only a subset of the objects in a module**. For example, if we know we will only need the `time` class, we could import it like this:

In [6]:
from datetime import time

That statement makes the `time` class, and all its methods and attributes, *directly* available to us: 

In [8]:
lincoln_tob = time(8, 44) # note we use just 'time'!
print(lincoln_tob)

08:44:00


The rest of the `datetime` module **is not available**. If we try, for example, to use the `date` class:

In [9]:
lincoln_dob = date(1809, 2, 12)
print(lincoln_dob)

NameError: name 'date' is not defined

<div class="alert alert-block alert-info">
<b>Note:</b> Because this usage of <code>import</code> places the object names directly into the local symbol table, any objects that already exist with the same name will be overwritten.
</div>

It is also possible, **albeit strongly discouraged**, to **import everything from a module at once**. The following statement will make all objects in the `datetime` module—with the exception of any that begin with underscores (`_`), which are considered private—accessible to us:

In [11]:
from datetime import *

It is essential that **module namespaces remain unique** in the context of your programme for this modular approach to work. To work around potential issues, Python allows **aliases** for modules to be defined at import time through the use of the `as` keyword:

In [1]:
import datetime as my_own_name_for_datetime # creates alias for entire 'datetime' module

lincoln_tob = my_own_name_for_datetime.time(8, 44)
print(lincoln_tob)

08:44:00


In [2]:
from datetime import date as gabes_date_module # creates alias for the 'date' class only

lincoln_dob = gabes_date_module(1809, 2, 12)
print(lincoln_dob)

1809-02-12


Lastly, we can use the `dir()` built-in function to see what names are currently defined in our programme:

In [3]:
dir()

['In',
 'Out',
 '_',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_dh',
 '_i',
 '_i1',
 '_i2',
 '_i3',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'exit',
 'gabes_date_module',
 'get_ipython',
 'lincoln_dob',
 'lincoln_tob',
 'my_own_name_for_datetime',
 'open',
 'quit']

`dir()` can also be called with a module as argument to list all the objects it defines:

In [14]:
dir(datetime)

['__add__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rsub__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 'astimezone',
 'combine',
 'ctime',
 'date',
 'day',
 'dst',
 'fold',
 'fromisocalendar',
 'fromisoformat',
 'fromordinal',
 'fromtimestamp',
 'hour',
 'isocalendar',
 'isoformat',
 'isoweekday',
 'max',
 'microsecond',
 'min',
 'minute',
 'month',
 'now',
 'replace',
 'resolution',
 'second',
 'strftime',
 'strptime',
 'time',
 'timestamp',
 'timetuple',
 'timetz',
 'today',
 'toordinal',
 'tzinfo',
 'tzname',
 'utcfromtimestamp',
 'utcnow',
 'utcoffset',
 'utctimetuple',
 'weekday',
 'year']

# 3. External Modules and Packages

So far, we have only used *built-in* functions and modules, *i.e.* those that are included in the [Python Standard Library](https://docs.python.org/3.11/library/index.html), a collection of utilities providing common functionality that is installed by default alongside the core Python elements. There are, in addition, **millions of Python components**, from individual functions to entire development frameworks, **available as external modules and packages**. 

The official repository for these external resources is the [Python Package Index (**PyPI**)](https://pypi.org/). Before spending time developing your own solution to a problem, it is a good idea to search the **PyPI** (and maybe [**GitHub**](https://github.com/search?l=Python&q=python&type=Repositories) as well), particularly if the problem seems common enough for someone to have had to solve it before.

External packages and modules **are used in the exact same way as built-in ones**, via the `import` statement, but before they can be imported, they **have to be installed in your system**. Before we look at how to install packages, however, it is important to understand the concept of *dependencies*.

## 3.1. Dependencies and Virtual Environments

Most developers tend to take the aforementioned bit of advice to heart: **the vast majority of Python packages rely on functionality from other packages and libraries**. These external pieces of code that a programme requires to work as intended are called **dependencies**. 

Naturally, over the lifetime of a piece of software, **functionality is added, changed, and removed**. This means that **dependencies are often version-dependent**: a programme may not only depend upon a package, but upon *a specific version* of that package, *i.e.* one that contains the specific functionality it requires. 

In addition, one **can only have a single version of a package installed at a time**, a fact that imposes some potentially problematic constraints. Let's say that your programme depends on an external module called `package_1`. `package_1`, in turn, depends upon version 1.0 of `package_2`. If you also want to use `package_2` in your own code, you could only use the same version (1.0), because that is the only one that would satisfy the requirements of `package_1`. If the functionality you require is only present in another version, then you are out of luck.

All this creates all kinds of conflicts and makes *dependency management* a key concern when developing projects at any scale. Attempting to manage the dependencies of multiple projects in a single environment, say your laptop, is generally considered <span style="color: red; font-weight: bold">a bad idea</span>. The more than likely result is called *dependency hell*, and is conveniently illustrated in the following graphic:

<a href="https://xkcd.com/1987/" target="_blank"><img src="https://imgs.xkcd.com/comics/python_environment.png" width=492></a>

The best way to avoid conflicts is to isolate dependencies to specific projects, and the most common isolation method are **virtual environments**. A *virtual environment* is a semi-isolated Python environment where packages are installed and used by a particular project instead of being installed system-wide. 

## 3.2. Environment and Package Management

There are a number of ways in which one can manage virtual environments and packages. The choice will largely be determined by the manner in which Python was installed in your system. In all cases, however, these tasks will be accomplished through the use of highly specialized programmes called *virtual environment managers* and *package managers*, respectively.

While these tools make both processes largely automated, it should be noted that, occasionally, a package will have special requirements that fall outside the purview of package managers, such as some atypical hardware configuration, or a special, os-dependent low-level library. In those cases manual intervention may be required. The landing pages for most packages in **PyPI** and **GitHub** will include special installation instructions, as well as links to the relevant documentation, when necessary. 

### Command Line

If you installed Python directly from the command line, the preferred programmes are `venv`, Python's module for [creating virtual environments](https://docs.python.org/3/library/venv.html) and `pip`, the official Python [package manager](https://packaging.python.org/en/latest/guides/tool-recommendations/). Both are used directly from the command line by invoking their respective keywords.

`venv` is used alongside the native `source` command:

```python
venv [/path/environment_name] # create a new virtual environment
source [/path/environment_name] # activate a virtual environment for use
```

`pip` must be followed by an action keyword and then whatever parameters are pertinent to it. The keyword to install packages is `install`:

```bash
pip install [package name] # installs the latest version
pip install [package name]==1.0.4 # installs version 1.0.4
pip install [package name]>=1.0.4 # installs at least version 1.0.4
pip install [package name]~=1.0.4 # installs a version compatible with 1.0.4
pip install --upgrade [package name] # upgrades package to the latest version
```

`install` will download and set up a package and all its dependencies. You can read more about version specifiers [here](https://peps.python.org/pep-0440/#compatible-release). Package upgrades **must be requested explicitly** using the `--upgrade` flag, since attempting to install an already present module will have no effect.

The keyword `list` can be used to see which packages are installed in the current environment (or current system, if not using virtual environments):

```bash
pip list # lists all packages installed and their versions
pip list --outdated # lists all packages with newer versions available
```

By default, `pip` only installs packages included in the **PyPI**, but it can also use other sources if configured to do so. See the [full documentation](https://pip.pypa.io/en/stable/) for details.

### Anaconda 

If you installed Python using the [Anaconda Distribution](https://www.anaconda.com/products/distribution), both tasks can be accomplished through the [Navigator](https://docs.anaconda.com/navigator/getting-started/), a graphical user interface for *conda*, Anaconda's own powerful [environment manager](https://docs.anaconda.com/navigator/tutorials/manage-environments/) and [package manager](https://docs.anaconda.com/navigator/tutorials/manage-packages/).

<div class="alert alert-block alert-info">
<b>Note:</b> If you are so inclined, <i>conda</i> can also be used directly from the command line. See the <a href="https://conda.io/projects/conda/en/latest/user-guide/getting-started.html" target="_blank">documentation</a> for the details.
</div>

# 4. External Data

Our experience with Python so far has been very self-contained, we've been working entirely within a sandbox. Naturally, in real life you would want to work with external sources of data. In this unit we'll learn how to exchange information via your file system.

## 4.1. Reading and Writing Files

The key built-in function for creating, writing, and reading files is called `open()`. It takes two arguments, a `filename` and an *optional* string indicating what `access mode` to use. The latter determines the type of file and the type of operations that are possible with it. There are six mode designators denoted by a single letter, as follows:

| Code | Value | Description |
|:----:|:------|:------------|
|`t`|text mode|**Default** - Handles file as text, each line is terminated with a special character called *End of Line* (EOL), by default the new line character (`\n`).|
|`b`|binary mode|Handles file as binary, there is no line terminator and the file is stored as raw binary data (*e.g.* images)|
|`r`|read|**Default** - Opens a file for reading, throws an **error if the file does not exist**|
|`a`|append|Opens a file for appending, **creates the file if it does not exist**|
|`w`|write|Opens a file for writing, **creates the file if it does not exist**|
|`x`|create|Creates the specified file, returns an **error if the file exists**|

The *text* and *binary* mode designators can be combined with the other modes into a single string. Files are assumed to be *text* by default, so in principle this is only necessary if we want to work with *binary* files.

Let's start by opening a file and assigning it to a variable:

In [6]:
jabberwocky = open('carroll.txt', 'r') # the 'r' here is optional, as read mode is the default
print(jabberwocky) # once open, let's print the file

<_io.TextIOWrapper name='carroll.txt' mode='r' encoding='UTF-8'>


As you can see from the printed result, the `open()` function returns a **file object**, not the actual contents of the file. 

Following the OOP paradigm, we must use *methods* to interact with the *file object's* attributes (*i.e.* the data contained in the file). In the specific case of text files, *file objects* include a built-in iterator as a convenience, meaning that we can use a loop to read the contents line by line:

In [7]:
for line in jabberwocky:
    print(line)

’Twas brillig, and the slithy toves

Did gyre and gimble in the wabe:

All mimsy were the borogoves,

And the mome raths outgrabe.

“Beware the Jabberwock, my son!

The jaws that bite, the claws that catch!

Beware the Jubjub bird, and shun

The frumious Bandersnatch!”

He took his vorpal sword in hand;

Long time the manxome foe he sought—

So rested he by the Tumtum tree

And stood awhile in thought.


The iterator is just a convenient wrapper for the `readlines()` method which splits the contents of a text file, using the newline character as separator, and returns them as a **list**. Let's open another file and try `readlines()`:

In [9]:
epigram = open('martial.txt') # access mode will be the default `rt`
all_lines = epigram.readlines()
print(type(all_lines))
print(all_lines)

<class 'list'>
['Miraris veteres, Vacerra, solos\n', 'nec laudas nisi mortuos poetas.\n', 'Ignoscas petimus, Vacerra: tanti\n', 'non est, ut placeam tibi, perire.']


Another method, `readline()` (note the singular), also splits the contents on the newline character, but instead of a list, it returns each line in sequence:

In [11]:
richard_iii = open('shakespeare.txt', 'r')
print(richard_iii.readline()) # reads a line from the file

Now is the winter of our discontent



Yet another method, `read()`, allows us to treat the contents of a file **as a single block of data**. Unlike the previous methods, it works with both *text* and *binary* files. 

`read()` takes a **single numeric argument**, `size`, which indicates **how much of that block to read**. If it is omitted (or negative), the entire contents of the file will be read and returned. If supplied, `size` is interpreted depending on the mode the file was opened in: as **characters in text mode**, or as **bytes in binary mode**:

In [12]:
alphabet = open('alphabet.txt', 'r')
print(alphabet.read(5)) # reads 5 characters from the file

abcde


It is important to keep in mind that these are **methods**, *i.e.* functions operating on the data stored in the file object, hence their return values are determined by their internal logic, which may not line up with our expectations. For example, you could quite rightly expect that running the above commands again would return the same values, however:

In [13]:
print(richard_iii.readline()) # reads a line from the file
print(alphabet.read(5)) # reads 5 characters from the file

Made glorious summer by this sun of York;

fghij


*File objects* keep track of something called the **file handle**, which works similarly to the *cursor* in a document: it **moves position as operations are performed on the contents** and, consequently, it also determines where most methods operate (*i.e.* where the *handle* is currently located). The position of the *handle* is always set to `0` when the file is first opened.

If using `readline()`, the *handle* will **track its position in terms of lines**, *i.e.* it will remember the last line we read and start from the next if we call the method again. If using `read()`, the *handle* **tracks its position in terms of characters or bytes** (depending on the file mode). 

Let's re-open the `alphabet.txt` file and read a few times from it to illustrate the point:

In [14]:
alphabet_2 = open('alphabet.txt', 'r')
print(alphabet_2.read(5)) # read 5 characters - handle will end at position 4 (0-based count)
print(alphabet_2.read(10)) # read 10 characters - will start at 4, end at 14

abcde
fghijklmno


A method called `seek()` allows us to re-position the *handle*. It uses two arguments to compute the new position: `offset` indicates how many characters or bytes to move, and `whence` determines the reference point, *i.e.* *from where* to start moving. Only three values are valid for `whence`: `0` (the beginning of the file), `1` (the current position of the *handle*), and `2` (the end of the file). Of those, `1` and `2` only work with *binary* files, or if `offset` is set to `0`:

In [15]:
alphabet_2.seek(0) # move handle to the beginning of the file
print(alphabet_2.read(5))

alphabet_2.seek(0, 2) # move handle to the end of the file
print(alphabet_2.read(5))

alphabet_2.seek(10) # move handle 10 characters ahead from the beginning of the file
print(alphabet_2.read(5))

abcde

klmno


**Writing to a file** works in a similar manner: **we need to open or create it** before we can write anything to it. For this we use `open()` as well, and indicate our intent by using a **writeable access mode**:

- `a` will open an existing file and append content to it, *i.e.* it will add whatever we write to the end of the file.
- `w` will overwrite a file if it already exists, or create it if it doesn't. 
- `x` provides a way to ensure we won't accidentally overwrite something: it attempts to create a file, but returns an error if it already exists

Some examples:

```python
i_can_improve_shakespeare = open('shakespeare.txt', 'a') # will open shakespeare.txt and append at the end
i_am_better_than_shakespeare = open('shakespeare.txt', 'w') # will overwrite shakespeare.txt
my_file_1 = open('my_text_1.txt', 'x') # will create my_text_1.txt or return an error if it exists
my_file_2 = open('my_text_2.txt', 'w') # will overwrite my_text_2.txt if it exists or create it if not
```

As for the writing itself, the relevant methods are `write()` and `writelines()`. Let's start by creating some content to write to a file:

<div class="alert alert-block alert-info">
<b>Note:</b> Since we're going to be writing files to the hard drive, starting here, <b>you'll have to run the cells</b> to see the results and follow along.
</div>

In [17]:
advice_bit = 'Take time to know yourself.'

advice_list = [
    advice_bit,
    'A narrow focus brings big results.',
    'Show up fully.',
    'Do not make assumptions.',
    'Be patient and persistent.',
]

More stupid advice [here](https://www.inc.com/lolly-daskal/25-excellent-pieces-of-advice-that-most-people-ignore.html), if you're interested (and, yes, I'm only adding this because I can't bring myself to not include the source). 

`write()` takes a single argument, which can be any kind of content, *text* or *binary*, and **writes to the file without modification**. 

`writelines()`, which **can only be used in text mode**, takes an *iterable* as argument, such as a list, and **writes each item as a line** to the file.

Let's go ahead and create a couple of files:

In [18]:
# let's use 'write()' first:
great_advice_1 = open('great_advice_1.txt', 'w') # open the file
great_advice_1.write(advice_bit) # write the contents of 'advice_bit' to it

# next let's use 'writelines()':
great_advice_2 = open('great_advice_2.txt', 'w') # open a new file
great_advice_2.writelines(advice_list) # write the contents of 'advice_list' to it

**Have a look at the files**: they are probably empty. 

<div class="alert alert-block alert-info">
<b>Note:</b> The files are in the same folder as the notebook. <b>Don't know where the notebook is?</b> No problem, in Jupyter you can use the <a href="https://ipython.readthedocs.io/en/stable/interactive/magics.html" target="_blank">magic command</a> <code>%pwd</code> to find out. Just run the next cell!
</div>

In [None]:
%pwd

The reason the files are empty is that Python uses buffering to interact with the file system, so changes may not show until a **file is closed**.

<div class="alert alert-block alert-danger">
<b>Warning:</b> In addtion to the problem described above, and many others, forgetting to close files also means that whatever resources they are using, such as memory, will continue to be tied up and thus unavailable for other purposes.
</div>

To **close a file we use the** `close()` method. It is also possible to **check whether a file is open**: *file objects* have a boolean attribute called `closed` that will be `False` if the file is open.

Let's check and close the files we've opened so far:

In [19]:
print(jabberwocky.closed)
jabberwocky.close()

print(epigram.closed)
epigram.close()

print(richard_iii.closed)
richard_iii.close()

print(alphabet.closed)
alphabet.close()

# if multiple instances of the same file are open, 
# each has to be closed individually
print(alphabet_2.closed)
alphabet_2.close()

# last, let's do the files we just created
print(great_advice_1.closed)
great_advice_1.close()

print(great_advice_2.closed)
great_advice_2.close()

False
False
False
False
False
False
False


**Have another look at the files we created earlier**: the text we wrote to them should now be there.

<div class="alert alert-block alert-info">
<b>Note:</b> If you look carefully at <code>great_advice_2.txt</code>, you'll notice that the lines are not in fact lines (<i>i.e.</i> there are no new line characters). This is because <code>writelines()</code> does not add them automatically.
</div>

Forgetting to close a file is such a common occurrence, that **this way of working with files is generally discouraged**. A much **better approach** is to take advantage of the `with` statement, which is used to [wrap the execution of a block of code](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement). The general syntax looks like this:

```python
with [EXPRESSION] as [TARGET]:
    [CODE BLOCK]
```

For working with files, the syntax would look like this:

```python
with open([filename], [access mode]) as [variable]:
    [CODE BLOCK]
```

The key advantage of this aproach is that **the file is automatically closed after** `[CODE BLOCK]` **finishes executing**, even if an exception is raised at some point. This guarantees that the file will be closed:

In [20]:
with open('great_advice_2.txt', 'w') as great_advice_2: # now we can access the file object in the indented code
    # here's a way to add line breaks to the file:
    for line in advice_list:
        great_advice_2.write(f'{line}\n')

print(great_advice_2.closed) # let's verify that the file was automatically closed

True


## 4.2. Working with Files and Directories

The examples in the previous section relied on **files present in the same directory as the notebook**, but what if that is not the case? The `filename` argument supplied to the `open()` function can **include a *path* to help locate the file**. That, however, is a bit more complicated than it sounds.

Let's assume we want to open a file called `martial.txt` which is located in a folder called `samples` located in the same folder as the notebook. The **format of the path to that file will depend on the operating system your computer is running**: in Unix systems, such as MacOS and Linux, it is `./samples/martial.txt`, in Windows computers it should look something like `samples\martial.txt`.

Let's try both:

In [21]:
filepaths = [
    {'filename': './samples/martial.txt', 'type': 'Unix'},
    {'filename': r'samples\martial.txt', 'type': 'Windows'}
]

for fp in filepaths:
    try:
        with open(fp['filename'], 'r') as epigram:
            for line in epigram:
                print(line)
    except:
        print(f"\nOops, '{fp['filename']}' didn't work because your computer is not running {fp['type']}.\n")


Miraris veteres, Vacerra, solos

nec laudas nisi mortuos poetas.

Ignoscas petimus, Vacerra: tanti

non est, ut placeam tibi, perire.

Oops, 'samples\martial.txt' didn't work because your computer is not running Windows.



Writing paths by hand, while okay for your own, short term use, is **not a good idea if you intend to share your code or use it across multiple platforms**. 

Thankfully, the built-in `os` module, **provides the tools to specify paths in a cross-platform way**. In order to do that for the case outlined above, we'll need:

- the `curdir` attribute, which **returns the current directory** (*i.e.* the one where the code is being executed), and 
- the `join()` method of the `path` class, which **takes individual path elements** as arguments and **returns a universally valid *path object***.

Let's go ahead and do it:

In [22]:
import os

filepath = os.path.join(os.curdir, 'samples', 'martial.txt')
print(filepath)

./samples/martial.txt


We can test whether the path is valid using the `exists()` method (returns `True` if a file exists):

In [23]:
print(os.path.exists(filepath))

True


Now we can open the file:

In [24]:
with open(filepath, 'r') as epigram:
    for line in epigram:
        print(line)

Miraris veteres, Vacerra, solos

nec laudas nisi mortuos poetas.

Ignoscas petimus, Vacerra: tanti

non est, ut placeam tibi, perire.


The `os` module provides **many other functions**. Here are a few particularly useful examples:

In [37]:
path_to_samples = os.path.join(os.curdir, 'samples') # path to the 'samples' folder

# the os.listdir() method returns a list of all files and folders in a location
print(os.listdir(path_to_samples)) # list all files and folders in the 'samples' folder

['death_causes.json', 'death_causes.csv', '.DS_Store', 'martial.txt']


In [38]:
print(os.listdir()) # if not location is provided, it defaults to the current directory

['.DS_Store', 'carroll.txt', 'samples', 'Unit 3 - Modules, Functions, and External Data.ipynb', '.ipynb_checkpoints', 'great_advice_1.txt', 'martial.txt', 'great_advice_2.txt', 'alphabet.txt', 'shakespeare.txt']


In [28]:
# the items in the list are just strings so they can be easily filtered
# print only text files
for item in os.listdir():
    if item.endswith('txt'):
        print(item)

carroll.txt
great_advice_1.txt
martial.txt
great_advice_2.txt
alphabet.txt
shakespeare.txt


In [29]:
# print only files, not directories
for item in os.listdir():
    item_path = os.path.join(os.curdir, item) # we need to build the path, 'item' is just a string
    if os.path.isfile(item_path): # isfile() tests whether a path is a file [fyi: isdir() is its counterpart]
        print(item)

.DS_Store
carroll.txt
Unit 3 - Modules, Functions, and External Data.ipynb
great_advice_1.txt
martial.txt
great_advice_2.txt
alphabet.txt
shakespeare.txt


In [30]:
# os.scandir() is equivalent to os.listdir() but returns an 
# iterator of path objects instead of a list of strings

# same as above - print only files but with scandir()
for item in os.scandir():
    if os.path.isfile(item): # 'item' is a path object so we can test directly
        print(item.name) # the 'name' attribute returns the name of the object

.DS_Store
carroll.txt
Unit 3 - Modules, Functions, and External Data.ipynb
great_advice_1.txt
martial.txt
great_advice_2.txt
alphabet.txt
shakespeare.txt


<div class="alert alert-block alert-success">
<b>Exercise:</b> Print a list of all <b>files</b> in the current directory, <b>sorted by their size</b>.
</div>

In [None]:
# Tip: the method os.path.getsize(path) will return the size of a file in bytes
# your code goes here:



While running the next few cells, you'll want to **keep a window open to the current directory** to watch what happens:

In [31]:
# create a new folder called 'test' in the current directory
new_folder = os.path.join(os.curdir, 'test') # create the path
os.mkdir(new_folder) # create the folder

In [32]:
# move the 'martial.txt' file in 'samples' to 'test'
# we stored the path to the file earlier in 'filepath'
target_file = os.path.join(new_folder, 'martial.txt') # we create the destination path
os.rename(filepath, target_file) # and move it

In [33]:
# Let's move it back and delete the directory we created
os.rename(target_file, filepath)
os.rmdir(new_folder)

## 4.3. Using Common Exchange Formats

Python's *Standard Library* includes modules designed to facilitate working with many file types. In this section, we'll go over the basic aspects of working with the two most common **data exchange formats**: [Comma-separated values (CSV)](https://datatracker.ietf.org/doc/html/rfc4180) and [JavaScript Object Notation (JSON)](https://www.json.org/json-en.html).

### CSV

The *CSV* format is used to store tabular data (*i.e.* data arranged in rows and columns, like in a spreadsheet). The following examples will use some familiar csv-formatted data (left column, rendered as a table on the right).

<div style="margin-top: 20px"><div style="width: 49%; float: left;">

```csv
State,Quarter ending,Population,Deaths,Cause 
AZ,6/30/1907,122,931,1,diphtheria
AZ,6/30/1907,122,931,4,enteric fever
AZ,6/30/1907,122,931,11,scarlet fever
AZ,6/30/1907,122,931,2,smallpox
CA,6/30/1907,2,054,000,19,diphtheria
CA,6/30/1907,2,054,000,18,enteric fever
CA,6/30/1907,2,054,000,1,scarlet fever
CA,6/30/1907,2,054,000,0,smallpox
```

</div><div style="width: 49%; float: right;">

State | Quarter ending | Population | Deaths | Cause 
:----:|:--------------:|-----------:|-------:|:------
AZ | 6/30/1907 | 122,931 | 1 | diphtheria
AZ | 6/30/1907 | 122,931 | 4 | enteric fever
AZ | 6/30/1907 | 122,931 | 11 | scarlet fever
AZ | 6/30/1907 | 122,931 | 2 | smallpox
CA | 6/30/1907 | 2,054,000 | 19 | diphtheria
CA | 6/30/1907 | 2,054,000 | 18 | enteric fever
CA | 6/30/1907 | 2,054,000 | 1 | scarlet fever
CA | 6/30/1907 | 2,054,000 | 0 | smallpox

</div></div>

The data has been saved to a file called `death_causes.csv` located in the `samples` folder.

The built-in [`csv` module](https://docs.python.org/3/library/csv.html) includes a function called `reader()` that will **take the contents of the file** and **return an iterable with the rows**. The iterable is not particularly useful, but it can be easily **converted to a list** using the `list()` constructor:

In [34]:
import os, csv # import os and csv modules

csv_file_in = os.path.join(os.curdir, 'samples', 'death_causes.csv') # define the file

with open(csv_file_in, 'r') as file: # load it
    csv_content = csv.reader(file) # and provide the file object to csv.reader()
    rows_from_reader = list(csv_content) # convert the reader() object to a list

for row in rows_from_reader:
    print(row)

['State', 'Quarter ending', 'Population', 'Deaths', 'Cause']
['AZ', '6/30/1907', '122,931', '1', 'diphtheria']
['AZ', '6/30/1907', '122,931', '4', 'enteric fever']
['AZ', '6/30/1907', '122,931', '11', 'scarlet fever']
['AZ', '6/30/1907', '122,931', '2', 'smallpox']
['CA', '6/30/1907', '2,054,000', '19', 'diphtheria']
['CA', '6/30/1907', '2,054,000', '18', 'enteric fever']
['CA', '6/30/1907', '2,054,000', '1', 'scarlet fever']
['CA', '6/30/1907', '2,054,000', '0', 'smallpox']


As we can see from the print out, `rows_from_reader` is a *list of lists*: **rows in the csv file have been converted to lists in which the order of the columns is preserved**. The column *headers*, if the csv file had them, will be contained in the first list, but are otherwise treated as a regular row.

The `csv` module provides an **alternative method** called `DictReader()` that coverts **rows to dictionaries** instead:

In [35]:
with open(csv_file_in, 'r') as file:
    csv_content = csv.DictReader(file) # this time we use csv.DictReader()
    rows_from_dictreader = list(csv_content)
    
for row in rows_from_dictreader:
    print(row)

{'State': 'AZ', 'Quarter ending': '6/30/1907', 'Population': '122,931', 'Deaths': '1', 'Cause': 'diphtheria'}
{'State': 'AZ', 'Quarter ending': '6/30/1907', 'Population': '122,931', 'Deaths': '4', 'Cause': 'enteric fever'}
{'State': 'AZ', 'Quarter ending': '6/30/1907', 'Population': '122,931', 'Deaths': '11', 'Cause': 'scarlet fever'}
{'State': 'AZ', 'Quarter ending': '6/30/1907', 'Population': '122,931', 'Deaths': '2', 'Cause': 'smallpox'}
{'State': 'CA', 'Quarter ending': '6/30/1907', 'Population': '2,054,000', 'Deaths': '19', 'Cause': 'diphtheria'}
{'State': 'CA', 'Quarter ending': '6/30/1907', 'Population': '2,054,000', 'Deaths': '18', 'Cause': 'enteric fever'}
{'State': 'CA', 'Quarter ending': '6/30/1907', 'Population': '2,054,000', 'Deaths': '1', 'Cause': 'scarlet fever'}
{'State': 'CA', 'Quarter ending': '6/30/1907', 'Population': '2,054,000', 'Deaths': '0', 'Cause': 'smallpox'}


`DictReader()` **converts the rows in the csv file into a dictionary where the column *headers* are used as keys**. The first row of the csv file is **assumed to contain *headers*** by default, but the keyword argument `fieldnames` can be used to **pass them in a list** if necessary.

If a row has **more fields than there are *headers***, the remaining data is put in a list under the key name supplied by the `restkey` keyword argument (or the default `None`). Similarly, if a non-blank row has **fewer fields than there are *headers***, the missing values are filled-in with the value supplied by the `restval` keyword argument (or the default `None`)

Both `reader()` and `DictReader()` also take a number of other keyword arguments that provide [information about the format of the file]((https://docs.python.org/3/library/csv.html#csv-fmt-params)). While rarely necessary, these can be used to fine-tune the conversion process. Here's a list of the most useful:

Keyword Arg | Description
:------|:------
`dialect` | String indicating one of several predefined settings, *e.g.* `excel` or `excel-tab`. The full list can be obtained by running the `csv.list_dialects()` function.
`delimiter` | A one-character string indicating what character is used to separate fields, *e.g.* `\t` for tabs. Defaults to `,`.
`quotechar` | A one-character string used to indicate whic character is used to quote fields containing special characters, such as the delimiter or new-line marker. It defaults to `"`.

A few examples:

```python
# assume that the file is in Excel tab-separated format
csv.reader(file_object, dialect='excel-tab')

# assume tab-separated format with pipes enclosing special fields
csv.reader(file_object, delimiter='\t', quotechar='|')

# assume tab-separated format, store date without headers under `extra` key, fill blanks with `N/A'
csv.DictReader(file_object, delimiter='\t', restkey='extra', restval='N/A')

# assume first row is data and use supplied headers, fill blanks with `DUCK'
csv.DictReader(file_object, fieldnames=['column_1', 'column_2', 'column_3'], restval='DUCK')
```

The `csv` module provides two functions for writing data, `writer()` and `DictWriter()`, that **have to be instantiated before they can be used**. The syntax is identical to the *reader objects* discussed above: they take a *file object* representing the destination csv file, and the **same formatting arguments** with one exception: the `fieldnames` argument **has to be passed to** `DictWriter()`, it's not optional as with `DictReader()`.

Once the *writer object* is instantiated, **individual rows can be written using the** `writerow()` **method**. `writerow()` takes row data **formatted as a list in the case of** `writer()` and **as a dictionary in the case of** `DictWriter()`. The latter supports an additional method called `writeheader()` (no arguments) that will write the supplied `fieldnames` to the file:

In [40]:
csv_file_out = os.path.join(os.curdir, 'samples', 'death_causes_out_1.csv') # destination csv file

with open(csv_file_out, 'w') as file: # open file in write mode
    csv_writer = csv.writer(file, delimiter=',') # instantiate writer object (delimiter karg. is optional)
    # now we write all the rows in the `rows_from_reader` list (which are formatted as lists)
    for row in rows_from_reader:
        csv_writer.writerow(row) # the first will be the headers

Go ahead and check the results in the `samples` folder. The resulting file, `death_causes_out_1.csv` should be identical to the original.

Now let's write the contents of `rows_from_dictreader` to a new file:

In [39]:
csv_file_out = os.path.join(os.curdir, 'samples', 'death_causes_out_2.csv') # new destination csv file

with open(csv_file_out, 'w') as file: # open file in write mode
    headers = ['State', 'Quarter ending', 'Population', 'Deaths', 'Cause'] # let's create a list with our headers
    csv_writer = csv.DictWriter(file, fieldnames=headers) # new we instantiate the writer
    # next we write the headers to the file
    csv_writer.writeheader()
    # then we write the rows (which are formatted as dictionaries)
    for row in rows_from_dictreader:
        csv_writer.writerow(row)

**Check the** `samples` **folder!** `death_causes_out_2.csv` should be identical to the other two csv files.

### JSON

The JSON format can store **serializable data structures** as **key-value pairs** (*e.g.* dictionaries) and **arrays** (*e.g.* lists). We'll continue using the same data, but it has been rearranged into a structure better fitting the JSON format:

```json
[
	{
		"State": "AZ",
		"Population": 122931,
		"Deaths per quarter": [
			{
				"Quater ending": "6/30/1907",
				"Deaths": {
					"diphtheria": 1,
					"enteric fever": 4,
					"scarlet fever": 11,
					"smallpox": 2,
				},
			},
		],
	},
	{
		"State": "CA",
		"Population": 2054000,
		"Deaths per quarter": [
			{
				"Quater ending": "6/30/1907",
				"Deaths": {
					"diphtheria": 19,
					"enteric fever": 18,
					"scarlet fever": 1,
					"smallpox": 0,
				},
			},
		],
	},
]
```

The data is saved in a file called `death_causes.json` located in the `samples` folder. 

The built-in [`json` module](https://docs.python.org/3/library/json.html) includes two functions that will **take JSON-formatted content and serialize it into a Python data structure**: `load()` and `loads()`. In this case, the latter is not a plural, the `s` stands for *string*. `json.load()` takes a file object as input, while `json.loads()` takes a string.

Let's try loading the data from the file first:

In [41]:
import json # load the module

json_file_in = os.path.join(os.curdir, 'samples', 'death_causes.json') # define the file

with open(json_file_in, 'r') as file: # load it
    json_content_from_file = json.load(file)

print(type(json_content_from_file))
print(len(json_content_from_file))
for i in range(len(json_content_from_file)):
    print(type(json_content_from_file[i]))

<class 'list'>
2
<class 'dict'>
<class 'dict'>


As you can see from the printed output, **the resulting object is a list containing two dictionaries**, just as we would expect. Converting between Python and JavaScript/JSON objects is fairly straightforwards: JSON data types are mapped to Python objects as follows:

JSON | Python
:-----|:----
object | `dict`
array | `list`
string | `str`
number (int) | `int`
number (real) | `float`
true | `True`
false | `False`
null | `None`

The full structure of `json_content_from_file` looks like this:

In [42]:
print(json_content_from_file)

[{'State': 'AZ', 'Population': 122931, 'Deaths per quarter': [{'Quater ending': '6/30/1907', 'Deaths': {'diphtheria': 1, 'enteric fever': 4, 'scarlet fever': 11, 'smallpox': 2}}]}, {'State': 'CA', 'Population': 2054000, 'Deaths per quarter': [{'Quater ending': '6/30/1907', 'Deaths': {'diphtheria': 19, 'enteric fever': 18, 'scarlet fever': 1, 'smallpox': 0}}]}]


The opposite process, **serializing a Python object to JSON**, is accomplished through the `dump()` and `dumps()` functions. These are the counterparts to `load()` and `loads()`: `dump()` will **return a stream that can be written to a file**, while `dumps()` will **return a string**.

Let's take the list of dictionaries in `json_content_from_file` and convert it into a JSON-formatted string:

In [43]:
json_content_as_string = json.dumps(json_content_from_file)
print(type(json_content_as_string))
print(json_content_as_string)

<class 'str'>
[{"State": "AZ", "Population": 122931, "Deaths per quarter": [{"Quater ending": "6/30/1907", "Deaths": {"diphtheria": 1, "enteric fever": 4, "scarlet fever": 11, "smallpox": 2}}]}, {"State": "CA", "Population": 2054000, "Deaths per quarter": [{"Quater ending": "6/30/1907", "Deaths": {"diphtheria": 19, "enteric fever": 18, "scarlet fever": 1, "smallpox": 0}}]}]


The output looks like *properly formatted* JSON, even if it's not *nicely* formatted. Besides printing the `type` we can tell the output is no longer a Python object because it uses **double-quotes** to enclose fields (the JSON default) as opposed to **single-quotes** (the Python default). 

As for making the output look nicer, `json.dumps()` **takes a keyword argument called** `indent` that, if provided, will **split and indent the lines properly**:

In [44]:
json_content_as_string = json.dumps(json_content_from_file, indent=4) # let's use the standard 4 spaces
print(json_content_as_string)

[
    {
        "State": "AZ",
        "Population": 122931,
        "Deaths per quarter": [
            {
                "Quater ending": "6/30/1907",
                "Deaths": {
                    "diphtheria": 1,
                    "enteric fever": 4,
                    "scarlet fever": 11,
                    "smallpox": 2
                }
            }
        ]
    },
    {
        "State": "CA",
        "Population": 2054000,
        "Deaths per quarter": [
            {
                "Quater ending": "6/30/1907",
                "Deaths": {
                    "diphtheria": 19,
                    "enteric fever": 18,
                    "scarlet fever": 1,
                    "smallpox": 0
                }
            }
        ]
    }
]


**Much better!** For reference, Python data types are mapped to JSON as follows:

Python | JSON
:-----|:-----
`dict` | object
`list`, `tuple` | array
`str` | string
`int`, `float` | number
`True` | true
`False` | false
`None` | null

<div class="alert alert-block alert-warning">
<b>Warning:</b> The data type of <i>keys</i> in JSON objects is always <code>string</code>, whereas Python supports other types (<i>e.g.</i> integers). When a Python dictionary is serialized into JSON, all the keys are coerced into strings. Because of this, <b>if a dictionary is serialized into JSON and then back into Python, the resulting dictionary may not equal the original</b>.
</div>

Now that we have a string formatted as JSON, we can demonstrate the workings of `json.loads()`:

In [45]:
json_content_from_string = json.loads(json_content_as_string)
print(type(json_content_from_string))
print(json_content_from_string)

<class 'list'>
[{'State': 'AZ', 'Population': 122931, 'Deaths per quarter': [{'Quater ending': '6/30/1907', 'Deaths': {'diphtheria': 1, 'enteric fever': 4, 'scarlet fever': 11, 'smallpox': 2}}]}, {'State': 'CA', 'Population': 2054000, 'Deaths per quarter': [{'Quater ending': '6/30/1907', 'Deaths': {'diphtheria': 19, 'enteric fever': 18, 'scarlet fever': 1, 'smallpox': 0}}]}]


In [46]:
print(json_content_from_file) # let's print the date we imported directly from the file for comparison

[{'State': 'AZ', 'Population': 122931, 'Deaths per quarter': [{'Quater ending': '6/30/1907', 'Deaths': {'diphtheria': 1, 'enteric fever': 4, 'scarlet fever': 11, 'smallpox': 2}}]}, {'State': 'CA', 'Population': 2054000, 'Deaths per quarter': [{'Quater ending': '6/30/1907', 'Deaths': {'diphtheria': 19, 'enteric fever': 18, 'scarlet fever': 1, 'smallpox': 0}}]}]


`json_content_from_file` and `json_content_from_string` are **identical Python objects** (lists of dictionaries). 

Finally, let's write our data to a JSON-formatted file. `json.dump()` **takes a Python object** to serialize, **a *file object*** as destination, and **a number of keyword arguments**, of which only `indent` (same as above) is immediately relevant:

In [47]:
json_file_out_1 = os.path.join(os.curdir, 'samples', 'death_causes_out_1.json') # create destination path

with open(json_file_out_1, 'w') as file: # open file in write mode
    json.dump(json_content_from_file, file, indent=4)

**Go ahead and have a look at the file**, it should be identical to the original JSON source.