## Introducing docstrings

In this mission, we'll cover some best practices that will make your code much easier to use, read, and maintain, including:

- How to document your code so that others can easily understand it.
- How to create functions that are easier to test, debug, and change.
- How to setup default arguments in functions so that your code doesn't behave unexpectedly.

Let's start by looking at this split_and_stack() function:

In [3]:
def split_and_stack(df, new_names):
    half = int(len(df.columns) / 2)
    left = df.iloc[:, :half]
    right = df.iloc[:, half:]
    return pd.DataFrame(data=np.vstack([left.values, right.values]), columns=new_names)

If we wanted to understand what the function does, what the arguments are supposed to be, and what it returns, we would have to spend some time deciphering the code.

With a **docstring** though, it is much easier to tell what the expected inputs and outputs should be, as well as what the function does. A docstring is a string written as the first line of a function. Because docstrings usually span multiple lines, they are enclosed in triple quotes, Python's way of writing multi-line strings:

In [4]:
def split_and_stack(df, new_names):
    """Splits a DataFrame's columns into two halves and then stack
    them vertically, returning a new DataFrame with `new_names` as the
    column names.

    Args:
      df (DataFrame): The DataFrame to split.
      new_names (iterable of str): The column names for the new DataFrame.

    Returns:
      DataFrame
    """
    half = int(len(df.columns) / 2)
    left = df.iloc[:, :half]
    right = df.iloc[:, half:]
    return pd.DataFrame(
      data=np.vstack([left.values, right.values]),
      columns=new_names
    )

Every docstring has some (although usually not all) of these five key pieces of information:

- Description of what the function does.
- Description of the arguments, if any.
- Description of the return value(s), if any.
- Description of errors raised, if any.
- Optional extra notes or examples of usage.

Docstrings makes it easier for you and other data scientists or engineers to use, read, and maintain your code in the future. Remember that even though computers execute it, code is actually written for humans to read (otherwise you'd just be writing the 1s and 0s that the computer operates on).

### Retrieving docstrings

Every function in Python comes with a __doc__ attribute that holds the contents of the function's docstring.

In [5]:
def the_answer():
    """Returns the answer to life, 
    the universe, and everything.

    Returns:
        int
    """
    return 42

In [6]:
print(the_answer.__doc__)

Returns the answer to life, 
    the universe, and everything.

    Returns:
        int
    


Notice that the __doc__ attribute contains the raw docstring, including any tabs or spaces that were added to make the words visually line up.

To get a cleaner version, with those leading spaces removed, we can use the getdoc() function from the inspect module.

In [7]:
import inspect

print(inspect.getdoc(the_answer))

Returns the answer to life, 
the universe, and everything.

Returns:
    int


The inspect module contains a lot of useful methods for gathering information about functions, so we recommend you take some time at the end of this mission to read through the documentation.

In Jupyter notebook, there's also a keyboard shortcut we can use to access the docstrings for built-in functions - just press `Shift` + `Tab` while the cursor is within the parentheses of a built-in function:

### Google style docstrings

Now that we know how to retrieve a function's docstring, let's learn how to write our own.

Consistent style makes a project easier to read, and the Python community has evolved several standards for how to format docstrings. Google style and Numpydoc are the most popular formats. However, since Numpydoc takes up more vertical space, we'll focus on Google style in this mission to keep the examples compact and legible.

#### Description of what the function does

In Google style, the docstring starts with a concise description of what the function does. This should be in **imperative language**. For instance, we would write "Split the data frame and stack the columns" instead of "This function will split the data frame and stack the columns."

```
def function(arg_1, arg_2=42):
    """Description of what the function does.
    """
```    
   
#### Description of the arguments, if any

Next comes the "Args" section where you list each argument name, followed by its expected type in parentheses, and then its role in the function. If you need extra space, break to the next line and indent, like below. If an argument has a default value, mark it as "optional" when describing the type. If the function does not take any parameters, leave this section out.

```
def function(arg_1, arg_2=42):
    """Description of what the function does.
    Args:
      arg_1 (str): Description of arg_1 that can break onto the next line
        if needed.
      arg_2 (int, optional): Write optional when an argument has a default
        value.
  """
```
  
#### Description of the return value(s), if any

The next section is the "Returns" section, where you list the expected type or types of what gets returned. You can also provide some comment about what gets returned, but often the name of the function and the description will make this clear. Additional lines should not be indented.

```
def function(arg_1, arg_2=42):
    """Description of what the function does.
    
    Args:
      arg_1 (str): Description of arg_1 that can break onto the next line
        if needed.
      arg_2 (int, optional): Write optional when an argument has a default
        value.
        
    Returns:
      bool: Optional description of the return value
      Extra lines are not indented.
    """
```


#### Description of errors raised, if any.
#### Optional extra notes or examples of usage.

```
def function(arg_1, arg_2=42):
    """Description of what the function does.

    Args:
      arg_1 (str): Description of arg_1 that can break onto the next line
        if needed.
      arg_2 (int, optional): Write optional when an argument has a default
        value.

    Returns:
      bool: Optional description of the return value
      Extra lines are not indented.

    Raises:
      ValueError: Include any error types that the function intentionally
        raises.

    Notes:
      See https://www.dataquest.io for more info.  
    """
```

## Don't repeat yourself

Now that we know how to make our functions easier to understand, let's look at how we can also make them easier to test, debug, and change. The **Don't repeat yourself** principle, also known as **DRY**, and the **Do One Thing** principle are good ways to ensure that our functions are well designed and easy to test. Let's see how, starting with DRY.

When we write code to look for answers to a research question, it is totally normal to copy and paste a bit of code, tweak it slightly, and re-run it. However this, kind of repeated code can lead to real problems.

In this code snippet, we load our train, validation, and test data, and plot the first two principal components of each dataset. Suppose we wrote the code for the train dataset, then copied it and pasted it into the next two blocks, updating the paths and the variable names:

In [None]:
train = pd.read_csv('train.csv')
train_y = train['labels'].values
train_X = train[col for col in train.columns if col != 'labels'].values
train_pca = PCA(n_components=2).fit_transform(train_X)
plt.scatter(train_pca[:,0], train_pca[:,1])

In [None]:
val = pd.read_csv('validation.csv')
val_y = val['labels'].values
val_X = train[col for col in val.columns if col != 'labels'].values
val_pca = PCA(n_components=2).fit_transform(val_X)
plt.scatter(val_pca[:,0], val_pca[:,1])

In [None]:
test = pd.read_csv('test.csv')
test_y = test['labels'].values
test_X = test[col for col in test.columns if col != 'labels'].values
test_pca = PCA(n_components=2).fit_transform(train_X)
plt.scatter(test_pca[:,0], test_pca[:,1])

But one of the problems with copying and pasting is that it is easy to accidentally introduce errors that are hard to spot. Notice in the last block, we accidentally took the principal components of the train data instead of the test data. Yikes!

In [None]:
test = pd.read_csv('test.csv')
test_y = test['labels'].values
test_X = test[col for col in test.columns if col != 'labels'].values
test_pca = PCA(n_components=2).fit_transform(train_X)  ### yikes! ###
plt.scatter(test_pca[:,0], test_pca[:,1])

Another problem with repeated code is that if we want to change something, we often have to do it in multiple places. For instance, if we realized that our CSVs used the column name "label" instead of "labels," we would have to change our code in six places. Repeated code like this is a good sign that we should write a function.

Wrapping the repeated logic in a function and then calling that function several times makes it much easier to avoid the kind of errors introduced by copying and pasting.

In [None]:
def load_and_plot(path):
    """Loads a data set and plot the first two principal components.

    Args:
      path (str): The location of a CSV file.

    Returns:
      tuple of ndarray: (features, labels)
    """
    data = pd.read_csv(path)
    y = data['label'].values
    X = data[col for col in train.columns if col != 'label'].values
    pca = PCA(n_components=2).fit_transform(X)
    plt.scatter(pca[:,0], pca[:,1])
    return X, y

In [None]:
train_X, train_y = load_and_plot('train.csv')

In [None]:
val_X, val_y = load_and_plot('validation.csv')

In [None]:
test_X, test_y = load_and_plot('test.csv')

### Do One thing

On the last screen, we wrapped repeated logic from our code in the following function.

In [None]:
def load_and_plot(path):
    """Loads a data set and plot the first two principal components.

    Args:
      path (str): The location of a CSV file.

    Returns:
      tuple of ndarray: (features, labels)
    """
    # load the data
    data = pd.read_csv(path)
    y = data['label'].values
    X = data[col for col in train.columns if col != 'label'].values

    # plot the first two principal components
    pca = PCA(n_components=2).fit_transform(X)
    plt.scatter(pca[:,0], pca[:,1])

    return X, y

However, there is still a big problem with this function. First, it loads the data. Then, it plots the data. And then it returns the loaded data.

This function violates another software engineering principle: **Do One Thing**. Every function should have a single responsibility. Let's look at how we could split this one up.

In [None]:
def load_data(path):
    """Loads a data set.

    Args:
      path (str): The location of a CSV file.

    Returns:
      tuple of ndarray: (features, labels)
    """
    data = pd.read_csv(path)
    y = data['labels'].values
    X = data[col for col in data.columns if col != 'labels'].values
    return X, y

In [None]:
def plot_data(X):
    """Plots the first two principal components of a matrix.

    Args:
      X (numpy.ndarray): The data to plot.
    """
    pca = PCA(n_components=2).fit_transform(X)
    plt.scatter(pca[:,0], pca[:,1])

Instead of one big function, we could have a more nimble function that just loads the data and a second one for plotting.

We get several advantages from splitting the load_and_plot() function into two smaller functions. Our code becomes:

- More flexible
- More easily understood
- Simpler to test
- Simpler to debug
- Easier to change

First of all, our code has become more flexible. Imagine that later on in our script, we just want to load the data and not plot it. That's easy now with the load_data() function.

Likewise, if we wanted to do some transformation to the data before plotting, we can do the transformation and then call the plot_data() function. We have decoupled the loading functionality from the plotting functionality.

The code will also be easier for other developers to understand, and it will be more pleasant to test and debug.

Finally, if we ever need to update our code, functions that each have a single responsibility make it easier to predict how changes in one place will affect the rest of the code.

### Pass by Assignment

Another important thing to keep in mind when writing functions is that the way that Python passes information to functions is different from many other languages. It is referred to as pass by assignment.

Let's say we have a function foo() that takes a list and sets the first value of the list to 99:

In [10]:
def foo(x):
    x[0] = 99

Then we set my_list to the value [1, 2, 3] and pass it to foo(). What do you expect the value of my_list to be after calling foo()?

In [11]:
my_list = [1, 2, 3]

foo(my_list)

my_list

[99, 2, 3]

If you said [99, 2, 3], then you are right. Lists in Python are mutable objects, meaning that they can be changed.

Now let's say we have another function bar() that takes an argument and adds 90 to it:

In [12]:
def bar(x):
    x = x + 90

Then we assign the value 3 to the variable my_var and call bar() with my_var as the argument. What do you expect the value of my_var to be after we call bar()?

In [13]:
my_var = 3

bar(my_var)
print(my_var)

3


If you said 3, you're right. In Python, integers are immutable, meaning they can't be changed.