# Chapter #1: Best Practices

## 1. Docstring

1. Docstrings

> You've probably spent a lot of time using functions that someone else wrote. In this course, you'll learn how to write functions that others can use. Docstrings are a Python best practice that will make your code much easier to use, read, and maintain.

2. A complex function

> Look at this split_and_stack() function. If you wanted to understand what the function does, what the arguments are supposed to be, and what it returns, you would have to spend some time deciphering the code.

>> ![image.png](attachment:9cd8ee50-1fad-4fca-b1f5-02906da52886.png)

3. A complex function with a docstring

> With a docstring though, it is much easier to tell what the expected inputs and outputs should be, as well as what the function does. This makes it easier for you and other engineers to use your code in the future.

>> ![image.png](attachment:dad80de6-acd4-4646-925d-226f1d9c3788.png)

4. Anatomy of a docstring

> A docstring is a string written as the first line of a function. Because docstrings usually span multiple lines, they are enclosed in triple quotes, Python's way of writing multi-line strings. Every docstring has some (although usually not all) of these five key pieces of information:

>> 1. What the function does,

>> 2. What the arguments are,

>> 3. What the return value or values should be,

>> 4. Info about any errors raised,

>> 5. Anything else you'd like to say about the function.

>> ![image.png](attachment:a94b37cc-8f2f-4f36-9023-30a29f735fe3.png)

5. Docstring formats

> Consistent style makes a project easier to read, and the Python community has evolved several standards for how to format your docstrings:

>> - Google Style,

>> - Numpydoc,

>> - reStructuredText,

>> - EpyText.

> Google-style and Numpydoc are the most popular formats, so we'll focus on those.

6. Google Style - description

> In Google style, the docstring starts with a concise description of what the function does. This should be in imperative language. For instance: "Split the data frame and stack the columns" instead of "This function will split the data frame and stack the columns".

>> ![image.png](attachment:9730cb43-c5de-4cd0-93e6-beee53c0b033.png)

7. Google style - arguments

> Next comes the "Args" section where you list each argument name, followed by its expected type in parentheses, and then what its role is in the function. If you need extra space, you can break to the next line and indent as I've done here. If an argument has a default value, mark it as "optional" when describing the type. If the function does not take any parameters, feel free to leave this section out.

>> ![image.png](attachment:23195815-f7b8-4627-84ee-b8cf429efcc2.png)

8. Google style - return value(s)

> The next section is the "Returns" section, where you list the expected type or types of what gets returned. You can also provide some comment about what gets returned, but often the name of the function and the description will make this clear. Additional lines should not be indented.

>> ![image.png](attachment:7400cfda-909d-4a60-b9e0-20398cbf4f29.png)

9. Google-style - errors raised and extra notes

> Finally, if your function intentionally raises any errors, you should add a "Raises" section. You can also include any additional notes or examples of usage in free form text at the end.

>> ![image.png](attachment:19c37f9e-f97c-40f6-a1da-c2c84317c2bb.png)

10. Numpydoc

> The Numpydoc format is very similar and is the most common format in the scientific Python community. Personally, I think it looks better than the Google style. It takes up more vertical space though, so this course will either use Google-style or leave out the docstrings entirely to keep the examples compact and legible.

>> ![image.png](attachment:7750b4be-5bce-4e24-82b5-b1b7efa0f979.png)

11. Retrieving docstrings

> Sometimes it is useful for your code to access the contents of your function's docstring. Every function in Python comes with a __doc__ attribute that holds this information. Notice that the __doc__ attribute contains the raw docstring, including any tabs or spaces that were added to make the words line up visually. To get a cleaner version, with those leading spaces removed, you can use the getdoc() function from the inspect module. The inspect module contains a lot of useful methods for gathering information about functions.

>> ![image.png](attachment:056a51e1-68bd-4ee4-abf8-82bd4c0f59b4.png)

12. Let's practice!

> Now it's your turn to practice writing and retrieving docstrings.

### 1.1. Crafting a docstring

> You've decided to write the world's greatest open-source natural language processing Python package. It will revolutionize working with free-form text, the way `numpy` did for arrays, `pandas` did for tabular data, and `scikit-learn` did for machine learning.

> he first function you write is `count_letter()`. It takes a string and a single letter and returns the number of times the letter appears in the string. You want the users of your open-source package to be able to understand how this function works easily, so you will need to give it a docstring. Build up a Google Style docstring for this function by following these steps.

>> - Copy the following string and add it as the docstring for the function:
>>> `Count the number of times letter appears in content`.

In [1]:
# Add a docstring to count_letter():
def count_letter(content, letter):
    
    """Count the number of times letter appears in content."""

    if (not isinstance(letter, str)) or len(letter) != 1:
        raise ValueError('`letter` must be a single character string.')
    
    return len([char for char in content if char == letter])

>> - Now add the arguments section, using the Google style for docstrings. Use `str` to indicate a string.

In [2]:
# Add a docstring to count_letter():
def count_letter(content, letter):
    
    """Count the number of times letter appears in content.
    
    Args:
        content (str): The string to search.
        letter (str):  The letter to search for.    
    """

    if (not isinstance(letter, str)) or len(letter) != 1:
        raise ValueError('`letter` must be a single character string.')
    
    return len([char for char in content if char == letter])

>> - Add a returns section that informs the user the return value is an int.

In [3]:
# Add a docstring to count_letter():
def count_letter(content, letter):
    
    """Count the number of times letter appears in content.
    
    Args:
        content (str): The string to search.
        letter (str):  The letter to search for.
        
    Returns:
        int: The number of times the letter appears in the string.
        
    Raises:
        ValueError: If `letter` is not a one-character string.
    """

    if (not isinstance(letter, str)) or len(letter) != 1:
        raise ValueError('`letter` must be a single character string.')
    
    return len([char for char in content if char == letter])

### 1.2. Retrieving docstrings

> You and a group of friends are working on building an amazing new Python IDE (integrated development environment -- like PyCharm, Spyder, Eclipse, Visual Studio, etc.). The team wants to add a feature that displays a tooltip with a function's docstring whenever the user starts typing the function name. That way, the user doesn't have to go elsewhere to look up the documentation for the function they are trying to use. You've been asked to complete the `build_tooltip()` function that retrieves a docstring from an arbitrary function.

> You will be reusing the `count_letter()` function that you developed in the last exercise to show that we can properly extract its docstring.

>> - Begin by getting the docstring for the function `count_letter()`. Use an attribute of the `count_letter()` function.

In [4]:
# Get the "count_letter" docstring by using an attribute of the function
docstring = count_letter.__doc__

border = '#' * 28
print('{}\n{}\n{}'.format(border, docstring, border))

############################
Count the number of times letter appears in content.
    
    Args:
        content (str): The string to search.
        letter (str):  The letter to search for.
        
    Returns:
        int: The number of times the letter appears in the string.
        
    Raises:
        ValueError: If `letter` is not a one-character string.
    
############################


>> - Now use a function from the inspect module to get a better-formatted version of `count_letter()`'s docstring.

In [5]:
import inspect

# Inspect the count_letter() function to get its docstring
docstring = inspect.getdoc(count_letter)

border = '#' * 28
print('{}\n{}\n{}'.format(border, docstring, border))

############################
Count the number of times letter appears in content.

Args:
    content (str): The string to search.
    letter (str):  The letter to search for.
    
Returns:
    int: The number of times the letter appears in the string.
    
Raises:
    ValueError: If `letter` is not a one-character string.
############################


>> - Now create a `build_tooltip()` function that can extract the docstring from any function that we pass to it.

In [6]:
def build_tooltip(function):
    
    """Create a tooltip for any function that shows the function's
    docstring.

    Args:
        function (callable): The function we want a tooltip for.
        info (str, optional): The type of information you want to retreive.

    Returns:
        str: The docstring of the selected function.
  """
    # Import inspect module
    import inspect
    
    # Get the docstring for the "function" argument by using inspect
    docstr = inspect.getdoc(function)
    
    # Return the formatted docstr
    border = '#' * 28
    return f"{border}\n{docstr}\n{border}"

In [7]:
# Get the docstring for count_letter():
print(build_tooltip(count_letter))

############################
Count the number of times letter appears in content.

Args:
    content (str): The string to search.
    letter (str):  The letter to search for.
    
Returns:
    int: The number of times the letter appears in the string.
    
Raises:
    ValueError: If `letter` is not a one-character string.
############################


## 2. DRY and "Do One Thing"

1. DRY and "Do One Thing"

> DRY (also known as "don't repeat yourself") and the "Do One Thing" principle are good ways to ensure that your functions are well designed and easy to test. Let's see how.

2. Don't repeat yourself (DRY)

> When you are writing code to look for answers to a research question, it is totally normal to copy and paste a bit of code, tweak it slightly, and re-run it. However, this kind of repeated code can lead to real problems. In this code snippet, I load my train, validation, and test data, and plot the first two principal components of each dataset. I wrote the code for the train dataset, then copied it and pasted it into the next two blocks, updating the paths and the variable names.

>> ![image.png](attachment:f55f5275-2f96-4fbb-94b2-6145c785b57e.png)

3. The problem with repeating yourself

> But one of the problems with copying and pasting is that it is easy to accidentally introduce errors that are hard to spot. If you'll notice in the last block, I accidentally took the principal components of the train data instead of the test data. Yikes!

>> ![image.png](attachment:e5d31f9c-2407-44cb-b9eb-cd603b7570a0.png)

4. Another problem with repeating yourself

> Another problem with repeated code is that if you want to change something, you often have to do it in multiple places. For instance, if we realized that our CSVs used the column name "label" instead of "labels", we would have to change our code in six places. Repeated code like this is a good sign that you should write a function. So let's do that.

>> ![image.png](attachment:986aee71-d7d8-4053-8952-a0ceec48e419.png)

5. Use functions to avoid repetition

> Wrapping the repeated logic in a function and then calling that function several times makes it much easier to avoid the kind of errors introduced by copying and pasting. And if you ever need to change the column "label" back to "labels", or you want to swap out PCA for some other dimensionality reduction technique, you only have to do it in one or two places.

>> ![image.png](attachment:f914a164-de01-4c63-873e-a920450dca4a.png)

6. Problem: it does multiple things

> However, there is still a big problem with this function.

7. Problem: it does multiple things
> First, it loads the data.

8. Problem: it does multiple things

> Then it plots the data.

9. Problem: it does multiple things

> And then it returns the loaded data. This function violates another software engineering principle: Do One Thing. Every function should have a single responsibility. Let's look at how we could split this one up.

>> ![image.png](attachment:22783ba3-a336-4ae7-8fbf-f5accc7ba6cb.png)

10. Do One Thing

> Instead of one big function, we could have a more nimble function that just loads the data and a second one for plotting. We get several advantages from splitting the load_and_plot() function into two smaller functions. First of all, our code has become more flexible. Imagine that later on in your script, you just want to load the data and not plot it. That's easy now with the load_data() function. Likewise, if you wanted to do some transformation to the data before plotting, you can do the transformation and then call the plot_data() function. We have decoupled the loading functionality from the plotting functionality.

>> ![image.png](attachment:c8358b11-da85-4723-a493-cfeb75cd12dd.png)

11. Advantages of doing one thing

> The code will also be easier for other developers to understand, and it will be more pleasant to test and debug. Finally, if you ever need to update your code, functions that each have a single responsibility make it easier to predict how changes in one place will affect the rest of the code.

> The code becomes:

>> - More flexible,

>> - More easily understood,

>> - Simpler to test,

>> - Simpler to debug,

>> - Easier to change.

12. Code smells and refactoring

> Repeated code and functions that do more than one thing are examples of "code smells", which are indications that you may need to refactor. Refactoring is the process of improving code by changing it a little bit at a time. This process is well described in Martin Fowler's book, "Refactoring", which is a good read for any aspiring software engineer.

13. Let's practice!

> Now you can do some refactoring of your own in the exercises!

### 2.1. Extract a function

> While you were developing a model to predict the likelihood of a student graduating from college, you wrote this bit of code to get the z-scores of students' yearly GPAs. Now you're ready to turn it into a production-quality system, so you need to do something about the repetition. Writing a function to calculate the z-scores would improve this code.

>![download.png](attachment:0a6b0b2f-5120-44c1-949d-6b3e52ccac2f.png)> ![image.png](attachment:2d5d3632-fc55-44fb-bbfc-46ecf74a592a.png)

>> Note: `df` is a pandas DataFrame where each row is a student with 4 columns of yearly student GPAs: `y1_gpa`, `y2_gpa`, `y3_gpa`, `y4_gpa`

>> - Finish the function so that it returns the z-scores of a column.

In [8]:
# Create the gpa_df (toy data)
import pandas as pd

gpa = [[2.786, 2.053, 2.171, 0.066],
       [1.145, 2.666, 0.267, 2.885],
       [0.907, 0.424, 2.613, 0.031],
       [2.205, 0.524, 3.984, 0.339],
       [2.878, 1.288, 3.078, 0.902]]

columns=['y1_gpa', 'y2_gpa', 'y3_gpa', 'y4_gpa']

gpa_df = pd.DataFrame(data=gpa, columns=columns)
gpa_df

Unnamed: 0,y1_gpa,y2_gpa,y3_gpa,y4_gpa
0,2.786,2.053,2.171,0.066
1,1.145,2.666,0.267,2.885
2,0.907,0.424,2.613,0.031
3,2.205,0.524,3.984,0.339
4,2.878,1.288,3.078,0.902


In [9]:
def standardize(column):
    """Standardize the values in a column.

    Args:
        column (pandas Series): The data to standardize.

    Returns:
        pandas Series: the values as z-scores
    """
    
  # Finish the function so that it returns the z-scores
    z_score = (column - column.mean()) / column.std()
    return z_score

In [10]:
# Use the standardize() function to calculate the z-scores
gpa_df['y1_z'] = standardize(gpa_df['y1_gpa'])
gpa_df['y2_z'] = standardize(gpa_df['y2_gpa'])
gpa_df['y3_z'] = standardize(gpa_df['y3_gpa'])
gpa_df['y4_z'] = standardize(gpa_df['y4_gpa'])

In [11]:
# Explore gpa_df after standardization:
gpa_df.head()

Unnamed: 0,y1_gpa,y2_gpa,y3_gpa,y4_gpa,y1_z,y2_z,y3_z,y4_z
0,2.786,2.053,2.171,0.066,0.87547,0.682687,-0.182366,-0.652794
1,1.145,2.666,0.267,2.885,-0.916306,1.314843,-1.562431,1.710712
2,0.907,0.424,2.613,0.031,-1.176174,-0.997218,0.138006,-0.682138
3,2.205,0.524,3.984,0.339,0.241087,-0.894093,1.13174,-0.423905
4,2.878,1.288,3.078,0.902,0.975923,-0.106219,0.47505,0.048125


### 2.2. Split up a function

> Another engineer on your team has written this function to calculate the mean and median of a sorted list. You want to show them how to split it into two simpler functions: mean() and median()

>> ![image.png](attachment:97bbd9e8-806c-46c2-a912-0eaafe5f4f4a.png)

>> - Write the mean() function.

In [12]:
def mean(values):
    """Get the mean of a sorted list of values

    Args:
        values (iterable of float): A list of numbers

    Returns:
        float
    """
  # Write the mean() function
    mean = sum(values) / len(values)
    return mean

>> - Write the median() function.

In [13]:
def median(values):
    """Get the median of a sorted list of values

    Args:
        values (iterable of float): A list of numbers

    Returns:
        float
    """
    # Write the median() function
    length = len(values)
    midpoint = length // 2
    
    if length % 2 == 0:
        median = (values[midpoint] + values[midpoint + 1]) / 2
    else:
        median = values[midpoint + 1]
    return median

## 3. Pass by assignment

1. Pass by assignment

> The way that Python passes information to functions is different from many other languages. It is referred to as "pass by assignment", which I will explain in this lesson.

2. A surprising example

> Let's say we have a function foo() that takes a list and sets the first value of the list to 99. Then we set "my_list" to the value [1, 2, 3] and pass it to foo(). What do you expect the value of "my_list" to be after calling foo()? If you said "[99, 2, 3]", then you are right. Lists in Python are mutable objects, meaning that they can be changed. Now let's say we have another function bar() that takes an argument and adds ninety to it. Then we assign the value 3 to the variable "my_var" and call bar() with "my_var" as the argument. What do you expect the value of "my_var" to be after we've called bar()? If you said "3", you're right. In Python, integers are immutable, meaning they can't be changed.

![image.png](attachment:0ce2d4a4-782d-4725-81b1-e1ec27eadc55.png)