#### Michael feedback: HUGE lesson here. If you want to create well built tools/projects/libraries this is the course to start with. An extreme amount of content that could probably 6 courses alone if he went into more detail. A lot of ideas and side projects from the content in this lesson. 5/5 stars.

# Part 1 - Python Programming Principles

### Chapter: Functions and Iterations

In [None]:
from pathlib import Path

In [None]:
def print_files(filenames):
    # Set up the loop iteration instructions
    for name in filenames:
        # Use pathlib.Path to print out each file
        print(Path(name).read_text())
        
def list_files(filenames):
    # Use pathlib.Path to read the contents of each file
    return [Path(name).read_text()
            # Obtain each name from the list of filenames
            for name in filenames]

filenames = "diabetes.txt", "boston.txt", "digits.txt", "iris.txt", "wine.txt"
print_files(filenames)
pprint(list_files(filenames))

In [None]:
def flatten(nested_list):
    return (item 
            # Obtain each list from the list of lists
            for sublist in nested_list
            # Obtain each element from each individual list
            for item in sublist)

number_generator = (int(substring) for string in flatten(matches)
                    for substring in string.split() if substring.isdigit())
pprint(dict(zip(filenames, zip(number_generator, number_generator))))

Fantastic flattening! In this third coding exercise, we practiced using generator expressions instead of list comprehensions. Next, we'll continue on to learn about a new set of principles related to Python workflows.


In [None]:
def obtain_words(string):
    # Replace non-alphabetic characters with spaces
    return "".join(char if char.isalpha() else " " for char in string).split()

def filter_words(words, minimum_length=3):
    # Remove words shorter than 3 characters
    return [word for word in words if len(word) >= minimum_length]

words = obtain_words(Path("diabetes.txt").read_text().lower())
filtered_words = filter_words(words)
pprint(filtered_words)

In [None]:
def count_words(word_list):
    # Count the words in the input list
    return {word: word_list.count(word) for word in word_list}

# Create the dictionary of words and word counts
word_count_dictionary = count_words(filtered_words)

(pd.DataFrame(word_count_dictionary.items())
 .sort_values(by=1, ascending=False)
 .head()
 .plot(x=0, kind="barh", xticks=range(5), legend=False)
 .set_ylabel("")
)
plt.show()

Wonderful word-ranking! We now have several independent, reusable functions at our disposal. In this coding exercise, we used a dictionary comprehension to link words with their counts and then sorted the words we obtained by their frequency. In the next video, we will learn about the last principle covered in this chapter.

In [None]:
# Fill in the first parameter in the pair_plot() definition
def pair_plot(self, vars=range(3), hue=None):
    return pairplot(pd.DataFrame(self.data), vars=vars, hue=hue, kind="reg")

ScikitData.pair_plot = pair_plot

# Create the diabetes instance of the ScikitData class
diabetes = ScikitData("diabetes")
diabetes.pair_plot(vars=range(2, 6), hue=1)._legend.remove()
plt.show()

In this exercise, we saw how classes and methods can be very useful! The choice to use seaborn for plotting is an implementation detail. We could switch to different visualization library without changing how our ScikitData class is used. Next, we'll add a class method to ScikitData that will allow us to simultaneously create more than one instance.

In [None]:
# Fill in the decorator for the get_generator() definition
@classmethod
# Add the first parameter to the get_generator() definition
def get_generator(cls, dataset_names):
    return map(cls, dataset_names)

ScikitData.get_generator = get_generator
dataset_generator = ScikitData.get_generator(["diabetes", "iris"])
for dataset in dataset_generator:
    dataset.pair_plot()
    plt.show()

In this chapter, we have learned about abstracting away implementation details and providing interfaces to facilitate code use. For example, the details of how the ScikitData class works are hidden inside the module that contains the ScikitData class definition. We can use the ScikitData class to quickly obtain and explore scikit-learn datasets without looking at the ScikitData source code. In the next chapter, we will learn to describe how our code works with documentation and to ensure that the code works properly with testing.



# PART 2 - Documentation and Tests 

#### Type Hints

In [1]:
from typing import Optional
from typing import List

In [None]:
class TextFile:
  	# Add type hints to TextFile"s __init__() method
    def __init__(self, name: str) -> None:
        self.text = Path(name).read_text()

	# Type annotate TextFile"s get_lines() method
    def get_lines(self) -> List[str]:
        return self.text.split("\n")

help(TextFile)

In this case, we can be certain that the result of calling get_lines() will be a list of strings. While it is good to be as specific as possible when annotating return values, we might want to be a little more flexible with argument type hints. For example, we could design the TextFile class to allow users to pass a single filename or a list or tuple of filenames. Being smart about type hinting requires finding a balance between flexibility and specificity.

In [None]:
class MatchFinder:
  	# Add type hints to __init__()'s strings argument
    def __init__(self, strings: List[str]) -> None:
        self.strings = strings

	# Type annotate get_matches()'s query argument
    def get_matches(self, query: Optional[str] = None) -> List[str]:
        return [s for s in self.strings if query in s] if query else self.strings

help(MatchFinder)

Type hints help us to understand important aspects of how functions and methods work, without having to look at the source code in modules. We will continue to use type annotation throughout the course, but in the next lesson we'll shift our attention to another way we can include information in the code that we write.

In [None]:
# pytest / doctest

In [None]:
def get_matches(word_list: List[str], query:str) -> List[str]:
    ("Find lines containing the query string.\nExamples:\n\t"
     # Complete the docstring example below
     ">>> get_matches(['a', 'list', 'of', 'words'], 's')\n\t"
     # Fill in the expected result of the function call
     "['list', 'words']")
    return [line for line in word_list if query in line]

help(get_matches)

A simple doctest example is a great way convey information on how an object can be used. In the next exercise, we'll look at another function definition that includes an example in its docstring.

In [None]:
def obtain_words(string: str) -> List[str]:
    ("Get the top words in a word list.\nExamples:\n\t"
     ">>> from this import s\n\t>>> from codecs import decode\n\t"
     # Use obtain_words() in the docstring example below
     ">>> obtain_words(decode(s, encoding='rot13'))[:4]\n\t"
     # Fill in the expected result of the function call
     "['The', 'Zen', 'of', 'Python']") 
    return ''.join(char if char.isalpha() else ' ' for char in string).split()
  
help(obtain_words)

Doctest examples are great additions to docstrings, but later in the chapter we will learn more about testing so that we do not have have to include all of our tests in docstrings. Next, we will learn to generate reports using Python code.

#### Build Jupyter Notebooks

In [None]:
def nbuild(filenames: List[str]) -> nbformat.notebooknode.NotebookNode:
    """Create a Jupyter notebook from text files and Python scripts."""
    nb = new_notebook()
    nb.cells = [
        # Create new code cells from files that end in .py
        new_code_cell(Path(name).read_text()) 
        if name.endswith(".py")
        # Create new markdown cells from all other files
        else new_markdown_cell(Path(name).read_text()) 
        for name in filenames
    ]
    return nb
    
pprint(nbuild(["intro.md", "plot.py", "discussion.md"]))

Wow, the output is a little bit difficult to read. Jupyter notebooks, JSON files, and NotebookNode objects are all similar to Python dictionaries. The differences between JSON files and Python dictionaries are minimal, so becoming familiar with one will help you understand the other. Next, we will save the output of nbuild() as a notebook and write a function that will allow us to convert notebooks to (almost) any format!

In [None]:
def nbconv(nb_name: str, exporter: str = "script") -> str:
    """Convert a notebook into various formats using different exporters."""
    # Instantiate the specified exporter class
    exp = get_exporter(exporter)()
    # Return the converted file"s contents string 
    return exp.from_filename(nb_name)[0]
    
pprint(nbconv(nb_name="mynotebook.ipynb", exporter="html"))

Congratulations on successfully converting a notebook to an HTML report! We could just as easily create a PDF report by passing 'pdf' to nbconv() as its exporter argument. The nbconvert library provides the means to create many different types of output documents from Jupyter notebooks. In the next video, we will learn to write pytest tests for the nbuild() function from the previous exercise. Furthermore, we will learn how to squeeze even more functionality from our docstrings by using a documentation generator called Sphinx.

#### Pytest

In [None]:
# Fill in the decorator for the test_nbuild() function 
@pytest.mark.parametrize("inputs", ["intro.md", "plot.py", "discussion.md"])
# Pass the argument set to the test_nbuild() function
def test_nbuild(inputs):
    assert nbuild([inputs]).cells[0].source == Path(inputs).read_text()

show_test_output(test_nbuild)

In [None]:
@pytest.mark.parametrize("not_exporters", ["htm", "ipython", "markup"])
# Pass the argument set to the test_nbconv() function
def test_nbconv(not_exporters):
     # Use pytest to confirm that a ValueError is raised
    with pytest.raises(ValueError):
        nbconv(nb_name="mynotebook.ipynb", exporter=not_exporters)

show_test_output(test_nbconv)

We've progressed way past doctest examples, but we've only scratched the surface of what pytest is capable of! The techniques we learned in this chapter prevent modules from turning into black boxes. While type hints, docstrings, documentation and tests do not change how modules work, users will certainly appreciate well-documented and bug-free code! In the next chapter, we will learn how Python can synergize with the shell.

# PART 3 - Shell superpowers 

#### CLIs

#### Argparse nbuild()

In [None]:
def argparse_cli(func: Callable) -> None:
    # Instantiate the parser object
    parser = argparse.ArgumentParser()
    # Add an argument called in_files to the parser object
    parser.add_argument("in_files", nargs="*")
    args = parser.parse_args()
    print(func(args.in_files))

if __name__ == "__main__":
    argparse_cli(nbuild)

#### docopt

In [None]:
# Add the section title in the docstring below
"""Usage: docopt_cli.py [IN_FILES...]"""

def docopt_cli(func: Callable) -> None:
    # Assign the shell arguments to "args"
    args = docopt(__doc__)
    print(func(args["IN_FILES"]))

if __name__ == "__main__":
    docopt_cli(nbuild)

#### GitPython

GitPython gives us building blocks that we can use to build Python scripts that make our use of version control faster, easier, and more efficient.

Version controlled projects usually start with initializing or cloning repositories.

After a repository is set up, the standard cycle of commands is add and commit changes.

In this exercise, we will focus on the first two steps: adding changes to the index and committing them to version control history.

The commit message is created by an f-string, which evaluates the code inside curly braces ({}).

With GitPython, we can initialize a new repository and instantiate the Repo class in one line of code.

We can then check for untracked files, add files to the index, commit changes, and list all of the newly tracked files.

In [None]:
# Initialize a new repo in the current folder
repo = git.Repo.init()

# Obtain a list of untracked files
untracked = repo.untracked_files

# Add all untracked files to the index
repo.index.add(untracked)

# Commit newly added files to version control history
repo.index.commit(f"Added {', '.join(untracked)}")
print(repo.head.commit.message)

In [None]:
changed_files = [file.b_path
                 # Iterate over items in the diff object
                 for file in repo.index.diff(None)
                 # Include only modified files
                 .iter_change_type("M")]

repo.index.add(changed_files)
repo.index.commit(f"Modified {', '.join(changed_files)}")
for number, commit in enumerate(repo.iter_commits()):
    print(number, commit.message)

Adding and commiting changes are the most common tasks in git workflows. Diffs let us inspect the tracked and untracked changes since the latest commit. Version control is an important aspect of every workflow, so even though you've finished the last GitPython exercises, we will certainly return to git later in the course. In the next video, we will learn how to set up and track the libraries that our code needs to run.

#### Environment managers

Check for libraries, run commands in the shell through subprocess, code below prints out library youre looking for

In [None]:
import subprocess
# Create an virtual environment
venv.create(".venv")

# Run pip list and obtain a CompletedProcess instance
cp = subprocess.run([".venv/bin/python", "-m", "pip", "list"], stdout=-1)

for line in cp.stdout.decode().split("\n"):
    if "pandas" in line:
        print(line)

In [None]:
print(run(
    # Install project dependencies
    [".venv/bin/python", "-m", "pip", "install", "-r", "requirements.txt"],
    stdout=-1
).stdout.decode())

print(run(
    # Show information on the aardvark package
    [".venv/bin/python", "-m", "pip", "show", "aardvark"], stdout=-1
).stdout.decode())

#### Persistence and packaging

pickle and unpickle a dataframe (and plot) - Pandas

In [None]:
pd.DataFrame(
    np.c_[(diabetes.data, diabetes.target)],
    columns="age sex bmi map tc ldl hdl tch ltg glu target".split()
    # Pickle the diabetes dataframe with zip compression
    ).to_pickle("diabetes.pkl.zip")
                  
# Unpickle the diabetes dataframe
df = pd.read_pickle("diabetes.pkl.zip")
df.plot.scatter(x="ltg", y="target", c="age", colormap="viridis")
plt.show()

pickle and unpickle a dataframe (and plot) - Joblib - Scikit learn

In [None]:
# Train and pickle a linear model
joblib.dump(LinearRegression().fit(x_train, y_train), "linear.pkl")

# Unpickle the linear model
linear = joblib.load("linear.pkl")
predictions = linear.predict(x_test)
plt.scatter(y_test, predictions, edgecolors=(0, 0, 0))
min_max = [y_test.min(), y_test.max()]
plt.plot(min_max, min_max, "--", lw=3)
plt.xlabel("Measured")
plt.ylabel("Predicted")
plt.show()

# PART 4 - Projects, pipelines, and parallelism

Cookie cutter for template | Sphinx for documentation

In [None]:
#set up JSON cookiecutter template file

json_path.write_text(json.dumps({
    "project": "Creating Robust Python Workflows",
  	# Convert the project name into snake_case
    "package": "{{ cookiecutter.project.lower().replace(' ', '_') }}",
    # Fill in the default license value
    "license": ["MIT", "BSD", "GPL3"]
}))

pprint(json.loads(json_path.read_text()))

#### json and pathlib modules to set up template

In [None]:
# Obtain keys from the local template's cookiecutter.json
keys = [*json.load(json_path.open())]
vals = "Your name here", "My Amazing Python Project"

# Create a cookiecutter project without prompting for input
main.cookiecutter(template_root.as_posix(), no_input=True,
                  extra_context=dict(zip(keys, vals)))

for path in pathlib.Path.cwd().glob("**"):
    print(path)

In this lesson, we used the pathlib and json modules from the standard library to learn more about a local Cookiecutter template and then used the template to set up a project with a custom name. Projects templates are a powerful automation tool that can be combined with Make and Sphinx to save tremendous amounts of time and standardize project structure and contents.

#### Zipapp - Create a zipped project

In [None]:
zipapp.create_archive(
    # Zip up a project called "myproject"
    "myproject",                    
    interpreter="/usr/bin/env python",
    # Generate a __main__.py file
    main="mypackage.mymodule:print_name_and_file")

print(subprocess.run([".venv/bin/python", "myproject.pyz"],
                     stdout=-1).stdout.decode())

In [None]:
def main():
    parser = argparse.ArgumentParser(description="Scikit datasets only!")
    # Set the default for the dataset argument
    parser.add_argument("dataset", nargs="?", default="diabetes")
    parser.add_argument("model", nargs="?", default="linear_model.Ridge")
    args = parser.parse_args()
    # Create a dictionary of the shell arguments
    kwargs = dict(dataset=args.dataset, model=args.model)
    return (classify(**kwargs) if args.dataset in ("digits", "iris", "wine")
            else regress(**kwargs) if args.dataset in ("boston", "diabetes")
            else print(f"{args.dataset} is not a supported dataset!"))

In this exercise, we learned how to set up a command-line interface for an executable project. Even if you do not plan to run projects in a shell very often, being able to pass command-line arguments to a zipped project is pretty cool! This is a great example of enjoying the convenience of a single file without sacrificing modularity! In the next video, we will learn how to pass parameters to Jupyter notebooks using the papermill library.

#### Parametrize notebooks

In [None]:
# Read in the notebook to find the default parameter names
pprint(nbformat.read("sklearn.ipynb", as_version=4).cells[0].source)
keys = ["dataset_name", "model_type", "model_name", "hyperparameters"]
vals = ["diabetes", "ensemble", "RandomForestRegressor",
        dict(max_depth=3, n_estimators=100, random_state=0)]
parameter_dictionary = dict(zip(keys, vals))

# Execute the notebook with custom parameters
pprint(pm.execute_notebook(
    "sklearn.ipynb", "rf_diabetes.ipynb", 
    kernel_name="python3", parameters=parameter_dictionary
	))

 In this exercise, we accessed a notebook cell and executed a notebook programmatically using papermill. In the next exercise, we will use scrapbook to programmatically access notebook data.

#### Summarize notebooks

In [None]:
import scrapbook as sb

# Assign the scrapbook notebook object to nb
nb = sb.read_notebook("rf_diabetes.ipynb")

# Create a dataframe of scraps (recorded values)
scrap_df = nb.scrap_dataframe
print(scrap_df)

#### parallel computing w/ dask

In [None]:
import dask.dataframe as dd

# Read in a csv file using a dask.dataframe method
df = dd.read_csv("diabetes.csv")

df["bin_age"] = (df.age > 0).astype(int)

# Compute the columns means in the two age groups
print(df.groupby("bin_age").mean().compute())

In the penultimate exercise of this course, we used created and persisted a Dask dataframe. We also showed how we can end a method chain with compute() to obtain a result.

#### joblib - In the last exercise of this course, we will use the grid search technique to find the optimal hyperparameters for an elastic net model.

In [None]:
# Set up a Dask client with 4 threads and 1 worker
Client(processes=False, threads_per_worker=4, n_workers=1)

# Run grid search using joblib and a Dask backend
with joblib.parallel_backend("dask"):
    engrid.fit(x_train, y_train)

plot_enet(*enet_path(x_test, y_test, eps=5e-5, fit_intercept=False,
                    l1_ratio=engrid.best_params_["l1_ratio"])[:2])