## writing-functions-in-python/best-practices



## Best Practices
Free
The goal of this course is to transform you into a Python expert, and so the first chapter starts off with best practices when writing functions. You'll cover docstrings and why they matter and how to know when you need to turn a chunk of code into a function. You will also learn the details of how Python passes arguments to functions, as well as some common gotchas that can cause debugging headaches when calling functions.
Play Chapter Now

Docstrings       50 xp
Crafting a docstring       100 xp
Retrieving docstrings       100 xp
Docstrings to the rescue!       50 xp
DRY and "Do One Thing"       50 xp
Extract a function       100 xp
Split up a function       100 xp
Pass by assignment       50 xp
Mutable or immutable?       50 xp
Best practice for default arguments       100 xp

## Context Managers
If you've ever seen the "with" keyword in Python and wondered what its deal was, then this is the chapter for you! Context managers are a convenient way to provide connections in Python and guarantee that those connections get cleaned up when you are done using them. This chapter will show you how to use context managers, as well as how to write your own.
Play Chapter Now

Using context managers       50 xp
The number of cats       100 xp
The speed of cats       100 xp
Writing context managers       50 xp
The timer() context manager       100 xp
A read-only open() context manager       100 xp
Advanced topics       50 xp
Context manager use cases       50 xp
Scraping the NASDAQ       100 xp
Changing the working directory       100 xp

## Decorators
Decorators are an extremely powerful concept in Python. They allow you to modify the behavior of a function without changing the code of the function itself. This chapter will lay the foundational concepts needed to thoroughly understand decorators (functions as objects, scope, and closures), and give you a good introduction into how decorators are used and defined. This deep dive into Python internals will set you up to be a superstar Pythonista.
Play Chapter Now

Functions are objects       50 xp
Building a command line data app       100 xp
Reviewing your co-worker's code       100 xp
Returning functions for a math game       100 xp
Scope       50 xp
Understanding scope       50 xp
Modifying variables outside local scope       100 xp
Closures       50 xp
Checking for closure       100 xp
Closures keep your values safe       100 xp
Decorators       50 xp
Using decorator syntax       100 xp
Defining a decorator       100 xp

## More on Decorators
Now that you understand how decorators work under the hood, this chapter gives you a bunch of real-world examples of when and how you would write decorators in your own code. You will also learn advanced decorator concepts like how to preserve the metadata of your decorated functions and how to write decorators that take arguments.
Play Chapter Now

Real-world examples       50 xp
Print the return type       100 xp
Counter       100 xp
Decorators and metadata       50 xp
Preserving docstrings when decorating functions       100 xp
Measuring decorator overhead       100 xp
Decorators that take arguments       50 xp
Run_n_times()       100 xp
HTML Generator       100 xp
Timeout(): a real world example       50 xp
Tag your functions       100 xp
Check the return type       100 xp
Great job!       50 xp

## Docstrings



it should be imperative language:
    for example: split the dataframe and stack the columns.
    instead of "this function will split the dataframe and stack the columns"

'''
1, description of what the function does,
2, description of the arguments, if any,
3, description of return values, if any,
4, descriptions of error raised, if any,
5, optional extra notes or example of useage.
'''


In [None]:
def split_and_stack(df, new_names):
    '''
    split a DataFrame's columns into two halves and then stack
    them vertically, retuning a new DataFrame with 'new_names' as the column names.
    
    Args:
    df (DataFrame): the DataFrameto split.
    new_names (interableof str): the column names for the new DataFrame
    
    Returns:
        DataFrame
    '''
    
    half = int(len(df.columns)/2)
    left = df.iloc[:,:half]
    right = df.iloc[:,half:]
    return pd.DataFrame(
        data = np.vstack([left.values, right.values]), columns = new_names)




In [None]:
Google style:
    
    
def function(arg_1, arg_2=42):
    '''Description of what the function does
    
    Args:
      arg_1 (str): Description of arg_1 that can be break into next line if needed.
      arg_2 (int, optional): Write optional when an argument has a default value
      
    Returns:
      bool: Optional description of the return value
      Extra lines are not indented
      
    Raise:
      ValueError: Include any error types that the fuction intentional raise
      
    Notes:
      see https:www.datacamp.com/community/tutorials/docstrings-python
      for more info
    '''

In [None]:
Numpydoc style:
    
    
def function(arg_1, arg_2=42):
    
    '''
    Description of what the function does
    
    
    Parameters
    ----------
    arg_1 : expected type of arg_1
      Description of arg_1
    arg_2 : int, optional
      Write optional when an argument has a default value
      Default=42
      
      
    Returns
    -------
    The type of the return value
      Can include a description of the return value
      Replace "Return" with "Yields" if the function is a generator
    
    '''

In [5]:
# Retrieving docstrings


def the_answer():
    '''Return the answer to life, the universe, and everything
    
    
    Returns:
      int
      
    '''
    return 42



print(the_answer.__doc__)


import inspect
print(inspect.getdoc(the_answer))

Return the answer to life, the universe, and everything
    
    
    Returns:
      int
      
    
Return the answer to life, the universe, and everything


Returns:
  int
  


## Crafting a docstring

You've decided to write the world's greatest open-source natural language processing Python package. It will revolutionize working with free-form text, the way numpy did for arrays, pandas did for tabular data, and scikit-learn did for machine learning.

The first function you write is count_letter(). It takes a string and a single letter and returns the number of times the letter appears in the string. You want the users of your open-source package to be able to understand how this function works easily, so you will need to give it a docstring. Build up a Google Style docstring for this function by following these steps.


Copy the following string and add it as the docstring for the function: Count the number of times `letter` appears in `content`.

In [13]:
# Add a docstring to count_letter()
def count_letter(content, letter):
    '''Count the number of times 'letter' appears in 'content'
    
    Args:
      content (str): The paragraph content of a doc or article
      letter (str): Any letter of 26 letters in alphabet
  
    Return:
      int: The times letter appeaded in whole content
  
    Raise:
      ValueError: Letter should be one character string
  
    Note:
      see https://campus.datacamp.com/courses/writing-functions-in-python/best-practices?ex=2
      for more info
    '''
    
    if (not isinstance(letter, str)) or len(letter) != 1:
        raise ValueError('`letter` must be a single character string.')
    return len([char for char in content if char == letter])




a = 'A static method in Java does not translate to a Python classmethod. Oh sure, \
it results in more or less the same effect, but the goal of a classmethod is actually \
to do something that’s usually not even possible in Java (like inheriting a non-default \
constructor). The idiomatic translation of a Java static method is usually a module-level \
function, not a classmethod or staticmethod. (And static final fields should translate to \
module-level constants.)'

print(count_letter(a,letter = 'f'))


print(count_letter.__doc__)

8
Count the number of times 'letter' appears in 'content'
    
    Args:
      content (str): The paragraph content of a doc or article
      letter (str): Any letter of 26 letters in alphabet
  
    Return:
      int: The times letter appeaded in whole content
  
    Raise:
      ValueError: Letter should be one character string
  
    Note:
      see https://campus.datacamp.com/courses/writing-functions-in-python/best-practices?ex=2
      for more info
    


## Retrieving docstrings


You and a group of friends are working on building an amazing new Python IDE (integrated development environment -- like PyCharm, Spyder, Eclipse, Visual Studio, etc.). The team wants to add a feature that displays a tooltip with a function's docstring whenever the user starts typing the function name. That way, the user doesn't have to go elsewhere to look up the documentation for the function they are trying to use. You've been asked to complete the build_tooltip() function that retrieves a docstring from an arbitrary function.

You will be reusing the count_letter() function that you developed in the last exercise to show that we can properly extract its docstring.




Begin by getting the docstring for the function count_letter(). Use an attribute of the count_letter() function.

Now use a function from the inspect module to get a better-formatted version of count_letter()'s docstring.
3

Now create a build_tooltip() function that can extract the docstring from any function that we pass to it.

Hint

    We don't want to call the function (e.g. count_letter()). Instead, treat the function as an object (e.g. count_letter.<attribute_name>).
    Try running dir(count_letter) in the shell to see a list of all of the attributes that the function has.





Now use a function from the inspect module to get a better-formatted version of count_letter()'s docstring.

Hint

    Try running dir(inspect) in the shell to see the names of all of the available functions in the inspect module.





Now create a build_tooltip() function that can extract the docstring from any function that we pass to it.



In [27]:
# Add a docstring to count_letter()
def count_letter(content, letter):
    '''Count the number of times 'letter' appears in 'content'
    
    Args:
      content (str): The paragraph content of a doc or article
      letter (str): Any letter of 26 letters in alphabet
  
    Return:
      int: The times letter appeaded in whole content
  
    Raise:
      ValueError: Letter should be one character string
  
    Note:
      see https://campus.datacamp.com/courses/writing-functions-in-python/best-practices?ex=2
      for more info
    '''
    
    if (not isinstance(letter, str)) or len(letter) != 1:
        raise ValueError('`letter` must be a single character string.')
    return len([char for char in content if char == letter])




a = 'A static method in Java does not translate to a Python classmethod. Oh sure, \
it results in more or less the same effect, but the goal of a classmethod is actually \
to do something that’s usually not even possible in Java (like inheriting a non-default \
constructor). The idiomatic translation of a Java static method is usually a module-level \
function, not a classmethod or staticmethod. (And static final fields should translate to \
module-level constants.)'

print(count_letter(a,letter = 's'))

dir(count_letter)
#?count_letter

import inspect
dir(inspect)
print(inspect.getdoc(the_answer))

33
Return the answer to life, the universe, and everything


Returns:
  int
  


In [31]:
import inspect

def build_tooltip(function):
    """Create a tooltip for any function that shows the
    function's docstring.
    
    Args:
      function (callable): The function we want a tooltip for.

    Returns:
      str
    """
    
    # Get the docstring for the "function" argument by using inspect
    docstring = inspect.getdoc(function)
    border = '#' * 28
    return '{} \n{} \n{}'.format(border, docstring, border)

print(build_tooltip(count_letter))
#print(build_tooltip(range))
#print(build_tooltip(print))

############################ 
Count the number of times 'letter' appears in 'content'

Args:
  content (str): The paragraph content of a doc or article
  letter (str): Any letter of 26 letters in alphabet

Return:
  int: The times letter appeaded in whole content

Raise:
  ValueError: Letter should be one character string

Note:
  see https://campus.datacamp.com/courses/writing-functions-in-python/best-practices?ex=2
  for more info 
############################


## Docstrings to the rescue!

Some maniac has corrupted your installation of numpy! All of the functions still exist, but they've been given random names. You desperately need to call the numpy.histogram() function and you don't have time to reinstall the package. Fortunately for you, the maniac didn't think to alter the docstrings, and you know how to access them. numpy has a lot of functions in it, so we've narrowed it down to four possible functions that could be numpy.histogram() in disguise: numpy.leyud(), numpy.uqka(), numpy.fywdkxa() or numpy.jinzyxq().

Examine each of these functions' docstrings in the IPython shell to determine which of them is actually numpy.histogram().


Possible Answers

    numpy.leyud()
    numpy.uqka()
    numpy.fywdkxa()
    numpy.jinzyxq()
    

Hint

    To view a function's docstring, you can either use print(function_name.__doc__) or print(inspect.getdoc(function_name)).


In [42]:
import numpy as numpy

#numpy.leyud.__doc__

#numpy.uqka.__doc__

#numpy.fywdkxa.__doc__

#print(numpy.jinzyxq.__doc__)

#print(numpy.array.__doc__)


print(numpy.histogram.__doc__)


    Compute the histogram of a set of data.

    Parameters
    ----------
    a : array_like
        Input data. The histogram is computed over the flattened array.
    bins : int or sequence of scalars or str, optional
        If `bins` is an int, it defines the number of equal-width
        bins in the given range (10, by default). If `bins` is a
        sequence, it defines a monotonically increasing array of bin edges,
        including the rightmost edge, allowing for non-uniform bin widths.

        .. versionadded:: 1.11.0

        If `bins` is a string, it defines the method used to calculate the
        optimal bin width, as defined by `histogram_bin_edges`.

    range : (float, float), optional
        The lower and upper range of the bins.  If not provided, range
        is simply ``(a.min(), a.max())``.  Values outside the range are
        ignored. The first element of the range must be less than or
        equal to the second. `range` affects the automatic bin
        c

In [None]:
In [2]:
print(numpy.leyud.__doc__)

    Gives a new shape to an array without changing its data.

    Parameters
    ----------
    a : array_like
        Array to be reshaped.
    newshape : int or tuple of ints
        The new shape should be compatible with the original shape. If
        an integer, then the result will be a 1-D array of that length.
        One shape dimension can be -1. In this case, the value is
        inferred from the length of the array and remaining dimensions.
    order : {'C', 'F', 'A'}, optional
        Read the elements of `a` using this index order, and place the
        elements into the reshaped array using this index order.  'C'
        means to read / write the elements using C-like index order,
        with the last axis index changing fastest, back to the first
        axis index changing slowest. 'F' means to read / write the
        elements using Fortran-like index order, with the first index
        changing fastest, and the last index changing slowest. Note that
        the 'C' and 'F' options take no account of the memory layout of
        the underlying array, and only refer to the order of indexing.
        'A' means to read / write the elements in Fortran-like index
        order if `a` is Fortran *contiguous* in memory, C-like order
        otherwise.

    Returns
    -------
    reshaped_array : ndarray
        This will be a new view object if possible; otherwise, it will
        be a copy.  Note there is no guarantee of the *memory layout* (C- or
        Fortran- contiguous) of the returned array.

    See Also
    --------
    ndarray.reshape : Equivalent method.

    Notes
    -----
    It is not always possible to change the shape of an array without
    copying the data. If you want an error to be raised when the data is copied,
    you should assign the new shape to the shape attribute of the array::

     >>> a = np.zeros((10, 2))
     # A transpose makes the array non-contiguous
     >>> b = a.T
     # Taking a view makes it possible to modify the shape without modifying
     # the initial object.
     >>> c = b.view()
     >>> c.shape = (20)
     AttributeError: incompatible shape for a non-contiguous array

    The `order` keyword gives the index ordering both for *fetching* the values
    from `a`, and then *placing* the values into the output array.
    For example, let's say you have an array:

    >>> a = np.arange(6).reshape((3, 2))
    >>> a
    array([[0, 1],
           [2, 3],
           [4, 5]])

    You can think of reshaping as first raveling the array (using the given
    index order), then inserting the elements from the raveled array into the
    new array using the same kind of index ordering as was used for the
    raveling.

    >>> np.reshape(a, (2, 3)) # C-like index ordering
    array([[0, 1, 2],
           [3, 4, 5]])
    >>> np.reshape(np.ravel(a), (2, 3)) # equivalent to C ravel then C reshape
    array([[0, 1, 2],
           [3, 4, 5]])
    >>> np.reshape(a, (2, 3), order='F') # Fortran-like index ordering
    array([[0, 4, 3],
           [2, 1, 5]])
    >>> np.reshape(np.ravel(a, order='F'), (2, 3), order='F')
    array([[0, 4, 3],
           [2, 1, 5]])

    Examples
    --------
    >>> a = np.array([[1,2,3], [4,5,6]])
    >>> np.reshape(a, 6)
    array([1, 2, 3, 4, 5, 6])
    >>> np.reshape(a, 6, order='F')
    array([1, 4, 2, 5, 3, 6])

    >>> np.reshape(a, (3,-1))       # the unspecified value is inferred to be 2
    array([[1, 2],
           [3, 4],
           [5, 6]])
    
In [3]:
print(numpy.uqka.__doc__)

    Returns the indices that would sort an array.

    Perform an indirect sort along the given axis using the algorithm specified
    by the `kind` keyword. It returns an array of indices of the same shape as
    `a` that index data along the given axis in sorted order.

    Parameters
    ----------
    a : array_like
        Array to sort.
    axis : int or None, optional
        Axis along which to sort.  The default is -1 (the last axis). If None,
        the flattened array is used.
    kind : {'quicksort', 'mergesort', 'heapsort', 'stable'}, optional
        Sorting algorithm.
    order : str or list of str, optional
        When `a` is an array with fields defined, this argument specifies
        which fields to compare first, second, etc.  A single field can
        be specified as a string, and not all fields need be specified,
        but unspecified fields will still be used, in the order in which
        they come up in the dtype, to break ties.

    Returns
    -------
    index_array : ndarray, int
        Array of indices that sort `a` along the specified axis.
        If `a` is one-dimensional, ``a[index_array]`` yields a sorted `a`.
        More generally, ``np.take_along_axis(a, index_array, axis=a)`` always
        yields the sorted `a`, irrespective of dimensionality.

    See Also
    --------
    sort : Describes sorting algorithms used.
    lexsort : Indirect stable sort with multiple keys.
    ndarray.sort : Inplace sort.
    argpartition : Indirect partial sort.

    Notes
    -----
    See `sort` for notes on the different sorting algorithms.

    As of NumPy 1.4.0 `argsort` works with real/complex arrays containing
    nan values. The enhanced sort order is documented in `sort`.

    Examples
    --------
    One dimensional array:

    >>> x = np.array([3, 1, 2])
    >>> np.argsort(x)
    array([1, 2, 0])

    Two-dimensional array:

    >>> x = np.array([[0, 3], [2, 2]])
    >>> x
    array([[0, 3],
           [2, 2]])

    >>> np.argsort(x, axis=0)  # sorts along first axis (down)
    array([[0, 1],
           [1, 0]])

    >>> np.argsort(x, axis=1)  # sorts along last axis (across)
    array([[0, 1],
           [0, 1]])

    Indices of the sorted elements of a N-dimensional array:

    >>> ind = np.unravel_index(np.argsort(x, axis=None), x.shape)
    >>> ind
    (array([0, 1, 1, 0]), array([0, 0, 1, 1]))
    >>> x[ind]  # same as np.sort(x, axis=None)
    array([0, 2, 2, 3])

    Sorting with keys:

    >>> x = np.array([(1, 0), (0, 1)], dtype=[('x', '<i4'), ('y', '<i4')])
    >>> x
    array([(1, 0), (0, 1)],
          dtype=[('x', '<i4'), ('y', '<i4')])

    >>> np.argsort(x, order=('x','y'))
    array([1, 0])

    >>> np.argsort(x, order=('y','x'))
    array([0, 1])

    
In [4]:
print(numpy.fywdkxa.__doc__)

    Compute the histogram of a set of data.

    Parameters
    ----------
    a : array_like
        Input data. The histogram is computed over the flattened array.
    bins : int or sequence of scalars or str, optional
        If `bins` is an int, it defines the number of equal-width
        bins in the given range (10, by default). If `bins` is a
        sequence, it defines the bin edges, including the rightmost
        edge, allowing for non-uniform bin widths.

        .. versionadded:: 1.11.0

        If `bins` is a string, it defines the method used to calculate the
        optimal bin width, as defined by `histogram_bin_edges`.

    range : (float, float), optional
        The lower and upper range of the bins.  If not provided, range
        is simply ``(a.min(), a.max())``.  Values outside the range are
        ignored. The first element of the range must be less than or
        equal to the second. `range` affects the automatic bin
        computation as well. While bin width is computed to be optimal
        based on the actual data within `range`, the bin count will fill
        the entire range including portions containing no data.
    normed : bool, optional

        .. deprecated:: 1.6.0

        This is equivalent to the `density` argument, but produces incorrect
        results for unequal bin widths. It should not be used.

        .. versionchanged:: 1.15.0
            DeprecationWarnings are actually emitted.

    weights : array_like, optional
        An array of weights, of the same shape as `a`.  Each value in
        `a` only contributes its associated weight towards the bin count
        (instead of 1). If `density` is True, the weights are
        normalized, so that the integral of the density over the range
        remains 1.
    density : bool, optional
        If ``False``, the result will contain the number of samples in
        each bin. If ``True``, the result is the value of the
        probability *density* function at the bin, normalized such that
        the *integral* over the range is 1. Note that the sum of the
        histogram values will not be equal to 1 unless bins of unity
        width are chosen; it is not a probability *mass* function.

        Overrides the ``normed`` keyword if given.

    Returns
    -------
    hist : array
        The values of the histogram. See `density` and `weights` for a
        description of the possible semantics.
    bin_edges : array of dtype float
        Return the bin edges ``(length(hist)+1)``.


    See Also
    --------
    histogramdd, bincount, searchsorted, digitize, histogram_bin_edges

    Notes
    -----
    All but the last (righthand-most) bin is half-open.  In other words,
    if `bins` is::

      [1, 2, 3, 4]

    then the first bin is ``[1, 2)`` (including 1, but excluding 2) and
    the second ``[2, 3)``.  The last bin, however, is ``[3, 4]``, which
    *includes* 4.


    Examples
    --------
    >>> np.histogram([1, 2, 1], bins=[0, 1, 2, 3])
    (array([0, 2, 1]), array([0, 1, 2, 3]))
    >>> np.histogram(np.arange(4), bins=np.arange(5), density=True)
    (array([ 0.25,  0.25,  0.25,  0.25]), array([0, 1, 2, 3, 4]))
    >>> np.histogram([[1, 2, 1], [1, 0, 1]], bins=[0,1,2,3])
    (array([1, 4, 1]), array([0, 1, 2, 3]))

    >>> a = np.arange(5)
    >>> hist, bin_edges = np.histogram(a, density=True)
    >>> hist
    array([ 0.5,  0. ,  0.5,  0. ,  0. ,  0.5,  0. ,  0.5,  0. ,  0.5])
    >>> hist.sum()
    2.4999999999999996
    >>> np.sum(hist * np.diff(bin_edges))
    1.0

    .. versionadded:: 1.11.0

    Automated Bin Selection Methods example, using 2 peak random data
    with 2000 points:

    >>> import matplotlib.pyplot as plt
    >>> rng = np.random.RandomState(10)  # deterministic random data
    >>> a = np.hstack((rng.normal(size=1000),
    ...                rng.normal(loc=5, scale=2, size=1000)))
    >>> plt.hist(a, bins='auto')  # arguments are passed to np.histogram
    >>> plt.title("Histogram with 'auto' bins")
    >>> plt.show()

    
In [5]:
print(numpy.jinzyxq.__doc__)

    Return an array of zeros with the same shape and type as a given array.

    Parameters
    ----------
    a : array_like
        The shape and data-type of `a` define these same attributes of
        the returned array.
    dtype : data-type, optional
        Overrides the data type of the result.

        .. versionadded:: 1.6.0
    order : {'C', 'F', 'A', or 'K'}, optional
        Overrides the memory layout of the result. 'C' means C-order,
        'F' means F-order, 'A' means 'F' if `a` is Fortran contiguous,
        'C' otherwise. 'K' means match the layout of `a` as closely
        as possible.

        .. versionadded:: 1.6.0
    subok : bool, optional.
        If True, then the newly created array will use the sub-class
        type of 'a', otherwise it will be a base-class array. Defaults
        to True.

    Returns
    -------
    out : ndarray
        Array of zeros with the same shape and type as `a`.

    See Also
    --------
    empty_like : Return an empty array with shape and type of input.
    ones_like : Return an array of ones with shape and type of input.
    full_like : Return a new array with shape of input filled with value.
    zeros : Return a new array setting values to zero.

    Examples
    --------
    >>> x = np.arange(6)
    >>> x = x.reshape((2, 3))
    >>> x
    array([[0, 1, 2],
           [3, 4, 5]])
    >>> np.zeros_like(x)
    array([[0, 0, 0],
           [0, 0, 0]])

    >>> y = np.arange(3, dtype=float)
    >>> y
    array([ 0.,  1.,  2.])
    >>> np.zeros_like(y)
    array([ 0.,  0.,  0.])

    

## DRY and "Do One Thing"



copy and paste code cause problems, mistakes

In [None]:
train = pd.read_csv('abc.csv')
train_y = train['labels'].values
train_x = train[col for col in train.columns id col != 'labels'].values
train_pca = PCA(n_components=2).fit_transform(train_x)
plt.scatter(train_pca[:,0], train_pca[:,1])


val = pd.read_csv('abc.csv')
val_y = val['labels'].values
val_x = val[col for col in val.columns id col != 'labels'].values
val_pca = PCA(n_components=2).fit_transform(val_x)
plt.scatter(val_pca[:,0], val_pca[:,1])


test = pd.read_csv('abc.csv')
test_y = test['labels'].testues
test_x = test[col for col in test.columns id col != 'labels'].testues
test_pca = PCA(n_components=2).fit_transform(test_x)
plt.scatter(test_pca[:,0], test_pca[:,1])



# repeated code like this is a good sign that you should write a function, lets do this

In [None]:
def load_and_plot(path):
    '''Load a data set and plot the first two principal components
    
    Args:
      path (str): The location of csv file
      
    Returns:
      Tuple of numpy ndarray: (features, labels)
    
    '''
    
    data = pd.read_csv(path)
    Y = data['label'].values
    X = data[col for col in data.columns in col != 'label'].values
    pca = PCA(n_components=2).fit_transform(X)
    plt.scatter(pca[:,0], pca[:,1])
    
    return X, Y



train_X, train_y = load_and_plot('train.csv')
val_X, val_y = load_and_plot('val.csv')
test_X, test_y = load_and_plot('test.csv')


## Wrapping the repeated logic in a function and then calling that function several times.

## Every function should have one responsibility, Do One Thing principle




************************** THINK

In [45]:
import pandas as pd

def load_data(path):
    '''Load a data set
    
    Args:
      path (str): The locatio of csv file
      
    Returns:
      Tuple of ndarray: (features, labels)
      
    '''
    
    df = pd.read_csv(path)
    y = df['labels'].values
    X = df[[i for i in list(df.columns) if i != 'labels']].values
    #X = df[[i for i in df.columns if i != 'labels']].values
    #df[[i for i in list(df.columns) if i not in [list_of_columns_to_exclude]]]
    #df = pd.DataFrame([[i] for i in range(10)], columns=['num'])

    
    return X,  y


load_data('train.csv')

(array([[1, 'Cliff', 'DataScientist', 800000],
        [2, 'Frank', 'DataEngineer', 900008],
        [3, 'Steve', 'PythonDeveloper', 900001],
        [4, 'Coco', 'DataEngineer', 900002],
        [5, 'John', 'DataScientist', 900003]], dtype=object),
 array([1, 0, 0, 3, 1]))

In [64]:
import pandas as pd

def load_data(path):
    
    df = pd.read_csv(path)
    #y = df['labels'].values
    y = df.loc[df.columns == 'labels']
        # <= pandas.DataFrame.loc, Single label. Note this returns the row as a Series

    X = df.loc[:, df.columns != 'labels']
        # <= pandas.DataFrame.loc, List of labels. Note using [[]] returns a DataFrame.

        
    #X = pd.DataFrame([i] for i in df.columns, if i != 'labels')
    #df = pd.DataFrame([[i] for i in range(10)], columns=['num'])
    #df.loc[:, df.columns != 'b']

    
    return X,  y


load_data('train.csv')

(   id   name              job  salary
 0   1  Cliff    DataScientist  800000
 1   2  Frank     DataEngineer  900008
 2   3  Steve  PythonDeveloper  900001
 3   4   Coco     DataEngineer  900002
 4   5   John    DataScientist  900003,
    id  name            job  salary  labels
 4   5  John  DataScientist  900003       1)

In [68]:
import pandas as pd

def load_data(path):
    
    df = pd.read_csv(path)
    
    y = df['labels']    
    # <= pandas.DataFrame.loc, Single label. Note this returns the row as a Series
    
    # In Pandas, we can select a single column with just using the index operator [],
    #   but without list as argument. However, the resulting object is a Pandas series 
    #   instead of Pandas Dataframe. For example, if we use df[‘A’], we would have 
    #   selected the single column as Pandas Series object
    
    X = df[['id', 'name', 'job', 'salary']]
    # <= pandas.DataFrame.loc, List of labels. Note using [[]] returns a DataFrame.
    
    return X,  y


load_data('train.csv')

(   id   name              job  salary
 0   1  Cliff    DataScientist  800000
 1   2  Frank     DataEngineer  900008
 2   3  Steve  PythonDeveloper  900001
 3   4   Coco     DataEngineer  900002
 4   5   John    DataScientist  900003,
 0    1
 1    0
 2    0
 3    3
 4    1
 Name: labels, dtype: int64)

In [67]:
import pandas as pd

def load_data(path):
    
    df = pd.read_csv(path)
    #y = df['labels'].values
    X = df[['id', 'name', 'job', 'salary']]
    # <= pandas.DataFrame.loc, List of labels. Note using [[]] returns a DataFrame.
    
    return X


load_data('train.csv')

Unnamed: 0,id,name,job,salary
0,1,Cliff,DataScientist,800000
1,2,Frank,DataEngineer,900008
2,3,Steve,PythonDeveloper,900001
3,4,Coco,DataEngineer,900002
4,5,John,DataScientist,900003


In [69]:
import pandas as pd

def load_data(path):
    
    df = pd.read_csv(path)
    y = df[['labels']]
    #X = df[['id', 'name', 'job', 'salary']]
    # <= pandas.DataFrame.loc, List of labels. Note using [[]] returns a DataFrame.
    
    return y


load_data('train.csv')

Unnamed: 0,labels
0,1
1,0
2,0
3,3
4,1


In [47]:
import pandas as pd

df = pd.read_csv('train.csv')
df

Unnamed: 0,id,name,job,salary,labels
0,1,Cliff,DataScientist,800000,1
1,2,Frank,DataEngineer,900008,0
2,3,Steve,PythonDeveloper,900001,0
3,4,Coco,DataEngineer,900002,3
4,5,John,DataScientist,900003,1


In [62]:
import pandas as pd

df = pd.read_csv('train.csv')
df['labels']    # <= returns a panda series
                # <=     A Pandas Series is like a column in a table.
                # <=     It is a one-dimensional array holding data of any type.


# ---------------------------------------------------------------------------------------- #
# go read this: 
# https://cmdlinetips.com/2020/04/3-ways-to-select-one-or-more-columns-with-pandas/

0    1
1    0
2    0
3    3
4    1
Name: labels, dtype: int64

In [51]:
import pandas as pd

df = pd.read_csv('train.csv')
df[['labels']]

Unnamed: 0,labels
0,1
1,0
2,0
3,3
4,1


In [32]:
import pandas as pd

df = pd.read_csv('train.csv')
df['labels'].values

array([1, 0, 0, 3, 1])

In [50]:
import pandas as pd

df = pd.read_csv('train.csv')
df[['labels']].values

array([[1],
       [0],
       [0],
       [3],
       [1]])

In [34]:
#print(i*i for i in range(10) if i%2==0)

#print(sum(i*i for i in range(4) if i%2 != 0),sum(i*i for i in range(7) if i%2 == 1))

print(i*i for i in range(4) if i%2 != 0)

print(list(i*i for i in range(4) if i%2 != 0))

<generator object <genexpr> at 0x7f9eedb9b7b0>
[1, 9]


In [None]:
def plot_data(X):
    '''Plot the first two principal components of a matrix
    
    Args:
      X (numpy.ndarray): The data to plot
    
    '''
    
    pca = PCA(n_components=2).fit_transform(X)
    plot.scaller(pca[:,0], pca[:,1])
    
    

In [85]:
import pandas as pd

data = pd.read_csv('train.csv')
print(data.columns)
list(data.columns)
#X = data[col for col in data.columns if col != 'labels'].values
#X

Index(['id', ' name', ' job', ' salary', ' labels'], dtype='object')


['id', ' name', ' job', ' salary', ' labels']

In [97]:
import pandas as pd

data = pd.read_csv('train.csv')
data.columns
print(col for col in data.columns)
X = data[col for col in list(data.columns) if col != 'labels'].values
X

SyntaxError: invalid syntax (4257226009.py, line 6)

In [92]:
for col in list(data.columns): print(col)

id
 name
 job
 salary
 labels


In [93]:
print([col] for col in list(data.columns))

<generator object <genexpr> at 0x7fb14169d740>


## Exercise
Exercise
Extract a function

While you were developing a model to predict the likelihood of a student graduating from college, you wrote this bit of code to get the z-scores of students' yearly GPAs. Now you're ready to turn it into a production-quality system, so you need to do something about the repetition. Writing a function to calculate the z-scores would improve this code.

# Standardize the GPAs for each year
df['y1_z'] = (df.y1_gpa - df.y1_gpa.mean()) / df.y1_gpa.std()
df['y2_z'] = (df.y2_gpa - df.y2_gpa.mean()) / df.y2_gpa.std()
df['y3_z'] = (df.y3_gpa - df.y3_gpa.mean()) / df.y3_gpa.std()
df['y4_z'] = (df.y4_gpa - df.y4_gpa.mean()) / df.y4_gpa.std()

Note: df is a pandas DataFrame where each row is a student with 4 columns of yearly student GPAs: y1_gpa, y2_gpa, y3_gpa, y4_gpa




    Finish the function so that it returns the z-scores of a column.
    Use the function to calculate the z-scores for each year (df['y1_z'], df['y2_z'], etc.) from the raw GPA scores (df.y1_gpa, df.y2_gpa, etc.).

Hint

    Notice how (df.y1_gpa - df.y1_gpa.mean()) / df.y1_gpa.std() is only performing operations on df.y1_gpa. So you should be able to pass df.y1_gpa as the column argument to the standardize() function.


In [None]:
def standardize(column):
    """Standardize the values in a column.

    Args:
      column (pandas Series): The data to standardize.

    Returns:
      pandas Series: the values as z-scores
    """
    # Finish the function so that it returns the z-scores
    z_score = (column - column.mean()) / column.std()
    return z_score

# Use the standardize() function to calculate the z-scores
df['y1_z'] = standardize(df.y1_gpa)
df['y2_z'] = standardize(df.y2_gpa)
df['y3_z'] = standardize(df.y3_gpa)
df['y4_z'] = standardize(df.y4_gpa)

## Split up a function

Another engineer on your team has written this function to calculate the mean and median of a sorted list. You want to show them how to split it into two simpler functions: mean() and median()

def mean_and_median(values):
  """Get the mean and median of a sorted list of `values`

  Args:
    values (iterable of float): A list of numbers

  Returns:
    tuple (float, float): The mean and median
  """
  mean = sum(values) / len(values)
  midpoint = int(len(values) / 2)
  if len(values) % 2 == 0:
    median = (values[midpoint - 1] + values[midpoint]) / 2
  else:
    median = values[midpoint]

  return mean, median



Write the mean() function.

In [71]:
def mean_and_median(values):
    """Get the mean and median of a sorted list of `values`

    Args:
      values (iterable of float): A list of numbers

    Returns:
      tuple (float, float): The mean and median
    """
    mean = sum(values) / len(values)
    midpoint = int(len(values) / 2)
    if len(values) % 2 == 0:
        median = (values[midpoint - 1] + values[midpoint]) / 2
    else:
        median = values[midpoint]

    return mean, median



values = [1,3,4,7,2,8,9,12,13]
mean_and_median(values)

(6.555555555555555, 2)

In [None]:
def mean(values):
    """Get the mean of a sorted list of values
  
    Args:
      values (iterable of float): A list of numbers
  
    Returns:
      float
    """
    # Write the mean() function
    mean = sum(values)/len(values)
    return mean

In [None]:
def median(values):
    """Get the median of a sorted list of values
  
    Args:
      values (iterable of float): A list of numbers
  
    Returns:
      float
    """
    # Write the median() function
    midpoint = int(len(values)/2)
    if len(values)%2 == 0:
        median = (values[midpoint-1] + values[midpoint])/2
    else:
        median = values[midpoint]
    return median

## Pass by assignment




