# Programming in Python for Data Science 

# Assignment 7: Importing Files and the Coding Style Guide

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links of 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).       

### Assignment Learning Goals:

By the end of the module, students are expected to:

- Describe what Python libraries are, as well as explain when and why they are useful.
- Identify where code can be improved concerning variable names, magic numbers, comments and whitespace.
- Write code that is human readable and follows the black style guide.
- Import files from other directories.
- Use [`pytest`](https://docs.pytest.org/en/stable/) to check a function's tests.
- When running [`pytest`](https://docs.pytest.org/en/stable/), explain how pytest finds the associated test functions.
- Explain how the Python debugger can help rectify your code.

This assignment covers [Module 7](https://prog-learn.mds.ubc.ca/en/module7) of the online course. You should complete this module before attempting this assignment.

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` and the `raise NotImplementedError # No Answer - remove if you provide an answer` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [1]:
# Import libraries needed for this lab
import test_assignment7 as t
from hashlib import sha1
import numpy as np

## 1.   Importing libraries   

**Question 1(a)** <br> {points: 1}  

Import the `pandas` library and name it `pd` in the worksheet environment. 

In [2]:
import pandas as pd

In [3]:
t.test_1a(dir())

'Success'

**Question 1(b)** <br> {points: 1}  

Import the Altair library into the worksheet enviroment. 

In [4]:
import altair as alt

In [5]:
t.test_1b(dir())

'Success'

**Question 1(c)** <br> {points: 1}  

From the `numpy` library, only import the `arange()` function using the keywork `from`. 

In [6]:
from numpy import arange

In [7]:
t.test_1c()

'Success'

## 2. Working with other files  

**Question 2(a)** <br> {points: 1}  

Load in the `chopped.csv` file from the data folder and save it as an object named `chopped`.

In [8]:
chopped = pd.read_csv('data/chopped.csv')

In [9]:
t.test_2a(chopped)

'Success'

**Question 2(b)** <br> {points: 1}  

Import the the function `sample_dataframe()` (that we created in Assignment 6) from `sampling.py` 

In [10]:
from sampling import sample_dataframe

In [11]:
t.test_2b(dir())

'Success'

**Question 2(c)** <br> {points: 2}  

To refresh yourself on what the function `sample_dataframe()` does, inspect the function docstring.  

Which of the following is the correct way to inspect the docstring of the function `sample_dataframe()`?     
*Hint: Try it out yourself*

A) `?sample.sample_dataframe`

B) `?sample.sample_dataframe()` 

C) `?sample_dataframe`

D) `?sample_dataframe()`


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer2_c`.*


In [12]:
answer2_c = 'C'

In [13]:
# Note that this test has been hidden intentionally.
# It will provide no feedback as to the correctness of your answers.
# Thus, it is up to you to decide if your answer is sufficiently correct.

In [14]:
?sample_dataframe

[0;31mSignature:[0m [0msample_dataframe[0m[0;34m([0m[0mdata[0m[0;34m,[0m [0mgrouping_col[0m[0;34m,[0m [0mN[0m[0;34m=[0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Given a dataframe, return a smaller sample of the dataframe
sampling N rows from each specified group

Parameters
----------
data : pandas.core.frame.DataFrame
    The dataframe to sample from
grouping_col : str
    The column to filter our condition on
N : int, optional
    The number of rows to sample from each group (The default value is 1
    which implies a single observation)

Returns
-------
pandas.core.frame.DataFrame
    The new sampled dataframe

Examples
--------
>>> sample_dataframe(pokemon, 'legendary'])
    name     deck_no  attack  defense  type    gen  legendary
411 Burmy     412     29        45      bug     4      0
640 Tornadus  641     100       80      flying  5      1
[0;31mFile:[0m      ~/prog-python-data-science-students/release/assignment7/sampling.py


**Question 2(d)** <br> {points: 1}  

Based on the docstring, which parameter is optional?       
Answer the parameter name as a `str` in the object `answer2_d`. 

In [15]:
answer2_d = 'N'

In [16]:
t.test_2d(answer2_d)

'Success'

**Question 2(e)** <br> {points: 1}  

Based on the docstring, which parameter accepts data types of `str`?      

Answer the parameter name as a `str` in the object `answer2_e`. 

In [17]:
answer2_e = 'grouping_col'

In [18]:
t.test_2e(answer2_e)

'Success'

**Question 2(f)** <br> {points: 1}  

Sample two rows from each season from the `chopped` dataframe using your function `sample_dataframe`.     

Save this in an object named `chopped_sample`.

In [19]:
chopped_sample = sample_dataframe(chopped, 'season', 2)

In [20]:
t.test_2f(chopped_sample)

'Success'

## 3. Using Pytest

We have provided you with another file called `test_sampling.py` which contains multiple functions that test if our `sample_dataframe()` function is working properly. 

**Question 3(a)** <br> {points: 1}  

The tests for `sample_dataframe()` are located in a different file than the function which means we will need to import the function from our `sampling.py` file at the top of `test_sampling.py`. 

Open `test_sampling.py` and on line 2, write code to import the `sample_dataframe()` function. 

In [21]:
t.test_3a()

'Success'

**Question 3(b)** <br>

We are going to do things a little differently then in the lesson here. 
Using `pytest` in a jupyter notebook, we can check if all the tests in `test_sampling.py` pass using the code `!pytest test_sampling.py` in a code cell. 


Try it out in the cell below and answer the following multiple choice questions regarding the results. 

In [22]:
!pytest test_sampling.py

platform linux -- Python 3.8.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/jupyter/prog-python-data-science-students/release/assignment7
plugins: anyio-3.2.1, dash-1.20.0
collected 6 items                                                              [0m[1m

test_sampling.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[31mF[0m[31m                                                  [100%][0m

[31m[1m________________________________ test_sd_cherry ________________________________[0m

    [94mdef[39;49;00m [92mtest_sd_cherry[39;49;00m():
        raw = {[33m'[39;49;00m[33mid[39;49;00m[33m'[39;49;00m: [[94m1873[39;49;00m, [94m4913[39;49;00m, [94m4801[39;49;00m, [94m4540[39;49;00m, [94m3581[39;49;00m,
                       [94m4534[39;49;00m, [94m1934[39;49;00m, [94m4944[39;49;00m, [94m1983[39;49;00m, [94m1266[39;49;00m],
               [33m'[39;49;00m[33mname[39;49;00m[33m'[39;49;00m: [[33m'[39;49;00m[33mEnglish Oak[39;49;00m[

**Question 3(b-i)** <br> {points: 1}  

How many of the tests from `test_sampling.py` passed?      
*Assign the correct answer to an object called `tests_passed`.*

In [23]:
tests_passed = 5

In [24]:
t.test_3bi(tests_passed)

'Success'

**Question 3(b-ii)** <br> {points: 2}  

How many of the tests from `test_sampling.py` failed?      
*Assign the correct answer to an object called `tests_failed`.*

In [25]:
tests_failed = 1

In [26]:
# Note that this test has been hidden intentionally.
# It will provide no feedback as to the correctness of your answers.
# Thus, it is up to you to decide if your answer is sufficiently correct.

**Question 3(b-iii)** <br> {points: 1}  

Name a test that did not pass.   
*Assign the correct answer to an object called `failed_name`.*

In [27]:
failed_name = 'test_sd_cherry'

In [28]:
t.test_3biii(failed_name)

'Success'

## 4. Black and Flake8 Formatting

**Question 4(a)** <br>

Run Flake8 on our `sampling.py` file in the cell below or in the terminal and answer the questions that follow.


In [29]:
!flake8 sampling.py

**Question 4(a-i)** <br> {points: 1}  

How many formatting issues did flake8 recognize in the `sampling.py` file?      
*Assign the correct answer to an object called `answer4_ai`.*

In [30]:
answer4_ai = 12

In [31]:
t.test_4ai(answer4_ai)

'Success'

**Question 4(a-ii)** <br> {points: 1}  

How many `W291 trailing whitespace` issues are there? (We will talk a little bit about trailing and leading white space in Module 8)       
*Assign the correct answer to an object called `answer4_aii`.*

In [32]:
answer4_aii = 1

In [33]:
t.test_4aii(answer4_aii)

'Success'

**Question 4(a-iii)** <br> {points: 1}  


Which of the following is the formatting issue that occurs on line 36?     


A) `E222 multiple spaces after operator`

B) `W293 blank line contains whitespace`

C) `W291 trailing whitespace`

D) `E251 unexpected spaces around keyword / parameter equals` 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer4_aiii`.*


In [34]:
answer4_aiii = 'C'

In [35]:
t.test_4aiii(answer4_aiii)

'Success'

**Question 4(b)**  {points: 1}  

Run `black` on our `sampling.py` file in the cell below or in the terminal and answer the questions that follow.


In [36]:
!black sampling.py

[1mAll done! ✨ 🍰 ✨[0m
1 file left unchanged.[0m


Which code would you use in a Jupyter code cell to run Black?

A) `black sampling.py`

B) `!black sampling.py` 

C) `sampling.black()`

D) `black.sampling()`


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer4_b`.*

In [37]:
answer4_b = 'B'

In [38]:
t.test_4b(answer4_b)

'Success'

**Question 4(c)** <br> {points: 2}  

Now that we have reformatted our `sampling.py` file, let's rerun flake8 just as we did before as see how many of our formatting issues have been fixed and answer the question below. 

In [39]:
!flake8 sampling.py

How many formatting issues are we left with after re-runing flake8 after formatting `sampling.py` using the `black` style guide?

*Assign the correct answer to an object called `answer4_c`.*

In [40]:
answer4_c = 0

In [41]:
# Note that this test has been hidden intentionally.
# It will provide no feedback as to the correctness of your answers.
# Thus, it is up to you to decide if your answer is sufficiently correct.

## 5. Style Guide - Comments and Variable Names

**Question 5(a)** <br> {points: 1}  

Which of the following names is most fitting for an object that contains a list of column names from a dataframe named `metals`? 

A) `metal_columns`

B) `columnsfrommetaldataframe`

C) `list`

D) `c_metals` 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer5_a`.*


In [42]:
answer5_a = 'A'

In [43]:
t.test_5a(answer5_a)

'Success'

**Question 5(b)** <br> {points: 1}  

Which of the following names is the best fitting for object containing a dataframe containing different lightbulb types?

A) `LIGHTBULBS`

B) `dataframe_where_lightbulbs_data_stored`

C) `data`

D) `lightbulb_df` 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer5_b`.*


In [44]:
answer5_b = 'D'

In [45]:
t.test_5b(answer5_b)

'Success'

**Question 5(c)** <br> {points: 2}  

Which of the following is NOT a reasonable comment to include in your code?

A) `# Keep this line of code in, or the function will break mysteriously`

B) `# Rename columns to shorter column names`

C) `# This assigns all the values greater than 100 a value of 100.`

D) `# TODO: Fix this next part so it's more readable and doesn't include magic numbers` 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer5_c`.*

In [46]:
answer5_c = 'A'

In [47]:
# Note that this test has been hidden intentionally.
# It will provide no feedback as to the correctness of your answers.
# Thus, it is up to you to decide if your answer is sufficiently correct.

**Question 5(d)** <br> {points: 2}  

Below is a function that plots a histogram of a specified quantitative column.
We want you to identify the 4 poorly designed elements within this function, and rewrite/rename them to something that is more appropriate. 

Copy and paste the function into the cell that follows it and then make your desired changes.

*Hint: The function name does not need to be changed* 

In [48]:
import altair as alt


def column_histogram(data, column_name):
    """
    
    Given a dataframe, this function creates a histogram
    of the values from a specified column
    
    Parameters
    ----------
    data : pandas.core.frame.DataFrame
        The dataframe to filter
    column_name : str
        The column values to plot
        
    Returns
    -------
    altair.vegalite.v4.api.Chart 
        the plotted histogram
        
    Examples
    --------
    >>> column_histogram(chopped, "season")
    altair.vegalite.v4.api.Chart 
    """
    
    # This checks if the data variable is of type pd.dataframe
    if not isinstance(data, pd.DataFrame): 
        raise TypeError("The data argument is not of type DataFrame")   
    
    # This checks if the column dtype of column_name 
    cs = column_name + ":Q"
    
    # This makes a histogram and plots the values of column_name frequency, it could be useful. 
    histogram_plot_of_column_name = alt.Chart(data).mark_bar().encode(
                                        alt.X( cs, bin=True),
                                              y='count()',
                                    )
    
    # This function now returns a histogram 
    return histogram_plot_of_column_name

In [63]:
def column_histogram(data, column_name:str):
    """
    
    Given a dataframe, this function creates a histogram
    of the values from a specified column
    
    Parameters
    ----------
    data : pandas.core.frame.DataFrame
        The dataframe to filter
    column_name : str
        The column values to plot
        
    Returns
    -------
    altair.vegalite.v4.api.Chart 
        the plotted histogram
        
    Examples
    --------
    >>> column_histogram(chopped, "season")
    altair.vegalite.v4.api.Chart 
    """
    
    # Checks if data variable is of type pd.dataframe
    if not isinstance(data, pd.DataFrame): 
        raise TypeError("The data argument is not of type DataFrame")   
    
    # Checks the column dtype of column_name   
    coltype = column_name + ":Q"
    
    # Makes a histogram and plots the values of column_name frequency 
    hist_plot = alt.Chart(data).mark_bar().encode(
                                        alt.X(coltype, bin=True),
                                              y='count()',
                                    )
    return hist_plot

In [64]:
t.test_5d(column_histogram)

'Success'

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

## Attributions
- UBC's original STAT545 - [Stat545 by Jenny Bryan](https://stat545.com/)
- MDS DSCI 523 - Data Wrangling course - [MDS's GitHub website](hhttps://ubc-mds.github.io/) 
- Chopped Dataset - [Kaggle](https://www.kaggle.com/jeffreybraun/chopped-10-years-of-episode-data)

## Module Debriefing

If this video is not showing up below, click on the cell and click the ▶ button in the toolbar above.

In [51]:
from IPython.display import YouTubeVideo
YouTubeVideo('hBGFNWtYoYw', width=854, height=480)