In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab01.ipynb")

# Lab 01: Introduction to Data Science Review

Welcome to the first lab of Advanced Data Science! This lab is meant to help you familiarize yourself with JupyterHub, review Python and NumPy, and introduce you to matplotlib, a Python visualization library.

To receive credit for a lab, answer all questions correctly and submit before the deadline.

**Due Date:** Thursday, February 4, 2021 at 7:00 p.m..

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

## 1. Importing Libraries and Magic Commands

In Advanced Data Science, we will be using common Python libraries to help us process data. By convention, we import all libraries at the very top of the notebook. There are also a set of standard aliases that are used to shorten the library names. Below are some of the libraries that you may encounter throughout the course, along with their respective aliases.

### 1.1. Importing Libraries

Run the cell below, but please **don't** change it.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns

%matplotlib inline

import otter
grader = otter.Notebook()

### 1.1. Magic Commands

`%matplotlib inline` is a Jupyter magic command that configures the notebook so that Matplotlib displays any plots that you draw directly in the notebook rather than to a file, allowing you to view the plots upon executing your code.

Another useful magic command is `%%time`, which times the execution of that cell. You can use this by writing it as the first line of a cell.

**Note:** `%%` is used for cell magic commands that apply to the entire cell, whereas % is used for line magic commands that only apply to a single line.

In [2]:
%%time
list = []
for i in range(100):
    list.append(i)

## 2. Keyboard Shortcuts

Even if you are familiar with Jupyter, we strongly encourage you to become proficient with keyboard shortcuts (this will save you time in the future). To learn about keyboard shortcuts, go to **Help --> Keyboard Shortcuts** in the menu above.

Here are a few that we like:

1. Ctrl + Return: Evaluate the current cell
2. Shift + Return: Evaluate the current cell and move to the next
3. ESC: command mode (may need to press before using any of the commands below)
 -  a: create a cell above
 -  b: create a cell below
 - dd: delete a cell
 -  z: undo the last cell operation
 -  m: convert a cell to markdown
 -  y: convert a cell to code
 
 

## 3. Prerequisites

It's time to answer some review questions. Each question has a response cell directly below it. Most response cells are followed by a test cell that runs automated tests to check your work. Please don't delete questions, response cells, or test cells. You won't get credit for your work if you do.

If you have extra content in a response cell, such as an example call to a function you're implementing, that's fine.

To receive full credit on this assignment, you must pass all test cases by the deadline. All test cases are public for lab.

### 3.1. Python

Python is the main programming language we'll use in the course. We expect that you've taken Introduction to Data Science or an equivalent class, so we will not be covering general Python syntax. If any of the below exercises are challenging (or if you would like to refresh your Python knowledge), please review one or more of the following materials.

- [Python Tutorial:](https://docs.python.org/3.5/tutorial/) Introduction to Python from the creators of Python.
- [Composing Programs Chapter 1:](http://composingprograms.com/pages/11-getting-started.html) This is more of a introduction to programming with Python.
- [Advanced Crash Course:](http://cs231n.github.io/python-numpy-tutorial/) A fast crash course which assumes some programming background.

**Question 3.1.1.** Write a function summation that evaluates the following summation 

$$\sum_{i=1}^n i^3+3i^2$$

for $n \geq 1$.

<!--
BEGIN QUESTION
name: q3_1_1
manual: false
-->

In [3]:
def summation(n):
    """Compute the summation i^3 + 3 * i^2 for 1 <= i <= n."""
    total = ...
    return total

In [None]:
grader.check("q3_1_1")

**Question 3.1.2.** Write a function list_sum that computes the square of each value in `list_1`, the cube of each value in `list_2`, then returns a list containing the element-wise sum of these results. 

**Note:** Assume that `list_1` and `list_2` have the same number of elements.

<!--
BEGIN QUESTION
name: q3_1_2
manual: false
-->

In [5]:
def list_sum(list_1, list_2):
    """Compute x^2 + y^3 for each x in list_1, and each y in list_2. 
       Assume list_1 and list_2 have the same length.
    """
    
    """ Both arguments (lists) must have the same number of elements"""
    assert len(list_1) == len(list_2)
    list = ...
    
    for ...
        list = ...
    return list

In [None]:
grader.check("q3_1_2")

**Question 3.1.3.** Write a function named `average_of_all` that takes a number and returns the average of all inputs on which it has ever been called. For example, `average_of_all(3)` would return 2 because 1 + 2 + 3 = 6 and 6/3 = 2.

<!--
BEGIN QUESTION
name: q3_1_3
manual: false
-->

In [7]:
def average_of_all(n):
    """Return the average of all arguments ever passed to the average function.
    >>> average(1)
    1.0
    >>> average(3)
    2.0
    >>> average(8)
    4.5
    >>> average(0)
    0.0
    """
    return ...

In [None]:
grader.check("q3_1_3")

### 3.2. NumPy

NumPy is the numerical computing module introduced in Introduction to Data Science, which is a prerequisite for this course. Here's a quick recap of NumPy. For more review, read the following [NumPy Quick Start Tutorial](https://docs.scipy.org/doc/numpy-1.15.4/user/quickstart.html).

The core of NumPy is the array. Like Python lists, arrays store data; however, they store data in a more efficient manner. In many cases, this allows for faster computation and data manipulation.

In Introduction to Data Science, we used `make_array` from the `datascience` module, but that's not the most typical way. Instead, use `np.array` to create an array. It takes a sequence, such as a list or range.

**Question 3.2.1.** Below, create an array named `arr` containing the values 1, 2, 3, 4, and 5 (in that order).

<!--
BEGIN QUESTION
name: q3_2_1
manual: false
-->

In [9]:
arr = ...

In [None]:
grader.check("q3_2_1")

In addition to values in the array, we can access attributes such as shape and data type. A full list of attributes can be found [here](https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.ndarray.html#array-attributes).

In [12]:
arr[3]

In [13]:
arr[2:4]

In [14]:
arr.shape

In [15]:
arr.dtype

Arrays, unlike Python lists, cannot store items of different data types.

In [16]:
# A regular Python list can store items of different data types
[1, '3']

In [17]:
# Arrays will convert everything to the same data type
np.array([1, '3'])

In [18]:
# Another example of array type conversion
np.array([5, 8.3])

Arrays are also useful in performing **vectorized** operations. Given two or more arrays of equal length, arithmetic will perform element-wise computations across the arrays.

For example, observe the following:

In [19]:
# Python list addition will concatenate the two lists
[1, 2, 3] + [4, 5, 6]

In [20]:
# NumPy array addition will add them element-wise
np.array([1, 2, 3]) + np.array([4, 5, 6])

**Question 3.2.2.** Given the array `random_arr`, assign `valid_values` to an array containing all values $x$ such that $2x^4$, for $x > 1$.

<!--
BEGIN QUESTION
name: q3_2_2
manual: false
-->

In [4]:
np.random.seed(42)
random_arr = np.random.rand(60)
valid_values = ...
valid_values

In [None]:
grader.check("q3_2_2")

**Question 3.2.3.** Use NumPy to recreate your answer to **Question 2.1.2.**. The input parameters will both be lists, so you will need to convert the lists into arrays before performing your operations.

**Hint:** If you're stuck, [click](https://docs.scipy.org/doc/numpy-1.15.1/reference/index.html) to read the NumPy documentation.

<!--
BEGIN QUESTION
name: q3_2_3
manual: false
-->

In [24]:
def array_sum(list_1, list_2):
    """Compute x^2 + y^3 for each x, y in list_1, list_2.
       Assume list_1 and list_2 have the same length.
       Return a NumPy array.
    """
    
    """Both arguments (arrays) must have the same number of elements"""
    assert len(list_1) == len(list_2) 
    return ...

In [None]:
grader.check("q3_2_3")

You might have been told that Python is slow, but array arithmetic is carried out very fast, even for large arrays. For ten numbers, `list_sum` and `array_sum` both take a similar amount of time.

In [27]:
sample_list_1 = [x for x in range(10)]
sample_array_1 = np.arange(10)
sample_list_1

In [28]:
%%time
list_sum(sample_list_1, sample_list_1)

In [29]:
%%time
array_sum(sample_array_1, sample_array_1)

The time difference seems negligible for a list/array of size 10; depending on your setup, you may even observe that `list_sum` executes faster than `array_sum`. However, we will commonly be working with much larger datasets:

In [30]:
sample_list_2 = [x for x in range(100000)]
sample_array_2 = np.arange(100000)

In [31]:
%%time
list_sum(sample_list_2, sample_list_2)
# The semicolon hides the output
;

In [32]:
%%time
array_sum(sample_array_2, sample_array_2)
;

With the larger dataset, we see that using NumPy results in code that executes over 50 times faster. Throughout this course (and in the real world), you will find that writing efficient code will be important; arrays and vectorized operations are the most common way of making Python programs run quickly.

## 4. Matplotlib

We're going to start by going through the official pyplot tutorial. Please go through the tutorial notebook and familiarize yourself with the basics of pyplot. This should take roughly 25 minutes.

**Note:** The tutorial uses `np.arange`, which returns an array that steps from $a$ to $b$ with a fixed step size $s$. While this is fine in some cases, we sometimes prefer to use `np.linspace(a, b, N)`, which divides the interval $[a, b]$ into $N$ equally spaced points.

For example, `np.linspace` always includes both end points while `np.arange` will not include the second end point $b$. For this reason, when we are plotting ranges of values we tend to prefer `np.linspace`.

Notice how the following two statements have different parameters but return the same result.

In [33]:
np.arange(-5, 6, 1.0)

In [34]:
np.linspace(-5, 5, 11)

Now that you're familiar with the basics of pyplot, let's practice with a plotting question.

<!-- BEGIN QUESTION -->

**Question 4.1.** Let's visualize the function $f(t) = 3\sin(2\pi t)$.

- Set the $x$ limit of all figures to $[0, \pi]$ and the $y$ limit to $[-10, 10]$. 
- Plot the sine function using `plt.plot` with 30 red plus signs. 
- Make sure the $x$ ticks are labeled $[0, \frac{\pi}{2}, \pi]$, and that your axes are labeled as well. 

Click [here](https://matplotlib.org/api/pyplot_api.html) to use the matplotlib documentation for reference.

Your plot should look like the following:

<center><img src="graph1.png"></center>

**Hint 1:** You can set axis bounds with `plt.axis`.

**Hint 2:** You can set xticks and labels with `plt.xticks`.

**Hint 3:** Make sure you add `plt.xlabel`, `plt.ylabel`, and `plt.title`.

<!--
BEGIN QUESTION
name: q4_1
manual: true
-->

In [35]:
...
plt.xlabel('t')
plt.ylabel('f(t)')
plt.title('f(t) = 3sin(2$\pi$t)')

<!-- END QUESTION -->

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)