# Week 1 Workshop [Student]

***
## Getting to Know Juptyer and Dataframes
Follow along the following to learn how to use Jupyter notebooks, and the Pandas library.

For a quickstart for Jupyter, check out [this tutorial](https://www.dataquest.io/blog/jupyter-notebook-tutorial/) (you can skip the installation instructions) and [these shortcuts](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/),  or Google around to find your own tips. A turorial for markdown text in jupyter notebook can be found [here](https://gtribello.github.io/mathNET/assets/notebook-writing.html)

The most important thing to know is Jupyter has 2 modes: an editing mode when you're editing a cell (green cell outline), and a command mode (blue cell outline). Editing directly edits the cell's contents, but when in insert mode, keys execute commands, like adding/moving/deleting cells. Enter enters editing mode and Ctrl+Enter or Shift+Enter exits it.

**Why Jupyter?** Jupyter notebooks are becoming a standard for data science because they allow you to save not only your code, but also your output (results, visualizations, etc.), and documentation through [markdown](https://www.markdownguide.org/cheat-sheet/).

## 1.1 Loading Data (Follow) 

In [None]:
# These libraries will be used on most assignments
# Pandas helps us manage data in a tabular dataframe
import pandas as pd
# Numpy helps with math and stats functions
import numpy as np
# Matplot helps with plotting
import matplotlib.pyplot as plt

# Remember you have to run this cell block before continuing!

In [None]:
# We'll also use sklearn for a lot of ML functions.
# In this case, we're loading the Iris dataset from the sklearn.datasets library
from sklearn import datasets
# Use pd.read_csv to load the dataframe
iris = pd.read_csv('./iris.csv')
# Remember, if a Jupyter cell ends with an expression (or assignment), it will print it.
iris

**Tip**: In practice, you'll be loading data from .csv files. You can do this in Pandas with the following code.
Note that `/etc/` is a public, read-only directly on this server and may not exist if you work on your own computer. That's why we'll often use sklearn's datasets.

In [None]:
iris_from_file = pd.read_csv('./iris.csv')

# the head() function prints the first [n=5] rows of the dataset
iris_from_file.head()

### 1.11 Subsetting data

In this section, you'll do some practice problems to manipulate data. I recommend reading up on the Pandas library, and practicing Googling key terms. Seriously, using these libraries involves a lot of searching - even for your professor :)

**Tip**: It might help to create a new cell and experiment with function calls before trying to write the answer. This can be done in command mode with the A (above) or B (below) keys.

### 1.12 Data Shape
The way data is arranged is called it's "shape". Right now since we're viewing a table of data, the current shape of the data is two dimensional.

In [None]:
iris.shape

The first dimension is the number of rows (objects)

In [None]:
iris.shape[0]

The second dimension is the number of columns (attributes)

In [None]:
iris.shape[1]

### 1.13 Getting Columns (Attributes) and Rows
Now check out [this documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) 
on how to get rows and columns of dataframes in Pandas.

And check out these examples

In [None]:
# Get the 4th row of the data (it's 0-indexed)
iris.iloc[3,]

In [None]:
# Get the 2nd column (use : to indicate all rows)
iris.iloc[:,1]

In [None]:
# Get the "sepal length (cm)" column
# notice the use of "loc" and not "iloc" for string keys
iris.loc[:,"sepal length (cm)"]
# For columns, you can use this shorter notation
iris["sepal length (cm)"]

In [None]:
# You can subset rows and columns at the same time
iris.loc[1:5:,"sepal length (cm)"]

### 1.14 Basic Aggregation
A dataframe column is just like a list, so you can perform all your favorite list operations on them.

In [None]:
sum(iris['sepal length (cm)'])

In [None]:
np.prod(iris['sepal width (cm)'])

### 1.15 Other ways to subset data

In [None]:
# Get the first 20 rows of the sepal length column
iris.loc[1:20, 'sepal length (cm)']

In [None]:
# You can use this boolean vector to subset the rows your want
# This gets only the rows of iris with sepal length > 5
iris.loc[iris['sepal length (cm)'] > 5,]

In [None]:
# For subsetting operations like that above, you can also use this shorthand 
# It ditches the '.loc' and needed to use a comma or ':' to specifiy the entire column
iris[iris['sepal length (cm)']>5]

In [None]:
# Another example, finding floers where the petal length is 5 (don't forget the two ==)!
iris[iris['petal length (cm)'] == 5.0]

### 1.16 Counting 
Count the number of rows in the iris dataset where the petal length is greater than 4

**Hint**: When you perform a *sum* operation on a list of boolean variables, it treates *True* as 1 and *False* as 0.

In [None]:
# Example of summing over a boolean list
sum([True, False, True, True])

In [None]:
# Write code here
petal_length_count = sum(iris['petal length (cm)'] > 4)
print(petal_length_count)

In [None]:
assert(petal_length_count == 84)

### 1.17 Plotting Data
We can also plot data from the iris dataframe using matplotlib.

In [None]:
# Here's a scatter plot of the sepal and petal length attributes
plt.scatter(iris['sepal length (cm)'], iris['petal length (cm)'])
plt.xlabel("sepal length (cm)")
plt.ylabel("petal length (cm)")

In [None]:
# Here is a histogram
plt.hist(iris["petal length (cm)"])
plt.xlabel("petal length (cm)")

# 1.2 Dataframes (Group)
For this question, you'll be manipulating some data on your own from the iris dataset.

In [None]:
# Don't forget your imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Remember you have to run this cell block before continuing!

### Q1.21
Load ```./iris.csv``` into a dataframe called ```iris```

In [None]:
# Load in here
iris = None # Replace "None" with your answer

# If the last line of a cell is an expression (like `iris`),
# Jupyter prints it out below!
# Now inspect the dataframe you loaded
iris

### Q1.22
Create a subset of the **first 20 *Virginica* flowers** in the dataset. Call this subset `first_20`

In [None]:
# Put Solution Here (Remember to replace the "None" with your answer)
first_20 = None
first_20

In [None]:
# If this passes without fail, you (probably) selected the data right!
# It asserts that the number of rows with petal length > 5.3 is 12
# (Here the sum of a true/false statement "length > 5.3" is a count)
assert(sum(first_20["petal length (cm)"] > 5.3)==12)

Now plot a **scatterplot** of this *subset* that shows **"Sepal Width"** on the x-axis **"Petal Width"** on the y-axis

In [None]:
# Plot the scatterplot here using first_20!


### Q1.23
Plot a **histogram** that shows the distribution **Petal Length** on the **last 10 Setosa flowers** in the dataset. You should also write **test cases** to check your work!

Similar to the last question, call the subset of data `last_10`

In [None]:
#Put Solution here

last_10 = None

In [None]:
# If this passes without fail, you (probably) selected the data right!
assert(sum(last_10["petal width (cm)"] > 0.2)==5)

In [None]:
# Write your own test cases here using "assert" that test
# that 1) the number of rows in the dataset is equal to 10
# and 2) all of them have the class "Setosa"



In [None]:
# Finally, plot the histogram here!



***
# 1.3 Numpy Matrix-Vector Refresher (Follow)

Here's some recap on some basic numpy math operations. The instructor can also go over some examples by hand on the board if needed.

In [None]:
# You've already used numpy extensively in your previous assignments.
# Here, we provide a few examples to show how easily you can 
# perform matrix operations in numpy without needing to use loops.
# imagine you're given two arrays toy_x and toy_y
toy_x = np.array([
    [1, 2], 
    [3, 4]
])
toy_y = np.array([
    [5, 6], 
    [7, 8]
])

print(f'toy_x looks like this \n{toy_x}\ntoy_y looks like this \n{toy_y}')

In [None]:
# Element-wise addition
# Imagine you wish to sum each element of toy_x and toy_y 
# In traditional programming, you'd write two loops to sum each
# element in toy_x with each element in toy_y.
# Using numpy, you can simply use np.add()
element_wise_sum = np.add(toy_x, toy_y)
print(f'Element wise sum of toy_x and toy_y is \n{element_wise_sum}')

In [None]:
# Note that using '+' will produce the same result
element_wise_sum = toy_x + toy_y
print(f'Element wise sum of toy_x and toy_y is \n{element_wise_sum}')

In [None]:
# Similarly element-wise multiplication
element_wise_multiplication = np.multiply(toy_x, toy_y)
print(f'Element wise multiplication of toy_x and toy_y is \n{element_wise_multiplication}')

In [None]:
# Note that using '*' will produce the same result
element_wise_multiplication = toy_x * toy_y
print(f'Element wise multiplication of toy_x and toy_y is \n{element_wise_multiplication}')

In [None]:
# You can do this with other functions too! 
# Play around with methods such as np.sqrt, np.square, np.exp, etc. 

# The dot product is another useful function, which takes in arrays (not matrices)
# https://www.mathsisfun.com/algebra/vectors-dot-product.html
#dot_product = np.dot(toy_x[0], toy_y[0])
dot_product = np.dot(toy_x[0], toy_y[0]) + np.dot(toy_x[1], toy_y[1])
print(f'Dot product of the first rows of toy_x and toy_y is \n{dot_product}')

In [None]:
# IMPORTANT
# Note the difference between element-wise multiplication and 
# matrix multiplication
# To perform matrix multiplication, you can use np.matmul
# https://mathsisfun.com/algebra/matrix-multiplying.html
matrix_multiplication = np.matmul(toy_x, toy_y)
print(f'Matrix multiplication between toy_x and toy_y is \n{matrix_multiplication}')

***
## 1.4: Numpy, Vectors, and Matrices (Group)
In this questions, you'll be working with vectors an matracies.

### Q1.41
When we talk about neural networks, you'll often see them represented as a vector multiplied by a matrix, in the form:

$\mathbf{y} = W \mathbf{a}$

Where $\mathbf{a}$ is an input vector of attributes, and $\mathbf{W}$ is a matrix of "weights" (i.e. numbers). 

(You don't need to know about neural nets right now, we'll talk more about them later)

You can go [here for a quick refresher on Matrix multiplication](https://www.khanacademy.org/math/precalculus/x9e81a4f98389efdf:matrices/x9e81a4f98389efdf:multiplying-matrices-by-matrices/a/multiplying-matrices).

As an example, given the problem:

$y = \begin{bmatrix} 3 & 2 \\ 1 & -2 \end{bmatrix} \begin{bmatrix} 5 \\ 6 \end{bmatrix}$

The answer would be:

$y = \begin{bmatrix} 3 * 5 + 2 * 6  \\ 1 * 5 + (-2) * 6  \end{bmatrix} = \begin{bmatrix} 27  \\ -7  \end{bmatrix}$

**Now, consider the problem:**

$y = \begin{bmatrix} 1 & 2 \\ 4 & 5 \\ 9 & 3 \end{bmatrix} \begin{bmatrix} 7 \\ 3 \end{bmatrix}$

First, **compute the correct solution by hand**, and then **verify it is correct by using numpy**.

In [None]:
# Put Solution Here


### Q1.42

Consider the matricies 

$L = \begin{bmatrix} 1 & 11 \\ 11 & 12 \end{bmatrix}$, $U=\begin{bmatrix} 16 & 71 \\ 22 & 5 \end{bmatrix}$

Calculate:

1) Element-wise multiplication of $L$ and $U$


2) Element-wise addition of $L$ and $U$

3) $U^2$

4) $UL$ (matrix multiplication, **NOT** element wise multiplication)

5) $LU$ (matrix multiplication, **NOT** element wise multiplication)


In [None]:
# Define the matracies



In [None]:
# Problem 1



In [None]:
# Problem 2



In [None]:
# Problem 3



In [None]:
# Problem 4



In [None]:
# Problem 5



***
## 1.5 Z-scores (If time allows)
In this questions, you'll be implementing a function yourself to calculate the z-scores of a list of numbers.

Remember the z-score $z_i$ for a given attribute value $x_i$ is:


$$ z_i = \frac{x_i-\mu}{\sigma}$$

Where:

$\mu =$ sample mean

$x_i =$ observed value

$\sigma =$ sample standard deviation


**Helpful functions**:

`np.mean(X)`: the mean of a list of numbers

`np.std(X)`: the standard deviate of a list of numbers

In [None]:
nums = [12,3,9,7,14,2]

In [None]:
def z_score(data):
    # Input: A list of numbers
    # Output: The list of numbers, but each one transformed to it's z value
    
    #BEGIN SOLUTION
        
    return None

In [None]:
final_scores = z_score(nums)

In [None]:
# Check if you got it right
assert(final_scores[0] == 0.9524241471993242)
assert(final_scores[1] == -1.1048120107512158)

1. (13 points) Classify the following attributes as nominal, ordinal, interval, or ratio. Also
classify them as binary, discrete, or continuous. Some cases may have more than one
interpretation, so briefly justify your answer if you think there may be some ambiguity.
(a) (1 point) The IP address of your laptop.
(b) (1 point) Years of work experience
(c) (1 point) Density of ocean water.
(d) (1 point) Temperature as measured in Fahrenheit.
(e) (1 point) 12 hour clock.
(f) (1 point) Categorization of clothing (hat, shirt, pants, shoes)
(g) (1 point) Brightness in terms of lumens
(h) (1 point) 24 hour clock.
(i) (1 point) Sugar content in grape juice in gram.
(j) (1 point) Calendar dates
(k) (1 point) Income earned in a week
(l) (1 point) Barcodes
(m) (1 point) Glasgow Coma Scale (GCS) used to describe the level of consciousness in
a person following a traumatic brain injury

Answer:
(a) nominal
(b) ratio
(c) ordinal
(d) ratio
(e) ratio
(f) nominal
(g) ratio
(h) ratio
(i) ratio
(j) interval
(k) ratio
(l) nominal
(m) ratio
