# Instructions

1. Add your name and HW Group Number below.
2. Complete each question. Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", and delete and `throw NotImplementedError()` lines.
3. Where applicable, run the test cases *below* each question to check your work. **Note**: In addition to the test cases you can see, the instructor may run additional test cases, including using *other datasets* to validate you code.
4. Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). You can also use the **Validate** button to run all test cases.
5. Turn in your homework by going to the main screen in JupyterHub, clicking the Assignments menu, and submitting.



In [None]:
"""
Name: 
HW Group Number: 
"""

# Homework 0 - Problem 1
Complete each the following to learn how to use Jupyter notebooks, and the Pandas library.

For a quickstart for Jupyter, check out [this tutorial](https://www.dataquest.io/blog/jupyter-notebook-tutorial/) (you can skip the installation instructions) and [these shortcuts](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/),  or Google around to find your own tips.

The most important thing to know is Jupyter has 2 modes: an editing mode when you're editing a cell (green cell outline), and a command mode (blue cell outline). Editing directly edits the cell's contents, but when in insert mode, keys execute commands, like adding/moving/deleting cells. Enter enters editing mode and Ctrl+Enter or Shift+Enter exits it.

**Why Jupyter?** Jupyter notebooks are becoming a standard for data science because they allow you to save not only your code, but also your output (results, visualizations, etc.), and documentation through [markdown](https://www.markdownguide.org/cheat-sheet/).

## Loading Data

In [None]:
# These libraries will be used on most assignments
# Pandas helps us manage data in a tabular dataframe
import pandas as pd
# Numpy helps with math and stats functions
import numpy as np
# Matplot helps with plotting
import matplotlib.pyplot as plt

# Remember you have to run this cell block before continuing!

In [None]:
# We'll also use sklearn for a lot of ML functions.
# In this case, we're loading the Iris dataset from the sklearn.datasets library
from sklearn import datasets
iris_sk = datasets.load_iris(as_frame=True) # Load the dataset
# We convert it to a Pandas dataframe, which will be easier to work with
iris = pd.DataFrame(iris_sk.data, columns=iris_sk.feature_names)
# Remember, if a Jupyter cell ends with an expression (or assignment), it will print it.
iris

**Tip**: In practice, you'll be loading data from .csv files. You can do this in Pandas with the following code.
Note that `/etc/` is a public, read-only directly on this server and may not exist if you work on your own computer. That's why we'll often use sklearn's datasets.

In [None]:
iris_from_file = pd.read_csv('/etc/data/iris.csv')
# the head() function prints the first [n=5] rows of the dataset
iris_from_file.head()

## Subsetting data

In this section, you'll do some practice problems to manipulate data. I recommend reading up on the Pandas library, and practicing Googling key terms. Seriously, using these libraries involves a lot of searching - even for your professor :)

**Tip**: It might help to create a new cell and experiment with function calls before trying to write the answer. This can be done in command mode with the A (above) or B (below) keys.

In [None]:
# Problem 1
def get_n_rows(df):
    """Compute the number of rows in the given dataframe
    Hint: check out the shape property
    """
    # YOUR CODE HERE
    raise NotImplementedError()

You are given test cases to make sure you're code is correct.
If there's not output when you run the tests below, that means you answer is correct
However, remember there may be hidden test cases. For example, if you simply wrote `return 50`, that would pass this test case, but the answer would be wrong.

In [None]:
assert(get_n_rows(iris) == 150)

### Getting Columns (Attributes) and Rows
Now check out [this documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) 
on how to get rows and columns of dataframes in Pandas.

And check out these examples

In [None]:
# Get the 4th row of the data (it's 0-indexed)
iris.iloc[3,]

In [None]:
# Get the 2nd column (use : to indicate all rows)
iris.iloc[:,1]

In [None]:
# Get the "sepal length (cm)" column
# notice the use of "loc" and not "iloc" for string keys
iris.loc[:,"sepal length (cm)"]
# For columns, you can use this shorter notation
iris["sepal length (cm)"]

In [None]:
# You can subset rows and columns at the same time
iris.loc[1:5:,"sepal length (cm)"]

In [None]:
# Problem 2
def get_attr_sum(df, column_name):
    """Compute the sum of the values of the column in the given 
    datarame df with the given column_name
    Hint: check out https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
np.testing.assert_almost_equal(get_attr_sum(iris, 'sepal length (cm)'), 876.5)
np.testing.assert_almost_equal(get_attr_sum(iris, 'sepal width (cm)'), 458.6)

### Other ways to subset data

In [None]:
# Get the first 20 rows of the sepal length column
iris.loc[0:19, 'sepal length (cm)']

In [None]:
# You can apply operations to entire Series (vectors)
# This creates a vector of boolean values, indicating 
# which rows have sepal length > 5
iris['sepal length (cm)'] > 5

In [None]:
# You can use this boolean vector to subset the rows your want
# This gets only the rows of iris with sepal length > 5
iris.loc[iris['sepal length (cm)'] > 5,]

In [None]:
# Problem 3
def count_with_petal_length(df, min_petal_length):
    """Compute the number of rows in df more than the
    given min_petal_length (greater than).
    Hint: You can sum a boolean vector with sum()
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert(count_with_petal_length(iris, 3) == 99)
assert(count_with_petal_length(iris, 4) == 84)
assert(count_with_petal_length(iris, 6) == 9)

## Plotting Data
We can also plot data from the iris dataframe using matplotlib.

In [None]:
# Here's a scatter plot of the sepal and petal length attributes
plt.scatter(iris['sepal length (cm)'], iris['petal length (cm)'])

In [None]:
# Problem 4
def plot_histogram(df, attribute):
    """This function should return a plot of a histogram of the given
    attribute in the given dataframe (df).
    Hint: Loop up the documentation for the hist function:
    https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.hist.html
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# Test your function!
plot_histogram(iris, 'petal length (cm)')

In [None]:
x = plot_histogram(iris, 'petal length (cm)')
assert type(x) == tuple
assert x[0][0] == 37

In [None]:
x[1]

**Remember**: Make sure to complete all problems (.ipynb files) in this assignment. When you finish, double-check the submission instructions at the top of this file, and submit on JupyterHub.