# Run the cell below

To run a code cell (i.e.; execute the python code inside a Jupyter notebook) you can click the play button on the ribbon underneath the name of the notebook that looks like ▶| or hold down `Shift` + `Return`.

Before you begin run the code cell below.

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("dsc295_003_005_a6.ipynb")

# Assignment 06

## Due: See Date in Moodle

In this assignment we will get familiar with Jupyter notebooks and some basic Python, including the print functions and variable assignment.

I would like you to attempt each question in the assignment. To get **full credit** on this assignment, you must complete **Level I**, **Level II**, and **Level III**. **Level IV** and **Level V** are **optional** as they give you a chance to work with advanced features in Python.

## This Week's Assignment
In this week's assignment, you'll learn how to:

- call built-in Python functions.

- import libraries.

- work with `NumPy` arrays.

- import data from `.csv` file and access information from a dataframe.

**Name:** 

**Section:** 

**Date:**

Let's get started!

## Level I

### Python Expressions

The two building blocks of Python code are *expressions* and *statements*. An **expression** is a piece of code that

* is self-contained, meaning it would make sense to write it on a line by itself, and
* usually evaluates to a value.


Here are two expressions that both evaluate to 3

* `3`
* `5-2`
    
One important type of expression is the **call expression**. A call expression begins with the name of a function and is followed by the argument(s) of that function in parentheses. The function returns some value, based on its arguments. Some important mathematical functions are listed below.

| Function | Description                                                   |
|----------|---------------------------------------------------------------|
| `abs`    | Returns the absolute value of its argument                    |
| `max`    | Returns the maximum of all its arguments                      |
| `min`    | Returns the minimum of all its arguments                      |
| `pow`    | Raises its first argument to the power of its second argument |
| `round`  | Rounds its argument to the nearest integer                    |

Here are two call expressions that both evaluate to 3.

* `abs(2-5)`

* `max(round(2.8), min(pow(2, 10), -1*pow(2, 10)))`

The expression `5-2` and the two call expressions given above are examples of **compound expressions**, meaning that they are actually combinations of several smaller expressions. `5-2` combines the expressions `5` and `2` by subtraction.  In this case, `5` and `2` are called **subexpressions** because they're expressions that are part of a larger expression.

### Python Statements

A **statement** is a whole line of code.  Some statements are just expressions.  The expressions listed above are examples.

Other statements *make something happen* rather than *having a value*. For example, an **assignment statement** assigns a value to a name. In R we learned that assignment could be done using `<-` or `=`. In Python we will only be using `=`.

As a reminder, think about it this way $-$ we're **evaluating the right-hand side** of the equals sign and **assigning it to the left-hand side**. Here are some assignment statements:
    
* `height = 65`

* `absolute_height_difference = abs(height-72)`

In [None]:
height = 65
height

In [None]:
absolute_height_difference = abs(height-72)
absolute_height_difference

An important idea in programming is that large, interesting things can be built by combining many simple, uninteresting things. The key to understanding a complicated piece of code is breaking it down into its simple components.

For example, a lot is going on in the last statement above, but it's really just a combination of a few things.


* `height` is a name expression assignned in the first line

* `72` is a number expression whose value is `72`

* `height-72` is an arithmetic expression whose value is `-7`

* `abs()` is a name expression that is a function whose value is the absolute value of the arithmetic epxression `height-72`

* `absolute_height_difference` is part of an assignment statement that assigns the value of `7` to the variable name

Run the cell below.

In [None]:
height = 65
absolute_height_difference = abs(height-72)
absolute_height_difference

Now let's look at some built-in Python functions.

Exponents are fundamental and widely used (especially in Base 2 and Base 16) in physics and electronics formulas involved in computing. Power of 2 exponents are the basis of all computing which is done in binary, or Base 2 numbers like 


$$2^0=1, \text{ } 2^1=2, \text{ } 2^2=4, \text{ } 2^3=8, \text{ } 2^4=16, \text{ } 2^5=32, \text{ } 2^6=64, \text{ } \ldots$$

**Question 1.** Use the `pow` function to compute the number of bytes in 1 GB (Gigabyte) as 2 to the power of 30. Assign this value to `one_gb`. To earn all the points for this question you **must** use the `pow` function. 

**Note:** 1GB $=$ 1073741824 bytes. 

**Hint:** Documentation for the `pow` function can be found [here](https://www.w3schools.com/python/ref_func_pow.asp).

In [None]:
one_gb = ...
one_gb

In [None]:
grader.check("q1")

### Nested Expressions

Function calls and arithmetic expressions can themselves contain expressions.  You saw an example in the previous cell `abs(height-72)` has a name expression and a number expression, in a subtraction expression, in a function call expression.

Suppose we are interested in heights that are very unusual. We'll say that a height is unusual to the extent that it's far away from the average human height. [An estimate of the average adult human height](http://www.wecare4eyes.com/averageemployeeheights.htm) (averaging, we hope, over all humans on Earth today) is 1.688 meters. 

Most NBA basketball players are *far* away from the average human height. From 1988-1997, Muggsy Bogues played for the Charlotte Hornets. He is shortest player in NBA history at 1.6002 meters tall, therefore his height is $|1.6002-1.688|$, or $0.0878$ meters away from the average.

Here's how we'd write that in one line of Python code:

In [None]:
round(abs(1.6002-1.688), 4)

What's going on here?  `round` takes two arguments, and `abs` takes one argument. The stuff inside the parentheses after `abs` is all part of that **single argument**.  Specifically, the argument is the value of the expression `1.6002-1.688`.  The value of that expression is `-0.08779999999999988`.  That value is the argument to `abs`.  The absolute value of that is `0.08779999999999988`. So `0.08779999999999988` is the value of the full expression `abs(1.6002-1.688)`. 

The value `0.08779999999999988` is the first argument of the `round` function and `4` is the second argument. This value `0.08779999999999988`will be rounded to four decimal places to `0.0878`.

Picture simplifying the expression in several steps:

1. `round(abs(1.6002-1.688), 4)`

2. `round(abs(-0.08779999999999988), 4)`

3. `round(0.08779999999999988, 4)`

4. `0.0878`

In fact, that's basically what Python does to compute the value of the expression. Run the code cells belwo to see for yourself.

In [None]:
# Subtract 
1.6002-1.688

In [None]:
# The output from 1.6002-1.688
# is used as input for the abs function
abs(-0.08779999999999988)

In [None]:
# The output from 
# is used as input for the round function
# and is rounded to 4 decimal places
round(0.08779999999999988, 4)

**Question 2.** Given the heights of three players from the [NC State Women's Basketball team](https://gopack.com/sports/womens-basketball/roster), write an expression that computes the smallest difference between any of the three heights, rounded to 3 decimal places. Your expression shouldn't have any numbers in it, only function calls and the names `rbaldwin`, `djohnson`, and `jboyd`. Give the value of your expression the name `min_height_difference` and round it to 3 decimal places.

In [None]:
# The three players' heights, in meters:
rbaldwin = 1.9558 # River Baldwin is 6'5"
djohnson = 1.6509 # Diamond Johnson is 5'5"
jboyd = 1.8796    # Jada Boyd is 6'2"
             
# We'd like to look at all 3 pairs of heights, 
# compute the absolute difference between each pair, 
# and then find the smallest of those 3 absolute differences.  

# This is left to you. 
# If you're stuck, try computing the value for each step of the process 
# (like the difference between River's heigh and Jada's height) 
# on a separate line and giving it a name (like river_jada_height_diff)
min_height_difference = ...

# Again, we've written this here so that
# the distance you compute will get printed 
# when you run this cell.
min_height_difference

In [None]:
grader.check("q2")

## Level II

### Importing code

Most programming involves work that is very similar to work that has been done before. Since writing code is time-consuming, it's good to rely on others' published code when you can. Rather than copy-pasting, Python allows us to **import modules**. A module is a file with Python code that has defined variables and functions. By importing a module, we are able to use its code in our own notebook.

In [None]:
# Import the math module
import math

radius = 5

# Use the value of pi from the math module
# math.pi is the value pi from the math module
area_of_circle = radius**2*math.pi
area_of_circle

In the code above, the line `import math` imports the math module. This statement creates a module and then assigns the name `math` to that module. We are now able to access any variables or functions defined within `math` by typing the name of the module followed by a dot, then followed by the name of the variable or function we want. 

For example, in the previous code cell we used `math.pi` to calculate the area of a circle.

**Question 3.** The module `math` also provides the name `e` for the base of the natural logarithm, which is roughly 2.71.  Compute 

$$\large e^{\pi}-\pi$$ 

giving it the name `near_twenty`. Do not round this value.

**Note:** If you're curious as to why this quantity is *almost* 20 you can read this [article on Mathematics Stackexchange](https://math.stackexchange.com/questions/724872/why-is-e-pi-pi-so-close-to-20).

In [None]:
near_twenty = ...
near_twenty

In [None]:
grader.check("q3")

### `NumPy`

Very often in data science, we want to work with collections of numbers. Sometimes it will be useful to create a set of numbers that are evenly spaced within some range. For example, in class we used the `arange` function from the `NumPy` module. If you don't remember the syntax click [here](https://www.geeksforgeeks.org/numpy-arrange-in-python/).

**Question 4.** Import `numpy` as `np` and then use `np.arange` to create an array named `multiples_of 99` that contains the multiples of 99 from 1 up to **and including** 9999. So its items will 99, 198, 297, $\dots$, 9999.

In [None]:
# Imoprt numpy
import ... as np

# Create the numpy array
multiples_of_99 = ...
multiples_of_99

In [None]:
grader.check("q4")

### Working with Array Elements

Let's work with a more interesting array of values.  The next cell creates an array called `population_amounts` that includes estimated world populations in every year from **1960** to roughly the present. The estimates come from the US Census Bureau website.

Rather than type in the data manually, we've loaded a file on for you called `world_population.csv`. 

Run the cell below to load the data.

In [None]:
population_array = np.loadtxt('data/world_population.csv', delimiter=',', skiprows=1) 
population_array

Run the cell below to review how python assigns index values to array items.

In [None]:
# The 1st item in the array is the population in 1960 (which is 1960 + 0).
population_array[0]

The value of that expression is the number 3032156070 (around 3 billion), because that's the first item in the array `population_array`.

In [None]:
# The 13th item in the array is the population in 1972 (which is 1960 + 12).
population_1972 = population_array[12]
population_1972

**Question 5.** Set `population_1999` to the world population in 1999, by accessing the appropriate item from `population_array`.

In [None]:
population_1999 = ...
population_1999

In [None]:
grader.check("q5")

### Doing Something to Every Element of an Array

Arrays are primarily useful for doing the same operation many times. Arithmetic works elementwise on arrays, meaning that if you perform an arithmetic operation (like subtraction, division, etc.) on an array, Python will do the operation to every item of the array individually and return an array of all of the results. For example, you can divide all the population numbers by 1 billion to get numbers in billions.

**Question 6.** Divide all the population numbers by 1 billion to get numbers in billions. Save the result to `population_in_billions`. 

In [None]:
population_in_billions = ...
population_in_billions

In [None]:
grader.check("q6")

In the previous question we changed the values of array elements to get the population in billions. This left us with quite a few numbers after the decimal place.

**Question 7.** Round the values in the `population_in_billions` array to one decimal place. Save this result to `population_rounded`.

**Hint:** To perform an operation to every item in an array you need to use array functions from the `NumPy` module. 

In [None]:
population_rounded = ...
population_rounded

In [None]:
grader.check("q7")

## Level III

### `pandas`

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another library named `Numpy`, which provides support for arrays. Since we know how to perform operations on `NumPy` arrays we can operate on columns in a `pandas` dataframe. 

#### Indeed

Indeed calculated the percentage change in seasonally-adjusted job postings starting from February 1, 2020, using a 7-day trailing average. February 1, 2020, is the pre-pandemic baseline. Indeed seasonally adjusts each series based on historical patterns in 2017, 2018, and 2019. Each series, including the national trend, occupational sectors, and sub-national geographies, is seasonally adjusted separately. Indeed switched to this new methodology in January 2021 and now reports all historical data using this new methodology. Historical numbers have been revised and may differ significantly from originally reported values. The new methodology applies a detrended seasonal adjustment factor to the percentage change in job postings. For more information, see [Indeed Hiring Lab Data Repository](https://github.com/hiring-lab/job_postings_tracker).

The data file `indeed_data.csv` contains information for Raleigh-Cary (`ral_cary`), Charlotte-Concord-Gastonia (`char_con_gast`), and Greensboro-High Point (`gso_hp`).

Let's import `pandas` as `pd`, then use `pd.read_csv` to read load the `indeed_data.csv` file into a `pandas` `DataFrame` named `indeed`.

**Note:** The `indeed_data.csv` file is located in the data folder. This is why `'data/indeed_data.csv'` is inside the parentheses of the `pd.read_csv` function.

In [None]:
# Import the pandas module as pd
import pandas as pd

# Read the .csv file
indeed = pd.read_csv('data/indeed_data.csv')

# Show the first 10 rows in the dataframe
indeed.head(10)

The `.head(10)` method is used to display the first 10 observations (rows) in the dataframe.

To access the items in a column of dataframe we can use the column label (name) in two ways:

- `<dataframe name>[<column name>]`
    
- `<dataframe name>.<column name>`

For example, to access the values in the column labled `ral_cary` we could enter

```
indeed[ral_cary]
```
or 

```
indeed.ral_cary
```

In [None]:
indeed['ral_cary']

In [None]:
indeed.ral_cary

We can use functions from the `NumPy` module to perform calculations on numercial columns in a dataframe. For example, some of the functions we can use are

- `np.average`

- `np.min`

- `np.max`

- `np.sort`

**Question 8.** Use `np.max` to find the largest postive percentage change in job postings from the Raleigh-Cary region. Save this to `max_pos_ral_cary`.

In [None]:
max_pos_ral_cary = ...
max_pos_ral_cary

In [None]:
grader.check("q8")

**Question 9.** Find the largest negative percentage change in job postings from the Raleigh-Cary region. Save this to `max_neg_ral_cary`.

In [None]:
max_neg_ral_cary = ...
max_neg_ral_cary 

In [None]:
grader.check("q9")

For **Question 10** and **Question 11** let's look at a dataset about something that we're very fimilar with, **coffee**. Run the cell below to load the `coffee.csv` dataset.

**Note:** There are not autograded checks for **Question 10** and **Question 11**. If you are'nt sure of your answer contact a classmate (maybe someone from your group project) or post to the Moodle discussion forum.

In [None]:
coffee = pd.read_csv('data/coffee.csv')
coffee.head(5)

Let's get some basic information about the dataframe.

<!-- BEGIN QUESTION -->

**Question 10.** Choose a numeric column. Then find the median value in the column.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 11.** Choose a year. Then subset the `coffee` dataframe to find all the observations that match your choice for year.

In [None]:
...

<!-- END QUESTION -->

## Level IV

While we were studying R, we learned how to use the `group_by()` function. Python has a similar function named `.groupby()`. To complete the last two questions you'll need to use the `groupby()` function. Even though we have not covered this during class, you have the internet. Look [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) to read the documentation provided by pydata. Also, a link to an example has provided in each question. 

**Question 12.** Use the `.groupby()` method on the categorical column `Location.Country` in the `coffee` dataframe. Then click [here](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/) to follow the steps for **Example #1** from the website to access the dataframe for China.

In [None]:
chi = ...
...

In [None]:
grader.check("q12")

## Level V


**Question 13.** You can group on multiple columns. This time add the column `Data.Type.Processing.method` to your code from **Question 12**. To see an example on how to group using multiple columns click [here](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/) and follow **Example #2**.

In [None]:
col = ...

# The line of code below will get the group that has Location.Country = 'Colombia'
# and Data.Type.Processing method = 'Natural / Dry'. 
# To get a group with more than one key you must pass the get_group method a tuple
# that contains the keys for the group.
col.get_group(('Colombia', 'Natural / Dry'))

In [None]:
grader.check("q13")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Moodle to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)