In [62]:
# Initialize Otter
import otter
grader = otter.Notebook()

## Lab 12: Pandas Overview

To receive credit for a lab, answer all questions correctly and submit before the deadline.

**This lab is due Monday, May 1st at 11:59 PM.**

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with:

* Creating dataframes
* Slicing data frames (i.e. selecting rows and columns)
* Filtering data (using boolean arrays)

In this lab you are going to use several pandas methods, such as `drop` and `loc`. You may press `shift+tab` on the method parameters to see the documentation for that method. If you are familar with the `datascience` library used in DSCI 101/102, the conversion reference notebook included with the assignment might serve useful. 



**Note**: The Pandas interface is notoriously confusing, and the documentation is not consistently great. Throughout the semester, you will have to search through Pandas documentation and experiment, but remember it is part of the learning experience and will help shape you as a data scientist!

In [63]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

## Creating DataFrames & Basic Manipulations

A [dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is a table in which each column has a type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers).

The [docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for the pandas `DataFrame` class  provide at least two syntaxes to create a data frame.

**Syntax 1: ** You can create a data frame by specifying the columns and values using a dictionary as shown below. 

The keys of the dictionary are the column names, and the values of the dictionary are lists containing the row entries.

In [64]:
fruit_info = pd.DataFrame(
    data={'fruit': ['apple', 'orange', 'banana', 'raspberry'],
          'color': ['red', 'orange', 'yellow', 'pink']
          })
fruit_info

Unnamed: 0,fruit,color
0,apple,red
1,orange,orange
2,banana,yellow
3,raspberry,pink


**Syntax 2: ** You can also define a dataframe by specifying the rows like below. 

Each row corresponds to a distinct tuple, and the columns are specified separately.

In [65]:
fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
     ("pink", "raspberry")], 
    columns = ["color", "fruit"])
fruit_info2

Unnamed: 0,color,fruit
0,red,apple
1,orange,orange
2,yellow,banana
3,pink,raspberry


You can obtain the dimensions of a dataframe by using the shape attribute `dataframe.shape`.

In [66]:
fruit_info.shape

(4, 2)

You can also convert the entire dataframe into a two-dimensional numpy array.

In [67]:
fruit_info.values

array([['apple', 'red'],
       ['orange', 'orange'],
       ['banana', 'yellow'],
       ['raspberry', 'pink']], dtype=object)

### Question 1a

For a DataFrame `d`, you can add a column with `d['new column name'] = ...` and assign a list or array of values to the column. Add a column of integers containing 1, 2, 3, and 4 called `rank1` to the `fruit_info` table which expresses your personal preference about the taste ordering for each fruit (1 is tastiest; 4 is least tasty). 

<!--
BEGIN QUESTION
name: q1a
-->

In [68]:
fruit_info['rank1'] = [3, 1, 2, 4]
fruit_info

Unnamed: 0,fruit,color,rank1
0,apple,red,3
1,orange,orange,1
2,banana,yellow,2
3,raspberry,pink,4


In [69]:
grader.check("q1a")

### Question 1b

You can also add a column to `d` with `d.loc[:, 'new column name'] = ...`. As discussed in lecture, the first parameter is for the rows and second is for columns. The `:` means change all rows and the `new column name` indicates the column you are modifying (or in this case, adding). 

Add a column called `rank2` to the `fruit_info` table which contains the same values in the same order as the `rank1` column.

<!--
BEGIN QUESTION
name: q1b
-->

In [70]:
fruit_info.loc[:, 'rank2'] = [3, 1, 2, 4]
fruit_info

Unnamed: 0,fruit,color,rank1,rank2
0,apple,red,3,3
1,orange,orange,1,1
2,banana,yellow,2,2
3,raspberry,pink,4,4


In [71]:
grader.check("q1b")

### Question 2

Use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) both the `rank1` and `rank2` columns you created. (Make sure to use the `axis` parameter correctly.) Note that `drop` does not change a table, but instead returns a new table with fewer columns or rows unless you set the optional `inplace` parameter.

*Hint*: Look through the documentation to see how you can drop multiple columns of a Pandas dataframe at once using a list of column names.

<!--
BEGIN QUESTION
name: q2
-->

In [72]:
fruit_info_original = fruit_info.drop(['rank1', 'rank2'], axis=1)
fruit_info_original

Unnamed: 0,fruit,color
0,apple,red
1,orange,orange
2,banana,yellow
3,raspberry,pink


In [73]:
grader.check("q2")

### Question 3

Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `fruit_info_original` so they begin with capital letters. Set this new dataframe to `fruit_info_caps`.
<!--
BEGIN QUESTION
name: q3
-->

In [74]:
fruit_info_caps = fruit_info_original.rename(columns={'fruit': 'Fruit', 'color': 'Color'})
fruit_info_caps

Unnamed: 0,Fruit,Color
0,apple,red
1,orange,orange
2,banana,yellow
3,raspberry,pink


In [75]:
grader.check("q3")

### NBA dataset
Now that we have learned the basics, let's move on to the NBA dataset. This dataset contains records of NBA matchups in the 2018 season along with multiple stats. 

In [76]:
nba = pd.read_csv("nba.csv")

In [77]:
len(nba)

2460

In [78]:
nba.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
0,22018,1610612744,GSW,Golden State Warriors,21800002,2018-10-16,GSW vs. OKC,W,240,42,...,41,58,28,7,7,21,29,108,8,1
1,22018,1610612760,OKC,Oklahoma City Thunder,21800002,2018-10-16,OKC @ GSW,L,240,33,...,29,45,21,12,6,15,21,100,-8,1
2,22018,1610612755,PHI,Philadelphia 76ers,21800001,2018-10-16,PHI @ BOS,L,240,34,...,41,47,18,8,5,16,20,87,-18,1
3,22018,1610612738,BOS,Boston Celtics,21800001,2018-10-16,BOS vs. PHI,W,240,42,...,43,55,21,7,5,15,20,105,18,1
4,22018,1610612750,MIN,Minnesota Timberwolves,21800010,2018-10-17,MIN @ SAS,L,240,39,...,32,46,20,9,2,11,27,108,-4,1


## Slicing Data Frames - selecting rows and columns


### Selection Using Label/Index (using loc)

**Column Selection** 

To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html). General usage of `.loc` looks like `df.loc[rowname, colname]`. (Reminder that the colon `:` means "everything.")  For example, if we want the `color` column of the `ex` data frame, we would use: `ex.loc[:, 'color']`

- You can also slice across columns. For example, `nba.loc[:, 'TEAM_NAME':]` would select the column `TEAM_NAME` and all columns after `Name`.

- *Alternative:* While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `df['colname']`.

**Row Selection**

Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (ie. primary key) of the dataframe.

In [79]:
#Example:
nba.loc[2:5, 'TEAM_NAME']

2        Philadelphia 76ers
3            Boston Celtics
4    Minnesota Timberwolves
5         San Antonio Spurs
Name: TEAM_NAME, dtype: object

In [80]:
#Example:  Notice the difference between these two methods
#Just passing in 'Name' returns a Series while ['Name'] returns a Dataframe
nba.loc[2:5, ['TEAM_NAME']]

Unnamed: 0,TEAM_NAME
2,Philadelphia 76ers
3,Boston Celtics
4,Minnesota Timberwolves
5,San Antonio Spurs


The `.loc` actually uses the Pandas row index rather than row id/position of rows in the dataframe to perform the selection. Also, notice that if you write `2:5` with `loc[]`, contrary to normal Python slicing functionality, the end index is included, so you get the row with index 5. 


### Selection using Integer location (using iloc)

In lecture we discussed another pandas feature `iloc[]` which lets you slice the dataframe by row position and column position instead of by row index and column label (which is the case for `loc[]`). This is really the main difference between the 2 functions and it is **important** that you remember the difference and why you might want to use one over the other. In addition, with `iloc[]`, the end index is NOT included, like with normal Python slicing.

As a mnemonic, remember that the i in `iloc` means "integer". 

Below, we have sorted the `nba` dataframe. Notice how the *position* of a row is not necessarily equal to the *index* of a row. For example, the first row is not necessarily the row associated with index 1. This distinction is important in understanding the different between `loc[]` and `iloc[]`.

In [81]:
sorted_nba_teams = nba.sort_values(by=['TEAM_NAME'])
sorted_nba_teams.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
1461,22018,1610612737,ATL,Atlanta Hawks,21800734,2019-01-26,ATL @ POR,L,240,42,...,27,36,24,9,7,11,27,111,-9,1
367,22018,1610612737,ATL,Atlanta Hawks,21800189,2018-11-11,ATL @ LAL,L,240,41,...,34,44,25,11,4,19,23,106,-1,1
546,22018,1610612737,ATL,Atlanta Hawks,21800268,2018-11-23,ATL vs. BOS,L,240,35,...,35,40,22,9,7,21,26,96,-18,1
156,22018,1610612737,ATL,Atlanta Hawks,21800076,2018-10-27,ATL vs. CHI,L,240,27,...,37,48,20,9,8,22,13,85,-12,1
1307,22018,1610612737,ATL,Atlanta Hawks,21800653,2019-01-15,ATL vs. OKC,W,240,56,...,34,44,36,8,4,19,28,142,16,1


Here is an example of how we would get the 2nd, 3rd, and 4th rows with only the `TEAM_NAME` column of the `nba` dataframe using both `iloc[]` and `loc[]`. Observe the difference, especially after sorting `nba` by name.

In [82]:
sorted_nba_teams.iloc[1:4, 3]

367    Atlanta Hawks
546    Atlanta Hawks
156    Atlanta Hawks
Name: TEAM_NAME, dtype: object

Notice that using `loc[]` with 1:4 gives different results, since it selects using the *index*.

In [83]:
sorted_nba_teams.loc[1:4, "TEAM_NAME"]

Series([], Name: TEAM_NAME, dtype: object)

Lastly, we can change the index of a dataframe using the `set_index` method.

In [84]:
#Example: We change the index from 0,1,2... to the Name column
df = nba[:5].set_index("TEAM_NAME") 
df

Unnamed: 0_level_0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,FGA,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
TEAM_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Golden State Warriors,22018,1610612744,GSW,21800002,2018-10-16,GSW vs. OKC,W,240,42,95,...,41,58,28,7,7,21,29,108,8,1
Oklahoma City Thunder,22018,1610612760,OKC,21800002,2018-10-16,OKC @ GSW,L,240,33,91,...,29,45,21,12,6,15,21,100,-8,1
Philadelphia 76ers,22018,1610612755,PHI,21800001,2018-10-16,PHI @ BOS,L,240,34,87,...,41,47,18,8,5,16,20,87,-18,1
Boston Celtics,22018,1610612738,BOS,21800001,2018-10-16,BOS vs. PHI,W,240,42,97,...,43,55,21,7,5,15,20,105,18,1
Minnesota Timberwolves,22018,1610612750,MIN,21800010,2018-10-17,MIN @ SAS,L,240,39,91,...,32,46,20,9,2,11,27,108,-4,1


We can now lookup rows by name directly:

In [85]:
df.loc[['Boston Celtics'], :]

Unnamed: 0_level_0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,FGA,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
TEAM_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Boston Celtics,22018,1610612738,BOS,21800001,2018-10-16,BOS vs. PHI,W,240,42,97,...,43,55,21,7,5,15,20,105,18,1


However, if we still want to access rows by location we will need to use the integer loc (`iloc`) accessor:

In [86]:
#Example: 
#df.loc[2:5,"GAME_ID"] You can't do this
df.iloc[1:4, 2:3]

Unnamed: 0_level_0,TEAM_ABBREVIATION
TEAM_NAME,Unnamed: 1_level_1
Oklahoma City Thunder,OKC
Philadelphia 76ers,PHI
Boston Celtics,BOS


### Question 4

Selecting multiple columns is easy.  You just need to supply a list of column names.  Select the `TEAM_NAME` and `MATCHUP` **in that order** from the `nba` DataFrame.

<!--
BEGIN QUESTION
name: q4
-->

In [87]:
matchup = nba.loc[:, ['TEAM_NAME', 'MATCHUP']]
matchup[:5]

Unnamed: 0,TEAM_NAME,MATCHUP
0,Golden State Warriors,GSW vs. OKC
1,Oklahoma City Thunder,OKC @ GSW
2,Philadelphia 76ers,PHI @ BOS
3,Boston Celtics,BOS vs. PHI
4,Minnesota Timberwolves,MIN @ SAS


In [88]:
grader.check("q4")

Note that `.loc[]` can be used to re-order the columns within a dataframe.

## Filtering Data

### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, for culling out fishy outliers, or for analyzing subgroups of your data set.  Note that compound expressions have to be grouped with parentheses. Example usage looks like `df[df['column name'] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
>=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

In the following we construct the DataFrame containing only home games for the Portland Trail Blazers.

In [89]:
ptb = nba[nba['TEAM_NAME'] == 'Portland Trail Blazers']
ptb.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
28,22018,1610612757,POR,Portland Trail Blazers,21800016,2018-10-18,POR vs. LAL,W,240,44,...,40,54,21,9,10,17,26,128,9,1
61,22018,1610612757,POR,Portland Trail Blazers,21800034,2018-10-20,POR vs. SAS,W,240,45,...,38,46,24,5,3,12,14,121,13,1
85,22018,1610612757,POR,Portland Trail Blazers,21800046,2018-10-22,POR vs. WAS,L,265,43,...,53,70,22,6,4,20,22,124,-1,1
126,22018,1610612757,POR,Portland Trail Blazers,21800064,2018-10-25,POR @ ORL,W,240,48,...,32,44,24,6,6,11,22,128,14,1
154,22018,1610612757,POR,Portland Trail Blazers,21800079,2018-10-27,POR @ MIA,L,240,37,...,34,42,14,7,8,9,25,111,-9,1


### Question 5
Using a boolean array, select the games that had a field goal percentage `FG_PCT` greater than 50% that also resulted in a win `WL`. Keep all columns from the original `nba` dataframe.

Note: Any time you use `p & q` to filter the dataframe, make sure to use `df[(df[p]) & (df[q])]` or `df.loc[(df[p]) & (df[q])]`. That is, make sure to wrap conditions with parentheses.

**Remember** that both slicing and `loc` will achieve the same result, it is just that `loc` is typically faster in production. You are free to use whichever one you would like.

<!--
BEGIN QUESTION
name: q5
-->

In [90]:
result = nba[(nba['FG_PCT'] > 0.5) & (nba['WL'] == 'W')]
result.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
9,22018,1610612756,PHX,Phoenix Suns,21800013,2018-10-17,PHX vs. DAL,W,240,44,...,38,44,35,5,2,12,22,121,21,1
10,22018,1610612740,NOP,New Orleans Pelicans,21800009,2018-10-17,NOP @ HOU,W,240,52,...,40,54,36,8,3,12,25,131,19,1
19,22018,1610612754,IND,Indiana Pacers,21800005,2018-10-17,IND vs. MEM,W,240,47,...,44,57,29,2,7,20,24,111,28,1
21,22018,1610612762,UTA,Utah Jazz,21800011,2018-10-17,UTA @ SAC,W,240,41,...,39,44,21,8,4,17,19,123,6,1
38,22018,1610612744,GSW,Golden State Warriors,21800024,2018-10-19,GSW @ UTA,W,240,49,...,35,43,27,10,6,16,23,124,1,1


In [91]:
grader.check("q5")

## Array Arithmetic

**Question 1.** Multiply the numbers **42**, **4224**, **42422424**, and **-250** by **157**. Assign each variable below such that `first_product` is assigned to the result of $42 * 157$, `second_product` is assigned to the result of $4224 * 157$, and so on. 

For this question, **don't** use arrays.

In [92]:
first_product = 42*157
second_product = 4224*157
third_product = 42422424*157
fourth_product = -250*157
print(first_product, second_product, third_product, fourth_product)

6594 663168 6660320568 -39250


In [93]:
grader.check("q6_1")

**Question 2.** Now, do the same calculation, but using a numpy array called `numbers` and only a single multiplication (`*`) operator.  Store the 4 results in an array named `products`

In [108]:
numbers = np.array([42, 4224, 42422424, -250])
products = numbers * 157
products

array([       6594,      663168, -1929614024,      -39250])

In [109]:
grader.check("q6_2")

**Question 3.** We've loaded a DataFrame  of temperatures in the next cell.  

The "Daily Max Temperature" column records the highest temperature observed on a day at a climate observation station, mostly from the US.  

Since they're from the US government agency [NOAA](https://www.noaa.gov/), all the temperatures are in Fahrenheit.  

Convert the `temperatures["Daily Max Temperatures"]` column of Fahrenheit temperatures all to Celsius by first subtracting 32 from them, then multiplying the results by $\frac{5}{9}$. Make sure to **ROUND** the final result after converting to Celsius to the nearest integer using the `np.round` function. 

Finally, cast the result to an np.array
`celsius_max_temperatures = np.array(  

In [112]:
temperatures = pd.read_csv("https://github.com/oregon-data-science/DSCI101/raw/main/data/temperatures.csv")
temperatures.head(6)

Unnamed: 0,Daily Max Temperature,Daily Min Temperature
0,25,13
1,87,69
2,89,67
3,96,72
4,79,75
5,91,81


In [115]:
# convert from temperature F to celsius
celsius_max_temperatures = np.array([(temperatures['Daily Max Temperature'] - 32) * 5/9])
# round result, convert to np.array
celsius_max_temperatures = np.round(celsius_max_temperatures)
celsius_max_temperatures

array([[-4., 31., 32., ..., 17., 23., 16.]])

In [116]:
grader.check("q6_3")

## Submission

Make sure you have run all cells in your notebook in order. Then execute the following two commands from the File menu:

 - Save and Checkpoint

 - Close and Halt

Then upload your .ipynb file to Canvas assignment Lab 12