# Discussion 2

### Due Friday April 10, 11:59:59PM


---

## Lecture Review



In [1]:
import pandas as pd
import numpy as np
import os

### `DataFrames` in `Pandas` module

We can make a dataframe in `pandas` using the class constructor `pandas.DataFrame`.

For example, suppose we know that individual number one has the following attributes:

* Favorite letter: `a`
* Number of games played: `9`
* Points accumulated: `1`

and we know the same attributes for individuals two, three, and four. To go about constructing a dataframe in this way, we can use a list of `numpy ndarrays` where each entry corresponds to a row in the dataframe:

In [3]:
data = [
    ['a', 9, 1], # row 1
    ['b', 3, 2], # row 2
    ['c', 3, 2], # row 3
    ['z', 1, 10] # row 4
]

In [4]:
df1 = pd.DataFrame(data,                                     # rows of dataframe
                   columns = ['letter', 'count', 'points'])  # column names 
df1

Unnamed: 0,letter,count,points
0,a,9,1
1,b,3,2
2,c,3,2
3,z,1,10


Equivalently, we can define the same dataframe by specifying the columns instead of the rows:
* This way of defining a dataframe closely resembles how the dataframe stores its underlying data.
* Each column is homogeneous (represents the same type of quantity).

Here, we also specify an index:

In [5]:
dictionary = {'letter' : ['a', 'b', 'c', 'z'],  # {column name : values}
              'count'  : [ 9,   3,   3,    1],  # {column name : values}
              'points' : [ 1,   2,   2,   10]}  # {column name : values}
 
df2 = pd.DataFrame(data=dictionary, index='i0 i1 i2 i3'.split())
df2

Unnamed: 0,letter,count,points
i0,a,9,1
i1,b,3,2
i2,c,3,2
i3,z,1,10


### Summary: `DataFrame` Constructor

* `pd.DataFrame` creates a dataframe from:
    * A dictionary of columns (`df2` above)
    * A list of rows (`df1` above)
* Optional (default) arguments include:
    * `index`: can be array-like if your dataframe requires something other than a range from 0 to n
    * `columns`: labels may be provided for column names (similar to `'letter'`, `'count'`, and `'points'` above) 
    * `dtype`: `None` is the default, `pandas` will infer based on the content of your columns.
* Accepts any 'array-like' container (`list`, `np.ndarray`, `pd.Series`)
    * Note the difference [here](https://stackoverflow.com/questions/15879315/what-is-the-difference-between-ndarray-and-array-in-numpy) between `np.ndarray` and `np.array`!
    * The former is an actual data type, while the latter is a function to make arrays from other data structures.
* Create small DataFrames to debug and understand your code!
* DataFrame column labels:
    * Accessed using the `columns` attribute
    * Columns default to column number (0-indexed)

### Select an Index or Column From a Pandas DataFrame

In [6]:
# recall df2
df2

Unnamed: 0,letter,count,points
i0,a,9,1
i1,b,3,2
i2,c,3,2
i3,z,1,10


You want to access the value that is at index `0`, in column `count`. We saw in lecture a number of different ways to get our value `9` back.

In [7]:
df2.iloc[0]

letter    a
count     9
points    1
Name: i0, dtype: object

In [8]:
df2.iloc[0].loc['count']

9

In [9]:
df2.loc['i0'].loc['count']

9

The most important ones to remember are, without a doubt, `.loc[]` and `.iloc[]`.

#### `iloc`

* The `iloc` indexer for `Pandas Dataframe` is used for integer-location based indexing / selection by position.

* The `iloc` indexer syntax is `data.iloc[<row selection>`, `<column selection>]`. `iloc` in pandas is used to select rows and columns by number, in the order that they appear in the data frame. 

* You can imagine that each row has a row number from 0 to the total rows (`data.shape[0]`) and `iloc[]` allows selections based on these numbers. The same applies for columns (ranging from 0 to `data.shape[1]`)
    * Note that `.iloc` returns a `Pandas Series` when **one** row is selected, and a `Pandas DataFrame` when **multiple rows** are selected, or if any column in full is selected. 
    * To counter this, pass a single-valued list if you require `DataFrame` output.

In [10]:
print(type(df2.iloc[1]))        # result of type series becuase only one row selected

print(type(df2.iloc[[1]]))      # result of type dataframe becuase list selection used

print(type(df2.iloc[0:2]))      # result of type dataframe since only two ros are selected

print(type(df2.iloc[0:2, 1]))   # result of type series becuase only one column is selected

print(type(df2.iloc[0:2, [1]])) # result of type dataframe with only one column becuase list selection used

print(type(df2.iloc[0:2, 0:2])) # result of type dataframe becuase multiple rows and columns selected

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


In [19]:
df2.iloc[0:2, 0:2]

Unnamed: 0,letter,count
i0,a,9
i1,b,3


####  `loc`

The `Pandas` `loc` indexer can be used with DataFrames for two different main use cases:

* Selecting rows by label/index
* Selecting rows with a boolean/conditional lookup

The `loc` indexer is used with the same syntax as `iloc`: `data.loc[<row selection>, <column selection>]`.

In [None]:
df2

In [11]:
# label/index lookup
df2.loc['i0']           # select first row

df2.loc['i0', 'count']        # select element in count column from first row

df2.loc[:, ['letter', 'count']]   # select letter and count columns; all rows

Unnamed: 0,letter,count
i0,a,9
i1,b,3
i2,c,3
i3,z,1


### Boolean conditional selection with `loc`

Recall that arrays can be compared using comparison operators (`<`,`>`,`==`,...), producing boolean arrays. These boolean arrays can be used to select rows according to those comparison conditions.

In [20]:
# boolean conditional lookup. What is the output of each of these (in plain english)?
df2.loc[df2['letter'] == 'a']

df2.loc[df2['count'] == 3, ['letter']]

df2.loc[:, df2.loc['i1'].apply(type) == np.int64]

Unnamed: 0,count,points
i0,9,1
i1,3,2
i2,3,2
i3,1,10


We can combine boolean expressions using the NOT,AND,OR,XOR operators, to create compound expressions for selecting rows of dataframes. In the table below are the operators that can be used to create boolean arrays:

![](bool_arr.png)

For example, if you want to select all rows where `count` is 3 or `score` is 7, but not BOTH:

In [21]:
count3 = df2['count'] == 3
score7 = df2['points'] == 10
bool_arr = count3 ^ score7
df2.loc[bool_arr]

Unnamed: 0,letter,count,points
i1,b,3,2
i2,c,3,2
i3,z,1,10


## Modifying a Pandas `DataFrame`

### Adding an Index, Row, or Column to a Pandas DataFrame

#### Adding an Index to a Dataframe

* When you create a DataFrame, you have the option to add input to the `index` argument to make sure that you have the index that you desire. 
* When you don’t specify this, your `DataFrame` will have, by default, a numerically valued index that starts with 0 and continues until the last row of your `DataFrame`.
* However, even when your index is specified for you automatically, you still have the power to re-use one of your columns and make it your index. You can easily do this by calling `set_index()` on your DataFrame.

In [None]:
# let's make the 'letter' column our index
df2.set_index('letter')

#### Resetting the Index of Your DataFrame

* When your index doesn’t look entirely the way you want it to, you can opt to reset it. 
* You can easily do this with `.reset_index()`. 
* However, you should still watch out, as you can pass several arguments that can make or break the success of your reset.

In [None]:
# Use `reset_index()` to reset the values. 
df2_reset = df2.reset_index(drop=False)

# Print `df_reset`
df2_reset

#### Deleting a Column from Your DataFrame

To get rid of (a selection of) columns from your DataFrame, you can use the drop() method:

In [None]:
df2.drop(['points'], axis=1)

In [None]:
# note: pandas methods return copies! Must reassign to change the dataframe
df2_dropped = df2.drop(['points'], axis=1)
df2_dropped

This is not so straightforward; there are some extra arguments that are passed to the drop() method!

* The axis argument is either 0 when it indicates rows and 1 when it is used to drop columns.
* While Pandas has an `inplace` keyword to delete the column without having to reassign the DataFrame, **you should never use it**. Pandas code should always be written to return copies; this keyword will be removed in the future.

You can also use `loc` to filter columns using boolean arrays:

In [None]:
df2.loc[:, ~(df2.columns == 'points')]

#### Removing a Row from Your DataFrame

You can remove rows most easily using the `loc` selector and creating appropriate conditions. There are also methods that drop rows based on common needs (`drop_duplicates`, `dropna`).

Below are methods to drop the row corresponding to index `c`:

In [None]:
df2

In [None]:
# rarely used, but works

# let's make the 'letter' column our index
df2_c = df2.set_index('letter')

df2_c.drop('c', axis=0)

In [None]:
# better: conditioning using boolean arrays
# '~' means 'not'

df2_c.loc[~(df2_c.index == 'c')]

## Tutorial: DataFrame Manipulation in Pandas

**Question 1**: Construct a DataFrame with columns `names, scores, attempts,` and `qualify`, with values given in the lists below, which has the index `labels`. 

Once you have done this in the notebook, write a function `question01` in the `py` file that takes data (a collection of columns) and labels and that outputs this dataframe.

In [23]:
names = ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas']
scores = [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19]
attempts = [1, 3, 2, 3, 2, 3, 1, 1, 2, 1]
qualify = ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [24]:
# work here...
data = ... # a collection of columns
samp_df = ... # the dataframe

In [28]:
dic = {'names':names,'scores':scores,'attempts':attempts,'qualify':qualify}

In [30]:
df2 = pd.DataFrame(data=dic, index=labels)
df2

Unnamed: 0,names,scores,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,James,,3,no
e,Emily,9.0,2,no
f,Michael,20.0,3,yes
g,Matthew,14.5,1,yes
h,Laura,,1,no
i,Kevin,8.0,2,no
j,Jonas,19.0,1,yes


**Question 2**: Find the index-labels of numbers that are multiples of 3 from `ser`. In the notebook, put these positions in a list entitled `multiples`. Once finished, put your work in `question02` of the `py` file. 

That is, create a function `question02` that takes in a series like `ser` and outputs the index-labels that correspond to values that are multiples of 3.

In [41]:
ser = pd.Series(np.random.randint(1, 10, 7))

multiples = ser[ser%3==0].index

In [42]:
ser

0    5
1    8
2    3
3    7
4    2
5    6
6    7
dtype: int32

In [44]:
multiples

Int64Index([2, 5], dtype='int64')