# Lab 2: Pandas Overview

## Due on 09/05/2017 at 11:59pm (Graded on Accuracy)

Pandas is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with:

* Creating dataframes
* Slicing data frames (ie. selecting rows and columns)
* Filtering data (using boolean arrays)
* Data Aggregation/Grouping dataframes
* Joining tables
* Handling NA/Null values

## Setup

In [None]:
import pandas as pd
import numpy as np

# These lines load the tests.
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab02.ok')

## Creating DataFrames & Basic Manipulations

A dataframe is a two-dimensional labeled data structure with columns of potentially different types.

**Method 1: ** You can create a data frame by specifying the columns and values as shown below.

In [None]:
fruit_info = pd.DataFrame(
    data={'fruit': ['apple', 'orange', 'banana', 'raspberry'],
          'color': ['red', 'orange', 'yellow', 'pink']
          })
fruit_info

**Method 2: ** You can also define a dataframe by specifying the rows like below.

In [None]:
fruit_info2 = pd.DataFrame([("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
                            ("pink", "raspberry")], columns = ["color", "fruit"])
fruit_info2

### Question 1

You can add a column by `dataframe['new column name'] = [data]`. Please add a column called `rank` to the `fruit_info` table which contains a 1,2,3, or 4 based on your personal preference ordering for each fruit.


In [None]:
...

In [None]:
_ = ok.grade('q01')
_ = ok.backup()

### Question 2

You can obtain the dimensions of a matrix by using the shape attribute `dataframe.shape`. How many rows and columns are in the dataframe you modified above?

In [None]:
num_rows = ...
num_columns = ...

In [None]:
_ = ok.grade('q02')
_ = ok.backup()

### Question 3

Use the `.drop()` method to drop the `rank` column you created.

In [None]:
fruit_info_original = ...

In [None]:
_ = ok.grade('q03')
_ = ok.backup()

### Question 4 

Use the `.drop()` method to drop the last row of the `fruit_info_original` table. (Hint: pay attention to the `axis` argument!)

In [None]:
...

In [None]:
_ = ok.grade('q04')
_ = ok.backup()

### Question 5

Use the `.rename()` method to rename the columns of `fruit_info_original` so they begin with a capital letter.

In [None]:
...

In [None]:
_ = ok.grade('q05')
_ = ok.backup()

Now that we have learned the basics, we created 3 dataframes below. We will be cleaning and wrangling the following data frames for the remainder of the lab.

In [None]:
popular_songs = pd.DataFrame(
    data={'song name': ['Thinking Out Loud', 'One Dance', 'Sorry', 
                    'Closer', 'Decpasito', 'Lean On'],
          'number of streams': [770, 1011, 828, 678, 500, 909],
         'artist': ["Ed Sheeran", "Drake", "Justin Bieber", "Chainsmokers", "Justin Bieber", "Major Lazer"]
         }
)

top_2017_albums = pd.DataFrame(
    data={'album name': ['Starboy', 'Divide', 'More Life',
                  '24k Magic', 'A Head Full of Dreams', 
                  'A Head Full of Dreams'],
          'artist': ['The Weeknd', 'Ed Sheeran', 'Drake', 'Bruno Mars',
                        'Coldplay', 'Coldplay']}
)

In [None]:
popular_songs

In [None]:
top_2017_albums

## Slicing Data Frames - selecting rows and columns


### Selection Using Label

**Column Selection** 
To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` method. General usage looks like `frame.loc[rowname,colname]`. (Reminder that the colon `:` means "everything").  For example, if we want the `color` column of the `ex` data frame, we would use :

- You can also slice across columns. For example, `popular_songs.loc[:, 'first_seen_on':]` would give select the columns `first_seen_on` and the columns after.

- *Alternative:* While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `frame['colname']`.

**Row Selection**
Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (ie. primary key) of the dataframe.

In [None]:
#Example:
top_2017_albums.loc[:, 'album name']

### Question 6a

Selecting multiple columns is easy.  You just need to supply a list of column names.  Select the `song name` and `number of streams` from the `popular_songs` table.

In [None]:
song_and_streams = ...

In [None]:
_ = ok.grade('q6a')
_ = ok.backup()

As you may have noticed above, the .loc() method is a way to re-order the columns within a dataframe.

### Question 6b

One of the important components of a dataframe is the **index**. An index uniquely defines each row of a dataframe. Notice that the index of the `popular_songs` table is numerical. Since the granularity of the popular_songs dataframe is one row per song, use the `set_index()` method to make `song name` the index of the dataframe. (this will be useful in row selection in the next problem)

In [None]:
...

In [None]:
_ = ok.grade('q6b')
_ = ok.backup()

In [None]:
popular_songs

**Note: ** Now try selecting the `song name` index from the table above - although it looks like an column, it cannot be accessed in the same way as columns. If you would like to turn `song name` back into a column, you can call the `reset_index()` method.

### Question 6c

Using the `.loc()` slicing technique, select the middle 4 rows (and all of the columns) of the `popular_songs` table using the index defined above.

In [None]:
popular_songs_small = ...

In [None]:
_ = ok.grade('q6c')
_ = ok.backup()

### Selection using position/location

If you want to select rows and columns by position, the Data Frame has an analogous `.iloc` method for integer indexing. General usage looks like `frame.iloc[row position, column position]`. Remember that Python indexing starts at 0. Also remember that you can use : in order to slice across rows and columns like in the previous question.

### Question 7

Select the first 4 rows and first 2 columns of the `popular songs` table.

In [None]:
selected_popular_songs = ...

In [None]:
_ = ok.grade('q07')
_ = ok.backup()

## Filtering Data

### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, culling out fishy outliers, or analyzing subgroups of your data set.  Note that compound expressions have to be grouped with parentheses. Example usage looks like `df[df[column name] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
>=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

### Question 8
Select the Justin Bieber songs that have over 600 streams. 

In [None]:
filtered_songs = ...

In [None]:
_ = ok.grade('q08')
_ = ok.backup()

### Question 9

An often-used operation missing from the above table is a test-of-membership.  The `Series.isin(values)` method returns a boolean array denoting whether each element of `Series` is in `values`.  We can then use the array to subset our data frame. For example, if we wanted to see which rows of `number of streams` had values in $\{500,1011\}$, we would use : 

`popular_songs[popular_songs['number of streams'].isin([500,1011])]`

Select the only rows in `popular_songs` where the artist is in the `top_2017_albums` dataframe.

In [None]:
top_2017_songs = ...

In [None]:
_ = ok.grade('q09')
_ = ok.backup()

## Data Aggregration (Grouping Data Frames)

### Question 10
To count the number of instances of a value in a `Series`, we can use the `value_counts()` method. Count the number of instances of each artist in `popular_songs`.

In [None]:
song_counts = ...

In [None]:
_ = ok.grade('q10')
_ = ok.backup()

### Question 11

A more versatile way to aggregate data is to use the `.group_by()` function. Find the total number of streams per each artist in the `popular_songs` table.

In [None]:
grouped_songs = ...

In [None]:
_ = ok.grade('q11')
_ = ok.backup()

## Joining Tables


**Inner Join: ** returns rows representing the heroes that appear in both data frames.

**Outer Join: ** returns all heroes found in both the left and right data frames. Any missing values are filled in with NaN.

**Left Join: ** returns all records from the left table and the matched records from the right table.

**Right Join: ** returns all records from the right table and the matched records from the left table.

### Question 12
Create a new data frame that contains the artist, number of streams, and album name only if the artist is in both the `popular_songs` table and the `top_2017_albums` table.

In [None]:
merged_artists = ...

In [None]:
_ = ok.grade('q12')
_ = ok.backup()

### Question 13
Create a new data frame that contains the artist, number of streams, and album name. Include row if the artist is in either the `popular_songs` table or the `top_2017_albums` table.

In [None]:
merged_artists_all = ...

In [None]:
_ = ok.grade('q13')
_ = ok.backup()

## Handling Null/Nan Values

To check if a value is null, we use the `isnull()` method for series and data frames.  Alternatively, there is a `pd.isnull()` function as well. In order to replace null values in a dataframe, we can use the `fillna()` function which will replace the NaNs with a value of your choosing. Feel free to experiment with these functions below! (This concept will be important in the upcoming homework).

### Question 14

In the table you created in the previous question, replace the NaN values in the `album_name` column with "None".

In [None]:
merged_artists_cleaned = ...

In [None]:
_ = ok.grade('q14')
_ = ok.backup()

## Submission
Run the cell below to submit the lab.  You may resubmit as many times you want.

In [None]:
_ = ok.submit()