In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab00.ipynb")

# `datascience` to Pandas

Welcome! This notebook is an unofficial resource that serves as an introduction to working with Python's widely used Pandas library for students who have taken Introduction to Data Science. The functions introduced will be analogous to those in Berkeley's `datascience` module, with examples provided for each.

We will cover the following topics in this notebook:
1. [Basics of Pandas](#basics)
    - [Importing and Loading Packages](#import)
<br>
<br>
2. [Dataframes: Working with Tabular Data](#dataframes)
    - [Creating a Dataframe](#creating)
    - [Accessing Values in Dataframe](#accessing)
    - [Manipulating Data](#manipulating)
<br>
<br>
3. [Visualizing Data](#visualizing)
    - [Histograms](#histograms)
    - [Line Plots](#line)
    - [Scatter Plots](#scatter)
    - [Bar Plots](#bar)

## 1. Basics <a id='basics'></a>

This notebook assumes familiarity with Python concepts, syntax and data structures at the level of Data 8. For a brief refresher on some Python concepts, refer to this [Python Basics Guide on Github](https://github.com/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)

Python has a great ecosystem of data-centric packages which makes it excellent for data analysis. Pandas is one of those packages, and makes importing and analyzing data much easier. Pandas builds on packages like NumPy and matplotlib to give us a single, convenient, place to do most of our data analysis and visualization work.

### 1.1 Importing and Loading Packages <a id='import'></a>

Let's import the `datascience` and `numpy` packages the `import` keyword. Since we Pandas as `pd`, we need to prefix all functions with `pd`, similar to how we prefix all numpy functions with `np` (e.g. as `np.append()`).

Run the cell below.

In [25]:
# import the datascience package, the numpy package, and the pandas library as pd
from datascience import * 
import numpy as np
import pandas as pd
import otter
grader = otter.Notebook()

## 2. Dataframes: Working with Tabular Data <a id='dataframes'></a>

In Python's `datascience` module, we used `Table` to build our dataframes and used commands such as `select()`, `where()`, `group()`, `column()` etc. In this section, we will go over some basic commands to work with tabular data in Pandas

### 2.1 Creating a Dataframe <a id='creating'> </a>

Pandas introduces a data structure (i.e. dataframe) that represents data as a table with columns and rows. 

In Python's `datascience` module that is used in Introduction to Data Science, this is how we created tables from scratch by extending an empty table:

In [None]:
t = Table().with_columns([
     'letter', ['a', 'b', 'c', 'z'],
     'count',  [  9,   3,   3,   1],
     'points', [  1,   2,   2,  10],
 ])
t

In Pandas, we can use the function `pd.DataFrame` to initialize a dataframe from a dictionary or a list-like object. Refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for more information.

In [None]:
# Create a dataframe from a dictionary
df_from_dict = pd.DataFrame({'letter' : ['a', 'b', 'c', 'z'],
                             'count'  : [  9,   3,   3,   1],
                             'points' : [  1,   2,   2,  10]
                            })
df_from_dict

More often, we will need to create a dataframe by importing data from a .csv file. In `datascience`, this is how we read data from a csv:

In [3]:
datascience_baby = Table.read_table('baby.csv')
datascience_baby

In Pandas, we use `pd.read.csv()` to read data from a csv file. Sometimes, depending on the data file, we may need to specify the parameters `sep`, `header` or `encoding` as well. For a full list of parameters, refer to [this guide](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).

**Note:** The `.head()` method will display the rows of a data frame object.

**Reminder:** A Python **function** is a sequence of statements, given a name, that execute in a specific order. In Introduction to Data Science, we talked about built-in and user-defined functions. A Python **method** is like a function, except it is attached to an object. We call a method on an object, and it may or may not make changes to that object. A method, then, belongs to a class.

In [2]:
baby = pd.read_csv('baby.csv')
baby.head()

**Question 2.1.1.** Create a dataframe named `baby_smoker` that contains only the observations where `Maternal Smoker` is True.

<!--
BEGIN QUESTION
name: q2_1_1
manual: false
-->

In [39]:
baby_smoker = ...

In [None]:
grader.check("q2_1_1")

**Question 2.1.2.** Create a dataframe named `baby_over40` that contains only the observations where `Maternal Age` is at least 40.

<!--
BEGIN QUESTION
name: q2_1_2
manual: false
-->

In [41]:
baby_over40 = ...

In [42]:
len(baby_over40)

The `.describe` method generates a set of summary statistics of the Series or Dataframe provided..

In [19]:
# view summary of data
baby.describe()

In [36]:
# load file
sat = pd.read_csv('sat2014.csv', index_col = 0)
sat.head()

In [37]:
# view information about dataframe
# view dimensions (rows, cols)
print(sat.shape)

# view column names
print(sat.columns.values) 

**Question 2.1.3.** Rename the `Combined` column to `Total Score`.

<!--
BEGIN QUESTION
name: q2_1_3
manual: false
-->

In [46]:
...

In [None]:
grader.check("q2_1_3")

**Question 2.1.4.** Create a dataframe named `sat_part5` that only contains states where the `Participation Rate` is more than 80.

<!--
BEGIN QUESTION
name: q2_1_4
manual: false
-->

In [58]:
sat_part5 = ...

In [None]:
grader.check("q2_1_4")

### 2.2 Accessing Values in Dataframe <a id='accessing'> </a>

In `datascience`, we can use `.column()` to access values in a particular column as follows:

In [None]:
# access column 'letter' and return an array
t.column('letter')

In Pandas, columns are also known as Series. We can access a Pandas series by using the square bracket notation.

In [55]:
# returns series object
sat['State']

If we want a numpy array of column values, we can call the method `.values` on a Series object:

In [52]:
sat['State'].values

In `datascience`, we can use `.take()` to access a row in the Table.

In [None]:
# select the first two rows using python's slicing notation
t.take[0:2]

In Pandas, we can access rows and column by their position using the `iloc` method. We need to specify the rows and columns we want in the following syntax: `df.iloc[<rows>, <columns>]`. For more information on indexing, refer to [this guide](https://pandas.pydata.org/pandas-docs/stable/indexing.html)

**Question 2.2.1.** Select the first two rows of the `baby` dataframe using `iloc`.

<!--
BEGIN QUESTION
name: q2_2_1
manual: false
-->

In [60]:
...

In [None]:
grader.check("q2_2_1")

In [61]:
# specify row indices
baby.iloc[[1, 4, 6], :]

**Question 2.2.2.** Select the value in the tenth row and fourth column of the `baby` dataframe by passing in the row and column indices.
<!--
BEGIN QUESTION
name: q2_2_1
manual: false
-->

In [65]:
...

In [None]:
grader.check("q2_2_1")

### 2.3. Manipulating Data <a id='manipulating'></a>

#### 2.3.1. Adding Columns

Adding a new column in `datascience` is done by the `.with_column()` function.

In [None]:
t.with_column('vowel', ['yes', 'no', 'no', 'no'])
t

In Pandas, we can use the bracket notation and assign a list to add to the dataframe as follows:

In [None]:
# add a new column
df_from_dict['newcol'] = [5, 6, 7, 8]
df_from_dict

We can also add an existing column to the new dataframe as a new column by performing an operation on it.

In [None]:
# add count * 2 to the dataframe
df_from_dict['doublecount'] = df_from_dict['count'] * 2
df_from_dict

#### 2.3.2. Selecting Columns

In `datascience`, we used `.select()` to subset the dataframe by selecting columns.

In [None]:
t.select(['letter', 'points'])

In Pandas, we use a double bracket notation to select columns. This returns a dataframe, unlike a Series object when we only use single bracket notation.

In [None]:
# double bracket notation for new dataframe
df_from_dict[['count', 'doublecount']]

#### 2.3.3. Filtering Rows Conditionally

In `datascience`, we used `.where()` to select rows according to a given condition.

In [None]:
# rows where points == 2
t.where('points', 2)

In [None]:
# rows where count < 8
t.where(t['count'] < 8)

In Pandas, we can use the bracket notation to subset the dataframe based on a condition. We first specify a condition and then subset using the bracket notation.

In [None]:
# array of booleans
baby['Maternal Smoker'] == True

In [None]:
# filter rows by condition Maternal.Smoker == True
baby[baby['Maternal Smoker'] == True]

In [None]:
# filter with multiple conditions
df_from_dict[(df_from_dict['count'] < 8) & (df_from_dict['points'] > 5)]

#### 2.3.4. Renaming Columns

In `datascience`, we used `.relabeled()` to rename columns.

In [None]:
# rename 'points' to 'other name'
t.relabeled('points', 'other name')

Pandas uses `rename()`, which has an `index` parameter that needs to be set to `str` and a `columns` parameter that needs to be set to a dictionary of the names to be replaced with their replacements.

In [None]:
# rename 'points' to 'other name'
df_from_dict.rename(index = str, columns = {"points" : "other name"})

#### 2.3.5. Sorting Dataframe by Column

In `datascience` we used `.sort()` to sort a Table according to the values in a column.

In [None]:
# sort by count
t.sort('count')

In Pandas, we use the `sort_values()` to sort by column. We need the `by` parameter to specify the row we want to sort by and the optional parameter `ascending = False` if we want to sort in descending order:

In [None]:
# sort by count, descending
df_from_dict.sort_values(by = ['count'], ascending = False)

#### 2.3.6. Grouping and Aggregating

In `datascience`, we used `group()` and the `collect` argument to group a Table by a column and aggregrate values in another column.

In [None]:
# group by count and aggregate by sum
t.select(['count', 'points']).group('count', collect = sum)

In Pandas, we use `groupby()` to group the dataframe. This function returns a groupby object, on which we can then call an aggregation function to return a dataframe with aggregated values for other columns. For more information, refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html).

In [None]:
# select two columns
df_subset = df_from_dict[['count', 'points']]
df_subset

In [None]:
count_sums_df = df_subset.groupby(['count']).sum()
count_sums_df

### 2.4. Pivot Tables

In `datascience`, we used the `pivot()` function to build contingency tables:

In [None]:
cones_tbl = Table().with_columns(
    'Flavor', make_array('strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'),
    'Color', make_array('pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'),
    'Price', make_array(3.55, 4.75, 5.25, 5.25, 5.25, 4.75)
)

cones_tbl

In [None]:
# pivot on color and flavor
cones_tbl.pivot("Flavor", "Color")

We can also pass in the parameters `values` to specify the values in the table and `collect` to specify the aggregration function.

In [None]:
# set parameter values and collect
cones_tbl.pivot("Flavor", "Color", values = "Price", collect = np.sum)

In Pandas, we use `pd.pivot_table()` to create a contingency table. The argument `columns` is similar to the first argument in `datascience`'s `pivot` function and sets the column names of the pivot table. The argument `index` is similar to the second argument in `datascience`'s `pivot` function and sets the first column of the pivot table or the keys to group on. For more information, refer to the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html).

In [None]:
# create new dataframe
cones_df = pd.DataFrame({"Flavor" : ['strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'],
                         "Color"  : ['pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'],
                         "Price"  : [3.55, 4.75, 5.25, 5.25, 5.25, 4.75]})
cones_df

In [None]:
# create the pivot table
pd.pivot_table(cones_df, columns = ["Flavor"], index = ["Color"])

If there is no data in the groups, then Pandas will output `NaN` values. `NaN`, standing for not a number, is a numeric data type used to represent any value that is undefined or unpresentable. `NaN` is also assigned to variables, in a computation, that do not have values and have yet to be computed. `NaN` is specifically a floating-point value; there is no equivalent NaN value for integers, strings, or other types.

We can also specify the parameters like `values` (equivalent to `values` in `datascience`'s `pivot`) and `aggfunc` (equivalent to `collect` in `datascience`'s `pivot`)

In [None]:
# additional arguments
pd.pivot_table(cones_df, columns = ["Flavor"], index = ["Color"], values = "Price", aggfunc = np.sum)

#### 2.4.1. Joining and Merging

In `datascience`, we used `join()` to join two tables based on shared values in columns. We specify the column name in the first table to match on, the name of the second table and the column name in the second table to match on.

In [None]:
ratings_tbl = Table().with_columns(
    'Kind', make_array('strawberry', 'chocolate', 'vanilla'),
    'Stars', make_array(2.5, 3.5, 4)
)

ratings_tbl

In [None]:
# join cones and ratings
cones_tbl.join("Flavor", ratings_tbl, "Kind")

In Pandas, we can use the `merge()` function to join two tables together. The first parameter is the name of the second table to join on. The parameters `left_on` and `right_on` specify the columns to use in the left and right tables respectively. There are more parameters such as `how` which specify what kind of join to perform (Inner (Default), Outer, Left, Right). For more information, refer to this [Kaggle Tutorial](https://www.kaggle.com/crawford/python-merge-tutorial/notebook).

In [None]:
# create new ratings df
ratings_df = pd.DataFrame({"Kind" : ['strawberry', 'chocolate', 'vanilla'],
                           "Stars" : [2.5, 3.5, 4]})
ratings_df

In [None]:
# merge cones and ratings
cones_df.merge(ratings_df, left_on = "Flavor", right_on = "Kind")

## 3. Visualizing Data <a id='visualizing'> </a>

In `datascience`, we learned to plot data using histograms, line plots, scatter plots and histograms. The corresponding functions were `hist()`, `plot()`, `scatter()` and `barh()`. Plotting methods in Pandas are nearly identical to `datascience` since both build on the library `matplotlib`

In this section we will go through examples of such plots in Pandas.

<a id='histograms'></a>**3.1. Histograms**

In `datascience`, we used `hist()` to create a histogram. In this example, we will be using data from `baby.csv`. Recall that the baby data set contains data on a random sample of 1,174 mothers and their newborn babies. The column `Birth.Weight` contains the birth weight of the baby, in ounces; `Gestational.Days` is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.

In [11]:
import matplotlib
%matplotlib inline

In [12]:
datascience_baby = Table.read_table('baby.csv')
datascience_baby

In [15]:
datascience_baby.hist('Birth Weight');

In Pandas, we use `hist()` to create histograms, just like `datascience`. Refer to the [documentation](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.hist.html) for a full list of parameters.

In [16]:
baby.hist('Birth Weight');

<a id='line'></a>**3.2. Line Plots**

In `datascience`, we used `plot()` to create a line plot of numerical values. In this example, we will be using census data and plot variables such as Age in a line plot.

In [18]:
# line plot in datascience
census_tbl = Table.read_table('census.csv').select(['SEX', 'AGE', 'POPESTIMATE2014'])
children_tbl = census_tbl.where('SEX', are.equal_to(0)).where('AGE', are.below(19)).drop('SEX')
children_tbl.plot('AGE')

In Pandas, we can use `plot.line()` to create line plots. For a full list of parameters, refer to the [documentation](http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.plot.line.html)

In [17]:
# pandas
census_df = pd.read_csv('census.csv')[["SEX", "AGE", "POPESTIMATE2014"]]
children_df = census_df[(census_df.SEX == 0) & (census_df.AGE < 19)].drop("SEX", axis = 1)
children_df.plot.line(x = "AGE", y = "POPESTIMATE2014")

<a id='scatter'></a>**3.3. Scatter Plots**

In `datascience`, we used `scatter()` to create a scatter plot of two numerical columns.

In [7]:
football_tbl = Table.read_table('deflategate.csv')
football_tbl

In [8]:
football_tbl.scatter('Blakeman', 'Prioleau')

In Pandas, we use `plot.scatter()` to create a scatter plot. For a full list of parameters, refer to the [documentation](http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.plot.scatter.html).

In [9]:
football_df = pd.read_csv('deflategate.csv')
football_df.plot.scatter(x = "Blakeman", y = "Prioleau");

<a id='bar'></a>**3.4. Bar Plots**

In `datascience`, we used `barh()` to create a horizontal bar plot

In [None]:
t.barh("letter", "points")

In Pandas, we use `plot.barh()` to create a bar chart. For a full list of parameters, refer to the [documentation](http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.plot.barh.html).

In [None]:
df_from_dict.plot.barh(x = 'letter', y = 'points');

---

## Further Reading

Here is a list of useful Pandas resources.

- [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [Dataquest Pandas Tutorial](https://www.dataquest.io/blog/pandas-python-tutorial/)
- [Pandas Cookbook](http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/tree/master/cookbook/)

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)