In [None]:
%%R
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  dev = "svg",
  fig.align = "center",
  #fig.width = 11,
  #fig.height = 5
  cache = FALSE
)

# define vars
om = par("mar")
lowtop = c(om[1],om[2],0.1,om[4])
library(tidyverse)
library(knitr)
library(reticulate)
use_python("C:\\Users\\jbpost2\\AppData\\Local\\Programs\\Python\\Python310\\python.exe")
#use_python("C:\\python\\python.exe")
options(dplyr.print_min = 5)
options(reticulate.repl.quiet = TRUE)

layout: false
class: title-slide-section-red, middle

# Pandas Data Frames
Justin Post

---
layout: true

<div class="my-footer"><img src="img/logo.png" style="height: 60px;"/></div> 

---

# Data Frames

- Pandas data frames are a 2D data structure

    + Each column is a `series` object
    + Each column can be differing types (just like most common data sets!)

---

# Data Frames

- Pandas data frames are a 2D data structure

    + Each column is a `series` object
    + Each column can be differing types (just like most common data sets!)


In [None]:
import pandas as pd
import numpy as np
name = ['Alice', 'Bob','Charlie','Dave','Eve','Francesca','Greg']
age = [20, 21, 22, 23, 22, 21, 22]
major = ['Statistics', 'History', 'Chemistry', 'English', 'Math', 'Civil Engineering','Statistics']
my_df = pd.DataFrame(zip(name, age, major), columns = ["name", "age", "major"])
my_df

---

# Creating a Data Frame from a Dictionary

- The `pd.DataFrame()` function can create data frames from many objects

In [None]:
people = {'Name': ['Alice', 'Bob','Charlie','Dave','Eve','Francesca','Greg'],
          'Age': [20, 21, 22, 23, 22, 21, 22],
          'Major': ['Statistics', 'History', 'Chemistry', 'English', 'Math', 'Civil Engineering','Statistics'],
         }
people
my_df = pd.DataFrame(people)
my_df

---

# Creating a Data Frame from a NumPy Array

- The `pd.DataFrame()` function can create data frames from many objects

In [None]:
my_array = np.random.random((5,3))
my_array

my_df2 = pd.DataFrame(my_array, columns=["1st", "2nd", "3rd"], index=["a", "b", "c", "d", "e"])
my_df2

---

# Indexing a Data Frame's Columns

- Access the columns using the column names

.left45[

In [None]:
my_df2.columns
my_df2["1st"]
type(my_df2["1st"])

]

---

# Indexing a Data Frame's Columns

- Access the columns using the column names

.left45[

In [None]:
my_df2.columns
my_df2["1st"]
type(my_df2["1st"])

]
.left25[
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
]
.left45[

In [None]:
my_df.columns
my_df.Major
type(my_df.Major)

]

---

# Indexing a Data Frame's Columns

- Returning more than one column is easy

In [None]:
my_df[['Name', 'Age']]

---

# Indexing a Data Frame's Rows

- Can access rows by their integer location using `.iloc[]`

.left45[

In [None]:
my_df.iloc[0]
my_df.iloc[1]
type(my_df.iloc[1])

]
.left25[
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
]

---

# Indexing a Data Frame's Rows

- Can access rows by their integer location using `.iloc[]`

.left45[

In [None]:
my_df.iloc[0]
my_df.iloc[1]
type(my_df.iloc[1])

]
.left25[
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
]
.left45[

In [None]:
my_df.iloc[[0,1]]
my_df.iloc[range(0,3)]

]


---

# Indexing a Data Frame's Rows & Columns

- `.iloc[]` allows for subsetting of columns by location too!

In [None]:
my_df.iloc[[0,1], [0, 2]]
my_df.iloc[3:6, 0:2]

---

# Indexing a Data Frame's Rows & Columns

- `.loc[]` allows for subsetting based on indices **or** names

.left45[

In [None]:
my_df.loc[0]
my_df.loc[[0, 3], ["Name", "Major"]]

]
.left25[
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
]
.left45[

In [None]:
my_df2
my_df2.loc["b"]
my_df2.loc[["b"], ["1st"]]

]

---

# Indexing a Data Frame's Rows & Columns

- Slicing can also be used... but it includes the upper index with `.loc[]` (`.iloc[]` still does not)

In [None]:
my_df.iloc[:3, [0,2]]
my_df.loc[:3, ['Name', "Major"]]

---

# Indexing a Data Frame's Rows & Columns

- Often use a Boolean object to subset (`True` gets returned, `False` does not)

- Note: `my_df[...]` can return either columns or rows. If given a numbers, it returns the corresponding rows. If given a sequence of `True/False` values of the same length as the number of rows, it returns rows where `True` occurred. If you give it a column name or a list of column names, it will return those columns.

In [None]:
my_df['Name'] == 'Alice'
my_df[my_df['Name'] == 'Alice']

---

# Indexing a Data Frame's Rows & Columns

- `&` (and), `|` (or), `~` (not), `^` (xor)

.left35[

In [None]:
(my_df['Name'] == 'Alice')

]
.left25[
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
]
.left35[

In [None]:
(my_df['Name'] == 'Greg')

]

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

In [None]:
my_df[(my_df['Name'] == 'Alice') | (my_df['Name'] == 'Greg')]

---

# Indexing a Data Frame's Rows & Columns

- `&` (and), `|` (or), `~` (not), `^` (xor)

.left35[

In [None]:
my_df['Major'] == 'Statistics'

]
.left35[

In [None]:
my_df['Major'] == 'History'

]
.left30[

In [None]:
my_df['Age'] < 22

]

<br>
<br>

In [None]:
my_df[((my_df['Major'] == 'Statistics') | (my_df['Major'] == 'History')) & (my_df['Age'] < 22)]

---

# Operations on Data Frames

- `.head` and `.tail` methods give the first few and last rows, respectively

In [None]:
my_df.head()
my_df.tail()

---

# Operations on Data Frames

- `shape` attribute contains the dimensions of the data frame

.left45[

In [None]:
my_df
my_df.shape

]
.left25[
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
]
.left45[

In [None]:
my_df2
my_df2.shape

]

---

# Operations on Data Frames

- Obtain a quick contingency table with `.value_counts()` on a column

In [None]:
my_df["Major"].value_counts()

---

# Operations on Data Frames

- `.info()` method gives information about the data frame

In [None]:
my_df.info()
my_df2.info()

---

# To JupyterLab!

- Create a data frame

    + Add a new column 
    + Reorder the rows/columns with `.sort_value`
    
- See a few other methods such as

    + `.dropna()`: removes rows with empty cells (returns a new dataset; add inplace = True to replace)
    + `.fillna()`: replaces missing values with something
    + `my_df.describe()` and other basic stats


---

# Recap

- Data Frames are great for storing a data set 

    + Rows = observations, Columns = variables
    
    + Many ways to create them (from a dictionary, list, array, etc.)
    
    + Index columns with `[col_name]` or .col_name attribute
    
    + Index rows with `.iloc[]` or `.loc[]`
    
        * Or index rows and columns together (ex: `.loc[rows, columns]`

    + `.info()`, `.head()` useful!
