# Module 2: Data wrangling using `pandas`

## Overview: Intro to pandas
This notebook is graciously provided by Wendy Fisher, who teaches the DSCI 403 course that is the 1st course in the four-part course series for the "Earth Resource Data Science" online graduate certificate at Mines - [learn more about the certificate here](https://online.mines.edu/er/)

It will help you learn the basics of `pandas`, a fast, powerful, flexible, and easy-to-use data analysis and manipulation tool.

For questions on this notebook, ask them on the [GEOL 557 slack](https://join.slack.com/t/minesgeo/shared_invite/zt-cqawm4lu-Zcfpf4mBLwjnksY6_umlKA)<a href="https://join.slack.com/t/minesgeo/shared_invite/zt-cqawm4lu-Zcfpf4mBLwjnksY6_umlKA">
<img src="https://cdn.brandfolder.io/5H442O3W/as/pl546j-7le8zk-ex8w65/Slack_RGB.svg" alt="Go to the GEOl 557 slack" width="100">
</a>

## Instructions

Work through this notebook - there will be several places where you need to fill-in-the-blank or write some code into an open cell. When you are finished, pat your self on the back!

--- 

## Course
**GEOL 557 Earth Resource Data Science I: Fundamentals**. GEOL 557 forms part 2 of the four-part course series for the "Earth Resource Data Science" online graduate certificate at Mines - [learn more about the certificate here](https://online.mines.edu/er/)


##### CSCI 303
# Introduction to Data  Science
<p/>
### 9 - pandas basics


## This Lecture
---
- Learn pandas basics

The obligatory setup code...

Modified by Thomas Martin, Sept 2023

In [None]:
# import helpful libraries to set up your environment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
import sklearn.datasets

from palmerpenguins import load_penguins

%matplotlib inline

## pandas
---
Python toolkit for data analysis

- provides Series and DataFrame data structures, which you can think of as very flexible tables for manipulating data!
- DataFrame type inspired by R
- designed to interact with the whole Python data science stack
- eases many of the data science tasks, particularly data "wrangling"

## Series
---
A one-dimensional array-like object (essentially a single column in an excel spreadsheet):

- contains a sequence of values of any type and can contain multiple types of data.
- has an associated array of *index* labels
  - labels do not have to be integers
  - labels do not have to be unique
  - labels do not have to be sequential

Like a NumPy array, a Series can be constructed from any iterable:

In [None]:
from pandas import Series

s = Series([42, 17, 99])
s

The *index* is shown on the left

- default: RangeIndex (representing sequential integers)
- access index via `index` property of the Series object

In [None]:
s.index

There is also a `values` property:

In [None]:
s.values

Things get interesting when you use *labels* for the index:

In [None]:
s = Series([42, 17, 99], index=['apple', 'pear', 'orange'])
s

Like a dictionary:

- associate values with labels
- retrieve values via [ ] operator

Unlike a dictionary:

- retain original order
- labels can duplicate

In [None]:
s2 = Series([42, 17, 99, 3.1415], index=['apple', 'pear', 'orange', 'apple'])
s2

In [None]:
s2['orange']

In [None]:
s2[['orange', 'pear']] # note the double brackets used to call both orange and pear indices

In [None]:
s2['apple']

In [None]:
test = Series([1,2,3],index=['foo',17,True])
test

Note the last two lookups resulted in Series objects.

You can apply math and other NumPy-like operations:

In [None]:
s2 * 2

In [None]:
np.cos(s2)

Data aligns by label in arithmetic operations:

In [None]:
s3 = Series([1, 2, 3, 4], ['a', 'b', 'c', 'd'])
s4 = Series([5, 6, 7, 8, 9], ['d', 'b', 'a', 'e', 'd'])
s3 + s4

In [None]:
s5 = Series(['hello', 'goodbye', np.NaN], index=['a','b','c'])
s5

Note the unmatched labels turned into NaNs - pandas notation for missing data.

Series objects can also be *named*, via the `name` property:

In [None]:
s2.name = 'tonnes'
s2

The index can also be named:

In [None]:
s2.index.name = 'fruit'
s2

In [None]:
s2['orange']

## DataFrame
---
A data structure which functions much like a database table

- ordered collection of column series
- column index labels the columns, similar to attribute names
- row index labels rows, similar to a primary key

However, more complex than a database table (and more powerful!)

You can make a DataFrame object from a dictionary object:

In [None]:
from pandas import DataFrame

df = DataFrame(
    {'fruit' : ['apple', 'orange', 'peach', 'apple'],
     'tonnes' : [42, 17, 99, 3.1415],
     'type' : ['pome', 'citrus', 'drupe', 'pome']})

df.index = ['crate 1', 'crate 2', 'crate 16', 'crate 11']
df
print(df)
df[:2][['fruit','type','tonnes']]

...although mostly we'll be getting DataFrames in other ways, such as from external sources.

DataFrame objects have much of the same extensible naming/indexing as Series objects:

In [None]:
df.index = ['crate 1', 'crate 2', 'crate 16', 'crate 11']
df

In [None]:
df.index.name = 'location'
df

You access columns by name, usign either [ ] or the . operator:

In [None]:
df['fruit']  # or df.fruit

When you access a single column of a dataframe, pandas automatically converts it to a series. 

Write a line of code below to find out the type of df['fruit']

In [None]:
# your code here

In [None]:
df[['tonnes', 'fruit']] # Notice the double bracket notation to call two columns of a data frame


In [None]:
# What is the type of df[['tonnes', 'fruit']]?
# Your code here

However, note that slicing notation applies to rows:

In [None]:
df[1:3]

You can more precisely access rows by label or position using the `loc` and `iloc` special operators (*not methods!*):

In [None]:
df.loc['crate 16', ['fruit','tonnes']]

In [None]:
df.loc[:'crate 16', ['type', 'tonnes']]

In [None]:
df.iloc[1:3,1:2] # when using .iloc[] to slice rows and columns, 
# the rows to slice are specified first (1:3) and then the columns (1:2), separated by a comma

In [None]:
df.iloc[3]

There's also Boolean indexing:

In [None]:
df[df['fruit']=='apple']

In [None]:
df[df.tonnes > 20]

Confused yet?

We'll explore these further as needed.  Don't forget the pandas documentation under the Help menu in your notebook!

Also, here's a ["cheat sheet"](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf).

## Palmer Penguins Dataset
---
A fun toy dataset covering penguin morphology. Used to test basic machine learning models and data analysis. 

In [None]:
penguins = load_penguins()
penguins.head()

In [None]:
penguins.keys()

In [None]:
print(penguins.species)

We can view the raw data and target arrays...

Adding/deleting a column is simple:

In [None]:
penguins = penguins.drop(columns=['year'])
penguins

## Basic Statistics
---
pandas provides the `describe` function (similar to R's `summary`):

In [None]:
penguins.describe()

pandas has other convenience methods.  How about pairwise correlations in the data?

In [None]:
penguins[['bill_depth_mm', 'bill_length_mm', 'flipper_length_mm', 'body_mass_g']].corr()

We can take sums, means, standard deviations, etc. by row or column:

In [None]:
penguins[['bill_depth_mm', 'bill_length_mm', 'flipper_length_mm', 'body_mass_g']].sum()

In [None]:
penguins[['bill_depth_mm', 'bill_length_mm', 'flipper_length_mm', 'body_mass_g']].sum(axis=1)[:10] 