# Lecture 4b: Introduction to Pandas

(Summer 2023)

## Pandas

The Pandas package is built on top of numpy. It gives us an efficient implementation of something called a `DataFrame`, which are multi-dimensional arrays that have attached row and column labels and can hold heterogeneous data and missing values.

The package is particularly good for data wrangling tasks such as grouping and pivot tables.

There are three main Pandas data structures: `Series`, `DataFrame`, and `Index`

For documentation and tutorials see ...

Also see: <a href="https://pandas.pydata.org/docs/" target="_blank">Pandas Documentation</a>

In [1]:
# Bring in the packages we have used before.

import math
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import scipy.linalg as la

In [2]:
# What version of pandas are we running ... ?
pd.__version__

'1.2.4'

## Series

A Pandas series is a one-dimensional array of indexed data. There are a number of ways to create one ...

In [3]:
# From a list ...
L1 = [0.1, 0.2, 0.5, 0.7, 0.9];
type(L1)

list

In [4]:
Ldata = pd.Series(L1)
print(Ldata)
print(type(Ldata))

0    0.1
1    0.2
2    0.5
3    0.7
4    0.9
dtype: float64
<class 'pandas.core.series.Series'>


In [None]:
type(Ldata.values)

In [None]:
Ldata.index

### Data in a Pandas Series can be accessed analogously to a numpy array ...

In [None]:
Ldata[3]

In [None]:
type(Ldata[3])

In [None]:
Ldata[2:4]

In [None]:
type(Ldata[2:4])

### Numpy array vs. Pandas Series ...

A numpy array has an implicitly defined index while a pandas series object has an explicitely defined index. And the index does not have to be an integer.

In [None]:
Ldata = pd.Series(L1,index=['cat','dog',42, 'A','3'])

In [None]:
Ldata

In [None]:
type(Ldata)

In [None]:
Ldata['dog']

In [None]:
Ldata.values

In [None]:
Ldata.index

### Pandas Series vs. Dictionary ...

A pandas series is similar to a dictionary object which maps arbitrary keys to arbitrary values. A pandas series is a dictionary that maps typed keys to typed values ...

Recall ...

### <u>Dictionaries</u> are unordered, changeable, and indexed. Written with "{}" but made up of key-value pairs. 

A **key-value pair** is a pair of strings separated by a colon. Different key-value pairs are separated by commas. It looks like `{"key1": "value1", "key2: "value2"}`.

In [None]:
# Make some dictionaries of farm equipment.
OldCombine = {"brand": "CASE", "model": "7130", "year": 2014}
NewCombine = {"brand": "CASE", "model": "8240", "year": 2016}
Tractor1 = {"brand": "CASE", "model": "290", "year": 2013}
Pickup = {"brand": "CHEVY", "model": "Silverado", "year": 2005}
FavoriteOldCombineEver = {"brand": "JD", "model": "7720", "year": 1978, "color": "green"}

# Create a dictionary of farm equipment from the dictionaries of
# individual machines.

FarmEquipment = {"C1": OldCombine, "C2": NewCombine, "T1": Tractor1, "P1": Pickup, "C3": FavoriteOldCombineEver}
print(FarmEquipment)

In [None]:
newLdata = pd.Series(FarmEquipment)

In [None]:
newLdata

In [None]:
newLdata['T1']

In [None]:
FarmEquipment.keys()

In [None]:
FarmEquipment.values()

## Example Data ...

I will want some data to illustrate Pandas concepts. For this I will use the wheat yields data set shared earlier. The code below was used before just to wrangled the raw file and save it as a nice csv.

In [None]:
# This cell is used to create a dictionary that I can use to examine
# the header categories in the sample file. It is helpful to look at the
# header names along with a sample of their values.

csv_file_name = 'Data/WheatYields.csv'

# Open the file for reading only and print the first few lines. 
# As written the first 6 lines will be printed.

fin = open(csv_file_name, "r")

for i in range(10):
    line = fin.readline()
    print(line, end='')

fin.close()

## Pandas DataFrame and Series

In [None]:
WheatYields = pd.read_csv(csv_file_name)

In [None]:
type(WheatYields)

In [None]:
WheatYields.index

In [None]:
WheatYields.columns

In [None]:
Value = WheatYields['Value']

In [None]:
type(Value)

In [None]:
Value

In [None]:
print(WheatYields)

In [None]:
WheatYields

## Pandas Series

In [None]:
NewWheatYields = WheatYields[['Year','State','County','County ANSI','Value']]

In [None]:
NewWheatYields

In [None]:
type(NewWheatYields)

In [None]:
NewWheatYields['State'].unique()

In [None]:
JustCalif = (NewWheatYields['State'] == 'CALIFORNIA')

In [None]:
JustCalif

In [None]:
JustCalifYields = NewWheatYields.loc[JustCalif]

In [None]:
JustCalifYields

## Exercise: Create the Time Series of Wheat Yields for Some Favorite Counties, Plot over time and compare ...

## Solution ...