### Notes on Ch 5: Getting Started with pandas

pandas is a fundamental tool for data analysis in Python. It offers data structures and manipulation tools for efficient data cleaning and analysis. It incorporates NumPy's array-based computing style but focuses on working with tabular or heterogeneous data. In contrast, NumPy is best suited for homogeneous numerical arrays.

In [1]:
import numpy as np
import pandas as pd

#### Introduction to pandas Data Structures

##### Series

A Series is a one-dimensional array-like object that holds a sequence of values, similar to NumPy types, and an associated index that labels the data.

In [2]:
obj = pd.Series([4, 7, -5, 3])
print(obj)

0    4
1    7
2   -5
3    3
dtype: int64


You can create a Series with a custom index to label each data point:

In [3]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)

d    4
b    7
a   -5
c    3
dtype: int64


You can use labels in the index to select values:

In [4]:
print(obj2['a'])

-5


A Series can be seen as a fixed-length, ordered dictionary, where index values map to data values. You can use it like a dictionary:

In [6]:
"b" in obj2

True

You can create a Series from a Python dictionary:

In [8]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
print(obj3)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


* You can convert a Series back to a dictionary using obj3.to_dict().
* The order of keys in the dictionary determines the order in the resulting Series. You can override this by passing an index explicitly.

<b>Handling Missing Data</b>

Missing data in pandas is represented as `NaN` (Not a Number). You can detect missing data using ```pd.isna(obj)``` or ```pd.notna(obj)```. Series also has instance methods for detecting missing data.

<b>Automatic Alignment</b>

Series automatically aligns data by index label in arithmetic operations. This alignment is similar to a database join operation.

In [10]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)

result = obj3 + obj4
print(result)

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


Both the Series object and its index can have names. You can rename the Series index in place by assignment.

In [11]:
obj4.name = "population"
obj4.index.name = "state"
print(obj4)

obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
print(obj)

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64
