### Notes on Ch 5: Getting Started with pandas

pandas is a fundamental tool for data analysis in Python. It offers data structures and manipulation tools for efficient data cleaning and analysis. It incorporates NumPy's array-based computing style but focuses on working with tabular or heterogeneous data. In contrast, NumPy is best suited for homogeneous numerical arrays.

In [2]:
import numpy as np
import pandas as pd

#### Introduction to pandas Data Structures

##### Series

A Series is a one-dimensional array-like object that holds a sequence of values, similar to NumPy types, and an associated index that labels the data.

In [2]:
obj = pd.Series([4, 7, -5, 3])
print(obj)

0    4
1    7
2   -5
3    3
dtype: int64


You can create a Series with a custom index to label each data point:

In [3]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)

d    4
b    7
a   -5
c    3
dtype: int64


You can use labels in the index to select values:

In [4]:
print(obj2['a'])

-5


A Series can be seen as a fixed-length, ordered dictionary, where index values map to data values. You can use it like a dictionary:

In [6]:
"b" in obj2

True

You can create a Series from a Python dictionary:

In [8]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
print(obj3)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


* You can convert a Series back to a dictionary using obj3.to_dict().
* The order of keys in the dictionary determines the order in the resulting Series. You can override this by passing an index explicitly.

<b>Handling Missing Data</b>

Missing data in pandas is represented as `NaN` (Not a Number). You can detect missing data using ```pd.isna(obj)``` or ```pd.notna(obj)```. Series also has instance methods for detecting missing data.

<b>Automatic Alignment</b>

Series automatically aligns data by index label in arithmetic operations. This alignment is similar to a database join operation.

In [10]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)

result = obj3 + obj4
print(result)

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


Both the Series object and its index can have names. You can rename the Series index in place by assignment.

In [11]:
obj4.name = "population"
obj4.index.name = "state"
print(obj4)

obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
print(obj)

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64


##### DataFrame

DataFrame is a two-dimensional tabular data structure in pandas, similar to a spreadsheet or SQL table. It consists of rows and columns, where each column can have a different data type (e.g., numeric, string, Boolean).  We can think of it as a collection of Series objects, with each column being a Series.

In [3]:
data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}

frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


DataFrames have both row and column indices. Rows are indexed by default, and columns are named according to the keys in the dictionary. You can use `head()` and `tail()` methods to display the first and last rows, respectively. Columns can be accessed using dictionary-like notation or dot notation. New columns can be added by assignment.

In [4]:
# Firrst and last 5 rows
print(frame.head())
print(frame.tail())

# Acessing columns
print(frame["state"])
print(frame.year)

# Creting new columns

frame["debt"] = 16.5
frame["debt"] = np.arange(6.)
frame

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
    state  year  pop
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64


Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,0.0
1,Ohio,2001,1.7,1.0
2,Ohio,2002,3.6,2.0
3,Nevada,2001,2.4,3.0
4,Nevada,2002,2.9,4.0
5,Nevada,2003,3.2,5.0


If you assign a Series to a DataFrame column, it will be aligned with the DataFrame's index, filling missing values with NaN. Columns can be deleted using the del keyword

In [5]:
del frame["debt"]
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


Rows can also be retrieved by position or name with the special iloc and loc attributes

In [9]:
print(frame.loc[2])

print(frame.iloc[1])

state    Ohio
year     2002
pop       3.6
Name: 2, dtype: object
state    Ohio
year     2001
pop       1.7
Name: 1, dtype: object


DataFrames can be transposed to swap rows and columns:

In [10]:
frame.T

Unnamed: 0,0,1,2,3,4,5
state,Ohio,Ohio,Ohio,Nevada,Nevada,Nevada
year,2000,2001,2002,2001,2002,2003
pop,1.5,1.7,3.6,2.4,2.9,3.2


We can convert a DataFrame to a NumPy array using the to_numpy() method:

In [11]:
frame.to_numpy()

array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9],
       ['Nevada', 2003, 3.2]], dtype=object)