# Introduction to Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.


In [17]:
import numpy as np
import pandas as pd

## Pandas Data Structures

Pandas introduces two new data structures to Python - `Series` and `DataFrame`, both of which are built on top of `NumPy` (this means it's fast).

You primarily interact with DataFrames, but it's useful to understand Series as well.

### Numpy arrays and ndarrays

### Series

A `Series` is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.


In [2]:
# Importing numpy
import numpy as np

# Creating a Series by passing a list of values
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s


0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

### DataFrame

A `DataFrame` is a two-dimensional table of data with columns that can be of different types (similar to a spreadsheet). It can be thought of as a dictionary of `Series` objects. DataFrames are generally the most commonly used pandas object.


In [3]:
# Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns
dates = pd.date_range('20230101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df


Unnamed: 0,A,B,C,D
2023-01-01,-1.970224,-0.052564,2.362557,-0.063333
2023-01-02,0.032486,-0.062967,2.810989,2.768199
2023-01-03,-1.272938,-0.040371,1.460796,0.869516
2023-01-04,-0.659098,-0.441819,1.58628,-0.223838
2023-01-05,0.758263,-1.228798,0.917047,-0.189668
2023-01-06,-0.056239,0.935787,1.001919,0.561253


## Viewing Data

Pandas provides several methods to view and inspect your data. Here are a few examples:


In [4]:
# Display the columns
print(df.columns)
# Viewing the top rows of the DataFrame
df.head()


Index(['A', 'B', 'C', 'D'], dtype='object')


Unnamed: 0,A,B,C,D
2023-01-01,-1.970224,-0.052564,2.362557,-0.063333
2023-01-02,0.032486,-0.062967,2.810989,2.768199
2023-01-03,-1.272938,-0.040371,1.460796,0.869516
2023-01-04,-0.659098,-0.441819,1.58628,-0.223838
2023-01-05,0.758263,-1.228798,0.917047,-0.189668


## Data Manipulation

Pandas provides a wide range of data manipulation methods. Let's explore some of them.


In [8]:
# Sorting by values.
# Note that this returns a new dataframe rather than sorting it in place. 
# Returning new dataframers rather than acting on them in place is the default behavior
df_sorted = df.sort_values(by='B')
df_sorted

Unnamed: 0,A,B,C,D
2023-01-05,0.758263,-1.228798,0.917047,-0.189668
2023-01-04,-0.659098,-0.441819,1.58628,-0.223838
2023-01-02,0.032486,-0.062967,2.810989,2.768199
2023-01-01,-1.970224,-0.052564,2.362557,-0.063333
2023-01-03,-1.272938,-0.040371,1.460796,0.869516
2023-01-06,-0.056239,0.935787,1.001919,0.561253


In [13]:
# Selecting multiple columns
print(df[["A","B"]])
# Selecting a single column, which yields a Series
df['A']



                   A         B
2023-01-01 -1.970224 -0.052564
2023-01-02  0.032486 -0.062967
2023-01-03 -1.272938 -0.040371
2023-01-04 -0.659098 -0.441819
2023-01-05  0.758263 -1.228798
2023-01-06 -0.056239  0.935787


2023-01-01   -1.970224
2023-01-02    0.032486
2023-01-03   -1.272938
2023-01-04   -0.659098
2023-01-05    0.758263
2023-01-06   -0.056239
Freq: D, Name: A, dtype: float64

In [12]:
# Selecting via [], which slices the _rows_
df[0:2]

Unnamed: 0,A,B,C,D
2023-01-01,-1.970224,-0.052564,2.362557,-0.063333
2023-01-02,0.032486,-0.062967,2.810989,2.768199


## Selection by Position

Pandas provides various methods to have purely integer-based indexing. This functionality is provided by the `iloc` accessor.


In [14]:
# Select via the position of the passed integers
df.iloc[3]


A   -0.659098
B   -0.441819
C    1.586280
D   -0.223838
Name: 2023-01-04 00:00:00, dtype: float64

In [15]:
# By integer slices, acting similar to numpy/python
df.iloc[3:5, 0:2]


Unnamed: 0,A,B
2023-01-04,-0.659098,-0.441819
2023-01-05,0.758263,-1.228798


In [16]:
# By lists of integer position locations, similar to the numpy/python style
df.iloc[[1, 2, 4], [0, 2]]


Unnamed: 0,A,C
2023-01-02,0.032486,2.810989
2023-01-03,-1.272938,1.460796
2023-01-05,0.758263,0.917047
