# Chapter 5

Pandas has multiple possible underlying type systems. There are the classic NumPy types, but also the new PyArrow types. These can coexist in the same data frame, but seemingly similar types can have different properties:

In [1]:
import pandas as pd

s1 = pd.Series([1.0, 2.0, 3.0], dtype="float64")
s2 = pd.Series([0.3, 1.3, 2.7], dtype="float64[pyarrow]")

df = pd.DataFrame({"first": s1, "second": s2})

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype          
---  ------  --------------  -----          
 0   first   3 non-null      float64        
 1   second  3 non-null      double[pyarrow]
dtypes: double[pyarrow](1), float64(1)
memory usage: 180.0 bytes


Properties of NumPy types:
- Automatic type conversion to Python `int` when dealing with large integers
- Automatic type conversion to `float64` when `NaN` values appear
- Conversion to "smaller" types is possible; overflows will happen without warning
- NumPy doesn't have a string type; strings are treated as general `object`s

Properties of PyArrow types:
- No automatic type conversions for large integers; an error is thrown instead
- No automatic overflows will happen when converting to smaller types; an error is thrown if one would occur
- Integer types can handle `<NA>` values
- No direct conversion from strings to floating point types; you must go through NumPy
- PyArrow does have a dedicated string type, but you need to create it with `pd.ArrowDtype(pa.String())`
    - The situation has changed since the book was published, but it's still unclear what the best approach to using strings is
    - This will be clarified in Pandas 3.x

## Integer data

In [3]:
small_values = [1, 99, 127]
small_ser = pd.Series(small_values, dtype="int8[pyarrow]")
small_ser

0      1
1     99
2    127
dtype: int8[pyarrow]

In [4]:
large_values = [2**31, 2**63, 2**100]
large_ser = pd.Series(large_values) # hard to do with PyArrow types
large_ser

0                         2147483648
1                9223372036854775808
2    1267650600228229401496703205376
dtype: object

It's possible to have missing values in integer arrays with PyArrow types:

In [5]:
missing_values = [None, 1, -45]
missing_ser = pd.Series(missing_values, dtype="int8[pyarrow]")
missing_ser

0    <NA>
1       1
2     -45
dtype: int8[pyarrow]

## Floating Point data

In [6]:
float_vals = [1.5, 3.7, 127.0]
pd.Series(float_vals, dtype="float64[pyarrow]") # or double[pyarrow]

0      1.5
1      3.7
2    127.0
dtype: double[pyarrow]

In [7]:
float_missing = [None, 1.5, -45.0]
pd.Series(float_missing, dtype="float64[pyarrow]")

0    <NA>
1     1.5
2   -45.0
dtype: double[pyarrow]

In [8]:
float_rain = [1.5, 2.7, 0.0, "T", 1.5, 0]
pd.Series(float_rain)

0    1.5
1    2.7
2    0.0
3      T
4    1.5
5      0
dtype: object

In [9]:
pd.Series(float_rain).replace("T", "0.0")

0    1.5
1    2.7
2    0.0
3    0.0
4    1.5
5      0
dtype: object

The series above looks fine, but the `0.0` at index 3 is actually a string, so converting it to `float64[pyarrow]` won't work.

We need to convert the `"T"` into a numeric `0.0`.

In [10]:
pd.Series(float_rain).replace("T", 0.0)

  pd.Series(float_rain).replace("T", 0.0)


0    1.5
1    2.7
2    0.0
3    0.0
4    1.5
5    0.0
dtype: float64

In [11]:
pd.Series(float_rain).replace("T", 0.0).astype("float64[pyarrow]")

  pd.Series(float_rain).replace("T", 0.0).astype("float64[pyarrow]")


0    1.5
1    2.7
2    0.0
3    0.0
4    1.5
5    0.0
dtype: double[pyarrow]

## String data

In [12]:
import pyarrow as pa
string_pa = pd.ArrowDtype(pa.string())

In [13]:
text_freeform = ["My name is Jeff", "I like pandas", "I like programming"]
pd.Series(text_freeform, dtype=string_pa)

0       My name is Jeff
1         I like pandas
2    I like programming
dtype: string[pyarrow]

In [14]:
text_with_missing = ["My name is Jeff", None, "I like programming"]
pd.Series(text_with_missing, dtype=string_pa)

0       My name is Jeff
1                  <NA>
2    I like programming
dtype: string[pyarrow]

## Categorical data

Use the `category` type for categorical data with a *low* number of categories. If there are too many categories, this becomes less efficient than treating the values as strings. PyArrow has a `dictionary` type for this kind of data, but the author ignores it in favor of the Pandas 1.x `category` type. `dictionary` is not exposed directly in Pandas unlike other PyArrow data types.

Pandas also supports ordered categories in particular.

In [15]:
states = ['CA', 'NY', 'TX']
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

pd.Series(states, dtype='category')

0    CA
1    NY
2    TX
dtype: category
Categories (3, object): ['CA', 'NY', 'TX']

In [16]:
month_cat = pd.CategoricalDtype(categories=months, ordered=True)
pd.Series(months, dtype=month_cat).sort_values()

0     Jan
1     Feb
2     Mar
3     Apr
4     May
5     Jun
6     Jul
7     Aug
8     Sep
9     Oct
10    Nov
11    Dec
dtype: category
Categories (12, object): ['Jan' < 'Feb' < 'Mar' < 'Apr' ... 'Sep' < 'Oct' < 'Nov' < 'Dec']

In [17]:
pd.Series(months, dtype=string_pa).astype(month_cat)

0     Jan
1     Feb
2     Mar
3     Apr
4     May
5     Jun
6     Jul
7     Aug
8     Sep
9     Oct
10    Nov
11    Dec
dtype: category
Categories (12, object): ['Jan' < 'Feb' < 'Mar' < 'Apr' ... 'Sep' < 'Oct' < 'Nov' < 'Dec']

In [18]:
# Not included in book, but this is how to use the 1.x categorical type with an
# underlying 2.x PyArrow data type.
month_cat = pd.CategoricalDtype(categories=pd.Series(months, dtype=string_pa), ordered=True)
pd.Series(months, dtype=string_pa).astype(month_cat)

0     Jan
1     Feb
2     Mar
3     Apr
4     May
5     Jun
6     Jul
7     Aug
8     Sep
9     Oct
10    Nov
11    Dec
dtype: category
Categories (12, string[pyarrow]): [Jan < Feb < Mar < Apr ... Sep < Oct < Nov < Dec]

## Dates and times

In [19]:
import datetime as dt
dt_list = [dt.datetime(2020, 1, 1, 4, 30), dt.datetime(2020, 1, 2), dt.datetime(2020, 1, 3)]
string_dates = ['2020-01-01 04:30:00', '2020-01-02 00:00:00', '2020-01-03 00:00:00']
string_dates_missing = ['2020-01-01 4:30', None, '2020-01-03']
epoch_dates = [1577836800, 1577923200, 1578009600]

In [20]:
pd.Series(dt_list)

0   2020-01-01 04:30:00
1   2020-01-02 00:00:00
2   2020-01-03 00:00:00
dtype: datetime64[ns]

In [21]:
pd.Series(string_dates, dtype='datetime64[ns]')

0   2020-01-01 04:30:00
1   2020-01-02 00:00:00
2   2020-01-03 00:00:00
dtype: datetime64[ns]

In [22]:
pd.Series(string_dates_missing, dtype='datetime64[ns]')

0   2020-01-01 04:30:00
1                   NaT
2   2020-01-03 00:00:00
dtype: datetime64[ns]

Be careful with epoch times; make sure you know the units. Here, the times are in seconds:

In [23]:
pd.Series(epoch_dates, dtype='datetime64[s]') # contrast with dtype='datetime64[ns]'

0   2020-01-01
1   2020-01-02
2   2020-01-03
dtype: datetime64[s]

In [24]:
pd.Series(dt_list, dtype='timestamp[ns][pyarrow]')

0    2020-01-01 04:30:00
1    2020-01-02 00:00:00
2    2020-01-03 00:00:00
dtype: timestamp[ns][pyarrow]

In [25]:
pd.Series(string_dates, dtype='timestamp[ns][pyarrow]')

0    2020-01-01 04:30:00
1    2020-01-02 00:00:00
2    2020-01-03 00:00:00
dtype: timestamp[ns][pyarrow]

PyArrow timestamp conversions require all timestamps to have a common format:

`pd.Series(string_dates_missing, dtype='timestamp[ns][pyarrow]')`

In [26]:
string_dates_missing_2 = ['2020-01-01 4:30', None, '2020-01-03 0:00']
pd.Series(string_dates_missing_2, dtype="timestamp[ns][pyarrow]")

0    2020-01-01 04:30:00
1                   <NA>
2    2020-01-03 00:00:00
dtype: timestamp[ns][pyarrow]

In [27]:
pd.Series(epoch_dates, dtype='timestamp[s][pyarrow]')

0    2020-01-01 00:00:00
1    2020-01-02 00:00:00
2    2020-01-03 00:00:00
dtype: timestamp[s][pyarrow]

## Exercises

1. To represent the number of people in the US, I would use the `uint32[pyarrow]` type, which has a maximum value of around 4 billion. For the number of people worldwide, I would use the `uint64[pyarrow]` type.

2. To describe a product, I would likely use the `category` type. For the name, I would use a PyArrow string. For the price, I might use a floating-point value like `float32[pyarrow]`, but PyArrow provides exact decimals with `decimal128(...)[pyarrow]`, so I might consider that for its precision.

3. For the date and time of a stock trade, I would use the `timestamp[ns][pyarrow]` type. For a date of birth of a person, I would likely use the `date32[pyarrow]` type.