# Some of the weirdness in Pandas with solutions

Some of these points may catch you out, some are historic (so you'll find lots of talk about it on older posts when you Google).

What's _caught you out_? We can discuss problems you've found and they'll help the others on the course.

In [None]:
import pandas as pd

# Is there danger when using a `.` to access columns?

_Recommendation_ only use `[]` to access columns, not `.`

In [None]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df

In [None]:
df['a']

In [None]:
df.b * 10

In [None]:
(df['b'] * 10) == (df.b * 10)

In [None]:
df.c = df.b # what's happening here? is this sensible?

In [None]:
type(df.c) # is this a DataFrame column?

In [None]:
df.d = [7, 8, 9] # uncomment and try this - is it a DataFrame column?

In [None]:
df.d

In [None]:
df._data # take a look at the Block Manager to see what's on the inside

In [None]:
df.columns # this is what Pandas actually knows about

# Indexing is inclusive in Pandas but not in Python

Slightly count-intuitive, there seem to be arguments going both ways especially for date handling, it is an annoyance that you ahve to be aware of.

In [None]:
lst = list(range(1, 6))
print(f'{lst} is the whole list')

In [None]:
print(f'{lst[2:4]} gets 2 items exclusive of the end selector')

In [None]:
df = pd.DataFrame({'a': range(1, 6), 'b': range(1, 6)})
df = df.set_index('a')
df

In [None]:
# be cautious that the following feels a bit ambiguous
df[2:4] # this looks like what we'd expect _if_ we're thinking of integer locations

In [None]:
df.loc[2:4] # what does this give? # the same applies to datetime indexes being inclusive

In [None]:
df.iloc[2:4] # this at least is not ambiguous and follows the Python approach

In [None]:
#df.ix[2:4] # thankfully this has now gone - it tried to guess which of these jobs to do!

# Be cautious when reading datetimes as strings

In [None]:
# text will look like a CSV file on disk as a convenience
from io import StringIO
text = """2022-01-20, 1\n2022-01-21, 2\n2022-01-22, 3"""

str_as_file = StringIO(text)
df = pd.read_csv(str_as_file, names=['date', 'amount'])
df

In [None]:
df.info()

In [None]:
df = df.set_index('date')

In [None]:
df.info()

In [None]:
df.loc['2022-01-20'] # looks legit

In [None]:
# will the following work?
df.resample('W').mean() # expect a weekly mean aggregation
# we'll need ", parse_dates=['date']" above

In [None]:
text2 = """2022-01--1, 1\n2022-01-02, 2\n2022-01-03, 3"""

str_as_file = StringIO(text2)
df = pd.read_csv(str_as_file, names=['date', 'amount'], parse_dates=['date'])
df.info() # remember that parse_dates is a soft request so bad data will silently fail! Use Pandera...

In [None]:
#df['date'].dt.day # will this work?

In [None]:
text3 = """01/01/2022, 1\n01/02/2022, 2\n01/03/2022, 3"""
# what if this was written DD-MM-YYYY?

str_as_file = StringIO(text3)
df = pd.read_csv(str_as_file, names=['date', 'amount'], parse_dates=['date'])
df.info()

In [None]:
df['date'].dt.day # is this what we expect given the format above?

In [None]:
str_as_file = StringIO(text3)
df = pd.read_csv(str_as_file, names=['date', 'amount'], parse_dates=['date'], dayfirst=True)
df.info()

In [None]:
df['date'].dt.day

# `NaN` and extension types

In [None]:
import numpy as np
ser = pd.Series([1, 2, 3, np.nan], name='a')
ser # what datatype is column 'a'? is it what you expect?

In [None]:
type(ser.to_numpy())

In [None]:
ser = pd.Series([1, 2, 3, pd.NA], name='b')
ser # what about now?

In [None]:
ser.dtype

In [None]:
ser = pd.Series([1, 2, 3, pd.NA], dtype='Int64')
ser # what's different here?

In [None]:
ser.dtype

In [None]:
ser.to_numpy() # can only work by promoting to an Object (not Pandas Extension) ndarray

## Not a Number

Did you know that Ian's newsletter shares this name?

How do we compare to `NaN` in floating point arithmethic?

In [None]:
ser = pd.Series([1, 2, 3, np.nan], name='a')
ser

In [None]:
ser == 1

In [None]:
ser == None

In [None]:
ser == np.nan

In [None]:
# why are the previous lines False, but here we get a True?
ser.isna() # or .isnull() but .isna is shorter and is preferred by pandas-vet

In [None]:
# we can do the same thing if we have an Int64 (Extension type) column
ser = pd.Series([1, 2, 3, pd.NA], dtype='Int64')
ser.isna()

# TIPS

`df = pd.read_csv(str_as_file, names=['date', 'amount'], parse_dates=['date'])` specify the column for `parse_dates` and double check that it has done the conversion you expect for non-US dates.


In [None]:
str_as_file = StringIO(text)
df = pd.read_csv(str_as_file, names=['date', 'amount'], parse_dates=['date'])
df = df.set_index('date')
df.resample('W').mean()

In [None]:
str_as_file = StringIO(text3)
df = pd.read_csv(str_as_file, names=['date', 'amount'], parse_dates=['date'], dayfirst=True)
df['date'].dt.day # expect 1,1,1 (not 1,2,3) with dayfirst == True for this datetime configuration

Whilst it reads a bit weirdly, probably sticking with `np.nan` and a `float64` numpy-style column is easier for the team rather than using the newer and less populat `Int64` nullable column (but your team may disagree and that'd be cool).

In [None]:
pd.Series([1, 2, 3, np.nan], name='a').isna()

For indexing be explicit - using `.iloc` for integer indexing (which follows the Python convention of being "(inclusive, exclusive]" using math bracket notation) or `.loc` for label-based (and inclusive) indexing which follows the form "(inclusive, inclusive)".

# Minimally Sufficient Pandas

* https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428 worth a read (mostly - avoid some methods, make the team agree to use a common subset so everyone has a better understanding of the defaults)
* https://www.dunderdata.com/blog/minimally-sufficient-pandas-cheat-sheet a cheat sheet for the above
* consider - is `join` a strict subset of `pd.merge` or does it do anything else? 3 ways for doing the same thing...
  * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html for `join` docs, click "source" and observe how it _actually_ works
  * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html (on DataFrame)
  * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html (top level - my choice)