# Lecture 5: Pandas Tips & Tricks

* Missing data in Pandas
* String accessor
* Date parsing & date accessor
* Resetting and setting an explicit index

## Imports

In [1]:
import pandas as pd, numpy as np

## Data

We use the slightly expanded student data from *Lecture 4*:

In [2]:
df = pd.read_csv('students.csv')
df

Unnamed: 0,student,programme,enrolment,first_class,classes_taken
0,Bob,BIM,2008,01/09/2008,10.0
1,Jake,MiM,2012,01/10/2012,4.0
2,Lisa,IM,2004,15/09/2004,7.0
3,Sue,BIM,missing,,2.0
4,William,SCM,2008,01/01/2009,
5,James,BIM,2012,01/10/2012,
6,Harper,BIM,2004,30/08/2004,
7,Mason,IM,2009,10/09/2009,3.0
8,Evelyn,IM,missing,,4.0
9,Ella,SCM,2012,02/10/2012,5.0


We already saw how to use `pandas.to_numeric()` to convert enrolment into an integer variable:

In [3]:
df['enrolment'] = pd.to_numeric(df['enrolment'], errors='coerce')
df

Unnamed: 0,student,programme,enrolment,first_class,classes_taken
0,Bob,BIM,2008.0,01/09/2008,10.0
1,Jake,MiM,2012.0,01/10/2012,4.0
2,Lisa,IM,2004.0,15/09/2004,7.0
3,Sue,BIM,,,2.0
4,William,SCM,2008.0,01/01/2009,
5,James,BIM,2012.0,01/10/2012,
6,Harper,BIM,2004.0,30/08/2004,
7,Mason,IM,2009.0,10/09/2009,3.0
8,Evelyn,IM,,,4.0
9,Ella,SCM,2012.0,02/10/2012,5.0


Let's take a look at the `DataFrame` and its data types:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   student        12 non-null     object 
 1   programme      12 non-null     object 
 2   enrolment      10 non-null     float64
 3   first_class    10 non-null     object 
 4   classes_taken  8 non-null      float64
dtypes: float64(2), object(3)
memory usage: 608.0+ bytes


## Missing Data

We can see in the output of `info()` that some columns have missing data. For example, only 10 observations have a year of `enrolment`, and only 8 have information on the number of classes taken.

Missing data is a semi-tricky thing in Pandas, as you've seen on the "Data Types" slide in *Lecture 2*. Generally speaking, there are two types of missing data:
* `None` for `object`
* `numpy.nan` for `float64`

Because `float64` is faster than `object` for numerical data, whenever a numerical variable contains missing values, Pandas automatically converts it to `float64`. Hence why `enrolment` is `float64` after we converted it to a number.

If you want to set a numerical variable to missing, you can do that with NumPy's `nan`:

```
import numpy as np
...
DF['col1'] = np.nan
```

### Selecting Missing Data

Let's have a look at the observations for which `classes_taken` is missing. A first approach might be to perform masking, for example by checking if any value in the column is equal to `np.nan`:

In [5]:
df[df['classes_taken'] == np.nan]

Unnamed: 0,student,programme,enrolment,first_class,classes_taken


However, this doesn't work, because we cannot compare missing values with each other. `None == None` is always `True` and `np.nan == np.nan` is always `False`.

Instead, the way to select missing data is with [`isna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html) or [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html), which return a *bitmask* in which missing data is `True`:

In [6]:
df[df['classes_taken'].isna()]

Unnamed: 0,student,programme,enrolment,first_class,classes_taken
4,William,SCM,2008.0,01/01/2009,
5,James,BIM,2012.0,01/10/2012,
6,Harper,BIM,2004.0,30/08/2004,
10,Jackson,MiM,2004.0,01/09/2004,


We can also get the opposite by using the bitwise negator `~`, which gives us all observations in `df` for which `classes_taken` is *not* missing:

In [7]:
df[~df['classes_taken'].isna()]

Unnamed: 0,student,programme,enrolment,first_class,classes_taken
0,Bob,BIM,2008.0,01/09/2008,10.0
1,Jake,MiM,2012.0,01/10/2012,4.0
2,Lisa,IM,2004.0,15/09/2004,7.0
3,Sue,BIM,,,2.0
7,Mason,IM,2009.0,10/09/2009,3.0
8,Evelyn,IM,,,4.0
9,Ella,SCM,2012.0,02/10/2012,5.0
11,Avery,MiM,2005.0,05/02/2006,10.0


### Filling Missing Data

Sometimes, we want to replace missing data with a different value instead. For example, we may want to treat missing values as 0. For this we can use the aptly named [`fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) function.

For example, missing values in `classes_taken` may suggest that the student hasn't taken any so far. In that case, we can replace the missing values with 0:

In [8]:
df['classes_taken'].fillna(0)

0     10.0
1      4.0
2      7.0
3      2.0
4      0.0
5      0.0
6      0.0
7      3.0
8      4.0
9      5.0
10     0.0
11    10.0
Name: classes_taken, dtype: float64

Like other `DataFrame` functions, `fillna()` creates a copy of the data. If we want to modify our original `DataFrame`, we can either replace the column or us `inplace=True`.

In [9]:
df['classes_taken'].fillna(0, inplace=True)
df

Unnamed: 0,student,programme,enrolment,first_class,classes_taken
0,Bob,BIM,2008.0,01/09/2008,10.0
1,Jake,MiM,2012.0,01/10/2012,4.0
2,Lisa,IM,2004.0,15/09/2004,7.0
3,Sue,BIM,,,2.0
4,William,SCM,2008.0,01/01/2009,0.0
5,James,BIM,2012.0,01/10/2012,0.0
6,Harper,BIM,2004.0,30/08/2004,0.0
7,Mason,IM,2009.0,10/09/2009,3.0
8,Evelyn,IM,,,4.0
9,Ella,SCM,2012.0,02/10/2012,5.0


### Converting Numeric Types

`classes_taken` is still a floating point data type, even though it really should be an integer (students cannot take 4.13 classes). We can convert between numerical data types easily with [`astype()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html):

In [10]:
df['classes_taken'].astype('int')

0     10
1      4
2      7
3      2
4      0
5      0
6      0
7      3
8      4
9      5
10     0
11    10
Name: classes_taken, dtype: int32

`'int'` defaults to a 32-bit integer (`int32`). We can also specify if we want a specific integer type (e.g., to save memory with large data sets). In this example, I'm choosing a 16-bit integer, which can hold values $\in [−32768, 32767]$:

In [11]:
df['classes_taken'].astype('int16')

0     10
1      4
2      7
3      2
4      0
5      0
6      0
7      3
8      4
9      5
10     0
11    10
Name: classes_taken, dtype: int16

`astype()` does not have an `inplace` option, so we need to replace the column if we want the change to take effect in the original `DataFrame`:

In [12]:
df['classes_taken'] = df['classes_taken'].astype('int16')
df

Unnamed: 0,student,programme,enrolment,first_class,classes_taken
0,Bob,BIM,2008.0,01/09/2008,10
1,Jake,MiM,2012.0,01/10/2012,4
2,Lisa,IM,2004.0,15/09/2004,7
3,Sue,BIM,,,2
4,William,SCM,2008.0,01/01/2009,0
5,James,BIM,2012.0,01/10/2012,0
6,Harper,BIM,2004.0,30/08/2004,0
7,Mason,IM,2009.0,10/09/2009,3
8,Evelyn,IM,,,4
9,Ella,SCM,2012.0,02/10/2012,5


### Dropping Missing Data

We can drop missing data with masking or with [`dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html):

In [13]:
df.dropna(subset=['enrolment', 'first_class'], how='all')

Unnamed: 0,student,programme,enrolment,first_class,classes_taken
0,Bob,BIM,2008.0,01/09/2008,10
1,Jake,MiM,2012.0,01/10/2012,4
2,Lisa,IM,2004.0,15/09/2004,7
4,William,SCM,2008.0,01/01/2009,0
5,James,BIM,2012.0,01/10/2012,0
6,Harper,BIM,2004.0,30/08/2004,0
7,Mason,IM,2009.0,10/09/2009,3
9,Ella,SCM,2012.0,02/10/2012,5
10,Jackson,MiM,2004.0,01/09/2004,0
11,Avery,MiM,2005.0,05/02/2006,10


Here, I drop all observations from the `DataFrame` which have all missing observations in columns `enrolment` and `first_class`.

## [String Accessors](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str)

If we have text data in a `DataFrame`, we often want to use functions similar to the ones that are available to us for standard Python strings. For example, `startswith()` checks if a string starts with a certain sequence of characters:

In [14]:
'Avery'.startswith('A')

True

To do that in a `DataFrame`, we use the *string accessor*. It replicates almost all functions that we are used to from Python strings.

For example, let's say we want to find all students whose name starts with an `E`. For that, we can use `str.startswith()`:

In [15]:
df[df['student'].str.startswith('E')]

Unnamed: 0,student,programme,enrolment,first_class,classes_taken
8,Evelyn,IM,,,4
9,Ella,SCM,2012.0,02/10/2012,5


`str` is the string accessor, and it exposes many useful string functions. See the above link to the API documentation.

## Date Parsing & Date Accessor

### Date Parsing

By default, `read_csv()` tries to convert data to the correct data types. However, this does not work for dates and datetimes. We can see that `first_class` is an `object` rather than a `datetime` column:

In [16]:
df.dtypes

student           object
programme         object
enrolment        float64
first_class       object
classes_taken      int16
dtype: object

In effect, this means they're strings:

In [17]:
df.loc[0, 'first_class']

'01/09/2008'

To parse them into Pandas' `datetime` data type, we use [`to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html):

In [18]:
pd.to_datetime(df['first_class'], errors='coerce')

0    2008-01-09
1    2012-01-10
2    2004-09-15
3           NaT
4    2009-01-01
5    2012-01-10
6    2004-08-30
7    2009-10-09
8           NaT
9    2012-02-10
10   2004-01-09
11   2006-05-02
Name: first_class, dtype: datetime64[ns]

I use `errors='coerce'` so that in case some conversions fail, the programme continues and the failed ones are replaced with missing values.

In [19]:
df['first_class'] = pd.to_datetime(df['first_class'], errors='coerce')

The data type of the converted columns is now `datetime`:

In [20]:
df.dtypes

student                  object
programme                object
enrolment               float64
first_class      datetime64[ns]
classes_taken             int16
dtype: object

`NaT` is the missing value for `datetime` and it is based on `np.nan`.

### [Date Accessors](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-dt)

One advantage of having dates in the correct data type is that we can use `datetime` accessors now. These little suffixes allow us to use special functionality only available to `datetime` columns.

For example, let's say we want to extract the year student took their first class:

In [21]:
df['first_class'].dt.year

0     2008.0
1     2012.0
2     2004.0
3        NaN
4     2009.0
5     2012.0
6     2004.0
7     2009.0
8        NaN
9     2012.0
10    2004.0
11    2006.0
Name: first_class, dtype: float64

We could now assign the extracted year to a new column. While a date by itself cannot be used in ML models, extracted properties such as the year can.

## More on Indices

### Reset Index

If we select data from a `DataFrame`, its index is usually preserved. For example, for this subset of BIM students:

In [22]:
df[df['programme'] == 'BIM']

Unnamed: 0,student,programme,enrolment,first_class,classes_taken
0,Bob,BIM,2008.0,2008-01-09,10
3,Sue,BIM,,NaT,2
5,James,BIM,2012.0,2012-01-10,0
6,Harper,BIM,2004.0,2004-08-30,0


This is a good thing, because it means we never lose track of where our data came from. However, sometimes we want to reset an index. We can do that with [`reset_index()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html):

In [23]:
df[df['programme'] == 'BIM'].reset_index(drop=True)

Unnamed: 0,student,programme,enrolment,first_class,classes_taken
0,Bob,BIM,2008.0,2008-01-09,10
1,Sue,BIM,,NaT,2
2,James,BIM,2012.0,2012-01-10,0
3,Harper,BIM,2004.0,2004-08-30,0


I set `drop=True` as otherwise the old index would be appended to the `DataFrame` as a column.

### Setting a Column to the Explicit Index

We can also replace an existing explicit index with any column (it doesn't even need to be unique, although it is advisable for obvious reasons). To do so, we use [`set_index()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html).

In our case, `student` might be a good explicit index. Let's check if it's unique. There are two ways.

1) We can compare the `DataFrame`'s shape with the number of unique values in `student`:

In [24]:
df.shape

(12, 5)

In [25]:
df['student'].nunique()

12

They're the same, so `permalink` must be unique. (`shape` is the easiest way to check a `DataFrame`'s row and column counts.)

2) Alternatively, we can use [`is_unique`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.is_unique.html):

In [26]:
df['student'].is_unique

True

`student` is indeed unique, so we can proceed and make it the explicit index:

In [27]:
df = df.set_index('student')
df

Unnamed: 0_level_0,programme,enrolment,first_class,classes_taken
student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bob,BIM,2008.0,2008-01-09,10
Jake,MiM,2012.0,2012-01-10,4
Lisa,IM,2004.0,2004-09-15,7
Sue,BIM,,NaT,2
William,SCM,2008.0,2009-01-01,0
James,BIM,2012.0,2012-01-10,0
Harper,BIM,2004.0,2004-08-30,0
Mason,IM,2009.0,2009-10-09,3
Evelyn,IM,,NaT,4
Ella,SCM,2012.0,2012-02-10,5


Now we can use a student's name to select her/his data:

In [28]:
df.loc['Bob']

programme                        BIM
enrolment                     2008.0
first_class      2008-01-09 00:00:00
classes_taken                     10
Name: Bob, dtype: object

© 2023 Philipp Cornelius