# Object, Categorical, and String Data Types

In this chapter, our focus turns to the data types that primarily contain strings. Currently, pandas uses the three data types object, string, and categorical. The string data type is new to pandas and still undergoing changes. The categorical data type has yet to be discussed, but is a fantastic way to save memory and significantly increase performance for operations done on columns containing strings.

## Object data types

Until the release of pandas 1.0, pandas did not have a string-only data type. Instead, it used the 'object' data type to hold strings. As mentioned previously, the object data type has no limitation as to what Python object can be within it. It is essentially a catch-all for any item that you desire to be in a DataFrame that doesn't belong to the other specific data types.

Series with the object data type do not have analogous size representation like integer and float data types do. There is no `'object64'`, only a single 'object' data type. Each item can be of a different type and therefore of a different size.

Though the object data type may contain any Python object, it is primarily used to hold strings. Let's begin by constructing a Series with a couple of strings as values.

In [1]:
import pandas as pd
import numpy as np
s_object = pd.Series(['some', 'strings'])
s_object

0       some
1    strings
dtype: object

As you can see from the bottom of the Series output, the data type is 'object'. When we verify this, you'll see the output `dtype('O')`. The object data type also comes directly from numpy, which uses 'O' to represent it instead of the full name.

In [2]:
s_object.dtype

dtype('O')

Because object is the most flexible type, any Series may be converted to it. Here we convert a Series of integers to object.

In [3]:
s = pd.Series([5, 10])
s.astype('object')

0     5
1    10
dtype: object

The underlying values are still integers. We verify this by finding the type of the first value.

In [4]:
type(s.loc[0])

numpy.int64

Converting a Series of integers to objects is something you would never want to do as numpy integer arrays are optimized for fast computation. By converting to an object array, you lose this excellent benefit. In the following example, a numpy array is created with 100,000 random integers between 0 and 100. A second array is created by converting the data type to object. We then time how long it takes to sum each array. On my machine, the integer array is about 20 times as fast as object array even though they both hold the exact same data.

In [5]:
a_fast = np.random.randint(low=0, high=100, size=100_000)
a_slow = a_fast.astype('object')

In [6]:
%time a_fast.sum()

CPU times: total: 0 ns
Wall time: 1.03 ms


np.int64(4958574)

In [7]:
%time a_slow.sum()

CPU times: total: 0 ns
Wall time: 998 Î¼s


4958574

### Many different types in a Series

There is no restriction on what can be placed in a Series with the object data type. The following Series contains a list, a boolean, a string, a float, and a dictionary.

In [8]:
pd.Series([[1,2], True, 'some string', 4.5, {'key': 'value'}])

0              [1, 2]
1                True
2         some string
3                 4.5
4    {'key': 'value'}
dtype: object

### Object Series usually contain strings

As we mentioned in previous chapters, when you encounter a Series or column of a DataFrame that has object as its data type, it usually contains nothing but strings.

### Poor practice to store complex data types within Series

Even though you are allowed to place any Python object within a Series, it's generally considered poor practice to do so. Series with object data types are designed to be filled with strings as the `str` accessor is available for this data type. This shouldn't be seen as an absolute statement since it might be necessary to use the flexibility of these object columns for special situations.

## Categorical data type

We now introduce the categorical data type, which is unique to pandas and does not exist within numpy. The categorical data type is often used whenever a column of data has known, limited, and discrete values. This is often the case for string columns and is easiest to understand with an example. Let's read in the City of Houston employee dataset.

In [9]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


In [23]:
emp.dtypes

dept          object
title         object
hire_date     object
salary       float64
sex           object
race          object
dtype: object

### Changing data types to categorical

Any column may be converted to the categorical data type, but most often it is a string (and occasionally an integer) column that is chosen. In this dataset, each string column is a good candidate to be categorical. Let's select the `dept` column and count the occurrences of each unique value.

In [10]:
dept = emp['dept']
dept.value_counts()

dept
Police                     7573
Fire                       4376
Houston Public Works       4190
Other                      3373
Health & Human Services    1353
Houston Airport System     1216
Parks & Recreation         1152
Library                     563
Solid Waste Management      512
Name: count, dtype: int64

### Values are known, limited, and discrete

In this section, we discuss the properties that make a particular column a good candidate to convert to categorical. The total possible number of values in the column should be **known**. There shouldn't be any mysterious values that will appear when future data of the same kind is collected. With our department column, it's likely that we will be aware of each individual department and that new departments will not be created often.

The unique values in the column should be **limited** and far less than the total number of values. There are only 9 unique department values, which is substantially less than the total number of values (24,000+). 

The values should be **discrete**, meaning that there are no partial values. Each value in the column must be one of the known categories. With our data, each employee works in exactly one department. There is no partial department. You either work in one department or another, but not both. You must be one of those 9 categories. 

If the values are limited and discrete then they will repeat, often with high counts. Due to all these properties, the `dept` column can be and should be converted to the categorical data type. 

### Converting to categorical with the `astype` method

The simplest way to convert a Series to categorical is to pass the string `'category'` to the `astype` method. We assign this new Series to the `dept_cat` variable name.

In [11]:
dept_cat = dept.astype('category')
dept_cat.head()

0                  Police
1                   Other
2    Houston Public Works
3                  Police
4                  Police
Name: dept, dtype: category
Categories (9, object): ['Fire', 'Health & Human Services', 'Houston Airport System', 'Houston Public Works', ..., 'Other', 'Parks & Recreation', 'Police', 'Solid Waste Management']

### Visual display of a categorical Series

The output looks a bit different than a normal Series. Let's verify that we do indeed have a Series.

In [24]:
type(dept_cat)

pandas.core.series.Series

The index and the values of our new categorical Series appear identical to the output of a Series that has the object data type. The difference is the appearance of the unique categories below the Series.

### The formal data type

Let's verify that the data type is categorical by accessing the `dtype` attribute.

In [25]:
dept_cat.dtype

CategoricalDtype(categories=['Fire', 'Health & Human Services', 'Houston Airport System',
                  'Houston Public Works', 'Library', 'Other',
                  'Parks & Recreation', 'Police', 'Solid Waste Management'],
, ordered=False, categories_dtype=object)

### What is CategoricalDtype?

The formal pandas object for categorical data is `CategoricalDtype`. This can be confusing since pandas also uses the string 'category' in the Series output. But this is no different than what we saw with the other pandas-only data types. For instance, nullable integers can use either the string 'Int64' or the pandas object `pd.Int64Dtype()`.

## Why the categorical data type is useful

On the surface, our categorical Series looks very similar to its object Series counterpart, but there are some significant differences.

### Internal storage of categorical data

Categorical data is stored much more efficiently than object data. Each unique value in a column of categorical data is stored **once** regardless of how many times it repeats in the Series and each of the unique values has an integer code that refers to it. It is these integers that are stored in memory to represent the data. 

Object columns store each value in a unique location in memory. For instance, the string 'Police' appears over 7,000 times in the `dept` Series. Each one of these strings is stored in a unique location in memory. Using integers to represent categories can save a tremendous amount of memory.

### Example of categorical storage

Let's create a simplified example to show how pandas stores categorical data internally using Python lists. In this example, we'll have three unique departments. They are stored exactly once in the list `cats` below. The actual data is stored in the `vals` list containing the values 0, 1, and 2. 

In [14]:
cats = ['Police', 'Fire', 'Library']
vals = [1, 1, 0, 2, 0, 1, 2, 2, 1, 2, 1]

The `cats` list acts as a mapping from integer location to string value. The integer 0 corresponds with 'Police', 1 with 'Fire', and 2 with 'Library'. We can convert each value in the `vals` list to its corresponding category using the list comprehension below. 

In [15]:
[cats[val] for val in vals]

['Fire',
 'Fire',
 'Police',
 'Library',
 'Police',
 'Fire',
 'Library',
 'Library',
 'Fire',
 'Library',
 'Fire']

## The `cat` accessor

We previously covered the `str` and `dt` accessors which provide us special access to string-only and datetime-only attributes and methods. The `cat` accessor provides us with special attributes and methods for categorical Series. Let's take a look at some important attributes and methods it provides.

### Get the categories

The unique sequence of categories can be retrieved with the `categories` attribute.

In [26]:
dept_cat.cat.categories

Index(['Fire', 'Health & Human Services', 'Houston Airport System',
       'Houston Public Works', 'Library', 'Other', 'Parks & Recreation',
       'Police', 'Solid Waste Management'],
      dtype='object')

### Get the integer codes

The underlying integer codes may be retrieved with the `codes` attribute which returns a Series the same length as the original. Notice that it uses 8-bit integers to store the data.

In [27]:
dept_cat.cat.codes.head()

0    7
1    5
2    3
3    7
4    7
dtype: int8

### Verify codes correspond with categories

The first three values of the `dept_cat` Series are 'Police', 'Other', and 'Houston Public Works'. Let's verify that the first three codes correspond with the categories.

In [28]:
dept_cat.cat.categories[7]

'Police'

In [29]:
dept_cat.cat.categories[5]

'Other'

In [30]:
dept_cat.cat.categories[3]

'Houston Public Works'

## Modifying categories

When you create a column of categorical data, the categories take some additional work to modify. In this section, you'll learn how to modify, add, and remove categories.

### All categories remain after subset selection

The number of categories remains the same after subset selection, even when some of the categorical values do not appear in the result. Let's select the first five values of the `dept_cat` Series, assign it to the variable `dept_cat_5` and output the result.

In [31]:
dept_cat_5 = dept_cat.head()
dept_cat_5

0                  Police
1                   Other
2    Houston Public Works
3                  Police
4                  Police
Name: dept, dtype: category
Categories (9, object): ['Fire', 'Health & Human Services', 'Houston Airport System', 'Houston Public Works', ..., 'Other', 'Parks & Recreation', 'Police', 'Solid Waste Management']

Notice that there are still 9 categories, even though there are only three unique values in the result. The categories will not change unless you explicitly run a command to do so. If you'd like to clean up your new column of data, you can run the `remove_unused_categories` method from the `cat` accessor. This does not work in-place, so you'll need to assign the result to variable to keep the changes.

In [33]:
dept_cat_5.cat.remove_unused_categories()

0                  Police
1                   Other
2    Houston Public Works
3                  Police
4                  Police
Name: dept, dtype: category
Categories (3, object): ['Houston Public Works', 'Other', 'Police']

### Assigning a new category

Attempting to assign a new value that isn't one of the current categories will result in an error. Here, we attempt to assign the first value to 'Information Technology', which is not a current category.

In [34]:
dept_cat.loc[0] = 'Information Technology'

TypeError: Cannot setitem on a Categorical with a new category (Information Technology), set the categories first

The error instructs us to add a new category. This is done by passing a single category or list of categories to the `add_categories` method from the `cat` accessor. Let's complete the operation and output the number of categories.

In [37]:
dept_cat = dept_cat.cat.add_categories('Information Technology')
len(dept_cat.cat.categories)

ValueError: new categories must not include old categories: {'Information Technology'}

Let's select the last category to verify it is the one we added.

In [39]:
dept_cat

0                         Police
1                          Other
2           Houston Public Works
3                         Police
4                         Police
                  ...           
24303                     Police
24304                      Other
24305       Houston Public Works
24306    Health & Human Services
24307                     Police
Name: dept, Length: 24308, dtype: category
Categories (10, object): ['Fire', 'Health & Human Services', 'Houston Airport System', 'Houston Public Works', ..., 'Parks & Recreation', 'Police', 'Solid Waste Management', 'Information Technology']

In [38]:
dept_cat.cat.categories[-1]

'Information Technology'

We can now successfully make the assignment and output the first few rows to verify it worked.

In [40]:
dept_cat.loc[0] = 'Information Technology'
dept_cat.head(3)

0    Information Technology
1                     Other
2      Houston Public Works
Name: dept, dtype: category
Categories (10, object): ['Fire', 'Health & Human Services', 'Houston Airport System', 'Houston Public Works', ..., 'Parks & Recreation', 'Police', 'Solid Waste Management', 'Information Technology']

### Removing categories that exist

In the rare event that you'd like to remove categories that exist as values in your column, do so with the `cat` accessor's `remove_categories` method. They will be replaced with missing values.

In [41]:
dept_cat.cat.remove_categories('Police').head()

0    Information Technology
1                     Other
2      Houston Public Works
3                       NaN
4                       NaN
Name: dept, dtype: category
Categories (9, object): ['Fire', 'Health & Human Services', 'Houston Airport System', 'Houston Public Works', ..., 'Library', 'Other', 'Parks & Recreation', 'Solid Waste Management']

### Missing values are not categories

There is no separate category for missing values. The categorical data type uses the numpy `NaN` as its missing value representation. All missing value operations will work as normal. Here we calculate the total number of missing values after removing the 'Police' category.

In [42]:
dept_cat.cat.remove_categories('Police').isna().sum()

np.int64(7572)

## Massive reduction in memory used

One of the biggest benefits of using categorical columns is the amount of memory saved. Instead of using a string for every value, an integer code is used. Integers take up significantly less space than strings. pandas also uses the smallest integer size to store the codes. For instance, if there are less than 128 categories, an int8 is used. pandas has chosen not to use unsigned integers for code storage, so only half the normal capacity is available.

### The `memory_usage` method

pandas provides the `memory_usage` method to return the number of bytes used by the Series. To get the exact memory for string columns, you need to set the parameter `deep` to `True`. Let's get the original amount of memory used.

In [48]:
dept

0                         Police
1                          Other
2           Houston Public Works
3                         Police
4                         Police
                  ...           
24303                     Police
24304                      Other
24305       Houston Public Works
24306    Health & Human Services
24307                     Police
Name: dept, Length: 24308, dtype: object

In [43]:
orig_mem = dept.memory_usage(deep=True)
orig_mem

1643107

Use the method again to get the memory used on our categorical Series.

In [49]:
dept_cat

0         Information Technology
1                          Other
2           Houston Public Works
3                         Police
4                         Police
                  ...           
24303                     Police
24304                      Other
24305       Houston Public Works
24306    Health & Human Services
24307                     Police
Name: dept, Length: 24308, dtype: category
Categories (10, object): ['Fire', 'Health & Human Services', 'Houston Airport System', 'Houston Public Works', ..., 'Parks & Recreation', 'Police', 'Solid Waste Management', 'Information Technology']

In [46]:
cat_mem = dept_cat.memory_usage(deep=True)
cat_mem

25459

Let's find the percentage reduction in memory used by converting to categorical.

In [47]:
1 - cat_mem / orig_mem

0.9845055738914142

An astounding 98.4% reduction in memory takes place. Using 8-bit integers instead of the entire string made a huge difference.

## Speeding up operations

Another nice benefit of using the categorical data type is the performance improvement for most operations. Let's cover a few examples that show this performance improvement.

### Comparison operators

The comparison operators should complete much faster. In the example below, we are testing equality of each value to the string 'Police'. Using the categorical Series, around a 4x performance improvement is seen on my machine.

In [50]:
%time _ = dept == 'Police'

CPU times: total: 0 ns
Wall time: 1.03 ms


In [51]:
%time _ = dept_cat == 'Police'

CPU times: total: 0 ns
Wall time: 0 ns


This is equivalent to checking whether the `codes` integer Series equals 7, the category number for police.

In [52]:
%time _ = dept_cat.cat.codes == 7

CPU times: total: 0 ns
Wall time: 0 ns


### Sorting

Sorting also executes much faster as it's only necessary to sort the unique values.

In [53]:
%time _ = dept.sort_values()

CPU times: total: 0 ns
Wall time: 11.7 ms


In [54]:
%time _ = dept_cat.sort_values()

CPU times: total: 0 ns
Wall time: 2.52 ms


### Most other operations

Most other operations happen faster. Here, we see the difference when using the `value_counts` method.

In [55]:
%time _ = dept.value_counts()

CPU times: total: 0 ns
Wall time: 1.67 ms


In [56]:
%time _ = dept_cat.value_counts()

CPU times: total: 0 ns
Wall time: 0 ns


## The str accessor is still available

Even though we've converted to the categorical data type, the `str` accessor is still available to use as long as the original data contained strings. Here, we make all the values uppercase.

In [59]:
dept_cat.head()

0    Information Technology
1                     Other
2      Houston Public Works
3                    Police
4                    Police
Name: dept, dtype: category
Categories (10, object): ['Fire', 'Health & Human Services', 'Houston Airport System', 'Houston Public Works', ..., 'Parks & Recreation', 'Police', 'Solid Waste Management', 'Information Technology']

In [57]:
dept_cat.str.upper().head()

0    INFORMATION TECHNOLOGY
1                     OTHER
2      HOUSTON PUBLIC WORKS
3                    POLICE
4                    POLICE
Name: dept, dtype: object

Unfortunately, an object Series is returned for all of the `str` accessor methods that return strings, so you'll have to convert it again to categorical after the operation completes if you want to keep it as a categorical.

## Ordered categories

Generally speaking, there are two types of categorical data, **nominal** and **ordinal**. For nominal categorical data, the values have no natural ordering. With ordinal data, the values do have a natural ordering. To help remember, both 'ordinal' and 'order' begin with 'ord'.

The departments column from above is a good example of nominal data, as no department has any natural precedence over the other. There is no good example in our employee dataset of ordinal data. Let's read in the diamonds dataset, which has information on the size, cut, color, clarity, and price for many diamonds.

In [60]:
diamonds = pd.read_csv('../data/diamonds.csv')
diamonds.head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


In [61]:
diamonds.shape

(53940, 10)

If you are familiar with diamonds, then you know that cut, color, and clarity have a specific ordering corresponding to the quality of that property. Let's take a look at the clarity column's unique values.

In [62]:
clarity = diamonds['clarity']
clarity.value_counts()

clarity
SI1     13065
VS2     12258
SI2      9194
VS1      8171
VVS2     5066
VVS1     3655
IF       1790
I1        741
Name: count, dtype: int64

Before creating an ordered categorical, let's convert clarity to an unordered categorical like we did above. Passing the `astype` method the string 'category' always creates an unordered categorical data type.

In [63]:
clarity_cat = diamonds['clarity'].astype('category')
clarity_cat.head()

0    SI2
1    SI1
2    VS1
3    VS2
4    SI2
Name: clarity, dtype: category
Categories (8, object): ['I1', 'IF', 'SI1', 'SI2', 'VS1', 'VS2', 'VVS1', 'VVS2']

We can use the `ordered` attribute available from the `cat` accessor to verify that it is unordered.

In [68]:
clarity_cat.dtype

CategoricalDtype(categories=['I1', 'IF', 'SI1', 'SI2', 'VS1', 'VS2', 'VVS1', 'VVS2'], ordered=False, categories_dtype=object)

In [64]:
clarity_cat.cat.ordered

False

The data dictionary contains information on the ordering of each categorical variable. Let's read it in, changing the column width option display so that we can read the entirety of the description.

In [65]:
pd.set_option('display.max_colwidth', 100)
pd.read_csv('../data/dictionaries/diamonds_dictionary.csv')

Unnamed: 0,Column Name,Description
0,carat,weight of the diamond (0.2--5.01)
1,clarity,"a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))"
2,color,"diamond colour, from J (worst) to D (best)"
3,cut,"quality of the cut (Fair, Good, Very Good, Premium, Ideal)"
4,depth,"total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)"
5,price,"price in US dollars ($326--$18,823)"
6,table,width of top of diamond relative to widest point (43--95)
7,x,length in mm (0--10.74)
8,y,width in mm (0--58.9)
9,z,depth in mm (0--31.8)


### Creating an ordered categorical

Creating an ordered categorical column takes a bit of work. Here are the three steps that you need to take:

1. Create a list of all of the unique categories in the order that you desire.
1. Us the `CategoricalDtype` constructor available directly from `pd`. Pass it the list of categories and set the `ordered` parameter to `True`. Assign this result to a variable name.
1. Pass the variable name in step 2 to the `astype` Series method.

Here, we complete the three steps and assign the new ordered categorical Series to `clarity_ordered_cat`.

In [66]:
cats = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
clarity_dtype = pd.CategoricalDtype(cats, ordered=True)
clarity_ordered_cat = clarity.astype(clarity_dtype)
clarity_ordered_cat.head()

0    SI2
1    SI1
2    VS1
3    VS2
4    SI2
Name: clarity, dtype: category
Categories (8, object): ['I1' < 'SI2' < 'SI1' < 'VS2' < 'VS1' < 'VVS2' < 'VVS1' < 'IF']

Notice that the categories appear at the bottom with less than signs separating them indicating the order. Let's verify the data type of our new Series.

In [67]:
clarity_ordered_cat.dtype

CategoricalDtype(categories=['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'], ordered=True, categories_dtype=object)

From the end of the output, you can see that it is ordered. Let's also verify this with the `ordered` attribute.

In [69]:
clarity_ordered_cat.cat.ordered

True

### Special properties of ordered categorical Series

Ordered categorical Series have some behavior that differs from their unordered counterparts. For instance, taking the maximum or minimum value returns the category ranked as the best or worst and is not based on alphabetical ordering.

In [70]:
clarity_ordered_cat.max()

'IF'

In [76]:
clarity_ordered_cat.min()

'I1'

In [77]:
clarity_ordered_cat.sort_values()

40286    I1
47112    I1
47123    I1
17956    I1
17960    I1
         ..
1395     IF
34430    IF
34429    IF
25731    IF
32494    IF
Name: clarity, Length: 53940, dtype: category
Categories (8, object): ['I1' < 'SI2' < 'SI1' < 'VS2' < 'VS1' < 'VVS2' < 'VVS1' < 'IF']

Attempting to call the `max` method on an unordered categorical raises an error.

In [71]:
clarity_cat.max()

TypeError: Categorical is not ordered for operation max
you can use .as_ordered() to change the Categorical to an ordered one


The original object Series does work with the `max` method and returns the value with greatest alphabetical value, which is different than the best measurement for clarity.

In [72]:
clarity.max()

'VVS2'

Sorting ordered categoricals is done by their category order and not alphabetically. Here we call the `value_counts` method on both the ordered and unordered clarity Series and then sort the index.

In [81]:
clarity_ordered_cat.value_counts().sort_index()

clarity
I1        741
SI2      9194
SI1     13065
VS2     12258
VS1      8171
VVS2     5066
VVS1     3655
IF       1790
Name: count, dtype: int64

Sorting the unordered categorical index does so alphabetically.

In [75]:
clarity_cat.value_counts().sort_index()

clarity
I1        741
IF       1790
SI1     13065
SI2      9194
VS1      8171
VS2     12258
VVS1     3655
VVS2     5066
Name: count, dtype: int64

## Integers can be categories

Any column, regardless of its data type, may be converted to categorical and not just strings. If the values are known, limited, and discrete, then they are good candidates for categorical data. Integers are the primary non-string data type that represent categorical data. Here are some examples of integer categorical data:

* Rating of a movie/hotel/restaurant given that the range is known(such as the integers 1-5)
* Zip codes for a particular city
* Hurricane strength category (1-5)

As with our string columns, integer categorical columns may be unordered (like zip codes) or ordered (like movie ratings). Let's read in just two columns of the housing dataset.

In [83]:
housing = pd.read_csv('../data/housing.csv', usecols=['MSSubClass', 'SalePrice'])
housing.head()

Unnamed: 0,MSSubClass,SalePrice
0,60,208500
1,20,181500
2,60,223500
3,70,140000
4,60,250000


By default, these columns are read in as integers. Looking at the data dictionary, 'MSSubClass' is a candidate for categorical as it identifies the type of dwelling. Here are a few of the integer codes and their corresponding description. There is no inherent order for these values.

* 20 - 1-STORY 1946 & NEWER ALL STYLES
* 30 - 1-STORY 1945 & OLDER
* 40 - 1-STORY W/FINISHED ATTIC ALL AGES
* 45 - 1-1/2 STORY - UNFINISHED ALL AGES
* 50 - 1-1/2 STORY FINISHED ALL AGES

Below, we make the conversion and note that there are 15 unique categories.

In [84]:
ms_class = housing['MSSubClass'].astype('category')
ms_class.head()

0    60
1    20
2    60
3    70
4    60
Name: MSSubClass, dtype: category
Categories (15, int64): [20, 30, 40, 45, ..., 120, 160, 180, 190]

## The new string data type

With the release of pandas 1.0, a new pandas-only data type called 'string' was made available. It can only contain strings and missing values.

Use the string `'string'` to create this data type in the Series constructor or with the `astype` method. You can also use the pandas object `pd.StringDtype` directly. Both Series below are identical.

In [85]:
s_string = pd.Series(['Police', 'Fire', 'Police', pd.NA], dtype='string')
s_string = pd.Series(['Police', 'Fire', 'Police', pd.NA], dtype=pd.StringDtype())
s_string

0    Police
1      Fire
2    Police
3      <NA>
dtype: string

### Similarities between string and object data types

The intended purpose of the string data type is to finally provide pandas users with a data type that is guaranteed to only contain strings (and missing values). This should reduce errors as the object data type is capable of containing anything. That said, the functionality between the two data types is going to be very similar. Here, we use the `str` accessor to uppercase the strings.

In [86]:
s_string.str.upper()

0    POLICE
1      FIRE
2    POLICE
3      <NA>
dtype: string

## Converting strings to numeric

It is possible to convert strings consisting entirely of numerical characters to either integer or float. Let's construct a Series of strings that look just like floats. pandas always uses object as the default data type for strings.

In [87]:
s = pd.Series(['4.5', '3.19'])
s

0     4.5
1    3.19
dtype: object

Notice that the quotation marks are not present for strings in the visual display of the DataFrame in the notebook, so they appear to be floats. But, you'll also notice that the decimals are not aligned one on top of the other and each value has a different number of digits after the decimal. This isn't the normal display for actual float columns. Let's create an actual float column (with the same values) so you can see the difference in the visual display. Let's make the conversion with the `astype` method. Notice that the decimals will always align.

In [88]:
s.astype('float64')

0    4.50
1    3.19
dtype: float64

## Force conversion with `pd.to_numeric` 

You may have a Series of string values where some can be converted to numeric and others that cannot. In this situation, it is not possible to use the `astype` method to make the conversion, as you can see with the following error.

In [89]:
s = pd.Series(['4.5', '3.19', 'NOT AVAILABLE'])
s

0              4.5
1             3.19
2    NOT AVAILABLE
dtype: object

In [90]:
s.astype('float64')

ValueError: could not convert string to float: 'NOT AVAILABLE'

Instead, you must turn to the `to_numeric` function, which works similarly to `astype`, but has an option to force the conversion to happen. You do this by setting the `errors` parameter to the string 'coerce'. Any value that cannot be converted will be set as missing.

In [None]:
pd.to_numeric(s, errors='coerce')

0    4.50
1    3.19
2     NaN
dtype: float64

Notice that `to_numeric` is a function and not a method. You must access it directly from `pd`. The `astype` method does have an `errors` parameter, but it does not have the option for 'coerce'. It would be quite nice if the developers implemented this option for `astype`, then we wouldn't need to use `to_numeric`.

### Converting to strings

You can convert all the values to a string with either the string `'str'` or the built-in `str` class. Let's create a Series of integers and then convert it to strings. The data type of the Series will be object.

In [92]:
s = pd.Series([10, 20, 99])
s.astype('str')

0    10
1    20
2    99
dtype: object

Let's verify that the underlying values are actually strings by accessing the `values` attribute. A numpy array is returned, and uses quote marks for its visual display of string data.

In [93]:
s.astype('str').values

array(['10', '20', '99'], dtype=object)

Use the string `'string'` to convert it to the new pandas-only string data type.

In [94]:
s.astype('string')

0    10
1    20
2    99
dtype: string

## Object, String, and Categorical data type summary

![0]

[0]: images/obj_str_cat_dtypes.png

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Using its constructor, create a Series containing three two-item lists of integers. Then call the `sum` method on the Series. What is returned?</span>

In [108]:
s = pd.Series([[1,2],[3,4],[5,6]])

s.sum()

[1, 2, 3, 4, 5, 6]

### Exercise 2

<span style="color:green; font-size:16px">Use the constructor to create a Series of integers, floats, and booleans. Do not set the `dtype` parameter. What data type is your Series?</span>

In [100]:
s2 = pd.Series([2,9.4,True])

s2

0       2
1     9.4
2    True
dtype: object

### Exercise 3

<span style="color:green; font-size:16px">Construct a Series with the same values but force the data type to be a float. Does it work? What happens to the non-float values?</span>

In [101]:
s2.astype('float')



0    2.0
1    9.4
2    1.0
dtype: float64

In [112]:
pd.Series([2,9.4,True], dtype='float')


0    2.0
1    9.4
2    1.0
dtype: float64

### Exercise 4

<span style="color:green; font-size:16px">Construct a Series containing three strings and the four missing values `None`, `np.nan`, `pd.NA`, and `pd.NaT` assigning the result to a variable.</span>

In [104]:
s4 = pd.Series(['one','two','three',None,np.nan,pd.NA,pd.NaT])

s4

0      one
1      two
2    three
3     None
4      NaN
5     <NA>
6      NaT
dtype: object

### Exercise 5

<span style="color:green; font-size:16px">Using pandas, count the number of missing values in exercise 4.</span>

In [113]:
s4.isna().sum()

np.int64(4)

### Exercise 6

<span style="color:green; font-size:16px">Convert the Series from exercise 4 to the new string data type. Notice what happens to the missing values.</span>

In [106]:
s4.astype('string')

0      one
1      two
2    three
3     <NA>
4     <NA>
5     <NA>
6     <NA>
dtype: string

### Read in the movie dataset

Execute the cell below to read in the first 10 columns of the movie dataset setting the index to be the title.

In [114]:
pd.set_option('display.max_columns', 100)
movie = pd.read_csv('../data/movie.csv', index_col='title', usecols=range(10))
movie.head(3)

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear


In [122]:
movie.shape

(4916, 9)

### Exercise 7

<span style="color:green; font-size:16px">Which of the columns above are good candidates for the categorical data type?</span>

In [128]:
movie['actor2'].value_counts()

actor2
Morgan Freeman     18
Charlize Theron    14
Brad Pitt          13
Meryl Streep       11
James Franco       10
                   ..
Jeremy Sande        1
Sonia Braga         1
Michael Cortez      1
Maxwell Moody       1
Valorie Curry       1
Name: count, Length: 3030, dtype: int64

In [124]:

def find_categorical_candidates(df, threshold=0.05):  # threshold = 5% unique
    total_rows = len(df)
    results = []

    for col in df.columns:
        unique_count = df[col].nunique(dropna=True)
        percent_unique = unique_count / total_rows

        if percent_unique <= threshold:  # good candidate
            results.append({
                'column': col,
                'unique_values': unique_count,
                '% unique': round(percent_unique * 100, 2)
            })

    return pd.DataFrame(results).sort_values('% unique')


In [125]:
candidates = find_categorical_candidates(movie, threshold=0.1)  # 10% threshold
candidates

Unnamed: 0,column,unique_values,% unique
1,color,2,0.04
2,content_rating,18,0.37
0,year,91,1.85
3,duration,191,3.89
4,director_fb,435,8.85


### Exercise 8

<span style="color:green; font-size:16px">Select the `content_rating` column as a Series and convert it to categorical. Assign the result to the variable `rating`.</span>

In [129]:
rating = movie['content_rating'].astype('category')

### Exercise 9

<span style="color:green; font-size:16px">Write an expression that returns the number of categories.</span>

In [131]:
len(rating.cat.categories)

18

### Exercise 10

<span style="color:green; font-size:16px">Prove that the `str` accessor still works with categorical columns by making the ratings lowercase.</span>

In [132]:
rating.str.lower(
)

title
Avatar                                        pg-13
Pirates of the Caribbean: At World's End      pg-13
Spectre                                       pg-13
The Dark Knight Rises                         pg-13
Star Wars: Episode VII - The Force Awakens      NaN
                                              ...  
Signed Sealed Delivered                         NaN
The Following                                 tv-14
A Plague So Pleasant                            NaN
Shanghai Calling                              pg-13
My Date with Drew                                pg
Name: content_rating, Length: 4916, dtype: object

### Exercise 11

<span style="color:green; font-size:16px">Assign the rating 'GGG' as the first value.</span>

In [136]:
rating = rating.cat.add_categories('GGG')

In [137]:
rating.iloc[0] = 'GGG'

In [138]:
rating

title
Avatar                                          GGG
Pirates of the Caribbean: At World's End      PG-13
Spectre                                       PG-13
The Dark Knight Rises                         PG-13
Star Wars: Episode VII - The Force Awakens      NaN
                                              ...  
Signed Sealed Delivered                         NaN
The Following                                 TV-14
A Plague So Pleasant                            NaN
Shanghai Calling                              PG-13
My Date with Drew                                PG
Name: content_rating, Length: 4916, dtype: category
Categories (19, object): ['Approved', 'G', 'GP', 'M', ..., 'TV-Y7', 'Unrated', 'X', 'GGG']

### Exercise 12

<span style="color:green; font-size:16px">Convert the following Series to integer.</span>

In [139]:
s = pd.Series(['1', '2'])

In [140]:
s.astype('int')

0    1
1    2
dtype: int64

In [141]:
pd.to_numeric(s)

0    1
1    2
dtype: int64

### Exercise 13

<span style="color:green; font-size:16px">Convert the following Series to integer.</span>

In [142]:
s = pd.Series(['1', '2', 'BAD DATA'])

In [156]:
pd.to_numeric(s,errors='coerce').astype('Int8')

0       1
1       2
2    <NA>
dtype: Int8

### Read in the diamonds dataset

Execute the next cell to read in the diamonds dataset.

In [144]:
diamonds = pd.read_csv('../data/diamonds.csv')
diamonds.head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


### Exercise 14

<span style="color:green; font-size:16px">Select the `cut` column as a Series and convert it to an ordered categorical. Use the data dictionary from above. Assign it to the variable `cut_cat`.</span>

In [162]:
cats = ['Fair','Good','Very Good','Premium','Ideal']
category = pd.CategoricalDtype(cats, ordered=True )

cut_cat = diamonds['cut'].astype(category)

cut_cat

0            Ideal
1          Premium
2             Good
3          Premium
4             Good
           ...    
53935        Ideal
53936         Good
53937    Very Good
53938      Premium
53939        Ideal
Name: cut, Length: 53940, dtype: category
Categories (5, object): ['Fair' < 'Good' < 'Very Good' < 'Premium' < 'Ideal']

### Exercise 15

<span style="color:green; font-size:16px">By only knowing that `cut_cat` is an ordered categorical, write an expression to get the percentage of diamonds that have the lowest category.</span>

In [166]:
(cut_cat == cut_cat.min()).mean()

np.float64(0.029847979236188357)

In [168]:
cut_cat.value_counts(normalize=True).sort_index()[0]

  cut_cat.value_counts(normalize=True).sort_index()[0]


np.float64(0.029847979236188357)