In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame()
df['id'] = [1,2,3,4,5]
df['rank'] = [3,5,1,7,8]
df['class'] = [10, 9, 9, 10, 9]
df['remarks'] = ['Good', 'OK', 'Excellent', 'Poor', 'Poor'] 
df['marks'] = ['87', '78', '100', '67', '55']
df

Unnamed: 0,id,rank,class,remarks,marks
0,1,3,10,Good,87
1,2,5,9,OK,78
2,3,1,9,Excellent,100
3,4,7,10,Poor,67
4,5,8,9,Poor,55


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       5 non-null      int64 
 1   rank     5 non-null      int64 
 2   class    5 non-null      int64 
 3   remarks  5 non-null      object
 4   marks    5 non-null      object
dtypes: int64(3), object(2)
memory usage: 328.0+ bytes


## Converting columns to categorical type

The categorical data type is useful in the following cases:

- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory.
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order.
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

In [4]:
# convert class columns to category
df['class'] = df['class'].astype('category')

The rank column is in numerical and is converted to ordered categorical type.

In [5]:
from pandas.api.types import CategoricalDtype

Create a categorical data type for a particular series with order defined

In [6]:
cat_type = CategoricalDtype([8,7,6,5,4,3,2,1], ordered = True)

In [7]:
df['rank'] = df['rank'].astype(cat_type)

In [8]:
df['rank']

0    3
1    5
2    1
3    7
4    8
Name: rank, dtype: category
Categories (8, int64): [8 < 7 < 6 < 5 < 4 < 3 < 2 < 1]

Create a categorical data type for a particular series with order defined

In [9]:
cat_type = CategoricalDtype(['Poor', 'OK', 'Good', 'Excellent'], ordered = True)
df['remarks'] = df['remarks'].astype(cat_type)
df['remarks']

0         Good
1           OK
2    Excellent
3         Poor
4         Poor
Name: remarks, dtype: category
Categories (4, object): [Poor < OK < Good < Excellent]

## Converting string to int type


In [10]:
df['marks'] = df['marks'].astype(int)

## Checking the data types and szie of the modifeid dataframe

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   id       5 non-null      int64   
 1   rank     5 non-null      category
 2   class    5 non-null      category
 3   remarks  5 non-null      category
 4   marks    5 non-null      int32   
dtypes: category(3), int32(1), int64(1)
memory usage: 875.0 bytes


#### Reference:
1. [Pandas categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)