# **Exploring PyJanitor: A Data Cleaning Companion for Pandas**



---


<p align="justify">Dat cleaning and preprocessing are fundamental steps in any data analysis or machine learning workflow. Often, these tasks can be repetitive and time-consuming, requiring multiple lines of code. Fortunately, libraries like PyJanitor emerge to streamline these processes and enhance productivity.

<p align="justify">PyJanitor is a powerful Python library built on top of pandas, inspired by the R package of the same name. It provides a rich set of functions designed to simplify and accelerate common data cleaning operations. From standardizing column names and handling missing values to filtering data and chaining multiple transformations, PyJanitor aims to make your data cleaning workflow more efficient and readable.

<p align="justify">In this notebook, we will explore some of the core functionalities of PyJanitor and see how it can help us clean and prepare our data with ease. We'll cover topics such as cleaning column names, renaming columns, handling missing values, filtering and selecting data, and the powerful concept of method chaining. By the end, you'll have a better understanding of how PyJanitor can become an invaluable tool in your data cleaning arsenal.





---



## **Install PyJanitor**

In [1]:
!pip install pyjanitor



## **Import Libraries**

In [2]:
import janitor
import pandas as pd

## **1. Cleaning Column Names**

With PyJanitor's `clean_names()` function, we can quickly standardise our column names making them uniform and consistent with just a simple call.

What this powerful function can do:

*   replaces spaces with underscores
*   converts all characters to lowercase
*   strips leading and trailing whitespace
*   replaces dots with underscores


### Example:

In [3]:
#Create a data frame with inconsistent column names
students = pd.DataFrame({
    'Student.ID': [1, 2, 3],
    'Student Name': ['Nurul', 'Cahaya', 'Sam'],
    'Student Gender': ['Male', 'Female', 'Male'],
    'Course': ['Calculus', 'Data Science', 'Database'],
    'Grade': ['A', 'B', 'C'],
})

#Clean the column names
cleaned_student = students.clean_names()
print(cleaned_student)

   student_id student_name student_gender        course grade
0           1        Nurul           Male      Calculus     A
1           2       Cahaya         Female  Data Science     B
2           3          Sam           Male      Database     C




---



## **2. Renaming Columns**

Renaming columns can significantly improve data clarity, readability, and consistency.

The `rename_column()` function makes this process straightforward.

### Example:

In [4]:
students = pd.DataFrame({
    'stu_id': [1, 2],
    'stu_name': ['Michael', 'Jackson'],
})

# Renaming the columns
students = students.rename_column('stu_id', 'Student_ID')
students = students.rename_column('stu_name', 'Student_Name')
print(students.columns)

Index(['Student_ID', 'Student_Name'], dtype='object')


  return method(self._obj, *args, **kwargs)




---



## 3. Handling Missing Values

Missing values can be a major challenge when working with datasets.

The `fill_empty()` function provides us an effective solution for handling them with ease.


### Example:

In [5]:
# Create a data frame with missing values
employees = pd.DataFrame({
    'employee_id': [1, 2, 3],
    'name': [None, 'Diego', 'Messi'],
    'department': ['Marketing', None, 'IT'],
    'salary': [60000, 55000, None],
})

How PyJanitor can assist in filling up these missing values:

In [6]:
# Fill missing values in 'department' and 'name' with 'Unknown' and 'salary' with the mean salary
employees = employees.fill_empty(column_names=['name', 'department'], value='Unknown')
employees = employees.fill_empty(column_names='salary', value=employees['salary'].mean())

print(employees)

   employee_id     name department   salary
0            1  Unknown  Marketing  60000.0
1            2    Diego    Unknown  55000.0
2            3    Messi         IT  57500.0


  return method(self._obj, *args, **kwargs)


<p align="justify">In this example, the department of the employee 'Diego' is replaced with 'Unknown', while the salary of 'Messi' is filled using the average salary of the 'Unknown' and 'Diego' entries. There are multiple strategies available for handling missing values, such as forward fill, backward fill, or filling with a specified value.



---



## **4. Filtering Rows & Selecting Columns**

<p align="justify">Filtering rows and columns is an essential step in data analysis. PyJanitor streamlines this task with convenient functions that let you easily select columns and filter rows according to defined conditions.


### Example:

In [7]:
# Create a data frame with student data
students = pd.DataFrame({
    'student_id': [1, 2, 3, 4, 5],
    'name': ['Ali', 'Siti', 'Mei', 'Muthu', 'Sam'],
    'subject': ['Maths', 'Science', 'English', 'History','Biology'],
    'marks': [85, 58, 92, 45, 75],
    'grade': ['A', 'C', 'A', 'D', 'B'],
})

# Filter rows where marks are less than 60
filtered_students = students.query('marks >= 60')
print(filtered_students)

   student_id name  subject  marks grade
0           1  Ali    Maths     85     A
2           3  Mei  English     92     A
4           5  Sam  Biology     75     B


<p align="justify">Now, let's say you only want to display specific columns, such as just the name and ID, instead of the entire dataset.

PyJanitor makes this easy to accomplish with a simple and clear approach:


In [8]:
# Select specific columns
selected_columns_df = filtered_students.loc[:,['student_id', 'name']]



---



## **5. Chaining Methods**

<p align="justify">One of PyJanitor's standout features is its method chaining capability, which allows us to perform multiple operations seamlessly in a single line of code.


### Example:

In [11]:
# Create a data frame with sample car data
cars = pd.DataFrame({
    'Car ID': [101, None, 103, 104, 105],
    'Car Model': ['Proton', 'Perodua', 'Toyota', 'Honda', 'Tesla'],
    'Price': [25000, 30000, None, 40000, 45000],
    'Year': [2018, 2019, 2017, 2020, None],
})

print("Cars Data Before Applying Method Chaining:\n")
print(cars)

Cars Data Before Applying Method Chaining:

   Car ID Car Model    Price    Year
0   101.0    Proton  25000.0  2018.0
1     NaN   Perodua  30000.0  2019.0
2   103.0    Toyota      NaN  2017.0
3   104.0     Honda  40000.0  2020.0
4   105.0     Tesla  45000.0     NaN


The data frame currently contains missing values and inconsistent column names. While we could address these issues step by step using functions like `clean_names()`, `rename_column()`, and `dropna()` over multiple lines, a more elegant solution is to chain these methods together. This allows us to perform all operations in a single, streamlined line of code, resulting in a more fluent workflow and cleaner, more readable code.


In [12]:
# Chain methods to clean column names, drop rows with missing values, select specific columns, and rename columns
cleaned_cars = (
    cars
    .clean_names()  # Clean column names
    .dropna()  # Drop rows with missing values
    .select_columns(['car_id', 'car_model', 'price'])  # Select columns
    .rename_column('price', 'price_usd')  # Rename column
)

print("Cars Data After Applying Method Chaining:\n")
print(cleaned_cars)

Cars Data After Applying Method Chaining:

   car_id car_model  price_usd
0   101.0    Proton    25000.0
3   104.0     Honda    40000.0


  return method(self._obj, *args, **kwargs)
  return method(self._obj, *args, **kwargs)


In this pipeline, the following operations are carried out:

* `clean_names()` standardises and cleans the column names.
* `dropna()` removes any rows that contain missing values.
* `select_columns()` filters the dataset to include only the ‘car\_id’, ‘car\_model’, and ‘price’ columns.
* `rename_column()` changes the column name ‘price’ to ‘price\_usd’.

## **Conclusion**

PyJanitor is an invaluable library for anyone working with data. Beyond the features covered in this exploration, it offers a wide range of powerful tools such as encoding categorical variables, extracting features and labels, detecting duplicate rows, and many more. All these advanced capabilities can be further explored in its [documentation](https://pyjanitor-devs.github.io/pyjanitor/). I believe the more we explore Pyjanitor, the more we’ll appreciate its robust functionality.
