# Pandas Data Manipulation Tutorial

In this tutorial, we will cover various aspects of data manipulation in Pandas library of Python. We will go through the following topics:

1. What are Missing Values and Why do They Occur?
2. How to Find Missing Values
3. Shape of Pandas DataFrames
4. Missing Values in Time Series Data
5. Creating a New Column
6. Pandas Profiling

Let's get started!


## 1. What are Missing Values and Why do They Occur?

Missing values are simply the values that are not present in a dataset. They occur due to various reasons such as data entry errors, equipment malfunctions, or data not being collected.
Missing values can cause problems during data analysis and modeling, as they may skew the results or lead to incorrect conclusions.


## 2. How to Find Missing Values

Pandas provides methods like `isnull()` and `any()` to find missing values in a DataFrame.

Let's demonstrate how to use these commands.


## 3. Shape of Pandas DataFrames

The shape of a DataFrame tells us the dimensions of the DataFrame, i.e., the number of rows and columns.

We'll show how to get the shape of a DataFrame.


## 4. Missing Values in Time Series Data

In time series data, missing values can occur due to various reasons such as equipment failure or network issues. 
We can handle missing values in time series data using methods like forward fill, backward fill, or interpolation.

We'll demonstrate these methods with examples.


## 5. Creating a New Column

We can create a new column in a DataFrame by performing operations on existing columns.
Let's illustrate this with an example where we create a new column by taking the sum of two existing columns.


## 6. Pandas Profiling

Pandas Profiling is a package that generates profile reports from a DataFrame, providing quick insights into the data.
It generates a detailed report containing information about the data types, missing values, distributions, correlations, etc.

We'll show how to use Pandas Profiling and discuss its benefits.


## Dataset

Let's create a sample dataset that we'll use throughout this tutorial.
We'll create a DataFrame with columns `A`, `B`, and `C`.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np

# Creating the sample dataset
data = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [np.nan, np.nan, np.nan, np.nan]
}

# Creating DataFrame
df = pd.DataFrame(data)

# Displaying the dataset
df


Unnamed: 0,A,B,C
0,1.0,5.0,
1,2.0,,
2,,,
3,4.0,8.0,


## 1. What are Missing Values and Why do They Occur?

Let's explore the dataset to identify missing values.


In [3]:
missing_values = df.isnull()
# we get a dataframe where everything is boolean
print(missing_values.dtypes)
missing_values

A    bool
B    bool
C    bool
dtype: object


Unnamed: 0,A,B,C
0,False,False,True
1,False,True,True
2,True,True,True
3,False,False,True


In [20]:
any_missing = df.isnull().any()
# this is a series. Use any_missing.index and any_missing.values to access the index and values separately
any_missing


A    True
B    True
C    True
dtype: bool

## 3. Shape of Pandas DataFrames

Let's get the shape of the DataFrame.


In [21]:
# Getting the shape of DataFrame
df_shape = df.shape
df_shape

(4, 3)

## 4. Missing Values in Time Series Data

Let's demonstrate how to handle missing values in time series data using forward fill, backward fill, and interpolation methods.

NOTE: this operation is often called "imputation". It's a big part of most data pipelines and essentially manual. It's hard to automate because it requires context on what you are doing to decide which strategy to apply, or no strategy at all if you'd rather notify of the error.


## 5. Creating a New Column

Let's create a new column `D` by taking the sum of columns `A` and `B`.


In [22]:
# Creating a new column
df['D'] = df['A'] + df['B']

# Displaying the updated DataFrame
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,,6.0
1,2.0,,,
2,,,,
3,4.0,8.0,,12.0


~~## 6. Pandas Profiling~~

~~Let's generate a profile report using Pandas Profiling package.~~

NOTE: this does not work on Python 3.12 and 3.13 because it relies on the cgi module, now removed.
At the moment of upgrading this notebook (March 2025) looks like this package is not mantained anymore. This cell is here in case in the future there's a substitute.

In [None]:
# Installing pandas-profiling package
# you can use this command, but be careful
# about updating the requirements.txt/pyproject.toml if this dependency is part of your codebase
# for quick experimentation it's fine

# !python -m install pandas-profiling

# Importing pandas_profiling
from pandas_profiling import ProfileReport

# Generating profile report
profile = ProfileReport(df)
profile.to_file('profile_report.html')
profile
