## Pandas: Data Manipulation & Handling Missing Data

In this lesson, we'll cover foundational techniques in Pandas for data manipulation, including creating and modifying DataFrames, handling missing data, and grouping data for aggregation. We’ll also dive deeper into strategies for imputing missing values (None or NaN) in datasets.

____

## Imports and Setup

First, let’s import the necessary libraries.

In [70]:
import numpy as np
import pandas as pd

import random

## Creating a DataFrame with Random Data

Let’s create a DataFrame with 10 columns, each filled with random integers between 0 and 50. 

This simulated data will help us practice the various Pandas techniques.

In [None]:
# create a dictionary of lists where each column contains 10 random values
data_dict = {f'column {i}': [random.randint(0, 50) for j in range(10)] for i in range(10)}

# convert it into a DataFrame
data_df = pd.DataFrame(data_dict)

data_df

## Modifying Data in a DataFrame

**Using iloc and loc to alter data based on Position**

You can use either *iloc* or *loc* to modify values in a DataFrame by position (i.e., by row and column indices). 

In [None]:
# let's change the value in row 2, column 3 to -100, using iloc

data_df.iloc[2, 3] = -100
data_df

In [None]:
# let's change the value in row 7, column 8 to 999, using loc

data_df.loc[7, 'column 8'] = 999
data_df

In [None]:
# changing multiple values at once using iloc

data_df.iloc[0:3, 0:3] = 0
data_df

**Using iloc and loc to modify values based on conditions**

You can also use loc or iloc to change values of rows that satisfy certain conditions.

In [None]:
# Use iloc to modify values by position for rows that satisfy a condition

data_df.iloc[(data_df['column 0'] > 30).values, 5:7] = 1000
data_df

In [None]:
# Use loc to change values based on a condition

data_df.loc[data_df['column 0'] > 25, 'column 1'] = 123456789
data_df

____

## Handling missing data in Pandas

Missing data is common in real-world datasets, and handling it appropriately is critical. Pandas provides several tools for identifying, filling, or removing missing data. In this section, we'll go over how to handle None, NaN, and pd.NA.

**Introducing missing data**

Let’s introduce some missing data into our DataFrame by manually setting some values to None (Python’s representation for missing data).

In [None]:
# Introduce missing data
data_df.loc[2, 'column 6'] = None
data_df.loc[4, 'column 7'] = None

# Display the DataFrame with missing values
data_df


In [None]:
data_df.info()

**Checking for missing data**

To check where missing values are present in the DataFrame, you can use the isna() or isnull() methods.

In [None]:
# check for missing values in the DataFrame

data_df.isnull()

#data_df.isna() # equivalent method

In [None]:
# Get the total count of missing values in each column
# This one works by counting and returning the number of True values in each column, which are the missing values in this case

data_df.isnull().sum()

Note that the above is now a series, and we can thus use the .sum() method again to get the total count of missing values in the DataFrame, as a single integer.

In [None]:
data_df.isnull().sum().sum()

We can also use .notnull(), to see which cells are not null.

In [None]:
data_df.notnull()

____

## Strategies for handling missing data

**Strategy 1: Dropping missing data**

In some cases, especially when the amount of missing data is minimal or irrelevant, it's common to drop rows or columns containing missing values.

In [None]:
# drop rows with missing values

data_df.dropna(inplace=False)  # Set inplace=True to modify the DataFrame in place

Alternatively, you can drop columns with missing values

In [None]:
# drop columns with missing values

data_df.dropna(axis=1, inplace=False)

**Important**

The strategy of droppping rows/columns is suitable when the missing data represents a small fraction of your dataset and when removing such rows/columns won’t negatively impact any subsequent analysis.

## Strategy 2: Filling missing data (Imputation)

Filling missing data is a more flexible strategy. There are various techniques to impute missing values depending on the type of data and its distribution.

**Filling with a constant value**

You can replace missing values with a constant, such as 0 or Unknown (for categorical data):

In [None]:
# Fill missing numerical values with 666

data_df.fillna(666, inplace=False)

**Filling with statistical values (Mean, Median, Mode)**

A common approach is to replace missing values with the mean, median, or mode (most frequent value) of the column. This is useful when the missing data is numerical, and you don’t want to distort the distribution too much.

**Mean Imputation**

In [None]:
# Fill missing values with the mean of the column

data_df['column 6'].fillna(data_df['column 6'].mean(), inplace=False)

**Median Imputation**

In [None]:
# Fill missing values with the median of the column

data_df['column 6'].fillna(data_df['column 6'].median(), inplace=False)

**Mode Imputation**

In [None]:
# Fill missing values with the most frequent value

data_df['column 7'].fillna(data_df['column 7'].mode()[0], inplace=False)

**Important**

This strategy is useful when your dataset contains a significant amount of missing data, and the missing values are likely to be similar to other values in the same column. Choose the imputation method based on the type of data and domain knowledge.

However, be very mindful about how this can potentially distort the distribution of your data, and how this can affect any subsequent analysis!

____

## Considerations when dealing with missing data

*Understand the Cause*: Why is the data missing? Is it random, or does it represent a particular pattern (e.g., not applicable)? Understanding the reason for missing data can guide your strategy.

*Domain Knowledge*: The best imputation technique often depends on the context of the data. For example, in a dataset of exam scores, you might assume that missing scores should be filled with 0. For income data, it might make more sense to fill missing values with the median.

*Imputation Bias*: Be mindful of introducing bias through imputation. For example, filling all missing values with the mean might underestimate variability in the data.

*Testing Sensitivity*: If your data analysis is sensitive to missing values, you might want to test different imputation methods and compare results.

____

## Grouping data with groupby

Grouping data allows you to aggregate values based on one or more categorical variables. This is useful for summary statistics and gaining insights.

**Grouping and aggregating by a single column**

Let’s use an example similar to the Titanic dataset, where you might want to group data by categories such as sex or pclass.

In [None]:
import seaborn as sns 

# Seaborn has some toy datasets for learning: https://github.com/mwaskom/seaborn-data

titanic_df = sns.load_dataset("titanic")

titanic_df.head()

In [None]:
titanic_df.info()

In [None]:
# Median age of passengers by sex

titanic_df.groupby('sex')['age'].median()

**Grouping and Aggregating by multiple columns**

You can group data by more than one column to perform multi-level aggregations.

In [None]:
# Median age and fare by sex and survival status

titanic_df.groupby(['sex', 'survived'])[['age', 'fare']].median()

In [None]:

sns.histplot(data= titanic_df, x="age", bins=100, hue='sex')