## Pandas: Data Manipulation & Handling Missing Data

In this lesson, we'll cover foundational techniques in Pandas for data manipulation, incluing creating and modifying DataFrames, handling missing data, and grouping data for aggregation. We'll also dive deeper into strategies for handling missing data.

_____

In [84]:
import numpy as np
import pandas as pd

import random

**Let's create a DataFrame of random data**

In [None]:
data_dict = {f'column {i}': [random.randint(0, 100) for _ in range(10)] for i in range(10)}

data_dict

In [None]:
data_df = pd.DataFrame(data_dict)

data_df

## Modify data in a dataframe

In [None]:
data_df.info()

In [None]:
data_df.iloc[3,5] = 14

data_df

In [None]:
data_df.loc[2, 'column 7'] = -100

data_df

In [None]:
# changing multiple values at once

data_df.iloc[:3, :3] = 0

data_df

**Using iloc and loc to modify values based on conditions**

In [None]:
data_df['column 0'].values

In [None]:
data_df

In [93]:
data_df.iloc[(data_df['column 0'] > 50).values, [7, 8]] = 9999

In [None]:
data_df

In [None]:
data_df.loc[data_df['column 1'] > 80, ['column 4', 'column 6']] = 1234567890

data_df

_____

## Handling missing data in Pandas

In [None]:
data_df

Let's introduce some missing data

In [None]:
data_df.loc[2, 'column 6'] = None
data_df.loc[4, 'column 7'] = None

data_df

In [None]:
data_df

In [None]:
data_df.info()

In [None]:
data_df.isnull()

#data_df.isna()

In [None]:
data_df.isnull().sum() # beräknar kolumnvis alla True (dvs alla missing values)

In [None]:
data_df.isnull().sum().sum() # beräknar totalt antal missing values

In [None]:
data_df.notnull()

# är helt ekvivalent med 
# ~data_df.isnull()

______

## Strategies for handling missing data

In some cases, especially when the amount of missing data is minimal or "irrelevant", it's common to drop either the rows or (less likely) the columns containing them.

In [None]:
# by default, drops the rows containing ANY missing values

data_df.dropna() # requires in_place=True if I want a persistent change

In [None]:
data_df

In [None]:
# drop columns with missing values

data_df.dropna(axis=1)

## Strategy 2: Filling missing data (imputation)

In other cases, it's more appropriate to fill the missing data with a value. This can be done with the `fillna()` method.

**Filling in with a constant value**

In [None]:
data_df

In [None]:
data_df.fillna(666)

**Filling in with statistical values (mean, meadian, mode)**

**mean**

In [None]:
data_df['column 7'].fillna(data_df['column 7'].mean()) # requires inplace=True for persistent change

**median**

In [None]:
data_df['column 7'].fillna(data_df['column 7'].median())

**mode**

In [None]:
data_df['column 7'].fillna(data_df['column 7'].mode()[0])

_____

## Grouping data with groupby

Grouping data allows us to perform aggregations on subsets of data. For example, we might want to calculate the average value of a column for each unique value in another column.

In [None]:
import seaborn as sns

titanic_df = sns.load_dataset('titanic')

titanic_df.head()

In [None]:
titanic_df

In [None]:
child_mask = titanic_df['who'] != 'man'
male_mask = titanic_df['sex'] == 'male'

titanic_df[child_mask & male_mask]

In [None]:
titanic_df['survived'].value_counts()

In [None]:
# total average age

titanic_df['age'].mean()

In [None]:
sns.histplot(x='age', data=titanic_df, hue='sex')

In [None]:
titanic_df.groupby('sex')['age'].mean()

In [None]:
titanic_df.groupby('sex')['survived'].sum()

In [None]:
titanic_df['class'].value_counts()

In [None]:
titanic_df.groupby('class')['survived'].sum()

**Grouping by and aggregating by multiple columns**

In [None]:
titanic_df.groupby(['sex', 'survived'])['age'].mean()

In [None]:
titanic_df.groupby(['sex', 'survived'])[['age', 'fare']].mean()

In [None]:
titanic_df.groupby(['sex', 'class', 'survived'])[['age', 'fare']].mean()