## Pandas: Data Manipulation & Handling Missing Data

In this lesson, we'll cover foundational techniques in Pandas for data manipulation, incluing creating and modifying DataFrames, handling missing data, and grouping data for aggregation. We'll also dive deeper into strategies for handling missing data.

---

In [None]:
import numpy as np
import pandas as pd

import random

**Let's create a DataFrame of random data**

In [None]:
data_dict = {f'column {i}': [random.randint(0, 100) for _ in range(10)]  for i in range(10)}

data_dict

In [None]:
data_df = pd.DataFrame(data_dict)

data_df

## Modify data in a DataFrame

In [None]:
data_df.info()

In [None]:
data_df.iloc[3, 5] = 180

data_df

In [None]:
data_df.loc[7, 'column 2'] = -220

data_df

In [None]:
# changing multiple values at once

data_df.iloc[:3, :3] = 0

data_df

**Using iloc and loc to modify values based on conditions**

In [None]:
(data_df['column 0'] > 50).values

In [None]:
data_df.iloc[(data_df['column 0'] > 50).values, [7,8]]

In [None]:
data_df.iloc[data_df['column 0'] > 50, [7, 8]] = -9999

In [None]:
data_df

In [None]:
data_df.loc[data_df['column 1'] > 80, ['column 4', 'column 6']]

In [None]:
data_df.loc[data_df['column 1'] > 80, ['column 4', 'column 6']] = 123456789

data_df

---

## Handling missing data in Pandas

In [None]:
data_df

In [None]:
data_df.loc[2, 'column 6'] = None
data_df.loc[8, 'column 6'] = None
data_df.loc[4, 'column 7'] = None

data_df

In [None]:
data_df.info()

In [None]:
data_df.isnull()

# data_df.isna()

In [None]:
data_df.isnull().sum()

#data_df.isnull().sum(axis=1)  # kommer summera över rader istället

In [None]:
data_df.isnull().sum().sum()

In [None]:
data_df.notnull()

# är helt ekvivalent med
# ~data_df.isnull()



Note that sometimes we might have dirty data in our dataset that is not counted as missing data, but is still important for you to identify and remedy, for example empty strings MIGHT be dirty and not intended data. 


In [None]:
data_df

In [None]:
data_df.iloc[9, 0] = ''

data_df

In [None]:
data_df.info()

---

## Strategies for handling missing data

In some cases, especially when the amount of missing data is minimal or "irrelevant", it's acceptable to simply remove rows or columns that contain missing values. This approach can be useful if you have a lot of data and don't need all the information.


In [None]:
# by default, drops ALL rows containing atleast one missing value

data_df.dropna()

In [None]:
# we can also instead drop all COLUMNS by providing axis=1 to dropna()

data_df.dropna(axis=1)

In [None]:
# data_df.dropna(inplace=True)   # inplace can be used to permanently delete the rows (or columns) containing missing values 

---

## Strategy 2: filling missing data (imputation)

In other cases, it's more appropriate to fill the missing data with a certain value. This can be done using the **fillna()** method

**Filling in with a constant value**

In [None]:
data_df

In [None]:
data_df.fillna(666)  # imputes 666 for missing values, WHEREVER they are in the dataset

**Limit imputation to a certain column**

In [None]:
data_df['column 7'].fillna(54321)

# data_df['column 7'] = data_df['column 7'].fillna(54321)
# data_df['column 7'].fillna(54321, inplace=True)

**Filling in with statistical values (mean, median, mode)**

**mean**

In [None]:
data_df['column 7'].fillna(data_df['column 7'].mean())

**median**

In [None]:
data_df['column 7'].fillna(data_df['column 7'].median())

---

## Groupding data with groupby

Groupding data allows us to perform aggregations on subsets of said data. For example, we might want to calculate the average value of a column for each unique value in another column.

In [None]:
import seaborn as sns

titanic_df = sns.load_dataset('titanic')

titanic_df.head()

In [None]:
titanic_df.info()

In [None]:
titanic_df['who'].unique()

In [None]:
child_mask = titanic_df['who'] == 'child'
male_mask = titanic_df['sex'] == 'male'


titanic_df[child_mask & male_mask].head(10)

In [None]:
titanic_df['survived'].value_counts()

In [None]:
# total average age

titanic_df['age'].mean()

In [None]:
sns.histplot(data=titanic_df, x='age', hue='sex')

In [None]:
titanic_df.groupby('sex')['age'].mean()

In [None]:
titanic_df.groupby('who')['age'].mean()

In [None]:
titanic_df.groupby('sex')['survived'].sum()

In [None]:
titanic_df['class'].value_counts()

In [None]:
titanic_df.groupby('class')['survived'].sum()

**Grouping by and aggregating by multiple columns**

In [None]:
titanic_df.groupby(['sex', 'survived'])['age'].mean()

In [None]:
titanic_df.groupby(['sex', 'survived'])[['age', 'fare']].mean()

In [None]:
titanic_df.groupby(['sex', 'class', 'survived'])[['age', 'fare']].mean()

In [None]:
titanic_df