# Data Exploration

## Setup

For the purposes of our analysis, the following modules shall be required:

In [None]:
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import mutual_info_regression

from ipynb_utils.cfg import CFG
from ipynb_utils.utils import dump_df, load_data

We shall reload the retrieved data.

In [None]:
df = load_data("1--df_retrieved.pkl")

## Basic Inspection

As a preliminary inspection, let us invoke some common examination methods of the data frame.

In [None]:
# TODO: Remove this cell in the final version!

# Comprehensive Data Report
from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Profiling Report")
profile

In [None]:
# Summary: non-null counts, types, memory
df.info()

In [None]:
# First 10 rows
df.head(10)

In [None]:
# Last 10 rows
df.tail(10)

In [None]:
# Random 10 rows
df.sample(10)

In [None]:
# Rows and columns (rows, cols)
df.shape

In [None]:
# Column names
df.columns

In [None]:
# Row index
df.index

In [None]:
# Data types of features
df.dtypes

In [None]:
# Descriptive statistics (numeric columns)
df.describe()

In [None]:
# Unique value count per column
df.nunique()

Sanitise "outer" properties of data frame such as column names and and data types if requisite. Explain all operations that are either necessary or already obsolete.

In [None]:
df_pageviews["received_at"] = pd.to_datetime(df_pageviews["received_at"])

## Missing Values

Likewise, identify misencoded values.

In [None]:
# Count of missing values per column
df.isna().sum()

In [None]:
# Percentage of missing values
df.isnull().mean()

In [None]:
msno.matrix(df)
# plt.show()

In [None]:
msno.heatmap(df)
# plt.show()

Explain all operations that are either necessary or already obsolete.

## Final Overview of the Polished Data

As the data have now been fully processed, we shall examine the pairwise distributions.

In [None]:
cols_blacklist = [
]
cols = [col for col in df.columns if col not in cols_blacklist]
df_subset = df[cols]

sns.pairplot(df_subset, diag_kind="kde")
plt.suptitle("Pairplot")
plt.show()

Furthermore, we shall survey the correlation matrix:

In [None]:
corr = df_subset.corr(numeric_only=True)

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt=".2f")
plt.title("Correlation Matrix")
plt.tight_layout()
plt.show()


---

As the final step in this notebook, we store the data frame, now entailing all our modifications, to disk.

In [None]:
dump_df(df, "df_retrieved", ["pkl", "csv"])