### Exercise 2: Data Exploration

In this exercise we will learn more advanced methods of exploring datasets. In particular, we will print different kind of summaries of a DataFrame, unique values of a column, amount of empty values and the correlation between variables. In addition, visualizations of variables will be plotted.

Import the packages and data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/adc-jaimier/PythonTraining/main/Data/Exercise2.csv')

Print a concise summary of a DataFrame.

In [None]:
df.info()

Generate descriptive statistics and analyse it. Which interessting information can you get from it?

In [None]:
df.describe()

In [None]:
# Return counts of unique rows of the dependent variable Y.
df['Y'].nunique()

In [None]:
# Return list with all distinct element in AGE columns.
df['AGE'].unique()

In [None]:
# Return number of distinct elements per columns
df.value_counts()

In [None]:
# Return count of 'smokers' per dependent variable Y
# Hint: use groupby in combination with aggregated function
df.reset_index().groupby(by='SMOKER').count()

In [None]:
# average age within group (caries or non-caries) using groupby
df.groupby(by='AGE').mean()

In [None]:
# Return number of missing data point (NaN) per columns
df.isna().sum()

### Visualization of variables

In [None]:
# Plot a histogram of the column Y
df['Y'].hist()

In [None]:
# histogram any of the input columns
df['AGE'].hist()

In [None]:
# Plot a histogram of the column N_CHECKS per value of Y (0/1)
no_caries = df[df['Y']==0]
caries = df[df['Y']==1]

min_checks = int(min(df['N_CHECKS']))
max_checks = int(max(df['N_CHECKS']))
bins = range(min_checks, min_checks)

fig, ax = plt.subplot()
ax.hist(x=no_caries['N_CHECKS'], bins=bins, alpha=0.5, label='no carries') 
ax.hist(x=caries['N_CHECKS'], bins=bins, alpha=0.5, label='carries')

ax.set_xlabel('Nr. of checks')
ax.set_xlabel('Nr. of patiens')
ax.set_title('Distribution of N_CHECKS per target variable')
plt.show()


### Extra exercises

In [None]:
# Correlation heatmap of all input variables
correlations = df.corr()

fig, ax = plt.subplots(figsize=(8,8))
im = ax.imshow(correlations)
fig.colorbar(im, orientation = 'vertical')

ax.set_xticklabels(df.columns, rotation=90)
ax.set_yticklabels(df.columns)
ax.set_title('Correlation matrix')
plt.show()

In [None]:
# Boxplot per group (Y=0 vs. Y=1) for any of the continuous input variables
fig, ax = plt.subplots()
ax.boxplot(no_caries['N_CHECKS'])
ax.boxplot(caries['N_CHECKS'])

ax.set_xlabel('Nr. of checks')
ax.set_xlabel('Nr. of patiens')
ax.set_title('Boxplot of N_CHECKS per target variable')
plt.show()