### Dataset Visualizations

Datasets used:
* [Wine quality dataset](https://archive.ics.uci.edu/ml/datasets/Wine+Quality)
* [Speeddating dataset](https://www.openml.org/d/40536)

In [44]:
# All imports needed
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [33]:
# Read data from file
speeddating = pd.read_csv("data/speeddating/speeddating.csv") 
white_wine = pd.read_csv("data/wine/winequality-white.csv", sep=';') 
red_wine = pd.read_csv("data/wine/winequality-red.csv", sep=';') 
wine = pd.concat([red_wine,white_wine])

# showing first 5 rows of dataset
pd.set_option('display.max_columns', 500)
speeddating.shape

(8378, 123)

Showing missing values in Speeddating dataset

In [None]:
speeddating = speeddating.replace("?", np.nan)

# all rows with nan values, 7330 in total
null_speeddating = speeddating[speeddating.isnull().values.any(axis=1)]

# how many values missing per column, only if > 0
values_missing = speeddating.isna().sum()
values_missing[values_missing > 0]

# percentage for each column
percent_missing = speeddating.isnull().sum() * 100 / len(speeddating)
percent_missing[percent_missing > 0]

# how many values missing in total, 18372
speeddating.isnull().sum().sum()

# visualize as heatmap, missing values are white
fig, ax = plt.subplots(figsize=(20,20)) 
sns.heatmap(speeddating.isnull(), cbar=False, annot=True, linewidths=.5, ax=ax)

Showing that the wine quality dataset has no missing values.

In [43]:
# all rows with nan values, 0 in total
null_wine = wine[wine.isnull().values.any(axis=1)]
null_wine.shape

(0, 12)