# 1.0 Data Exploration
This notebook allows to visualize the features of the raw dataset using different plots.

## Imports and loading
Import necessary packages and load the raw data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.express as px

In [None]:
# define numeric datatypes
NUMERICS = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

In [None]:
# load csv file
#df = pd.read_csv('../data/raw/<your_data>.csv')

## Display the Dataset
Shows a few samples of the dataset.

In [None]:
df

## Statistics about the Data
Shows common meta information and statistics of the dataset like datatypes, number of missing values, ...

In [None]:
df.shape
# (#rows, #columns)

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
missing_values = df.isnull().sum()
print(missing_values)

## Visualize the Data

### Line Plot
Line plots are commonly used to visualize the trend of a variable over a continuous time. The plot connects data points with straight lines, making it easy to see the overall trend or pattern in the data.

In [None]:
# Plot the first column as a line plot
plt.plot(df.iloc[:, 0], label='<Column 1>')
#plt.plot(df.iloc[:, 1], label='<Column 2>')

# Add labels and title
plt.xlabel('Index')
plt.ylabel('Feature(s)')
plt.title('Line Plot')

# Add a legend
plt.legend()

# Show the plot
plt.show()

### Histogram
Histograms are used to represent the distribution of a single variable and show the frequency of different values or ranges. The plot consists of bars where the height of each bar corresponds to the frequency of data within a specified bin or range.

In [None]:
# Plot histogram for the first column
plt.hist(df.iloc[:, 0], bins=10, alpha=0.5, label='<Column 1>')
#plt.hist(df.iloc[:, 1], bins=10, alpha=0.5, label='<Column 2>')

# Add labels and title
plt.xlabel('Feature Values')
plt.ylabel('Frequency')
plt.title('Histogram')

# Add a legend
plt.legend()

# Show the plot
plt.show()

### Violin Plot
Violin plots are useful for visualizing the distribution of a variable or comparing the distributions of multiple variables. The plot combines aspects of box plots and kernel density estimation, providing insights into the distribution, quartiles, and probability density.

In [None]:
# Create a Violin trace for each column
traces = []
for column in df.columns:
    trace = go.Violin(y=df[column], name=column)
    
    # Set box_visible to False only if more than 5 features
    if len(traces) > 5:
        trace.visible = False
    
    traces.append(trace)

# Create the layout
layout = go.Layout(title='Violin Plot of Columns', xaxis=dict(title='Columns'), yaxis=dict(title='Values'))

# Create the figure
fig = go.Figure(data=traces, layout=layout)

# Show the plot
fig.show()

### Correlation Map
Correlation heatmaps are used to visualize the correlation structure between numeric variables in a dataset. Each cell in the heatmap represents the correlation coefficient between two variables. The color scale typically ranges from cool colors (e.g., blue) for negative correlations to warm colors (e.g., red) for positive correlations. A high positive correlation is represented by a lighter color, while a high negative correlation is represented by a darker color.

In [None]:
# Select only numeric columns since only them can be shown
cols_num = list(df.select_dtypes(include=NUMERICS).columns)
# Select the columns that you need
df_num = df[cols_num]

corr = df_num.corr()
fig = px.imshow(corr)
fig.show()

### Scatterplot
Scatter plots with color encoding are useful for visualizing the relationship between two variables, where the color represents a third variable. Data points are represented as markers, and the color of each marker encodes information about a third variable, providing insights into multivariate relationships.

In [None]:
# Select two columns for the scatter plot
x_column = '<Column 1>'
y_column = '<Column 2>'
# Select a color column if you want to do classification
#color_column = '<Class Column>'

# Create a scatter plot trace
scatter_trace = go.Scatter(
    x=df[x_column],
    y=df[y_column],
    mode='markers',
    marker=dict(
        size=10,
#        color=df[color_column],  # Use the values from the ColorColumn for color encoding
#        colorscale='Viridis',  # You can choose a different colorscale if needed
#        colorbar=dict(title=color_column)
    ),
    name=f'{x_column} vs {y_column}'
)

# Create the layout
layout = go.Layout(title=f'Scatter Plot of {x_column} vs {y_column}', xaxis=dict(title=x_column), yaxis=dict(title=y_column))

# Create the figure
fig = go.Figure(data=[scatter_trace], layout=layout)

# Show the plot
fig.show()