# Importing Required Libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import plotly.graph_objects as go
import pandas as pd
import plotly.express as px
from sklearn.model_selection import KFold, cross_val_score
import plotly.subplots as sp

In [None]:
#Step 1: Import the data
file_path = "cw1data.csv"
df = pd.read_csv(file_path)

df.head()

From the result of the code in cell 3 it was observed that the dataset has 135 rows and 14 columns. Each column contains 135 non-null values, meaning there are no missing values. Most columns (13 out of 14) are of type float64.

In [None]:
# Step 2: Display basic info
print("Dataset Info:")
df.info()

In [None]:
# Visualize the data
# Check for missing values
print("\nMissing Values:\n", df.isnull().sum())

# Summary statistics
df.describe()

 # Distribution Analysis 

This code provides a comprehensive view of the distribution of numerical features in the dataset, helping us uncover hidden patterns and characteristics that might influence future analysis or model performance

In [None]:
# Select only numerical columns
numeric_cols = df.select_dtypes(include=['number']).columns

# Create subplots with 3 columns per row
from plotly.subplots import make_subplots

num_features = len(numeric_cols)
cols = 3  # Number of columns per row
rows = (num_features // cols) + (num_features % cols > 0)

fig = make_subplots(rows=rows, cols=cols, subplot_titles=numeric_cols)

# Add histograms for each feature
for i, feature in enumerate(numeric_cols):
    row = (i // cols) + 1
    col = (i % cols) + 1
    fig.add_trace(
        go.Histogram(x=df[feature], nbinsx=20, name=feature, 
                     marker=dict(color=px.colors.qualitative.Prism[i % len(px.colors.qualitative.Prism)])), 
        row=row, col=col
    )

# Update layout
fig.update_layout(
    title="Feature Distributions",
    height=800, width=1200, 
    showlegend=False,
    template="plotly_white"
)

# Show the plot
fig.show()

The histogram grid shows the distribution of multiple features, highlighting their spread, central tendencies, and variability. Some features, like `x5` and `x6`, follow a normal distribution, while others, such as `x1` and `x2`, are skewed. Features like `y`, `x3`, and `x12` have multiple peaks, indicating variability. Understanding these distributions helps in detecting skewness, outliers, and feature importance for modeling.

# Feature Correlations

This code is essential for understanding how features in the dataset relate to one another by visualizing their correlation matrix as a heatmap.

In [None]:
# Compute the correlation matrix
corr_matrix = df.corr()

# Create an interactive correlation heatmap
fig_heatmap = go.Figure(data=go.Heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns,
    y=corr_matrix.index,
    colorscale='RdBu',  
    colorbar=dict(title="Correlation"),
    zmin=-1, zmax=1,
    text=corr_matrix.round(2).astype(str).values,  
    texttemplate="%{text}",  
    hoverinfo="z+text"
))

fig_heatmap.update_layout(
    title="Feature Correlation Heatmap",
    xaxis_title="Features",
    yaxis_title="Features",
    width=900,
    height=700
)

fig_heatmap.show()

This Feature Correlation Heatmap reveals important relationships between variables in the dataset. Strongly correlated features, such as x4, x5, x6, x7, and x8, suggest redundancy, which could lead to multicollinearity issues. The negative correlations between x3, x4, x5, x6, x7, and x8 with y indicate that as these features increase, y tends to decrease, highlighting an inverse relationship. On the other hand, x1 and x2 show a positive correlation with y, with x2 having the strongest influence, making it a key feature for prediction. Meanwhile, features like x10, x11, x12, and x13 exhibit weaker correlations, suggesting they may have a more independent role in the dataset. These insights are essential for feature selection, model optimization, and avoiding redundant information in machine learning applications.

# Data Splitting