# Exercise: Dataset Principles

Since models are nothing without data, it's important to make sure the fundamentals are strong when creating and shaping your datasets. Here we'll create a regression dataset and split it into the three core dataset types: train, validation, and test.

Your tasks for this exercise are:
1. Create a dataframe with your features and target arrays from `make_regression`.
2. Create a 60% Train / 20% Validation / 20% Test dataset group using the `train_test_split` method.
3. Confirm the datasets are the correct size by outputing their shape.
4. Save the three datasets to CSV

In [2]:
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split



In [9]:
# Creating a regression dataset with 1000 samples, 5 feature columns, 2 which are actually useful, and 1 target column
regression_dataset = make_regression(
    n_samples=1000, n_features=5, n_informative=2, n_targets=1, random_state=0
)

In [10]:
df = pd.DataFrame(regression_dataset[0])
df["target"] = regression_dataset[1]

In [11]:
df.head()

Unnamed: 0,0,1,2,3,4,target
0,0.236225,-0.323289,-0.018429,-1.548471,1.311427,70.618083
1,-0.801497,0.27117,-0.525641,-0.88778,0.936399,52.75787
2,0.687881,0.417044,-1.203735,0.498727,-0.737932,-43.728456
3,-0.679593,-1.063433,-1.797456,0.913202,2.211304,156.835125
4,0.096479,-0.50706,0.522083,0.155794,1.520004,102.748706


In [12]:
# Create a train: 0.8 | test: 0.2 ratio dataset
df_train, df_test = train_test_split(df,
                                    test_size=0.2,
                                    random_state=0)

# Create a train: 0.6 | validation: 0.2 ratio dataset
df_train, df_val = train_test_split(df_train,
                                    test_size=0.25,
                                    random_state=0)

# Final dataset sizes: train: 0.6, validation: 0.2, test: 0.2,

In [13]:
# Output each shape to confirm the size of train/validation/test
print(f"Train: {df_train}")
print(f"Validation: {df_val}")
print(f"Test: {df_test}")

Train:             0         1         2         3         4      target
787  0.275667 -1.641703 -0.730874  0.719668 -0.248419   -8.808848
683 -1.342867  1.251292 -0.604033 -0.824147 -0.167253  -19.952561
616  1.621728 -2.096557 -0.093123 -1.525778  0.478125   15.447073
691 -1.170261  0.079926  1.202768  0.430374 -0.951992  -58.695696
972  0.089620 -0.825779 -0.395222  1.010694 -0.743228  -38.595137
..        ...       ...       ...       ...       ...         ...
431 -0.235503  2.011243 -1.377019  1.121907  1.392837  104.643680
243 -0.612626  0.133913  0.104907  0.774161 -0.774459  -43.206352
444 -0.547518 -1.109478  0.978567  0.610392  0.286013   25.559747
866  0.191159 -1.938357 -0.562252  1.423475 -0.785666  -36.994540
943  0.270411 -1.592216 -0.167610  0.715783  0.158668   18.220595

[600 rows x 6 columns]
Validation:             0         1         2         3         4      target
375  1.346221 -0.972403  1.496140 -0.822828 -1.490265 -107.918097
289  0.447054  0.938839 -0.446465

In [14]:
# Output all datasets to csv
df_train.to_csv("train_data.csv")
df_val.to_csv("val_data.csv")
df_test.to_csv("val_test.csv")