# Ready for your first Kaggle competition?

Kaggle is a popular platform that hosts machine learning competitions.

The platform helps users to interact via forums and shared code, fostering both collaboration and competition.

1. Go to the Kaggle competition [website](https://www.kaggle.com/competitions).
2. Register for an account (it's free).
3. Find the __House Prices - Advanced Regression Techniques__
4. Go to the Data tab, read the description, download the data.

## Inspect the data

Use pandas python package read the csv files and inspect the data: 
* How many examples? 
* How many features?
* Are there non-numerial values? If so how do you handle these cases?
* Are there NaNs? and if so how do you handle such cases?

In [34]:
import pandas as pd

In [35]:
# Load the dataset
df = pd.read_csv('../data/house-prices-advanced-regression-techniques/train.csv')
# Display how many examples and features are in the dataset
print(f"The dataset contains {df.shape[0]} examples and {df.shape[1]} features.")
# Non-numerical values in the dataset
non_numerical = df.select_dtypes(exclude=['number'])
print(f"The dataset contains {non_numerical.shape[1]} non-numerical features:")
print(non_numerical.columns.tolist())

The dataset contains 1460 examples and 81 features.
The dataset contains 43 non-numerical features:
['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']


## Class to load the Training, Validation and Test sets.

In [36]:
import torch
from d2l import torch as d2l

class KaggleHouse(d2l.DataModule):
    def __init__(self, batch_size, train=None, val=None):
        super().__init__()
        self.save_hyperparameters()
        if self.train is None:
            # read the csv files:
            self.raw_train = pd.read_csv('../data/house-prices-advanced-regression-techniques/train.csv')
            self.raw_test = pd.read_csv('../data/house-prices-advanced-regression-techniques/test.csv')

    def preprocess(self, train_frac=0.8):
        """All the things you noticed about the data that needs preprocessing 
           can be addressed here.
        """
        label_col_name = "SalePrice"
        features_train = self.raw_train.drop(columns=['Id', label_col_name])
        features_test = self.raw_test.drop(columns=['Id'])

        # Handle NaN in numerical variables
        all_features = pd.concat([features_train, features_test], ignore_index=True)
        numeric_features = all_features.select_dtypes(include=['number'])
        numeric_features = numeric_features.fillna(numeric_features.mean())
        # Standardize numerical variables
        numeric_features = (numeric_features - numeric_features.mean()) / numeric_features.std()
        all_features.update(numeric_features)
        # Handle categorical features
        all_features = pd.get_dummies(all_features, dummy_na=True)
        # Inspect the dataset at every step
        n_train = features_train.shape[0]
        final_train = all_features[:n_train]
        final_test = all_features[n_train:]
        
        self.train = final_train
        self.val = final_train.sample(frac=1 - train_frac, random_state=42)
        self.test = final_test
        print('Train shape:', self.train.shape)
        print('Val shape:', self.val.shape)
        print('Test shape:', self.test.shape)

        # Sanity check: train and test must have the same number of features.

    def get_dataloader(self, train):
        """Define the data tensor (features tensor, labels tensor reshaped appropriately (i.e. (-1, 1))).
           Note: all the examples need ot be tensors so you need to pass the numpy arrays to torch.tensor.
           Note: Better taking the Logarithm of prices."""
        
        label = "SalePrice"
        data = self.train if train else self.val

        if data is None or label not in data:
            raise ValueError(f"The required label column '{label}' is missing in the data.")
        
        if label not in data: 
            return
        else:
            features = torch.tensor(data.drop(columns=[label]).values, dtype=torch.float32)
            labels = torch.tensor(data[label].values, dtype=torch.float32).reshape(-1, 1)
            tensors = (features, labels)
            
        return self.get_tensorloader(tensors, train)

In [37]:
data = KaggleHouse(batch_size=64)

In [38]:
# Insert some prints in the preprocess function so that you can verify everything is as expected
data.preprocess()

Train shape: (1460, 330)
Val shape: (292, 330)
Test shape: (1459, 330)


In [40]:
# Ensure the data is preprocessed before testing the data loader
data.preprocess()

# Test the data loader: check features and labels dimensions.
data.get_dataloader(train=True)

Train shape: (1460, 330)
Val shape: (292, 330)
Test shape: (1459, 330)


ValueError: The required label column 'SalePrice' is missing in the data.

## Training

In [None]:
# Here you could define your own regression model. 

In [None]:
# This function is complete: if you have done everything correctly this should work without modify anything
def your_training(trainer, data, lr=0.01):
    # Get the training dataloader
    train_loader = data.get_dataloader(train=True)

    model = d2l.LinearRegression(lr) # Initialize the model
    model.board.yscale='log'         # iterative loss plot

    trainer.fit(model, data)         # fit model to data

    return model                     # return the model

In [None]:
# define the trainer (we can use the built in d2l.Trainer)
trainer = d2l.Trainer(max_epochs=20)
your_model = your_training(trainer, data, lr=0.01)



TypeError: object of type 'NoneType' has no len()

## Evaluate your model on the Test set

In [None]:
testset = # remember that "data" contains also the preprocessed testset
your_predictions = your_model(torch.tensor(testset, dtype=torch.float32))
# NOTE: we trained the model to predict  the log of the labels.
preds_exp = 10**your_predictions.detach().numpy()

## Now save your predictions in a csv file

Read carefully the format they want the predction to be and create the csv file accordingly.

They want two columns, comma separated values, 'Id' and 'SalePrice'

In [None]:
submission = # Create the predictions dataset

submission.to_csv('./solutions/my_submission_solution.csv', index=False)

## Submit your predition to the Kaggle competition and see your score!