# (0) Naming Conventions.
- `full_data` refers to a dataframe containing the full training set, i.e. the result of something like `df = pd.read_csv('/blah/train.csv)`.
- `X, y, valid_X, valid_y` refer to the training and validation input and outputs.

# (1) Combining Training and Test data sets.

<font color=red> Warning: This is probably actually really bad form because it will aggresively encourage train-test contamination. </font>

A standard scenario is the following.
In the Titanic problem, the training data contains the feature 'Survived'.  The Test data obviously does not contain this feature.  This should be the only difference in the shape of the two data frames. Rather than working with each frame seperately (to fill missing values etc), it's much cleaner to join them, tidy in one sweep and then seperate before loading the model.

In [2]:
import pandas as pd

train = pd.read_csv('./input/titanic/train.csv')
test  = pd.read_csv('./input/titanic/test.csv')

print("Before joining:")
print('train shape:', train.shape)
print('test shape: ', test.shape, '\n')

# Save this column for later.
survived = train['Survived']
train = train.drop('Survived', axis=1)

# Join frames.
titanic = pd.concat([train, test], axis=0, sort=False)

print("After joining:")
print("titanic shape: ", titanic.shape, '\n')

# Go and tidy the data...
# ...

# Split it back to train and test.
train = titanic[0:891]
test  = titanic[891: 1310]

print("After Splitting:")
print('train shape:', train.shape)
print('test shape: ', test.shape, '\n')

# Prepare for loading into model
x_train = train
y_train = survived
x_test  = test

# model = SomeModel()
# model.fit(x_train, y_train)
# model.score(x_train, y_train)
# y_prediction = model.predict(x_test) 
# etc

# (2) Conforming to Submission Format.
An easy way to conform to the submission format is to load the sample submission and replace the target feature.

In [4]:
# Load sample submission.
# submission = pd.read_csv('./output/sample_submission.csv')

# Replace target with the model's predictions.
# submission['target'] = model.predict(test_vectors)

# Save submission.
# submission.to_csv('./output/submission.csv', index=False)

# (3) Models Relying on Random State.
According to the *Into to Machine Learning* tutorial on Kaggle, it is best practice to declare the seed when instantiating a model if applicable.  This allows for repeatable results.

In [1]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state=1)

# (4) Random Utililty Functions

In [4]:
import pandas as pd

# Add a suffix to every column name.
df.add_suffix("_normalized")

# One-hot encoding for all categorical variables.
pd.get_dummies(df)

# Train-test split.
# Be aware of 'shuffle' parameter.
from sklean.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Count frequencies of a category
df['feature'].value_counts()

# Glossary
*Imputation* 'The assignment of a value to something by inference from the value of the products or processes to which it contributes.' - Oxford.  E.g. replacing a missing value by the mean of that values column.

*One-hot encoding* The creation of a boolean feature for each possible value of a nominal categorical feature.

*Fold* A subset of the training data used to cross-validate.

A *dense layer* in a neural network is one in which each neuron recieves input from every neuron in the previous layer.