# Recruitment excercise

Thank you for being interested in our open positions. We prepared a short exercise to assess your Machine Learning and Python coding skills. The easiest questions are at the beginning, more advanced are later on. For most of us, completing it takes more than 60 minutes, so please feel free to select the ones which are interesting for you (e.g. jump straight to the more advanced ones) and answer only those. Just please remember that data is being read only once, at the beginning and data preparation tasks are prerequisites for some tasks provided later.

Also, since there's quite a lot of questions, please do not waste your time on beautifying the output or writing long answers - as long as an answer is there and it's correct, it's good enough.

Important: some tasks contain questions indicated by "⚠️" Write your answers (in English or Polish) to those questions below them in area enclosed by big red balls "🔴".

Also, please remember that the Internet search engines are your friends and as we're also using those in our everyday work, please feel free to use them as much as you need.

The entirety of this exercise can be solved with the data provided in the same email and the following libraries:
- pandas
- sklearn
- matplotlib
- xgboost
- optuna
- plotly
- tensorflow

Data comes from the famours Titanic dataset, so feel free to take advantage if you played with this dataset before and it's familiar to you. Don't worry if it's new, here's the data dictionary:

| Variable | Definition                                 | Key                                            |
|----------|--------------------------------------------|------------------------------------------------|
| survival | Survival                                   | 0 = No, 1 = Yes                                |
| pclass   | Ticket class (proxy for socio-economic status) | 1 = 1st (Upper), 2 = 2nd (Middle), 3 = 3rd (Lower) |
| sex      | Sex                                        |                                                |
| Age      | Age in years                               |                                                |
| sibsp    | # of siblings / spouses aboard the Titanic |                                                |
| parch    | # of parents / children aboard the Titanic |                                                |
| ticket   | Ticket number                              |                                                |
| fare     | Passenger fare                             |                                                |
| cabin    | Cabin number                               |                                                |
| embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv("train.csv")
data.head()

#### Task: Pandas warm-up 1/6

Print unique values for the Cabin column

#### Task: Pandas warm-up 2/6
Return the most crowded cabins

#### Task: Pandas warm-up 3/6
Using pandas methods, count number of people for each combination of Survived and Pclass levels

#### Task: Pandas warm-up 4/6
Create a bar plot showing survival rate (%) for each combination of Sex and Pclass levels

#### Task: Pandas warm-up 5/6
Create a scatter plot of Age (x-axis) and logarithmized Fare (y-axis)

#### Task: Pandas warm-up 6/6
Compute a correlation matrix for ["Survived", "Pclass", "Age", "Fare"]

#### Task: Data preparation 1/2
Create a new variable "Is_Cabin" with 0 if "Cabin" was NaN and 1 otherwise. Then drop the original Cabin variable

#### Task: Data preparation 2/2
For the rest of NaNs, remove all rows consisting any NaNs and print the number of remaining rows

#### Task: Unsupervised learning: Clustering
Standardize variables and run K-Means for the subset: ["Pclass", "Age", "Fare", "SibSp"]. Number of clusters doesn't matter, let's say we ask you for 4 clusters.

⚠️Question: Should one always standardize data use for K-Means clustering? Why so?⚠️

🔴Answer: ...🔴

#### Task: Unsupervised learning: Dimensionality reduction 1/2
Create a 2-dimensional t-SNE visualization of the subset: ["Pclass", "Age", "Fare", "SibSp"]

#### Task: Unsupervised learning: Dimensionality reduction 2/2
Create a 2-dimensional PCA visualization of the subset: ["Pclass", "Age", "Fare", "SibSp"]

#### Task: Unsupervised learning: Combining results 1/2
Use labels provided by K-Means to enrich your t-SNE scatterplot with colors indicating assigned cluster

#### Task: Unsupervised learning: Combining results 2/2
Use labels provided by K-Means to enrich your PCA scatterplot with colors indicating assigned cluster

#### Task: Unsupervised learning: Probing your understanding

⚠️Question: Which one would you show to your client interested in "seeing" the data and why?⚠️

🔴Answer: ...🔴

***

#### Task: Data split
Split your data into train (80% of data) and test (20% of data) sets with Survived being the y (dependent variable) and ["Pclass", "Age", "Fare", "SibSp", "Is_Cabin"] being the X (independent variables)

#### Task: Supervised learning: Probing your understanding

⚠️Question: Why do we split our data into train and test (and sometimes further into validation as well)?⚠️

🔴Answer: ...🔴

#### Task: Decision Tree: Training
Using your train set, train decision tree classifier

#### Task: Decision Tree: Performance evaluation
Calculate and print accuracy, precision, recall, F1 and confusion matrix (protip: in sklearn, there's one method for confusion matrix and another one to all the rest) for your decision tree using test set

#### Task: Logistic Regression: Training and performance evaluation
Using your train set, train logistic regression and check it's accuracy, precision, recall, F1 and confusion matrix

#### Task: Understanding logistic regression 1/2

⚠️Question: Why statisticians (and others) care about multicollinearity when estimating linear/logistic regression models?⚠️

🔴Answer: ...🔴

#### Task: Understanding logistic regression 2/2

⚠️Question: What is p-value? How it is used in logistic regression?⚠️

🔴Answer: ...🔴

#### Task: Compare models

⚠️Question: Which (decision tree vs logistic regression) model did better job (if any) and why?⚠️

🔴Answer: ...🔴

***

#### Task: eXtreme Gradient Boosting and Hyperparameter Optimization

An intern was asked to write a code optimizing XGB hyperparameters. They did a solid job, but... **there's a critical error in their work - can you spot and fix it?**

Protip: visualizing optimization history can help you to spot it

Also, please **adjust the number of trees to 2e3** - our data scientist said that the current number is not enough.

⚠️Question: Why one uses hyperparameter optimization?⚠️

🔴Answer: ...🔴

In [None]:
import xgboost as xgb
import optuna
from sklearn.metrics import f1_score

def objective(trial, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test):
    param = {
        'tree_method':'hist',
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]),
        'subsample': trial.suggest_categorical('subsample', [0.4, 0.5, 0.6, 0.7, 0.8, 1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.008, 0.009, 0.01, 0.012, 0.014, 0.016, 0.018, 0.02]),
        'n_estimators': 100,
        'max_depth': trial.suggest_categorical('max_depth', [5, 7, 9, 11, 13, 15, 17, 20]),
        'random_state': trial.suggest_categorical('random_state', [1, 11, 131]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
    }
    model = xgb.XGBRegressor(**param)

    model.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=100, verbose=False)

    predictions = [1 if x>0.5 else 0 for x in model.predict(X_test)]

    f1 = f1_score(y_test, predictions)

    return f1

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=30)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
optuna.visualization.plot_optimization_history(study)

#### Task: Finding a minimum of a function the gradient way

Most of Machine Learning algorithms use some sort of gradient techniques to optimize their parameters. Gradient techniques are also the very fundament of neural networks. Do you remember the basics? Please **complete the "gradient" function (and let it be as simple as a gradient can be) and see if it can find the global minimum of the "our_function" function for x_0 = 5.**

Protip: before you answer the question, be smart and check (there's plenty of nice tools to plot equations on the Internet) the plot of "our_function" for x:<-10,10> to see how it behaves.

⚠️Question: Was it able to find the global minimum of "our_function" and why?⚠️

🔴Answer: ...🔴

In [None]:
from math import sin

def gradient(func, x_0=5):
    step = 0.001
    alpha = 0.1
    x_new = x_0
    for i in range(10000):
        x_old = x_new
        gradient = COMPLETE_THIS_LINE
        x_new = x_old - alpha * gradient
    local_minimum_pretending_to_be_global_minimum = func(x_new)
    return local_minimum_pretending_to_be_global_minimum

def our_function(x):
    return sin(8*x)+x/2+x**2

print(gradient(our_function))

#### Task: The Very Basics of Deep Learning

1. Usually, we like our final activation function to provide values ranging from 0 to 1. This code is having a different one - please fix it
2. Please add a dropout layer
3. Please change optimizer to adam
4. Please provide the f1 score for the test set

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='relu'))

model.compile(optimizer='SGD', loss='binary_crossentropy', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=0)

loss, acc = model.evaluate(X_test, y_test, verbose=0)
print(f'Test Accuracy: {acc:.3f}')

#### And that's it! :)