### Salary prediction, episode II: make it actually work (4 points)

Your main task is to use some of the tricks you've learned on the network and analyze if you can improve __validation MAE__. Try __at least 3 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points. 

__Please be serious:__ " plot learning curves in MAE/epoch, compare models based on optimal performance, test one change at a time. You know the drill :)

You can use either __pytorch__ or __tensorflow__ or any other framework (e.g. pure __keras__). Feel free to adapt the seminar code for your needs. For tensorflow version, consider `seminar_tf2.ipynb` as a starting point.


In [1]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator
from sklearn.metrics import mean_absolute_error as MAE
import gutil.ml

In [2]:
data = pd.read_csv("./Train_rev1.csv", index_col=None)


In [3]:
data['salary_log'] = np.log1p(data['SalaryNormalized'])

In [4]:
y_col = "salary_log"


In [5]:
train ,test = sklearn.model_selection.train_test_split(data, random_state=0)

In [6]:
y_train = train[y_col]
y_test = test[y_col]

# preprocess categorical

In [7]:
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]


In [8]:
class ReplaceUncommon (BaseEstimator):
    def __init__(self, min_occurence = 50):
        self.min_occurence = min_occurence
    def fit(self, X, y=None):
        self.valid = {} # mapping from columns to valid names
        for col in X.columns:
            s = X[col].value_counts()>=self.min_occurence
            self.valid[col] = s[s].index
        return self
    def transform(self,X):
        out = X.copy()
        for col in out.columns:
            out.loc[~out[col].isin(self.valid[col]), col] = "UNK"
        return out
class ToDict(BaseEstimator):
    def fit(self,X, y=None):
        return self
    def transform(self,X):
        return X.apply(dict,axis=1)


In [9]:
precat = Pipeline([('replace_uncommon', ReplaceUncommon()),
                   ("to_dict", ToDict()),
                   ("dict_vectorizer", DictVectorizer())
                  ])

In [10]:
precat.fit(train[categorical_columns])

Pipeline(steps=[('replace_uncommon', ReplaceUncommon()), ('to_dict', ToDict()),
                ('dict_vectorizer', DictVectorizer())])

In [11]:
cat_features_train = precat.transform(train[categorical_columns])
cat_features_test = precat.transform(test[categorical_columns])

# hyperparameter tuning

In [74]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import KFold
import itertools
from sklearn.linear_model import Ridge
import tqdm

In [92]:
%%time
X = cat_features_train.toarray()[:500]
Y = y_train.iloc[:500]
num_splits=5
return_search_space = False



CPU times: user 0 ns, sys: 306 ms, total: 306 ms
Wall time: 303 ms


Ridge(alpha=0.0, solver='lsqr')

[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished


-0.06499715796971728

In [42]:
print(**(all_combos[0]))

TypeError: 'alpha' is an invalid keyword argument for print()

# try to predict just using categorical data

In [147]:
from sklearn.linear_model import LinearRegression

In [145]:
class BaselineMean():
    def __init__(self):
        pass
    def fit(self,X, y= None):
        self.mean = y.mean()
    def predict(self,X):
        return np.repeat(self.mean, X.shape[0])

In [148]:
predictor = LinearRegression()


predictor.fit(cat_features_train, y_train)
preds=  predictor.predict(cat_features_test)
MAE(y_test, preds)

0.3075822684245251

Summary:

BaselineMean: 0.39

LinearRegression: 0.307





### A short report

Please tell us what you did and how did it work.

`<YOUR_TEXT_HERE>`, i guess...

## Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `nn.BatchNorm*`/`L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time (independently for each feature)
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not download pre-trained embeddings from [here](http://nlp.stanford.edu/data/glove.6B.zip) and follow this [manual](https://keras.io/examples/nlp/pretrained_word_embeddings/) to initialize your Keras embedding layer with downloaded weights.
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback(keras)](https://keras.io/callbacks/#earlystopping) or in [pytorch(lightning)](https://pytorch-lightning.readthedocs.io/en/latest/common/early_stopping.html).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!