<a href="https://colab.research.google.com/github/jarodchristiansen/Angular-8-course/blob/master/Copy_of_O'Reilly_ML_Engineer_Take_home.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# O'Reilly Machine Learning Engineer Takehome

Welcome to the evaluation project for the Machine Learning Engineer position at O'Reilly Media. In this project you will evaluate a search academic dataset built using common learn-to-rank features, build a ranking model using the dataset, and discuss how additional features could be used and how they would impact the performance of the model.

Steps:
1. Make a copy of this notebook.
2. Download the dataset to the notebook
3. Preprocess and evaluate the dataset
4. Build a **ranking** model
5. Evaluate your ranking model using a metric of your choice
6. Discuss the performance of your model and why you chose the model you chose.
7. Answer discussion questions
8. Submit your notebook


## Notes

 

*   Throughout the notebook you should include notes explaining your choices and what you are doing. Your thought process is more important than the actual performance of your model.
*   Create as many cells as you want. The exisiting cells are just provided to provide some initial organization.
* You may use any choice of libraries or frameworks.
* If ranking models are new to you, consider starting here: https://arxiv.org/abs/1812.00073

In [2]:
# Import dependencies here
!pip install pyltr
import pyltr

Collecting pyltr
  Downloading https://files.pythonhosted.org/packages/29/01/e7120dffc8bb40307002a51b85810ef714ee3faed260c416e9ae38feb282/pyltr-0.2.6-py3-none-any.whl
Installing collected packages: pyltr
Successfully installed pyltr-0.2.6


### 2) Download Dataset

In [3]:
# Download the dataset located at https://storage.googleapis.com/personalization-takehome/MSLR-WEB10K.zip
# You can read about the features included in the dataset here: https://www.microsoft.com/en-us/research/project/mslr/
!wget https://storage.googleapis.com/personalization-takehome/MSLR-WEB10K.zip

--2020-11-22 15:11:02--  https://storage.googleapis.com/personalization-takehome/MSLR-WEB10K.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.26.128, 172.217.193.128, 172.217.204.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.26.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1234144912 (1.1G) [application/zip]
Saving to: ‘MSLR-WEB10K.zip’


2020-11-22 15:11:13 (113 MB/s) - ‘MSLR-WEB10K.zip’ saved [1234144912/1234144912]



In [4]:
!unzip MSLR-WEB10K.zip

Archive:  MSLR-WEB10K.zip
   creating: Fold1/
  inflating: Fold1/test.txt          
  inflating: Fold1/train.txt         
  inflating: Fold1/vali.txt          
   creating: Fold2/
  inflating: Fold2/test.txt          
  inflating: Fold2/train.txt         
  inflating: Fold2/vali.txt          
   creating: Fold3/
  inflating: Fold3/test.txt          
  inflating: Fold3/train.txt         
  inflating: Fold3/vali.txt          
   creating: Fold4/
  inflating: Fold4/test.txt          
  inflating: Fold4/train.txt         
  inflating: Fold4/vali.txt          
   creating: Fold5/
  inflating: Fold5/test.txt          
  inflating: Fold5/train.txt         
  inflating: Fold5/vali.txt          


### 3) Preprocess and evaluate the dataset

In [5]:
# Preprocess and evaluate the dataset
with open('Fold1/train.txt') as trainfile, \
        open('Fold1/vali.txt') as valifile, \
        open('Fold1/test.txt') as evalfile:
    TX, Ty, Tqids, _ = pyltr.data.letor.read_dataset(trainfile)
    VX, Vy, Vqids, _ = pyltr.data.letor.read_dataset(valifile)
    EX, Ey, Eqids, _ = pyltr.data.letor.read_dataset(evalfile)

4) Build ranking model

In [None]:
# Build ranking model
metric = pyltr.metrics.NDCG(k=10)
metric2 = pyltr.metrics.ERR(highest_score=.34, k=10)

# Only needed if you want to perform validation (early stopping & trimming)
monitor = pyltr.models.monitors.ValidationMonitor(
    VX, Vy, Vqids, metric=metric, stop_after=400)

model = pyltr.models.LambdaMART(
    metric=metric,
    n_estimators=800,
    learning_rate=0.02,
    max_features=0.5,
    query_subsample=0.5,
    max_leaf_nodes=10,
    min_samples_leaf=64,
    verbose=1,
)

model.fit(TX, Ty, Tqids, monitor=monitor)

5) Evaluate model performance

In [35]:
predictions = model.predict(EX)
metric.evaluate_preds(Vqids,Vy, predictions)

ValueError: ignored

### 6) Please answer the following questions about your choices:
1. Why did you choose your metric to evaluate the model?
2. How well would you say your model performed?
3. If you had more time what else would you want to try?

### 7) Please answer the following questions about how you would use additional features:

1. If you had an additional feature for each row of the dataset that was unique identifier for the user performing the query e.g. `user_id`, how could you use it to improve the performance of the model?
2. If you had the additional features of: `query_text` or the actual textual query itself, as well as document text features like `title_text`, `body_text`, `anchor_text`, `url` for the document, how would you include them in your model (or any model) to improve its performance?




### 8) Please submit your colab by sharing it with: cmaon@oreilly.com
