# Data Challenge: H-index Prediction

Michael Fotso Fotso, Tristan François and Christian Kotait

## Data Generation

### First preprocessing

Create a file `processed_data.csv` in the folde `../tmp/` with a few features related to the graph and two features related to a **fasttext model**.

It took about 15 minutes on our machines.

You can now run the code in the last section to test the performance locally. You should get a MSE of about 52 with 1000 iterations in just a few seconds.

In [None]:
from preprocess_utils import store_full_dataset_with_features
from utils import write_train_data_json

write_train_data_json()
store_full_dataset_with_features(from_scratch=True, vectorize=True)

### Add authority feature

It took about 10 minutes on our machines

In [None]:
from preprocess_utils import PROCESSED_DATA_PATH, add_authority, get_processed_data

data = get_processed_data(split=False)
data = add_authority(data)
data.to_csv(PROCESSED_DATA_PATH)

### Add a closeness centrality feature

It took about 2 hours on our machines.

In [None]:
from utils import get_closeness
from preprocess_utils import PROCESSED_DATA_PATH, add_features, get_processed_data

data = get_processed_data(split=False)
closeness = get_closeness()
data = add_features(data, closeness)
data.to_csv(PROCESSED_DATA_PATH)

### Add doc2vec features

Add 20 features from a doc2vec model.

It took about 5 hours on our machines.

In [None]:
from d2vec import add_do2vec_to_whole_dataset
from preprocess_utils import PROCESSED_DATA_PATH, get_processed_data

data = get_processed_data(split=False)
data = add_do2vec_to_whole_dataset(data)
data.to_csv(PROCESSED_DATA_PATH)

### Add tf-idf features

Add `n_features` from a tf-idf vectorizer.

It took a few minutes on our machine but it requires a lot of ram. 

We were able to generate up to 5000 features.

In [None]:
from preprocess_utils import PROCESSED_DATA_PATH, get_processed_data, add_tf_idf

data = get_processed_data(split=False)
data = add_tf_idf(data, n_features=1000)
data.to_csv(PROCESSED_DATA_PATH)

## Model training and submission

The variables `iterations` and `task_type` define respectively the number of iterations in the model training and the execution mode. If you do not have a GPU, set task_type to CPU.

The execution time of the model depends strongly on the number of tf-idf features selected.

We were able to train with 150,000 iterations with 3000 tf-idf features in a few hours on our machines.

However, we were far from converging and increasing the number of iterations and/or features would most likely have improved the results significantly. However, we had neither the time nor the necessary equipment.

In [None]:
iterations = 1000
task_type = "GPU" # "GPU" or "CPU"

## Run for submission

In [None]:
from catboost import CatBoostRegressor
from preprocess_utils import get_submission_data
from read_data import get_test_data

X_train, y_train, X_test, y_test = get_submission_data()

model_cat = CatBoostRegressor(
    verbose=False,
    random_state=1,
    iterations=iterations,
    task_type=task_type,
    depth=8
)

model_cat.fit(X_train, y_train)

y_pred = model_cat.predict(X_test)

test, _ = get_test_data()
test["hindex"] = y_pred
submission = test[["author", "hindex"]]
submission.to_csv("../tmp/submission.csv", index=None)

## Run for local test

In [None]:
from catboost import CatBoostRegressor
from preprocess_utils import get_split_train_data

X_train, y_train, X_test, y_test = get_split_train_data()

model_cat = CatBoostRegressor(
    verbose=False,
    iterations=iterations,
    task_type=task_type,
    depth=8
)

model_cat.fit(X_train, y_train)

y_pred = model_cat.predict(X_test)

from sklearn.metrics import mean_squared_error

print(mean_squared_error(y_test, y_pred))