The system provides brand safety functionality -- answer queries with a html document as an argument. The answer is vector of compatibilities for each predefined advertisement type. These values can be thresholded for discrimination of incompatible ads or for selecting best ads for the given website.
Python libaries:
- scikit-learn
- gensim
- nltk
- numpy
- flask
- BeautifulSoup
For website classification word2vec embeddings are needed.
They can be downloaded
here.
./train_text_classifier.py
This script will generate output file with trained model - classifier_data.
./train_text_classifier.py validate
This script require classifier_data file.
./test.py
This script uses test data from file data/test_data.json,
and require classifier_data file.
Before running agents, website classifier
needs to be trained with ./train_text_classifier.py.
./start_front_agent.py
In current version, only one such agent can be running in the system.
./start_worker.py <front agent ip address>
These agents can be deployed at any number on different machines (note that one such agent allocates word vectors that take above 3GB of RAM).
Following commands examples return vector of compability measurement for given webiste code:
curl -X POST localhost:5000/compat -d "<html>Some website code<\html>"
curl -X POST localhost:5000/compat -d "Or just plain text"
curl -X POST localhost:5000/compat -F "data=@index.html"