(from the project root dir)
docker build -t kennissessie-text-classification docker/
(from the project root dir)
docker run -it --rm --name kennissessie-text-classification -v "$PWD":/code kennissessie-text-classification python3 script.py
hint: The TF IDF vectorizer is very similar to the word count vectorizer
The doc2vec library exposes the top10 similar results based on file id. see: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Doc2VecKeyedVectors.most_similar
In any of the data sets pick a random document and find the most similar documents using doc2vec. If you read these texts do you agree?
Hint: you can access the similar docvecs for a document using:
document_id = 'training/10335' # (stored in document['file_id'])
document_vector = doc2vec.docvecs[document_id]
print(doc2vec.docvecs.most_similar(positive=[document_vector]))
Sklearn has a lots of of-the-shelve classification algorithm next to one that is currently used in the "classify" function in common.py
Most of these classification algorithms have lots of parameters, you can try to tweak them and see if you get better results.
For an overview see: http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html