Skip to content

An automated text processing tool, that exports a text/article to an abstract (with keywords)

Notifications You must be signed in to change notification settings

jolin1337/Text2Abstract

Repository files navigation

Auto categorizer

This project aims to create a self learning application that can detect the subject of a text. The algorithm consists of three steps.

  • Preprocessing text by filter out non words, strip HTML, identify nouns and verbs, replace Entities with tag names
  • Train a doc2vec model that converts the input text into a document vector to be used in the categorization step
  • Train a categorizer with LSTM model by feeding the processed training data through the doc2vec model -> LSTM model

This work is a clean version of the project described here

To install

Navigate to the folder you want to have the project

$ git clone https://github.com/jolin1337/Text2Abstract auto-categorizer
$ cd auto-categorizer
$ python -m venv ./venv
$ ln -s venv/lib/python3.5/site-packages/learning learning
$ python -m pip install -r requirements.txt
$ cp .env.sample .env
$ export PYTHONPATH=`pwd`
### Add SOLDR_TOKEN in .env file and do
$ source .env

To fetch data run

$ python learning/mm_services/fetch_data.py learning/data/new_metadata_articles.json

Create a list of top categories, one line per category

Name it new_top_categories.txt

To train

All training parameters are ported to the configuration file specified in the config folder

$ python learning/model.py

To evaluate models

# creates a result.csv with all articles in the specified article source evaluated by the models, with the actual categories and the predicted categories
python evaluation/evaluate_models.py

# takes the result.csv and calculates different accuracies depending on the probability used for filtering out predictions.
ruby evaluation/calculate_accuracy.rb evaluation/results_20191125-114435-stopwords.csv 

To run a local webservice

$ python web/app.py

To run sagemaker docker container

$ ./build_and_push.sh sagemaker-auto-categorization
# Then open notebook and train the model or create an endpoint starting jupyter:
$ jupyter notebook

Test endpoint

https://localhost:8080/invocations

Deploy

To deploy the project stack package the Cloudformation template and then deploy it

$ aws cloudformation package --template serverless-sagemaker-orchestration/cloudformation/continuous_sagemaker.serverless.yaml --s3-bucket sagemaker-autocategorization > serverless-sagemaker-orchestration/cloudformation/continuous_sagemaker.serverless.yaml.package
$ aws cloudformation deploy --stack-name autocategorizationtest --template-file serverless-sagemaker-orchestration/cloudformation/continuous_sagemaker.serverless.yaml.package --capabilities  CAPABILITY_NAMED_IAM

Then deploy the ECR container which will serve as the model's container, needed to create the endpoint

$ ./_sagemaker/build_and_push.sh autocategorization

For more guides look in the notebooks found in the notebooks folder

About

An automated text processing tool, that exports a text/article to an abstract (with keywords)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •