Auto categorizer

This project aims to create a self learning application that can detect the subject of a text. The algorithm consists of three steps.

Preprocessing text by filter out non words, strip HTML, identify nouns and verbs, replace Entities with tag names
Train a doc2vec model that converts the input text into a document vector to be used in the categorization step
Train a categorizer with LSTM model by feeding the processed training data through the doc2vec model -> LSTM model

This work is a clean version of the project described here

To install

Navigate to the folder you want to have the project

$ git clone https://github.com/jolin1337/Text2Abstract auto-categorizer
$ cd auto-categorizer
$ python -m venv ./venv
$ ln -s venv/lib/python3.5/site-packages/learning learning
$ python -m pip install -r requirements.txt
$ cp .env.sample .env
$ export PYTHONPATH=`pwd`
### Add SOLDR_TOKEN in .env file and do
$ source .env

To fetch data run

$ python learning/mm_services/fetch_data.py learning/data/new_metadata_articles.json

Create a list of top categories, one line per category

Name it new_top_categories.txt

To train

All training parameters are ported to the configuration file specified in the config folder

$ python learning/model.py

To evaluate models

# creates a result.csv with all articles in the specified article source evaluated by the models, with the actual categories and the predicted categories
python evaluation/evaluate_models.py

# takes the result.csv and calculates different accuracies depending on the probability used for filtering out predictions.
ruby evaluation/calculate_accuracy.rb evaluation/results_20191125-114435-stopwords.csv

To run a local webservice

$ python web/app.py

To run sagemaker docker container

$ ./build_and_push.sh sagemaker-auto-categorization
# Then open notebook and train the model or create an endpoint starting jupyter:
$ jupyter notebook

Test endpoint

https://localhost:8080/invocations

Deploy

To deploy the project stack package the Cloudformation template and then deploy it

$ aws cloudformation package --template serverless-sagemaker-orchestration/cloudformation/continuous_sagemaker.serverless.yaml --s3-bucket sagemaker-autocategorization > serverless-sagemaker-orchestration/cloudformation/continuous_sagemaker.serverless.yaml.package
$ aws cloudformation deploy --stack-name autocategorizationtest --template-file serverless-sagemaker-orchestration/cloudformation/continuous_sagemaker.serverless.yaml.package --capabilities  CAPABILITY_NAMED_IAM

Then deploy the ECR container which will serve as the model's container, needed to create the endpoint

$ ./_sagemaker/build_and_push.sh autocategorization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auto categorizer

To install

To fetch data run

Create a list of top categories, one line per category

To train

To evaluate models

To run a local webservice

To run sagemaker docker container

Test endpoint

Deploy

For more guides look in the notebooks found in the `notebooks` folder

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.circleci		.circleci
_sagemaker		_sagemaker
config		config
evaluation		evaluation
learning		learning
notebooks		notebooks
serverless-sagemaker-orchestration		serverless-sagemaker-orchestration
web		web
.dockerignore		.dockerignore
.env.sample		.env.sample
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

jolin1337/Text2Abstract

Folders and files

Latest commit

History

Repository files navigation

Auto categorizer

To install

To fetch data run

Create a list of top categories, one line per category

To train

To evaluate models

To run a local webservice

To run sagemaker docker container

Test endpoint

Deploy

For more guides look in the notebooks found in the notebooks folder

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

For more guides look in the notebooks found in the `notebooks` folder

Packages