Skip to content

koaning/classycookie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🪐 spaCy Project: classycookie

Text classification benchmarks with scikit-learn and spaCy v3.

This project assumes that there is a data.csv file in the assets folder that contains data for a basic text classification task. This file needs to have one column named "text" and another one named "label". It can also contain an optional column called "set" which can contain either the value "valid" and "train". If this column is missing this project will make a random 50% split.

This data will be modelled in four different ways.

  1. a spaCy bag of words model.
  2. a spaCy CNN model.
  3. a scikit-learn countvectorizer logistic classifier
  4. a scikit-learn tf/idf logistic classifier

Setup

To set up this project, you only need to clone the spaCy project.

python -m spacy project clone --repo https://github.com/koaning/classycookie --branch main classification <folder-name>

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command Description
prepare Prepare the .csv for the next steps
convert Convert the data to spaCy's binary format
train-spacy Train the spaCy models
train-sklearn Train the spaCy models
evaluate Evaluate the model and export metrics

⏭ Workflows

The following workflows are defined by the project. They can be executed using spacy project run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
all prepareconverttrain-spacytrain-sklearnevaluate

🗂 Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File Source Description
assets/data.csv Local Original .csv file.

This project comes with a copy of the clinc dataset that you can replace at will.

End Result

The final result of the project will be a summary table like this one:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ Model                      ┃ Train Score ┃ Valid Score ┃ Time Taken          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ sklearn-trained/pipe_tfid… │ 0.9262      │ 0.8582      │ 0.15783333778381348 │
│ sklearn-trained/pipe_cv.j… │ 0.9866      │ 0.8955      │ 0.14619827270507812 │
│ training/cnn/model-best    │ 0.9867      │ 0.8723      │ 3.962278127670288   │
│ training/bow/model-best    │ 0.9867      │ 0.8723      │ 4.037751197814941   │
└────────────────────────────┴─────────────┴─────────────┴─────────────────────┘

About

cookiecutter to run standard text classifiers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published