Name		Name	Last commit message	Last commit date
parent directory ..
data		data
models		models
src		src
test-data		test-data
.eslintignore		.eslintignore
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.nycrc.yaml		.nycrc.yaml
.prettierrc.yaml		.prettierrc.yaml
README.md		README.md
package.json		package.json
tsconfig.json		tsconfig.json

README.md

@unifed/backend-ml

This package is currently used for machine learning related code.

Unifed currently utilises two forms of machine learning:

A spam detection filter.
A text toxicity classifier.

The spam detection models are created and trained in this package, whereas the text toxicity classifier utilises the pre-trained @tensorflow-models/toxicity model.

Spam Detection

The majority of the code in this package is for training a spam detection model.

Training data located in the data directory is converted into a common form, using the parsers located in src/parsers.
The training data is then tokenized, using src/tokenizer.ts.
A tensor is created using this data with the code in src/tensor.ts.
The models used in src/models are trained with the data.

src/train.ts provides a command line utility for training the models, whereas src/test-model.ts provides a command line utility for for accessing the performance of models.

An API to utilise the models is exposed in src/index.ts, which can be used by other packages.

Training Data

Training data is located in the data directory. The sources for the training data are as follows:

enron.zip - Source
sms.zip - Source
spam-assasin.zip - Source
testing.zip - Source

Models

The models used have been taken from the following sources:

dense (trained) - Source
dense-pooling (trained) - Source
twilio-dense (trained) - Source
lstm (not trained) - Source
bi-directional-lstm (not trained) - Source

Some models have not been trained, as we did not have the computing resources to do so in a reasonable amount of time. Training and evaluating these would be an interesting project extension.

Artifacts

The models directory contains the trained models. All configuration information is stored within here. These models take time to train and are checked into the repository.

The meta directory contains statistics about the training data, used in the report. This directory is not committed, as it contained hundreds of thousands of lines.

Development and Evaluation

A detailed report outlining the development and evaluation of the spam detection filter is available in both the 3rd and 4th deliverables.

Text Toxicity

The text toxicity classifier utilises the pre-trained @tensorflow-models/toxicity model.

This package provides a simple API around the model in order to classify single pieces of text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend-ml

backend-ml

README.md

@unifed/backend-ml

Spam Detection

Training Data

Models

Artifacts

Development and Evaluation

Text Toxicity

Files

backend-ml

Directory actions

More options

Directory actions

More options

Latest commit

History

backend-ml

Folders and files

parent directory

README.md

@unifed/backend-ml

Spam Detection

Training Data

Models

Artifacts

Development and Evaluation

Text Toxicity