Neural net language identification for many languages on short texts plus construction-based dialectometry
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
idNet
whl
COPY.lesser
LICENSE.txt
MANIFEST.in
README.md
__init__.py
setup.cfg
setup.py
setup_linux.sh
setup_win.bat

README.md

idNet

This package supports a generalized architecture for language identification (LID) and dialect identification (DID) using a multi-layer perceptron built using Keras. DID also supports a Linear SVM classifier using scikit-learn.

To load a model:

from idNet import idNet_Enrich

lid = idNet_Enrich("Path to model file", s3_bucket)
did = idNet_Enrich("Path to model file", s3_bucket)

s3_bucket takes a str containing an optional s3 bucket to load the model from. The model filename must contain the necessary prefixes.

Once a LID model is loaded, it has the following properties:

Method Description
lid.n_features Number of features in the model (i.e., hashing bins)
lid.n_classes Number of languages in the model
lid.lang_mappings Dictionary of {"iso_code": "language_name"} mappings for all ISO 639-3 codes
lid.langs List of ISO 639-3 codes for languages present in the current model

Once a DID model is loaded, it has the following properties:

Method Description
did.n_features Number of features in the grammar used to learn the model
did.n_classes Number of countries in the model
did.country_mappings Dictionary of {"iso_code": "country_name"} mappings for all country codes used
did.countries List of country codes for regional dialects (country-level) present in the current model

Loaded models perform the following tasks:

Method Description
lid.predict(data) Takes an array of strings or individual strings; returns array of predicted language codes
did.predict(data) Takes an array of strings or individual strings; returns array of predicted country codes

Note: Model filenames need to include ".DID"/".LID" and ".MLP"/".SVM" because this information is used to determine the model type!

Training New Models

To train new models, the training data needs to be prepared. This process is automated; see the Data_DID and Data_LID directories for directions and scripts.

from idNet import idNet_Train
id = idNet_train()
Argument Type Description
type (str) Whether to work with language or dialect identification
input (str) Path to input folder
output (str) Path to output folder
s3 = False (boolean) If True, use boto3 to interact with s3 bucket
s3_bucket = "" (str) s3 bucket name as string
nickname = "Language" (str) The nickname for saving / loading models
divide_data = True (boolean) If True, crawl for dataset; if False, just load it
test_samples = 20 (int) The number of files for each class to use for testing
threshold = 100 (int) Number of files required before language/country is included in model
samples_per_epoch = 5 (int) Number of samples to use per training epoch
language = "" (str) For DID, specifies the language of the current model
lid_sample_size = 200 (int) For LID, the number of characters to allow per sample
did_sample_size = 1 (int) For DID, the number of 100 word samples to combine
preannotate_cxg = False (boolean) For DID, if True enrich and save all CxG vectors
preannotated_cxg = False (boolean) For DID, if True just load pre-enriched CxG vectors
cxg_workers = 1 (int) For DID, if pre-enriching dataset, number of workers to use
class_constraints = [] (list of strs) Option to constrain the number of classes
merge_dict = {} (dict) Original:New name keys
id.train()
Argument Type Description
model_type = "MLP" (str) MLP or SVM
lid_features = 524288 (int) Number of character n-gram features to allow, hashing only
lid_ngrams = (1,3) (tuple of ints) Range of n-grams to hash
did_grammar = ".Grammar.p" (str) Name of C2xG grammar to use for annotation
c2xg_workers = 1 (int) For DID, number of workers for c2xg enrichments
mlp_sizes = (300, 300, 300) (tuple of ints) Size and number of layers; e.g., 3 layers at 300 neurons each
cross_val = False (boolean) Whether to use cross-validation rather than a held-out test set
dropout = 0.25 (float) The amount of dropout to apply to each layer
activation = "relu" (str) The type of activation; just passes name to Keras
optimizer = "sgd" (str) The type of optimization algorithm; just passes name to Keras