arekit-0.22.0
nicolay-r
released this
17 Mar 11:46
·
635 commits
to master
since this release
Release Notes 🎉
- Pipelines integration!
- Utilized now in text processing, which now could be deleted onto tokenization, entities assignation, frames assignation stages.
- Repositories for opinions and network input samples!
- Storage kernel customizations support for opinion and samples! Using Pandas by default.
- Opinion-related service turn into providers: pairs, opinions, text-opinions, etc.
NOTE: issue #232 has been moved to the next release.
This version does not support RuAttitudes collection news parsing!
Will be fixed in the upcomming project.
Changelog
v0.22.0-rc (2022-03-17)
Changes
Implemented enhancements:
create_term_embedding
-- Embedding algorithm based on parts requires useless check #298- UnitTests -- BertOntoNotes is no longer below the core processing #293
- SingleLabelScaler -- provide [QUICK] #291
- BRAT visualization -- support processing in case of multiple documents. #286
- Entity -- IDs Refactoring #280
- BaseSampleRowProvider -- provide sentence id #279
- BRAT tool -- adopt ui as a callback for the predict pipeline #275
- ExperimentIterationHandler -- add Labeled Output Samples convertion to OpinionCollection #270
- InferenceContext -- split bags and samples extraction from a single method [Quick] #268
- DataFolding -- organize united data folding. #267
- BaseDataFolding -- iter_index is not related to the base implementation #266
- DataFolding -- move into experiment context #264
- DataIO (exp_data var) -- rename it to
ExperimentContext
#263 - ExperimentIterationHandler (Callback before) -- organize ExperimentEvaluationCallback #262
- NetworkCallback -- this callback should not inherit experiment base Callback #261
- Neural Network Hidden states writers and providers refactoring #260
- TrainingCallback -- separate onto
TrainingTerminationCallback
andHiddenWriterCallback
classes. #259 - BaseTensorflowModel -- simplify
fit
andpredict
operations. #258 - LabeledCollection -- remove
is_empty
andreset_labels
api #257 - NetworkCallback -- move train/predict notification info into callback #256
- Tensorflow saver -- move the related logic outside of the model implementation #255
- DefaultSingleLabelAnnotationAlgorithm -- single label is not a part of the algo #244
ThreeScaleTaskAnnotator
-- rename and move into core. #243- Data/output -- create pipelines directory with the related output processing #240
- Examples -- document parsing executes twicely #239
- Might be utilized pipeline implementation #238
- OpinionsProvider -- performs two actions, including ids assignation #236
- entity_to_group_func --
BaseExperiment
should not provide this method. #235 - TextOpinionHelper -- to news/parsed/providers (implement the latter as a provider) #233
- DefaultSingleLabelAnnotationAlgorithm -- iter_opinion duplicates the generalized pair opinion pair creation approach #231
- Common
languages
dir -- move its contents into processing contrib. #229 - Linked Text Opinions Refactoring. #228
- Lemmatization should be a part of the frames processing pipeline stage #226
- DefaultTextParser -- this class is actually a Tokenizer #225
- News -- text-opinions provider and entities access API might be a part of a
ParsedNews
by means ofNewsParser
(new class) #224 - StringLabelsFormatter -- switch to label_types instead of label instances. #223
- AnnotationAlgorithm -- iter_opinions requires EntitiesCollection while the latter utilized for entities iteration #222
- TextParseOptions -- add
keep_tokens
#221 - FrameVariantsParser -- return modified terms only #218
- FramesAnnotation --
is_inverted
flag and processing shoult be a pipeline item #217 - FramesCollection -- use
FrameConnotationProvider
instead #216 - FrameVariantsParser -- move into processing subfolder. #215
- OpinionOperations -- remove
try_read_annotated_opinion_collection
#213 - DocumentOperation -- unify iter_doc_ids operation into one with
tag
parameter. #212 - OpinionOperations -- move readers* into IO. #211
- OpinionCollectionsProvider -- serialization should not be a part of this class #210
- data -- separate data-related information from the experiment #209
- BaseInputReader -- class stores
_df
, however it should replaced withBaseRowsStorage
#207 - Repositories -- fill method should be a part of a
storage
rather than provider. #204 - BaseStorage -- exclude
save
method into separated class BaseRowsWriter #202 - Experiments -- rename
formats
toapi
(QUICK) #201 - Embedding and Vocabulary -- organize Storage/Repository with
serialize
/load
operations. #200 - Sample -- remove dependency from DefaultNetworkConfig. #199
- BaseOutputFormatter -- both provider and formatter mixes
df
usage #198 - OpinionProvider -- remove dependency from Opinion and Document Operation instances. #197
- Repositiories -- add this class which unite all the providers for data writing #195
- Add column providers #194
- NetworkSampleFormatter -- switch to provider #193
- BaseSampleStorage -- use
store_labels
instead ofdata_type
passing (QUICK) #192 - NetworkOutputEncoder -- separate formatting from serialization. #191
- BaseSampleFormatter --
__create_row
is not relted to the Formatter, should be moved. #190 - BaseDocumentStatGenerator -- provider depends on IO files. #189
- OpinonFormatter -- use the latter in experiment io. #188
- News -- remove
return_text
parameter from iter_sentences method (QUICK) #187 - BaseRowsFormatter -- move
format
method in another class #185 - BaseSampleFormatter --
_iter_sentence_terms
should not be a part of this class. (QUICK) #184 - BaseSampleFormatter --
_provide_rows
behavior depends on row_ids_provider instance type. #182 - BaseSampleFormatter -- remove
data_type
parameter from ctor #181 - BaseObjectParser --
parse
method should return object of the same type assentence
#179 - News -- remove
entities_parser
instance from News class. #178 - BaseEntitiesParser -- generalize to BaseObjectsParser. #177
- Provide SHA checksums utilization for downloaded resources. #176
- OpinionCollectionsFormatter -- use it as instance, created within
with
block #175 - BaseOutput -- move
_csv_to_dataframe
out of this class. #174 - DataIO -- remove
Stemmer
instance #172 - BaseRowsFormatter --
formatter_type_log_name
mehod should be removed. #171 - BaseOpinionsFormatter -- leave
save
method implementation for inheritor classes. #170 - BaseSampleFormatter -- leave
save
method implementation for inheritor classes. #169 - BaseIOUtils -- remove dependencies from file/(path) based data storage format #168
- BaseIOUtils --
get_input_sample_filepath
get_input_opinions_filepath
are limit possible storage abilities. #166 - perform_reading_and_initialization -- provide samples reader. #165
- perform_reading_and_initialization -- remove dependency from
doc_ops
#164 - NetworkInputSampleReader -- remove inheritance from TSV-based reader. #163
- OpinionCollectionsFormatter -- use
save_to
andload_from
notation for method names with source provider (file/archive/storage, etc.). #142 - RuSentRelOpinionCollectionFormatter -- move all the opinion iteration during saving/loading into base class #141
- news_id or doc_id -- normalize class and field names #133
- embeddings subdir -- considered to be a part of networks contrib #132
- Sentiment frame polarity (A0->A1) considered to be a part of the related experiment. #118
- EnumServices -- provide a base class with string to Enum conversion functionality #117
- EntityFormaters -- Move formaters into the particular experiment implementation #116
- _create_parse_options -- remove this method from DocumentOperations across all the experiments. #112
- NewsParseOptions -- provide this options for the particular
DefaultParser
derived fromTextParser
#111 - TextParser -- Provide a separated class with a text processing algorithm implementation API #75
- Providing all the logging information into log_utils.py #30
Fixed bugs:
- ModuleNotFoundError: No module named 'arekit.common.data.input.providers.instances' #301
- UnitTests -- Discard RuAttitudes-v1.2 support due to
index out of range
exception on reading #295 - text_opinions_iter_pipeline -- ids assigments varies after multiple calls #278
- EntitiesParser -- provide doc_level ids #277
- DeepPavlovNER -- BertOntoNotes entities annotation [Treating string and list-based text representation simultaneously] #274
- Examples -- get_index_by_term of Vocabulary failed #271
- Annotator Performance -- keeps all possible pairs between entities. #253
- Network SampleID -- has type
unicode
, but expected to be integer type #248 - Example -- given two sentences results in samples of only last of them. #246
- UnitTests -- Incorrect labels formatter (QUICK) #186
- test_samples_iter.py -- incorrect API usage in Tensorflow contrib. #158
Closed issues:
- Transfer examples folder into separated project [ARElight] #300
- RuSentRel Experiment -- Text is lemmatized irrespect of the save_lemmas parameter in parser [OK] #297
- Experiment -- refactor inference pipeline implementation #290
- Example -- reorganize infer folder (experiment) #289
- Experiment -- Organize pipeline stages as items of the BasePipeline #285
- BaseSampleRowProvider -- provide entity values and entity types. [QUICK] #283
- DeepPavlov NER -- adopt BERTontonotes. #272
- NeuralNetworks -- graph and tf session should be initialized before the
predict
method call. #247 - NewsServiceCollection -- implement #245
- numpy 1.19.5 -- returns int64 by default #242
- Organize unit tests for Output to Opinion conversion pipeline #241
- Iter_opinions_collection -- complicated, considering pipeline processing instead #237
- EntitiesCollection -- provide
value_to_group
function instead of SynonymsCollection. #230 - BaseTextParser --
parse_news
is not related to the text parsing concepts and should be a part of the another class #220 - DocumentOperations --
_get_text_parser
should not be a part of this API #219 - Create simple parser for text with mentioned [entities] #214
- NetworkInputHelper -- performing
serialize_missed_collections
during writing process #208 - RowIDs -- should be common for input and output #206
- SampleRowBalancerHelper -- simplify by using
pandas
group sampling #203 - convert_output_to_opinion_collections -- pass opinion reader into parameters. #167
- Experiment -- Separate TSV-based formater from based one for samples and opinions #162
- Switch to Python3.6 #160
- RuSentRel Experiment Contrib -- update description #153
- Provide Cache for data sources #151
- SynonymsCollection considered in ReadOnly mode only #5
Merged pull requests: