GitHub - lailaglez/named-entity-recognition: Collection of scripts for Named Entity Recognition in Short Messages such as Tweets

This is a collection of scripts used to test a Methodology for Named Entity Recognition (NER) in Short Texts such as tweets. This methodology was developed as part of my Bachelor degree Thesis. The Thesis can be found (in Spanish) in the Documentation folder.

Thesis abstract

Abstract Short messages (like tweets and SMS) are a potentially rich source of continuously and instantly updated information. The lack of context and the informality of such messages are challenges for traditional Named Entity Recognition systems. Most eﬀorts done in this direction rely on supervised machine learning techniques which are expensive in terms of data collection and training. In this thesis, we present a semi-supervised approach to Named Entity Recognition using self-training. We use wikis as external knowledge and unsupervised features to improve portability. Whether or not the use of case in a tweet is correct is also evaluated. This avoids one of the most common problems of traditional named entity recognition systems when facing noisy environments like Twitter: excessive dependence on title case as an indicator of the presence of an entity. The results obtained when applying this methodology are similar to those achieved by the most popular and eﬃcient named entity recognition systems. These results validate the eﬀectiveness of the methodology.

Features

The features extracted from each tweet can be divided into four groups.

Classic features

We denominate the ﬁrst group of features as "classic" features because they are the most commonly used ones in NER. This group is made up of orthographic, syntactic and grammatical features. It includes the token's sequence of characters, suffixes, prefixes, part of speech and neighbors.

Unsupervised features

These were extracted by joining our 7000 annotated tweets with unannotated tweets. Features are obtained as a result of three clustering algorithms: Brown clusters, Clark clusters, and word2vec. These clustering algorithms tend to assign named entities that belong to the same class to the same cluster.

Proper case use

Due to the importance of capital letters for most entity detection systems, our corpus of annotated tweets was compared to a more formal and well-formatted text. This allowed us to determine for each tweet whether or not case was properly used.

Global features

It has been proven that knowledge bases and dictionaries improve NER. However, building and maintaining this knowledge bases is a costly process. We propose the use of an existing collection of knowledge: Wikipedia.

The ﬁrst attribute obtained using global knowledge indicates whether there exists a Wikipedia article whose title matches the token or a collocation that contains it.
The second of these attributes is explored only if a Wikipedia article is found. We obtain the article’s category list. We remove from this list, categories containing the words: article, Wikipedia, and page. These categories have no semantic value. We select the category that is most common in the tweet corpus.
The third attribute is also only obtained if the article is found. A Wikipedia article’s ﬁrst sentence usually describes the named entity associated with it. We use as a token's description, the ﬁrst noun that follows the verb to be. For example: Bartolomé Maximiliano Moré (24 August 1919 – 19 February 1963), known as Benny Moré and Beny Moré (in Spanish), was a Cuban singer, bandleader and songwriter.

Classifiers

Many classifiers were tested both traditional classifiers and sequence-based ones. Traditional classifiers tested include ID3, SVM, Naive Bayes, Random Forest, Stochastic Gradient Descent, and AdaBoost. Sequence-based ones include Hidden Markov Models, Maximum Entropy Models, and Conditional Random Fields. Conditional Random Fields proved to be the most suitable for our problem.

Other tests

Dimensionality reduction and data balancing algorithms were also tested.

Corpus

The corpus used for these tests is the xLime Twitter Corpus which can be found here. Rei, Luis, Dunja Mladenić y Simon Krek: A Multilingual Social Media Linguistic Corpus. 2014.

Data

The vectors resulting from the feature extraction can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Documentation in Spanish		Documentation in Spanish
_utils		_utils
clusters		clusters
collocations		collocations
vocabulary		vocabulary
README.md		README.md
classic_classifier.py		classic_classifier.py
classic_classifier_balancing.py		classic_classifier_balancing.py
dimension_reduction.py		dimension_reduction.py
english-bidirectional-distsim.tagger		english-bidirectional-distsim.tagger
feature_vectors.py		feature_vectors.py
features.py		features.py
features_case.py		features_case.py
features_classic.py		features_classic.py
features_global.py		features_global.py
features_unsupervised.py		features_unsupervised.py
filter_stopwords.py		filter_stopwords.py
hiddenmarkov.py		hiddenmarkov.py
hiddenmarkov2.py		hiddenmarkov2.py
imbalanced.py		imbalanced.py
maxentropy.py		maxentropy.py
plotting.py		plotting.py
probabilities.txt		probabilities.txt
self_training.py		self_training.py
self_training_more_entities.py		self_training_more_entities.py
stanford-postagger.jar		stanford-postagger.jar
structured_classifier.py		structured_classifier.py
subset_result		subset_result

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thesis abstract

Features

Classifiers

Other tests

Corpus

Data

About

Releases

Packages

Languages

lailaglez/named-entity-recognition

Folders and files

Latest commit

History

Repository files navigation

Thesis abstract

Features

Classifiers

Other tests

Corpus

Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages