Text preprocessing functions.
def get_ngram_module(word_list, prev_ngram=None, min_count=20, threshold=1)
Create an ngram module, used to generate bi-grams or higher order n-grams
word_list
: pandas.Series
or list
list of tokens
prev_ngram
: gensim.model.Phrases
, optional
a lower order Phrases module, if exists, the function will return 1+ higher order ngram. Defaults to None.
min_count
: int
, optional
Ignore all words and bigrams with total collected count lower than this value.. Defaults to 20.
threshold
: int
, optional
Score threshold for forming the ngrams (higher means fewer ngrams). Defaults to 1.
gensim.model.Phrases
n-gram module
def make_dictionary(series_words, large_list=False)
Make a dictionary out of a series of words for documents
series_words
: pandas.Series
documents' tokens, each element is a list of tokens for a document
large_list
: bool
, optional
if True, the series_words is a hydrated list of tokens (comma seperated format). Defaults to False.
gensim.corpora.Dictionary
gensim dictionary
def most_common(words, n=10)
Returnes the most common words in a document
words
: list
list of words in a document
n
: int
, optional
Top n common words. Defaults to 10.
list
list of Top n common terms
def tokenize_sentence(text)
Wrapper for nltk sent_tokenize. Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer.
text
: string
text to split into sentences
list
list of sentence tokens
def tokenize_word(text, normalise_case=True, keep_numerics=False, shortwords=3, remove_stopwords=True, lemmatize=True, gaps=False, addl_stopwords=[])
Tokenisation function, accepts a text to be tokenised, and applies the following steps according: 1. normalise case to lower 2. tokenise using nltk RegexpTokenizer with pattern: r'\w+|$[0-9]+|\S+&[^.<>]' 3. remove numerics from list of tokens 4. remove short words 5. remove english language stop words and custom stop words 6. lemmatise tokens to base form
text
: string
text to split into word tokens
normalise_case
: bool
, optional
enable/disable case normalisation. Defaults to True.
keep_numerics
: bool
, optional
to keep or remove numerics. Defaults to False.
shortwords
: int
, optional
length of short words to be excluded. Defaults to 3.
remove_stopwords
: bool
, optional
enable or disable removing of stop words. Defaults to True.
lemmatize
: bool
, optional
enable or disable lemmatisation. Defaults to True.
gaps
: bool
, optional
True to find the gaps instead of the work tokens. Defaults to False.
addl_stopwords
: list
, optional
additional stop words to be excluded from tokenisation. Defaults to [].
list
list of tokens
-
[helpers](index.html "helpers")
-
get_ngram_module
make_dictionary
most_common
tokenize_sentence
tokenize_word
Generated by pdoc 0.9.1.