Skip to content

Latest commit

 

History

History
155 lines (85 loc) · 3.29 KB

process.md

File metadata and controls

155 lines (85 loc) · 3.29 KB

Module helpers.process

Text preprocessing functions.

Functions

def get_ngram_module(word_list, prev_ngram=None, min_count=20, threshold=1)

Create an ngram module, used to generate bi-grams or higher order n-grams

Args

word_list : pandas.Series or list

list of tokens

prev_ngram : gensim.model.Phrases, optional

a lower order Phrases module, if exists, the function will return 1+ higher order ngram. Defaults to None.

min_count : int, optional

Ignore all words and bigrams with total collected count lower than this value.. Defaults to 20.

threshold : int, optional

Score threshold for forming the ngrams (higher means fewer ngrams). Defaults to 1.

Returns

gensim.model.Phrases

n-gram module

def make_dictionary(series_words, large_list=False)

Make a dictionary out of a series of words for documents

Args

series_words : pandas.Series

documents' tokens, each element is a list of tokens for a document

large_list : bool, optional

if True, the series_words is a hydrated list of tokens (comma seperated format). Defaults to False.

Returns

gensim.corpora.Dictionary

gensim dictionary

def most_common(words, n=10)

Returnes the most common words in a document

Args

words : list

list of words in a document

n : int, optional

Top n common words. Defaults to 10.

Returns

list

list of Top n common terms

def tokenize_sentence(text)

Wrapper for nltk sent_tokenize. Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer.

Args

text : string

text to split into sentences

Returns

list

list of sentence tokens

def tokenize_word(text, normalise_case=True, keep_numerics=False, shortwords=3, remove_stopwords=True, lemmatize=True, gaps=False, addl_stopwords=[])

Tokenisation function, accepts a text to be tokenised, and applies the following steps according: 1. normalise case to lower 2. tokenise using nltk RegexpTokenizer with pattern: r'\w+|$[0-9]+|\S+&[^.<>]' 3. remove numerics from list of tokens 4. remove short words 5. remove english language stop words and custom stop words 6. lemmatise tokens to base form

Args

text : string

text to split into word tokens

normalise_case : bool, optional

enable/disable case normalisation. Defaults to True.

keep_numerics : bool, optional

to keep or remove numerics. Defaults to False.

shortwords : int, optional

length of short words to be excluded. Defaults to 3.

remove_stopwords : bool, optional

enable or disable removing of stop words. Defaults to True.

lemmatize : bool, optional

enable or disable lemmatisation. Defaults to True.

gaps : bool, optional

True to find the gaps instead of the work tokens. Defaults to False.

addl_stopwords : list, optional

additional stop words to be excluded from tokenisation. Defaults to [].

Returns

list

list of tokens

Index

  • Super-module

    • [helpers](index.html "helpers")
  • Functions

    • get_ngram_module
    • make_dictionary
    • most_common
    • tokenize_sentence
    • tokenize_word

Generated by pdoc 0.9.1.