Skip to content

mremad/SpokenInputTopicDetection

Repository files navigation

--------------------------------------------------
|    INCLUDED PROJECTS				 |             
--------------------------------------------------

*punctuation_detector

A keras implementation of a period punctuation detector. This system inserts a period to a stream of unstructured text.


*topic_based_segmentation

A keras implementation of a system that divides a stream of sentences into segments where 
adjacent segments are different in topics.


*segment_classification

A kNN implementation (based on Word Embeddings and TFIDF features) that classifies a given segment into one of the available topics given their summaries.

--------------------------------------------------
|    DATASET USED				 |             
--------------------------------------------------
TDT-2 Corpus

--------------------------------------------------
|    DEPENDANCIES				 |             
--------------------------------------------------
keras 1.2.0
gensim
word2vec
numpy
nltk


--------------------------------------------------
|    CORPUS                                      |              
--------------------------------------------------
For the experiments, a TDTCorpus pickled object (*.pckl) is loaded that has two attributes: "text_corpus_bnds" and "sent_boundaries". Any object having these two attributes can work as input

Explaining "text_corpus_bnds":
A dict() object having as keys the ids of the documents, and values the list of tokens (words) of each document.
Tokens (words) that mark an end of sentence contain a marker "<bnd>"

EXAMPLE:

text_corpus_bnds = 	{	"document_id_1":["This","is","a", "sentence.<bnd>", "this", "is", "another", "sentence<bnd>"]
				,
				"document_id_2":["This","is","A", "non-pre-procces", "sentence.<bnd>", "this", "is", "another", "sentence<bnd>"]
				,
				"document_id_3":["This","is","a", "sentence.<bnd>", "this", "is", "another", "sentence<bnd>"]
			}




Explaining "sent_boundaries":

A dict() object having as keys the ids of the documents, and values the list of sentence-based TOPIC boundaries. So each "i"th index in the list
represents whether there is a topic change after "i"th sentence in the document or not.

EXAMPLE:

sent_boundaries = 	{	"document_id_1":[0,0,0,1,0,0,0,0,0,1] 
				#document id 1 has two story segments: segment from sentence 1 > 4, segment from sentence 5 > 10
				,
				"document_id_2":[0,1,0,0,1] 
				#document id 2 has two story segments: segment from sentence 1 > 2, segment from sentence 3 > 5
				,
				"document_id_3":[0,0,1,0,0,1,1] 
				#document id 1 has three story segments: segment from sentence 1 > 3, segment from sentence 4 > 6 
				#and segment containing sentence 7
			}


--------------------------------------------------
|    VOCAB                                       |              
--------------------------------------------------
Another input used in experiments is the vocabulary of the training set.
A vocab is a dict stored in a pickle object (*.pckl)
The dict() has as keys the words and the value is a unique id to that word.
Word ids should start from 1.
A special word "<bnd>" is added to the dict and has the id: len(vocab) + 1. 
This <bnd> word is reserved for unkown words.
Words should be lowercased.

EXAMPLE:


vocab = 	{	"word"	: 1,
			"usa"	: 2,
			"<bnd>" : 3
		}