GitHub - hyy369/verb-pattern: TribeHacks II NLP Challenge Winner: Python program that parses texts and generates table of verb patterns.

This project, originally named "verb-pattern-plus" is the 2016 Tribe Hackathon Winner in Natural Language Processing. The challenge prompt was created by LOGAPPS and is posted at: https://github.com/ACMWM/Logapps-TribeHacks-Challenge-2016 To successfully run this program, users are expected to have installed NLTK 3.0+ and corrsponding packages (http://www.nltk.org), Stanford Parsers (https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software), Java 8, JDK 1.8+ and default encoding set to "UTF-8". Please feel free to report any compatibility issues.

Part I sentence.py

The Sentence class is defined in this source file. A Sentence object will have the following fields:

str: the string representation of the sentence's content
sbj: the subject part of the sentence
obj: the list of the object parts of the sentence
paraID: the number of paragraph where the sentence appears
sentID: the index of the sentence within the paragraph

The make_sentence_list(filename) method reads trough a text file, identify complete sentences within the file, and initialize each sentence as a Sentence object. This file returns a list of Sentence objects that represents all complete sentences in the text file.

This script will generate a temporary text file in temp/.

Part II decompose.py

This source file iterates through each Sentence object, and

parses the sentence using Stanford Dependency Parser to look for subjects and objects
tokenizes and tags each word to identify real verbs within the sentence
creates a csv file with information of subjects, objects and verbs of each sentence

User will have to input the source text file name (w/o directory or extension names).The result file is result/xxx_decomposed.csv.

Part III expand.py

This source file extends the csv file created in Part II and count verbs of different categories provided by table1.csv. User will have to input the source text file name (w/o directory or extension names). The result file is result/xxx_expanded.csv.

Directories

./: Root directory contains source codes, readme docs, the challenge prompt, the presentation, and table1.csv, which is provided by the challenge.

en: necessary local nltk libraries for English.

result: result tables in csv files.

temp: cache space for tidying texts.

texts: text files that need to be processed.

Future applications

The score table of some key verbs provided by table1.csv is randomly generated. However, we can apply VerbPattern to plenty amount of text files and relate the pattern of output score table with the texts' author or other information. Each text file will have a unique matrix of score table which can be regarded as its fingerprint. We can use this database of text-fingerprints to maybe predict information of a given anonymous text or detect plagiarism, etc. Our Sentence object include not only the verbs, but also information regarding subjects and objects of a sentence. By a simple extension of the project, we can create score tables for subjects and objects too, so that we can analyze the headlines people are mostly talking about on social network, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Part I sentence.py

Part II decompose.py

Part III expand.py

Directories

Future applications

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
en		en
result		result
temp		temp
texts		texts
.gitignore		.gitignore
README.md		README.md
challenge.md		challenge.md
decompose.py		decompose.py
expand.py		expand.py
presentation.pptx		presentation.pptx
sentence.py		sentence.py
table1.csv		table1.csv

hyy369/verb-pattern

Folders and files

Latest commit

History

Repository files navigation

Part I sentence.py

Part II decompose.py

Part III expand.py

Directories

Future applications

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages