GitHub - harveyxia/semantic_classification: Use WordNet and word2vec to semantically classify documents

Setup

Our use of NLTK depends on several corpora. To install them, run the following in a Python environment:

import nltk
nltkl.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Running the script

open an ipython terminal
run semantic_classifier.py
output = run('filename.txt', min_size, max_size, max_dist)
output.hypernyms

Replace the 'filename.txt' with the input file of choice, and set the min_size, max_size, and max_dist. The last step outputs the ordered list of hypernyms.

Algorithm

semantic_classifier.py

Extract all nouns from document, noun_extractor.py

a. Strip all punctuation and non-ascii characters from each line b. Tokenize the line c. Tag the POS of each token d. Filter out all non-noun tokens e. Add all noun tokens to a Python dict and track occurrences of each noun
Convert nouns to synsets and remove nouns for which no synset exists
Generate the 2D matrix of similarity values
Perform hierarchical clustering
Get clusters based on min_size, max_size, and dist parameters.
Sort clusters by noun occurrence, most frequent first.
Find the least common ancestor of each cluster of synsets.

Notes/Ideas/Issues

Semantic classification vs. what the article is about
large clusters vs. iterative clusters of pairs

The larger the cluster size, the more abstract and oftentimes less accurate the hypernym. The smaller the cluster size, especially pairs, yield the most accurate hypernyms, but there is less semantic synthesis.
hypernym vs. content

E.g. "Photograph" is not clustered with "photography," their wup_similarity is only 0.1176. But the wup_similarity of "photograph" with "painting" is 0.705
Incorporate noun counts for assigning 'salience scores' to each hypernym
NLTK's POS tagger sometimes mis-tags words as nouns. For instance, it tags "tamer" in the following sentence as a noun: "Scientists once thought that some visionary hunter-gatherer nabbed a wolf puppy from its den one day and started raising tamer and tamer wolves".
Currently, the algorithm only takes the first synset and first common hypernym

a. The first synset is the most frequently occurring, but it might be the incorrect sense of the noun. b. A set of synsets might have multiple lowest common hypernyms, some of which may be more accurate than others.
How to do evaluation?
Morphology — collapse 'photography' and 'photograph'?
Methodological limitations

a. Only accounts for nouns b. Hypernym is not equivalent to 'semantic class' or 'content' c. A document's complete semantic meaning cannot fully be captured by a set of nouns
discuss clustering mode, i.e. median vs. complete

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
documents		documents
report		report
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
classification.py		classification.py
cluster.py		cluster.py
evaluate.py		evaluate.py
noun_extractor.py		noun_extractor.py
requirements.txt		requirements.txt
scoreHist.png		scoreHist.png
scraper.py		scraper.py
semantic_classifier.py		semantic_classifier.py
temp.txt		temp.txt
test_article.txt		test_article.txt
test_article_2.txt		test_article_2.txt

harveyxia/semantic_classification

Folders and files

Latest commit

History

Repository files navigation

Setup

Running the script

Algorithm

Notes/Ideas/Issues

About

Resources

Stars

Watchers

Forks

Languages