Skip to content

cluster Ted corpus; visualize in word-cloud; reduce by t-sne

Notifications You must be signed in to change notification settings

iamlxb3/TedClustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TedClustering

An attempt of using clusering algorithms to explore TED corpus. The data is from https://www.kaggle.com/rounakbanik/ted-talks.

Used features: TF, TF-IDF, LSA

Clustering algorithms: K-means, MiniBatchK-means, hierarchical clustering, DBSCAN, iforest (abnormality detection)

algorithm entropy
MiniBatchKMeans 5.06
KMeans 4.82
hierarchical clustering average link 5.28
hierarchical clustering complete link 5.17
hierarchical clustering ward link 4.85

Below shows how lsa affect the result

alt text

Wordcloud for clusters 0-9

alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text

Tsne Project of clusters 0-9

alt text

Abnormality detection by iforest ( the most distinctive Ted talks), TFIDF + LSA

score tile
-0.039 An 8-dimensional model of the universe
-0.038 Debate: Does the world need nuclear energy?
-0.025 Does democracy stifle economic growth?
-0.021 Why bees are disappearing
-0.018 How we're growing baby corals to rebuild reefs
-0.015 Our refugee system is failing. Here's how we can fix it
-0.012 The laws that sex workers really want
-0.009 How fear of nuclear power is hurting the environment
-0.008 The refugee crisis is a test of our character
-0.007 Why I still have hope for coral reefs

About

cluster Ted corpus; visualize in word-cloud; reduce by t-sne

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages