Skip to content

🔍 📈 Using techniques of knowledge discovery and text mining the goal is to explain the structure and the evolution of the EGC community

Notifications You must be signed in to change notification settings

pchampio/EGC-2020

Repository files navigation

EGC_2020

EGC 2020 Challenge: 20 years of history for which future?

The goal of this challenge is to take stock at the evolution of the EGC community over the past 20 years and try to predict the future. The principle is to apply techniques of knowledge discovery and data mining to explain the structure and evolution.

Dataset

The data set consists of 1200 titles and abstracts from the articles published at the EGC conference between 2004 and 2018.
Fields:

  • years
  • title
  • abstract
  • authors

Pipeline

  • filter_extreme
  • tf-idf
  • LDA (Coherence Score)
  • K-Means (Silhouette scores)

Cluster (Topics) evolution / Time

  ScreenShot

Our system deducted a sharp increase in articles related to the social network analysis over the past years 20 (Label 1).
On the other hand, rule-based algorithms seem to have declined drastically (Label 6).

Evaluation (Hyper-parameters defined in the Jupiter-notebook)

The pipeline used in this project doesn't seem to find a lot of structure for one cluster (Label 9), sadly this cluster represents ~30% of our training data (Silhouette plot below).

Silhouette plot for 10 clusters

  ScreenShot

There is still room for improvement.

About

🔍 📈 Using techniques of knowledge discovery and text mining the goal is to explain the structure and the evolution of the EGC community

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published