This repository stores materials for the Text mining for learning content analysis workshop and tutorial organized in the scope of the Learning Analytics Summer Institute 2018 (LASI'18), held at Teachers College Columbia University, New York City, NY, on June 12-13, 2018.
The stored R scripts provide examples for four Text Mining (TM) tasks: classification, clustering, topic modelling, and keywords extraction. Each script presents the overall TM workflow for the respective task, starting with text preprocessing and ending with examination and evaluation of the results.
The four TM tasks are covered with R scripts as follows:
- Text classification:
- TM_Intro_Newsgroup_Classifier.R (for WS)
- NewsGroup_GloVe_Classifier.R (for WS)
- TM_Tutorial_Newsgroup_Classifier.R (for tutorial)
- UtilityFunctions.R (for WS and tutorial)
- Text Clustering:
- NewsGroup_AP_Clustering.R (for WS)
- ClustEvalUtil.R (for WS)
- Topic modelling with LDA method
- TopicModelingUseNetGroups.R (for WS)
- UtilityFunctions.R (for WS)
- Keywords extraction using the TextRank method
- TextRankUsenetGroups.R (for WS)
Note also that some prebuilt models are available in the 'models' folder. Models are grouped into subfolders based on the TM task they are related to (that is, R scripts they are associated with). They are made available so that we do not need to wait for models to build during the WS/tutorial.
All examples are based on the 20 Newsgroups dataset. This dataset is a collection of approximately 20,000 newsgroup documents (forum posts), partitioned (nearly) evenly across 20 different newsgroups, each corresponding to a different topic. In case the term "newsgroup" is new to you: a newsgroup is an online discussion forum accessible through Usenet (a decentralized computer network, like Internet). Even though they are not 'mainstream' social networks, newsgroups are in active use (see Newsgroup info).
Slides that introduce relevant concepts and methods can be downloaded from the links given below.
- Slides for the tutorial:
- Slides for the workshop: