Skip to content
Materials for the workshop on Text Mining for Learning Content Analysis at Learning Analytics Summer Institute (LASI'19)
R
Branch: master
Clone or download
jeljov Update preprocess_20News_dataset.R
Additional minor change in the intro comments
Latest commit 65b5d31 Jun 18, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data/20news Initial commit May 29, 2019
models Adding models and extended readme file May 29, 2019
.gitignore Initial commit May 29, 2019
LICENSE
README.md Minor updates in the readme and the scripts related to the 1st topic May 30, 2019
Text_Mining_at_LASI19.Rproj Initial commit May 29, 2019
exploring_word_vectors.R Tiny updates of the first word vectors script May 30, 2019
newsgroup_GloVe_classifier.R Very minor updates of the glove-based classifier script May 31, 2019
newsgroup_classifier.R Minor updates in the readme and the scripts related to the 1st topic May 30, 2019
preprocess_20News_dataset.R Update preprocess_20News_dataset.R Jun 18, 2019
tm_utils.R Minor updates in the readme and the scripts related to the 1st topic May 30, 2019
word_vec_utils.R Initial commit May 29, 2019

README.md

LASI'19 workshop on Text Mining for Learning Content Analysis

This repository stores materials for the Text Mining for Learning Content Analysis workshop organized at the Learning Analytics Summer Institute 2019 (LASI'19), at University of British Columbia, Vancouver, Canada, on June 17-19, 2019.

The stored R scripts cover 3 topics:

  • General text mining (TM) workflow exemplified through a binary text classification task. It covers the overall TM process, starting with text preprocessing, going through the creation of a few different classification models, and ending up with the testing of the best model. Scripts covering this topic:

    • preprocess_20News_dataset.R
    • newsgroup_classifier.R
    • tm_utils.R
  • Introduction to word vectors (word embeddings). The aim is to familiarize with the notion of word vectors through exploration of a pre-built word vector model. In particular, GloVe model (w/ 300 dimensions) is used. T-sne dimensionality reduction technique is used for visualization of word vectors in 2D space. Relevant scripts are:

    • exploring_word_vectors.R
    • word_vec_utils.R
  • Using word vectors for text classification. This includes two ways of using a pre-built word vector model to create an input for a classification algorithm: i) using weighted average of word vectors to form document vectors; ii) using Word Mover Distance to compute the similarity of documents based on their word vectors. The pre-built model introduced in topic 2 (GloVe) is used in this topic, as well. Scripts that cover this topic:

    • newsgroup_GloVe_classifier.R
    • tm_utils.R
    • word_vec_utils.R

Note also that some prebuilt models are available in the 'models' folder. They are made available so that we do not need to wait for models to build during the workshop.

The first and third topic are based on the 20 Newsgroups dataset. This dataset, widely used in text mining tasks and benchmarks, is a collection of approximately 20,000 newsgroup documents (forum posts), partitioned (nearly) evenly across 20 different newsgroups, each corresponding to a different topic. The csv files, in the data/20news folder, are derived from this dataset (subsetted and pre-processed).

Slides that introduce relevant concepts and methods are available at the links given below. The slides also cover some recent research work in Learning Analytics that was either partially or fully based on TM methods and techniques.

If interested in learning more, you may want to check materials from the previous edition of this workshop, held at LASI'18.

You can’t perform that action at this time.