LASI'19 workshop on Text Mining for Learning Content Analysis

This repository stores materials for the Text Mining for Learning Content Analysis workshop organized at the Learning Analytics Summer Institute 2019 (LASI'19), at University of British Columbia, Vancouver, Canada, on June 17-19, 2019.

The stored R scripts cover 3 topics:

General text mining (TM) workflow exemplified through a binary text classification task. It covers the overall TM process, starting with text preprocessing, going through the creation of a few different classification models, and ending up with the testing of the best model. Scripts covering this topic:
- preprocess_20News_dataset.R
- newsgroup_classifier.R
- tm_utils.R
Introduction to word vectors (word embeddings). The aim is to familiarize with the notion of word vectors through exploration of a pre-built word vector model. In particular, GloVe model (w/ 300 dimensions) is used. T-sne dimensionality reduction technique is used for visualization of word vectors in 2D space. Relevant scripts are:
- exploring_word_vectors.R
- word_vec_utils.R
Using word vectors for text classification. This includes two ways of using a pre-built word vector model to create an input for a classification algorithm: i) using weighted average of word vectors to form document vectors; ii) using Word Mover Distance to compute the similarity of documents based on their word vectors. The pre-built model introduced in topic 2 (GloVe) is used in this topic, as well. Scripts that cover this topic:
- newsgroup_GloVe_classifier.R
- tm_utils.R
- word_vec_utils.R

Note also that some prebuilt models are available in the 'models' folder. They are made available so that we do not need to wait for models to build during the workshop.

The first and third topic are based on the 20 Newsgroups dataset. This dataset, widely used in text mining tasks and benchmarks, is a collection of approximately 20,000 newsgroup documents (forum posts), partitioned (nearly) evenly across 20 different newsgroups, each corresponding to a different topic. The csv files, in the data/20news folder, are derived from this dataset (subsetted and pre-processed).

Slides that introduce relevant concepts and methods are available at the links given below. The slides also cover some recent research work in Learning Analytics that was either partially or fully based on TM methods and techniques.

If interested in learning more, you may want to check materials from the previous edition of this workshop, held at LASI'18.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LASI'19 workshop on Text Mining for Learning Content Analysis

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data/20news		data/20news
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Text_Mining_at_LASI19.Rproj		Text_Mining_at_LASI19.Rproj
exploring_word_vectors.R		exploring_word_vectors.R
newsgroup_GloVe_classifier.R		newsgroup_GloVe_classifier.R
newsgroup_classifier.R		newsgroup_classifier.R
preprocess_20News_dataset.R		preprocess_20News_dataset.R
tm_utils.R		tm_utils.R
word_vec_utils.R		word_vec_utils.R

License

jeljov/Text_Mining_at_LASI19

Folders and files

Latest commit

History

Repository files navigation

LASI'19 workshop on Text Mining for Learning Content Analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages