Supporting Materials for Clustering and Visualising Documents using Word Embeddings

This GitHub repository is intended to provide supporting materials for the tutorial published on Programming Historian. The contents can be divided up into several parts:

Setup

There are three options connected to the setting up of Python for this tutorial:

Direct Installation: if you are installing the required Python libraries directly (e.g. pip install) then you can do so using the requirements.txt file as follows: pip install -r requirements.txt. This will install the latest version of each library named in the requirements file. So, over longer periods of time, changes to these libraries may eventually result in errors because the interface/functions used have change.
Google Colab: this has been set up by Programming Historian so that the environment is created automatically for you and so is the easiest way to run this tutorial. It also draws on requirements.txt but there is nothing for you to do as Google Colab will install the libraries automatically.
Docker: this approach ensures that you are always running the version of Python (and the associated required libraries) that were used in developing this tutorial. So this approach will work for as long as Docker continues to exist as a (mostly free) software platform. Details for running the Docker image can be found here. If you know what you're doing with Docker already: docker pull jreades/ph-tutorial:latest.

Code

There are two Jupyter Notebooks that you can run interactively:

Clustering_Word_Embeddings.ipynb: the main tutorial. This takes you through downloading the data, installing the one library that is not available via pip, and then performing the analysis presented in the tutorial.
Comparison_to_PCA.ipynb: this separate notebook informs the brief section about the limitations of Principal Components Analysis (PCA) on 'non-linear' data used in another tutorial. This code should run independently of anything in the main tutorial as we're working with a different data set (downloaded automatically for you in the tutorial).

Other

There is also a Supplementary Materials file -- technically, this relates to an earlier version of the tutorial in which different EThOS data was used; however, I've left it here as it's useful additional context for the results reported in the newer tutorial.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
docker		docker
.gitignore		.gitignore
Clustering_Word_Embeddings.ipynb		Clustering_Word_Embeddings.ipynb
Comparison_to_PCA.ipynb		Comparison_to_PCA.ipynb
LICENSE		LICENSE
README.md		README.md
Supplementary_Materials.md		Supplementary_Materials.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supporting Materials for Clustering and Visualising Documents using Word Embeddings

Setup

Code

Other

About

Releases

Packages

Languages

License

jreades/ph-tutorial-code

Folders and files

Latest commit

History

Repository files navigation

Supporting Materials for Clustering and Visualising Documents using Word Embeddings

Setup

Code

Other

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages