This repository contains an extended version of David McClure's textplot
package, which is a tool for visualizing the structure of a text document. It uses kernel density estimation to create a network of terms based on their co-occurrence in the document.
In this version, we've added the following features:
- Text preprocessing with SpaCy: The text is tokenized and lemmatized using SpaCy, which allows us to benefit from its advanced NLP capabilities and support for multiple languages. Currently, the package supports English, German, French, and Italian, but you can easily add support for other languages by installing the appropriate SpaCy model and updating the code.
- Phrase detection: The package now includes a phrase detection feature based on Gensim that allows you to identify and visualize multi-word expressions in the text. This is particularly useful for analyzing texts with complex terminology or idiomatic expressions.
- Filtering by part-of-speech: The package now allows you to filter the terms included in the network based on their UPOS tags. This can help you focus on specific types of words, such as nouns or verbs, and improve the quality of your analysis.
- Support for multiple input formats: The package now supports multiple input formats, including a single plain text file, a directory of files and a pre-loaded list of strings.
To install the package, clone the repository and install the required dependencies:
# Clone the repository
git clone git@github.com:liri-uzh/textplot.git
cd textplot
# Create a virtual environment (we recommend conda)
conda create -n textplot python=3.11
conda activate textplot
# Install textplot for command line usage
pip install .
By default, we install SpaCy's small English model. If you want to use a different language, you need to install the appropriate model. For example, for German, run:
python -m spacy download de_core_news_sm
To use the package, you can either run the textplot
command line tool or import the package in your Python code.
The command line tool provides a simple interface for generating a gml file from a text file.
textplot generate data/corpora/war-and-peace/war-and-peace.txt data/outputs/war-and-peace.gml
Alternatively, you can run it as a Python module, using the textplot/helpers.py
script to process the text and compute the network:
python -m textplot.helpers \
data/corpora/human_rights/en/human_rights.txt \
--tokenizer spacy \
--lang en \
--allowed_upos NOUN \
--custom_stopwords_file textplot/data/stopwords.txt \
--custom_stopwords "article" \
--phrase_min_count 6 --phrase_threshold 0.6 \
--bandwidth 2000 --term_depth 200 --skim_depth 5 -d \
--output_dir data/outputs/human_rights \
This command processes the text file data/corpora/human_rights.txt
using SpaCy for tokenization and lemmatization, filters the terms based on their UPOS tags (in this case, only nouns), and applies phrase detection with Gensim.
By default, this will create a single output file in the output directory with the same name as the input file, but with the .gml
extension.
Once you have generated a .gml
file, you can visualize the network using the textplot/plotting.py
script or import it into a graph visualization tool like Gephi.
For example, you can use the following command to visualize the network using the plotting.py
script:
python -m textplot.plotting data/outputs/human_rights-td200-sd5-bw2000-dwFalse.gml
By default, the script runs a series of layout hyperparameters and saves the output png and json files in the same directory as the input file.
This allows for quick exploration of potential layouts and visualizations for a given network.
If you do not want to explore the layout hyperparameters, you can specify --no_trials
, in which case, the script will only generate a single plot with the layout parameters specified in the command line.
In this case, the --iterations
parameter controls the number of iterations for the force-directed layout algorithm, and the --layout_algorithm
parameter specifies which layout algorithm to use. In this case, we are using the ForceAtlas2 algorithm. Run python -m textplot.plotting --help
for more options.
An example of the resulting network with pyvis
is shown below:
For a full list of options, run python -m textplot.helpers --help
.
If you have labelled data, you can use the --labels
option to specify the labels used.
This ensures that labels are included in the network even if they are tagged as stopwords or filtered out by the part-of-speech filter.
For example, the 8set dataset contains texts labelled with positive and negative sentiment (sentinegative
and sentipositive
).
python -m textplot.helpers \
data/corpora/8set/8set_ALL.name_text_source_ASCII_cleaned_w_sentiment_h1k.txt \
--tokenizer spacy \
--lang en \
--labels "sentinegative" "sentipositive" \
--phrase_min_count 6 --phrase_threshold 0.6 \
--bandwidth 20000 --term_depth 200 --skim_depth 5 -d \
--output_dir data/outputs/8set
For effective visualisations, we recommend using Gephi.
Below are some useful resources for getting familiar with Gephi: - https://www.youtube.com/watch?v=371n3Ye9vVo&list=PLk_jmmkw5S2BqnYBqF2VNPcszY93-ze49 - https://www.youtube.com/watch?v=WpFZmIJTjA8&t=731s
-
pre vs. post filtering for POS, key terms, sudo words etc. - add support for sudo words (these are words that need to be kept for the analysis, but are ultimately visualised in the network)
-
intermediate output of .gml files for networkx -
remove numbers from outputs - improve visualisations with forceatlas or post-hoc processing with gephi
- re-scoring and filtering with TF-IDF (needs document boundaries or reference corpus)
Textplot is the work of David McClure, who created the original version of the package.
The extended version was developed by LiRI at UZH and is based largely on David's work. We would like to thank him for making his code available to the community.
Texplot uses numpy, scipy, scikit-learn, matplotlib, networkx, and clint.