Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
internaut committed May 3, 2023
2 parents af565c9 + 366371a commit d8fa537
Show file tree
Hide file tree
Showing 86 changed files with 18,114 additions and 2,837 deletions.
30 changes: 30 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: publish new tmtoolkit release to PyPI
on: push

jobs:
build-and-publish-test:
runs-on: ubuntu-latest
if: startsWith(github.ref, 'refs/tags')
environment:
#name: pypi-test
name: pypi
url: https://pypi.org/p/tmtoolkit
permissions:
id-token: write
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.x'
- name: Install build dependencies
run: python -m pip install -U setuptools wheel build
- name: Build
run: python -m build .
# - name: Publish package distributions to TestPyPI
# uses: pypa/gh-action-pypi-publish@release/v1
# with:
# repository-url: https://test.pypi.org/legacy/
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
10 changes: 5 additions & 5 deletions .github/workflows/runtests.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# GitHub actions workflow for testing tmtoolkit
# Runs tests on Ubuntu, MacOS and Windows with Python versions 3.8, 3.9 and 3.10 each, which means 9 jobs are spawned.
# Runs tests on Ubuntu, MacOS and Windows with Python versions 3.8, 3.9, 3.10, 3.11.
# Tests are run using tox (https://tox.wiki/).
#
# author: Markus Konrad <markus.konrad@wzb.eu>
# author: Markus Konrad <post@mkonrad.net>

name: run tests

Expand All @@ -19,12 +19,12 @@ jobs:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.8", "3.9", "3.10"]
python-version: ["3.8", "3.9", "3.10", "3.11"]
testsuite: ["minimal", "full"]
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: set up python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,7 @@ examples/data/*.pickle
.tox/
.Rhistory
doc/source/data/corpus_norm.pickle
.coverage
.coverage*
examples/data/aclImdb_v1.tar.gz
venv
examples/data/topicmod_evaluate_*.png
2 changes: 1 addition & 1 deletion AUTHORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Maintainer / main developer

[Markus Konrad](https://github.com/internaut) @ [WZB](https://github.com/WZBSocialScienceCenter/)
[Markus Konrad](https://github.com/internaut)

## Contributors

Expand Down
37 changes: 18 additions & 19 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,24 @@ tmtoolkit: Text mining and topic modeling toolkit
|pypi| |pypi_downloads| |rtd| |runtests| |coverage| |zenodo|

*tmtoolkit* is a set of tools for text mining and topic modeling with Python developed especially for the use in the
social sciences, in journalism or related disciplines. It aims for easy installation, extensive documentation
social sciences, linguistics, journalism or related disciplines. It aims for easy installation, extensive documentation
and a clear programming interface while offering good performance on large datasets by the means of vectorized
operations (via NumPy) and parallel computation (using Python's *multiprocessing* module and the
`loky <https://loky.readthedocs.io/>`_ package). The basis of tmtoolkit's text mining capabilities are built around
`SpaCy <https://spacy.io/>`_, which offers a `many language models <https://spacy.io/models>`_.
`SpaCy <https://spacy.io/>`_, which offers `many language models <https://spacy.io/models>`_.

The documentation for tmtoolkit is available on `tmtoolkit.readthedocs.org <https://tmtoolkit.readthedocs.org>`_ and
the GitHub code repository is on
`github.com/WZBSocialScienceCenter/tmtoolkit <https://github.com/WZBSocialScienceCenter/tmtoolkit>`_.

**Upgrade note:**

Since Feb 8 2022, the newest version 0.11.0 of tmtoolkit is available on PyPI. This version features a new API
for text processing and mining which is incompatible with prior versions. It's advisable to first read the
first three chapters of the `tutorial <https://tmtoolkit.readthedocs.io/en/latest/getting_started.html>`_
to get used to the new API. You should also re-install tmtoolkit in a new virtual environment or completely
remove the old version prior to upgrading. See the
`installation instructions <https://tmtoolkit.readthedocs.io/en/latest/install.html>`_.
`github.com/internaut/tmtoolkit <https://github.com/internaut/tmtoolkit>`_.

Requirements and installation
-----------------------------

**tmtoolkit works with Python 3.8 or newer (tested up to Python 3.10).**
**tmtoolkit works with Python 3.8 or newer (tested up to Python 3.11).**

.. note:: There are two dependencies, that don't work with Python 3.11 so far: *lda* and *wordcloud*. If you want to
do topic modeling via LDA and/or want to use word cloud visualizations, you must use Python 3.8 to 3.10 or
wait until lda and wordcloud receive updates that make them work under Python 3.11.

The tmtoolkit package is highly modular and tries to install as few dependencies as possible. For requirements and
installation procedures, please have a look at the
Expand Down Expand Up @@ -66,8 +61,10 @@ The tmtoolkit package offers several text preprocessing and text mining methods,
`document and token attributes as dataframes <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Accessing-tokens-and-token-attributes>`_
- calculating and `visualizing corpus summary statistics <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Visualizing-corpus-summary-statistics>`_
- finding out and joining `collocations <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Identifying-and-joining-token-collocations>`_
- calculating `token cooccurrences <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Token-cooccurrence-matrices>`_
- `splitting and sampling corpora <https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html#Corpus-functions-for-document-management>`_
- generating `n-grams <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-n-grams>`_
- generating `n-grams <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-n-grams>`_ and using
`N-gram models <https://tmtoolkit.readthedocs.io/en/latest/api.html#module-tmtoolkit.ngrammodels>`_
- generating `sparse document-term matrices <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-a-sparse-document-term-matrix-(DTM)>`_

Wherever possible and useful, these methods can operate in parallel to speed up computations with large datasets.
Expand Down Expand Up @@ -110,6 +107,8 @@ Other features
`text files, tabular files (CSV or Excel), ZIP files or folders <https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html#Loading-text-data>`_
- `splitting and joining documents <https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html#Corpus-functions-for-document-management>`_
- `common statistics and transformations for document-term matrices <https://tmtoolkit.readthedocs.io/en/latest/bow.html>`_ like word cooccurrence and *tf-idf*
- `interoperability with R <https://tmtoolkit.readthedocs.io/en/latest/rinterop.html>`_


Limits
------
Expand All @@ -129,7 +128,7 @@ License
-------

Code licensed under `Apache License 2.0 <https://www.apache.org/licenses/LICENSE-2.0>`_.
See `LICENSE <https://github.com/WZBSocialScienceCenter/tmtoolkit/blob/master/LICENSE>`_ file.
See `LICENSE <https://github.com/internaut/tmtoolkit/blob/master/LICENSE>`_ file.

.. |pypi| image:: https://badge.fury.io/py/tmtoolkit.svg
:target: https://badge.fury.io/py/tmtoolkit
Expand All @@ -139,12 +138,12 @@ See `LICENSE <https://github.com/WZBSocialScienceCenter/tmtoolkit/blob/master/LI
:target: https://pypi.org/project/tmtoolkit/
:alt: Downloads from PyPI

.. |runtests| image:: https://github.com/WZBSocialScienceCenter/tmtoolkit/actions/workflows/runtests.yml/badge.svg
:target: https://github.com/WZBSocialScienceCenter/tmtoolkit/actions/workflows/runtests.yml
.. |runtests| image:: https://github.com/internaut/tmtoolkit/actions/workflows/runtests.yml/badge.svg
:target: https://github.com/internaut/tmtoolkit/actions/workflows/runtests.yml
:alt: GitHub Actions CI Build Status

.. |coverage| image:: https://raw.githubusercontent.com/WZBSocialScienceCenter/tmtoolkit/master/coverage.svg?sanitize=true
:target: https://github.com/WZBSocialScienceCenter/tmtoolkit/tree/master/tests
.. |coverage| image:: https://raw.githubusercontent.com/internaut/tmtoolkit/master/coverage.svg?sanitize=true
:target: https://github.com/internaut/tmtoolkit/tree/master/tests
:alt: Coverage status

.. |rtd| image:: https://readthedocs.org/projects/tmtoolkit/badge/?version=latest
Expand Down
2 changes: 1 addition & 1 deletion conftest.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
Configuration for tests with pytest
.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>
.. codeauthor:: Markus Konrad <post@mkonrad.net>
"""

from hypothesis import settings, HealthCheck
Expand Down
4 changes: 2 additions & 2 deletions coverage.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 15 additions & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,26 @@ Functions to visualize corpus summary statistics
:members:


tmtoolkit.ngrammodels
---------------------

.. automodule:: tmtoolkit.ngrammodels
:members:


tmtoolkit.strings
-----------------

.. automodule:: tmtoolkit.strings
:members:


tmtoolkit.tokenseq
------------------

.. automodule:: tmtoolkit.tokenseq
:members:
:imported-members:


tmtoolkit.topicmod
Expand Down

0 comments on commit d8fa537

Please sign in to comment.