Merge branch 'develop'

internaut · May 3, 2023 · d8fa537 · d8fa537
2 parents af565c9 + 366371a
commit d8fa537
Show file tree

Hide file tree

Showing 86 changed files with 18,114 additions and 2,837 deletions.
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -0,0 +1,30 @@
+name: publish new tmtoolkit release to PyPI
+on: push
+
+jobs:
+  build-and-publish-test:
+    runs-on: ubuntu-latest
+    if: startsWith(github.ref, 'refs/tags')
+    environment:
+      #name: pypi-test
+      name: pypi
+      url: https://pypi.org/p/tmtoolkit
+    permissions:
+      id-token: write
+    steps:
+    - name: Checkout
+      uses: actions/checkout@v3
+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: '3.x'
+    - name: Install build dependencies
+      run: python -m pip install -U setuptools wheel build
+    - name: Build
+      run: python -m build .
+#    - name: Publish package distributions to TestPyPI
+#      uses: pypa/gh-action-pypi-publish@release/v1
+#      with:
+#        repository-url: https://test.pypi.org/legacy/
+    - name: Publish package distributions to PyPI
+      uses: pypa/gh-action-pypi-publish@release/v1
diff --git a/.github/workflows/runtests.yml b/.github/workflows/runtests.yml
@@ -1,8 +1,8 @@
 # GitHub actions workflow for testing tmtoolkit
-# Runs tests on Ubuntu, MacOS and Windows with Python versions 3.8, 3.9 and 3.10 each, which means 9 jobs are spawned.
+# Runs tests on Ubuntu, MacOS and Windows with Python versions 3.8, 3.9, 3.10, 3.11.
 # Tests are run using tox (https://tox.wiki/).
 #
-# author: Markus Konrad <markus.konrad@wzb.eu>
+# author: Markus Konrad <post@mkonrad.net>
 
 name: run tests
 
@@ -19,12 +19,12 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-latest, macos-latest, windows-latest]
-        python-version: ["3.8", "3.9", "3.10"]
+        python-version: ["3.8", "3.9", "3.10", "3.11"]
         testsuite: ["minimal", "full"]
     steps:
-      - uses: actions/checkout@v2
+      - uses: actions/checkout@v3
       - name: set up python ${{ matrix.python-version }}
-        uses: actions/setup-python@v2
+        uses: actions/setup-python@v4
         with:
           python-version: ${{ matrix.python-version }}
           cache: 'pip'

diff --git a/.gitignore b/.gitignore
@@ -17,4 +17,7 @@ examples/data/*.pickle
 .tox/
 .Rhistory
 doc/source/data/corpus_norm.pickle
-.coverage
+.coverage*
+examples/data/aclImdb_v1.tar.gz
+venv
+examples/data/topicmod_evaluate_*.png
diff --git a/AUTHORS.md b/AUTHORS.md
@@ -2,7 +2,7 @@
 
 ## Maintainer / main developer
 
-[Markus Konrad](https://github.com/internaut) @ [WZB](https://github.com/WZBSocialScienceCenter/)
+[Markus Konrad](https://github.com/internaut)
 
 ## Contributors
 

diff --git a/README.rst b/README.rst
@@ -4,29 +4,24 @@ tmtoolkit: Text mining and topic modeling toolkit
 |pypi| |pypi_downloads| |rtd| |runtests| |coverage| |zenodo|
 
 *tmtoolkit* is a set of tools for text mining and topic modeling with Python developed especially for the use in the
-social sciences, in journalism or related disciplines. It aims for easy installation, extensive documentation
+social sciences, linguistics, journalism or related disciplines. It aims for easy installation, extensive documentation
 and a clear programming interface while offering good performance on large datasets by the means of vectorized
 operations (via NumPy) and parallel computation (using Python's *multiprocessing* module and the
 `loky <https://loky.readthedocs.io/>`_ package). The basis of tmtoolkit's text mining capabilities are built around
-`SpaCy <https://spacy.io/>`_, which offers a `many language models <https://spacy.io/models>`_.
+`SpaCy <https://spacy.io/>`_, which offers `many language models <https://spacy.io/models>`_.
 
 The documentation for tmtoolkit is available on `tmtoolkit.readthedocs.org <https://tmtoolkit.readthedocs.org>`_ and
 the GitHub code repository is on
-`github.com/WZBSocialScienceCenter/tmtoolkit <https://github.com/WZBSocialScienceCenter/tmtoolkit>`_.
-
-**Upgrade note:**
-
-Since Feb 8 2022, the newest version 0.11.0 of tmtoolkit is available on PyPI. This version features a new API
-for text processing and mining which is incompatible with prior versions. It's advisable to first read the
-first three chapters of the `tutorial <https://tmtoolkit.readthedocs.io/en/latest/getting_started.html>`_
-to get used to the new API. You should also re-install tmtoolkit in a new virtual environment or completely
-remove the old version prior to upgrading. See the
-`installation instructions <https://tmtoolkit.readthedocs.io/en/latest/install.html>`_.
+`github.com/internaut/tmtoolkit <https://github.com/internaut/tmtoolkit>`_.
 
 Requirements and installation
 -----------------------------
 
-**tmtoolkit works with Python 3.8 or newer (tested up to Python 3.10).**
+**tmtoolkit works with Python 3.8 or newer (tested up to Python 3.11).**
+
+.. note:: There are two dependencies, that don't work with Python 3.11 so far: *lda* and *wordcloud*. If you want to
+          do topic modeling via LDA and/or want to use word cloud visualizations, you must use Python 3.8 to 3.10 or
+          wait until lda and wordcloud receive updates that make them work under Python 3.11.
 
 The tmtoolkit package is highly modular and tries to install as few dependencies as possible. For requirements and
 installation procedures, please have a look at the
@@ -66,8 +61,10 @@ The tmtoolkit package offers several text preprocessing and text mining methods,
   `document and token attributes as dataframes <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Accessing-tokens-and-token-attributes>`_
 - calculating and `visualizing corpus summary statistics <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Visualizing-corpus-summary-statistics>`_
 - finding out and joining `collocations <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Identifying-and-joining-token-collocations>`_
+- calculating `token cooccurrences <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Token-cooccurrence-matrices>`_
 - `splitting and sampling corpora <https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html#Corpus-functions-for-document-management>`_
-- generating `n-grams <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-n-grams>`_
+- generating `n-grams <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-n-grams>`_ and using
+  `N-gram models <https://tmtoolkit.readthedocs.io/en/latest/api.html#module-tmtoolkit.ngrammodels>`_
 - generating `sparse document-term matrices <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-a-sparse-document-term-matrix-(DTM)>`_
 
 Wherever possible and useful, these methods can operate in parallel to speed up computations with large datasets.
@@ -110,6 +107,8 @@ Other features
   `text files, tabular files (CSV or Excel), ZIP files or folders <https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html#Loading-text-data>`_
 - `splitting and joining documents <https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html#Corpus-functions-for-document-management>`_
 - `common statistics and transformations for document-term matrices <https://tmtoolkit.readthedocs.io/en/latest/bow.html>`_ like word cooccurrence and *tf-idf*
+- `interoperability with R <https://tmtoolkit.readthedocs.io/en/latest/rinterop.html>`_
+
 
 Limits
 ------
@@ -129,7 +128,7 @@ License
 -------
 
 Code licensed under `Apache License 2.0 <https://www.apache.org/licenses/LICENSE-2.0>`_.
-See `LICENSE <https://github.com/WZBSocialScienceCenter/tmtoolkit/blob/master/LICENSE>`_ file.
+See `LICENSE <https://github.com/internaut/tmtoolkit/blob/master/LICENSE>`_ file.
 
 .. |pypi| image:: https://badge.fury.io/py/tmtoolkit.svg
     :target: https://badge.fury.io/py/tmtoolkit
@@ -139,12 +138,12 @@ See `LICENSE <https://github.com/WZBSocialScienceCenter/tmtoolkit/blob/master/LI
     :target: https://pypi.org/project/tmtoolkit/
     :alt: Downloads from PyPI
 
-.. |runtests| image:: https://github.com/WZBSocialScienceCenter/tmtoolkit/actions/workflows/runtests.yml/badge.svg
-    :target: https://github.com/WZBSocialScienceCenter/tmtoolkit/actions/workflows/runtests.yml
+.. |runtests| image:: https://github.com/internaut/tmtoolkit/actions/workflows/runtests.yml/badge.svg
+    :target: https://github.com/internaut/tmtoolkit/actions/workflows/runtests.yml
     :alt: GitHub Actions CI Build Status
 
-.. |coverage| image:: https://raw.githubusercontent.com/WZBSocialScienceCenter/tmtoolkit/master/coverage.svg?sanitize=true
-    :target: https://github.com/WZBSocialScienceCenter/tmtoolkit/tree/master/tests
+.. |coverage| image:: https://raw.githubusercontent.com/internaut/tmtoolkit/master/coverage.svg?sanitize=true
+    :target: https://github.com/internaut/tmtoolkit/tree/master/tests
     :alt: Coverage status
 
 .. |rtd| image:: https://readthedocs.org/projects/tmtoolkit/badge/?version=latest

diff --git a/conftest.py b/conftest.py
@@ -1,7 +1,7 @@
 """
 Configuration for tests with pytest
 
-.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>
+.. codeauthor:: Markus Konrad <post@mkonrad.net>
 """
 
 from hypothesis import settings, HealthCheck

diff --git a/coverage.svg b/coverage.svg
diff --git a/doc/source/api.rst b/doc/source/api.rst
@@ -37,11 +37,26 @@ Functions to visualize corpus summary statistics
     :members:
 
 
+tmtoolkit.ngrammodels
+---------------------
+
+.. automodule:: tmtoolkit.ngrammodels
+    :members:
+
+
+tmtoolkit.strings
+-----------------
+
+.. automodule:: tmtoolkit.strings
+    :members:
+
+
 tmtoolkit.tokenseq
 ------------------
 
 .. automodule:: tmtoolkit.tokenseq
     :members:
+    :imported-members:
 
 
 tmtoolkit.topicmod