pyTCTK for Python Text Cleaning ToolKit

Overview

The objective is to give tools to prepare your text data without having to install anything. Some text cleaning libraries can't be used on professional computers because they need to download files from servers or from urls that are blocked by internet proxies. With pyTCTK, you just need Python and access to GitHub to clean your text data. So it's a library that you can use on your professional computer, that's the goal : a library usable everywhere.

Content

For the moment, three class with several functions are available:

The TextNet class implements all the general functions to clean up your text (remove punctuation, uppercase, email address, urls, html tags, etc.);
The WordNet class implements all the functions to perform more precise cleaning at the word level of your text (remove stopwords or apply lemming or stemming);
The Tokenize class implements all two functions to tokenize and detokenize the words in your text.

Requirements

Python version 3.9.7
Install requirements.txt

$ pip install -r requirements.txt

Librairies used

import numpy as np
import os
import pandas as pd
import re
from urllib import request

File details

requirements
This folder contains a .txt file with all the packages and versions needed to run the project.
pyTCTK
This folder contains a .py file with all class, functions and methods.
example
This folder contains an example notebook to better understand how to use the different class and functions, and their outputs.
ressources
This folder contains several subfolders in which there are .txt vocabulary files for processing and cleaning the texts.

Here is the project pattern:

- project
    > pyTCTK
        > requirements
            - requirements.txt
        > codefile 
            - pyTCTK.py
        > example 
            - pyTCTK.ipynb
        > ressources 
            >stopwords
                - english.txt
                - french.txt
            >lemme
                - english.txt
                - french.txt
            >stemme
                - english.txt
                - french.txt
            >accents
                - accents.txt

Features

My profil • My GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
codefile		codefile
example		example
requirements		requirements
ressources		ressources
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyTCTK for Python Text Cleaning ToolKit

Table of contents

Overview

Content

Requirements

File details

Features

About

Releases

Packages

Languages

License

lprtk/pyTCTK

Folders and files

Latest commit

History

Repository files navigation

pyTCTK for Python Text Cleaning ToolKit

Table of contents

Overview

Content

Requirements

File details

Features

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages