Skip to content
/ pyTCTK Public

Python Text Cleaning ToolKit library (pyTCTK)

License

Notifications You must be signed in to change notification settings

lprtk/pyTCTK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyTCTK for Python Text Cleaning ToolKit

GitHub issues GitHub forks Github Stars GitHub license Code style: black

Table of contents

Overview

The objective is to give tools to prepare your text data without having to install anything. Some text cleaning libraries can't be used on professional computers because they need to download files from servers or from urls that are blocked by internet proxies. With pyTCTK, you just need Python and access to GitHub to clean your text data. So it's a library that you can use on your professional computer, that's the goal : a library usable everywhere.

Content

For the moment, three class with several functions are available:

  • The TextNet class implements all the general functions to clean up your text (remove punctuation, uppercase, email address, urls, html tags, etc.);

  • The WordNet class implements all the functions to perform more precise cleaning at the word level of your text (remove stopwords or apply lemming or stemming);

  • The Tokenize class implements all two functions to tokenize and detokenize the words in your text.

Requirements

  • Python version 3.9.7
  • Install requirements.txt
$ pip install -r requirements.txt 
  • Librairies used
import numpy as np
import os
import pandas as pd
import re
from urllib import request

File details

  • requirements
  • This folder contains a .txt file with all the packages and versions needed to run the project.
  • pyTCTK
  • This folder contains a .py file with all class, functions and methods.
  • example
  • This folder contains an example notebook to better understand how to use the different class and functions, and their outputs.
  • ressources
  • This folder contains several subfolders in which there are .txt vocabulary files for processing and cleaning the texts.

Here is the project pattern:

- project
    > pyTCTK
        > requirements
            - requirements.txt
        > codefile 
            - pyTCTK.py
        > example 
            - pyTCTK.ipynb
        > ressources 
            >stopwords
                - english.txt
                - french.txt
            >lemme
                - english.txt
                - french.txt
            >stemme
                - english.txt
                - french.txt
            >accents
                - accents.txt

Features

My profilMy GitHub