Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting started: preprocessing #144

Open
jbesomi opened this issue Aug 6, 2020 · 1 comment
Open

Getting started: preprocessing #144

jbesomi opened this issue Aug 6, 2020 · 1 comment
Labels
documentation Improvements or additions to documentation

Comments

@jbesomi
Copy link
Owner

jbesomi commented Aug 6, 2020

Task: write the "Getting started: preprocessing" doc page

Advice/Tips to the technical writer

Good to know:

  • This page appears after the "1. Getting started". The users reading this page has already a global overview of Texthero and are motivated to learn more.
  • As this is by ordering the second page of the documentation (i.e in theory the second page the users is reading), we want to teach the user how they can profit most from the library. This include:
    1. Mention that the API is very complete and very easy to use, i.e we want to teach him how to find information alone (we achieve our goal of creating an awesome library when the user is capable of finding the answers by looking at our website without the need of searching in Google/opening a new Stackoverflow question)
    2. ...

Concept useful to have clear in mind:

  • What is it and how the pandas.Series.pipe function works
  • Why text-preprocessing is crucial in many NLP/text mining areas
  • Dealing with regex is often painful, with Texthero this can be avoided
  • At exception of tokenize, all preprocessing functions receive as input a TextSeries and returns a TextSeries.

Things to keep in mind when writing:

  • In the future, Texthero will allow to preprocess also non-Western languages

To stay in the technical discussion loop:

Page

aim: learn how to preprocess text-based dataset with Texthero

content:

  • clean function: default way, option when no customization is required
  • custom pipeline: how to create a custom pipeline -- basically "copy/paste" the clean code and edit the pipeline see Preprocessing: explain how to create a custom pipeline #38
  • mention/explain it exists different preprocessing functions (cite a few, link them, ...). Not sure we want to explain all, this is up to the writer ...
  • ...
  • (at the end) --> tokenize
    • explain what tokenization is about and why is needed
    • output is TokenSeries (fundamentally different, show an example, every cell is now a list of tokens)
@jbesomi jbesomi added the documentation Improvements or additions to documentation label Aug 6, 2020
@Iota87
Copy link

Iota87 commented Aug 18, 2020

I am structuring this part as follows (based on review of similar contexts, including Texthero "Getting Started" structure):
** Overview/Intro**
Why is pre-processing crucial and what are the benefits of having a standardized/customizable pipeline
** Clean**
What it does and how
** Custom Pipeline**
Why and how you should take control of the pre-processing steps
** More details **
Including pre-processing API functionalities

Please let me know if something is not clear or if you have any additional suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants