Getting started: preprocessing #144

jbesomi · 2020-08-06T19:39:34Z

Task: write the "Getting started: preprocessing" doc page

Advice/Tips to the technical writer

Good to know:

This page appears after the "1. Getting started". The users reading this page has already a global overview of Texthero and are motivated to learn more.
As this is by ordering the second page of the documentation (i.e in theory the second page the users is reading), we want to teach the user how they can profit most from the library. This include:
1. Mention that the API is very complete and very easy to use, i.e we want to teach him how to find information alone (we achieve our goal of creating an awesome library when the user is capable of finding the answers by looking at our website without the need of searching in Google/opening a new Stackoverflow question)
2. ...

Concept useful to have clear in mind:

What is it and how the pandas.Series.pipe function works
Why text-preprocessing is crucial in many NLP/text mining areas
Dealing with regex is often painful, with Texthero this can be avoided
At exception of tokenize, all preprocessing functions receive as input a TextSeries and returns a TextSeries.

Things to keep in mind when writing:

In the future, Texthero will allow to preprocess also non-Western languages

To stay in the technical discussion loop:

What if every preprocessing function would require as input an already tokenized Series? I.e the first mandatory task would be tokenize, even before remove_punctuation or anything else? This is useful when dealing with non-Western language (see All preprocessing functions to receive as input TokenSeries #145 ).

Page

aim: learn how to preprocess text-based dataset with Texthero

content:

clean function: default way, option when no customization is required
custom pipeline: how to create a custom pipeline -- basically "copy/paste" the clean code and edit the pipeline see Preprocessing: explain how to create a custom pipeline #38
mention/explain it exists different preprocessing functions (cite a few, link them, ...). Not sure we want to explain all, this is up to the writer ...
...
(at the end) --> tokenize
- explain what tokenization is about and why is needed
- output is TokenSeries (fundamentally different, show an example, every cell is now a list of tokens)

The text was updated successfully, but these errors were encountered:

Iota87 · 2020-08-18T18:53:52Z

I am structuring this part as follows (based on review of similar contexts, including Texthero "Getting Started" structure):
** Overview/Intro**
Why is pre-processing crucial and what are the benefits of having a standardized/customizable pipeline
** Clean**
What it does and how
** Custom Pipeline**
Why and how you should take control of the pre-processing steps
** More details **
Including pre-processing API functionalities

Please let me know if something is not clear or if you have any additional suggestions.

jbesomi added the documentation Improvements or additions to documentation label Aug 6, 2020

jbesomi mentioned this issue Aug 6, 2020

📝 Documentation next steps: checklist #94

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting started: preprocessing #144

Getting started: preprocessing #144

jbesomi commented Aug 6, 2020 •

edited

Iota87 commented Aug 18, 2020

Getting started: preprocessing #144

Getting started: preprocessing #144

Comments

jbesomi commented Aug 6, 2020 • edited

Advice/Tips to the technical writer

Page

Iota87 commented Aug 18, 2020

jbesomi commented Aug 6, 2020 •

edited