From a1b1cca1e765a35db21bd822e0adf8ddd76318ba Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Wed, 22 Jul 2020 22:05:02 +0200 Subject: [PATCH 01/10] Update getting_started.md *removed spelling mistakes from the getting started * added tokenisation to the document Co-authored-by: Henri Froese --- website/docs/getting-started.md | 103 +++++++++++++++++++++----------- 1 file changed, 69 insertions(+), 34 deletions(-) diff --git a/website/docs/getting-started.md b/website/docs/getting-started.md index e2b9419c..284796b9 100644 --- a/website/docs/getting-started.md +++ b/website/docs/getting-started.md @@ -9,13 +9,13 @@ Texthero is a python package to let you work efficiently and quickly with text d ## Overview -Given a dataset with structured data, it's easy to have a quick understanding of the underline data. Oppositely, given a dataset composed of text-only, it's harder to have a quick undertanding of the data. Texthero help you there, providing utility functions to quickly **clean the text data**, **map it into a vector space** and gather from it **primary insights**. +Given a dataset with structured data, it's easy to have a quick understanding of the underlying data. Oppositely, given a dataset composed of text-only, it's harder to have a quick undertanding of the data. Texthero help you there, providing utility functions to quickly **clean the text data**, **tokenize it**, **map it into a vector space** and gather from it **primary insights**. ##### Pandas integration One of the main pillar of texthero is that is designed from the ground-up to work with **Pandas Dataframe** and **Series**. -Most of texthero methods, simply apply transformation to Pandas Series. As a rule of thumb, the first argument and the return ouputs of almost all texthero methods are either a Pandas Series or a Pandas DataFrame. +Most of texthero's methods simply apply a transformation to a Pandas Series. As a rule of thumb, the first argument and the ouput of almost all texthero methods are either a Pandas Series or a Pandas DataFrame. ##### Pipeline @@ -46,7 +46,7 @@ The five different areas are _athletics_, _cricket_, _football_, _rugby_ and _te The original dataset comes as a zip files with five different folder containing the article as text data for each topic. -For convenience, we createdThis script simply read all text data and store it into a Pandas Dataframe. +For convenience, we created this script simply read all text data and store it into a Pandas Dataframe. Import texthero and pandas. @@ -87,7 +87,7 @@ Recently, Pandas has introduced the pipe function. You can achieve the same resu df['clean_text'] = df['text'].pipe(hero.clean) ``` -> Tips. When we need to define a new column returned from a function, we prepend the name of the function to the column name. Example: df['tsne_col'] = df['col'].pipe(hero.tsne). This keep the code simple to read and permit to construct complex pipeline. +> Tips. When we need to define a new column returned from a function, we prepend the name of the function to the column name. Example: df['tsne_col'] = df['col'].pipe(hero.tsne). This keep the code simple to read and allows us to construct complex pipelines. The default pipeline for the `clean` method is the following: @@ -120,46 +120,66 @@ or alternatively df['clean_text'] = df['clean_text'].pipe(hero.clean, custom_pipeline) ``` +##### Tokenize + +Next, we usually want to tokenize the text (_tokenizing_ means splitting sentences/documents into separate words, the _tokens_). Of course, texthero provides an easy function for that! + +```python +df['tokenized_text'] = hero.tokenize(df['clean_text']) +``` + + ##### Preprocessing API -The complete preprocessing API can be found at the following address: [api preprocessing](/docs/api-preprocessing). +The complete preprocessing API can be found here: [api preprocessing](/docs/api-preprocessing). ### Representation -Once cleaned the data, the next natural is to map each document into a vector. +Once the data is cleaned and tokenized, the next natural step is to map each document to a vector so we can compare documents with mathematical methods to derive insights. ##### TFIDF representation +TFIDF is a formula to calculate the _relative importance_ of the words in a document, taking +into account the words' occurrences in other documents. ```python -df['tfidf_clean_text'] = hero.tfidf(df['clean_text']) +df['tfidf'] = hero.tfidf(df['tokenized_text']) ``` +Now, we have calculated a vector for each document that tells us what words are characteristic for the document. +Usually, documents about similar topics use similar terms, so their tfidf-vectors will be similar too. + ##### Dimensionality reduction with PCA -To visualize the data, we map each point to a two-dimensional representation with PCA. The principal component analysis algorithms returns the combination of attributes that better account the variance in the data. +We now want to visualize the data. However, the tfidf-vectors are very high-dimensional (i.e. every +document might have a tfidf-vector of length 100). Visualizing 100 dimensions is hard! + +Thus, we perform dimensionality reduction (generating vectors with fewer entries from vectors with +many entries). For that, we can use PCA. PCA generates new vectors from the tfidf representation +that showcase the differences among the documents most strongly in fewer dimensions, often 2 or 3. ```python -df['pca_tfidf_clean_text'] = hero.pca(df['tfidf_clean_text']) +df['pca'] = hero.pca(df['tfidf']) ``` ##### All in one step -We can achieve all the three steps show above, _cleaning_, _tf-idf representation_ and _dimensionality reduction_ in a single step. Isn't fabulous? +We can achieve all the steps shown above, _cleaning_, _tokenizing_, _tf-idf representation_ and _dimensionality reduction_ in a single step. Isn't that fabulous? ```python df['pca'] = ( - df['text'] - .pipe(hero.clean) - .pipe(hero.tfidf) - .pipe(hero.pca) - ) + df['text'] + .pipe(hero.clean) + .pipe(hero.tokenize) + .pipe(hero.tfidf) + .pipe(hero.pca) +) ``` ##### Representation API -The complete representation module API can be found at the following address: [api representation](/docs/api-representation). +The complete representation module API can be found here: [api representation](/docs/api-representation). ### Visualization @@ -176,32 +196,43 @@ Also, we can "visualize" the most common words for each `topic` with `top_words` ```python NUM_TOP_WORDS = 5 -df.groupby('topic')['text'].apply(lambda x: hero.top_words(x)[:NUM_TOP_WORDS]) +df.groupby('topic')['clean_text'].apply(lambda x: hero.top_words(x, normalize=True)[:NUM_TOP_WORDS]) ``` ``` topic -athletics said 0.010068 - world 0.008900 - year 0.008844 -cricket test 0.008250 - england 0.008001 - first 0.007787 -football said 0.009515 - chelsea 0.006110 - game 0.005950 -rugby england 0.012602 - said 0.008359 - wales 0.007880 -tennis 6 0.021047 - said 0.013012 - open 0.009834 +athletics said 0.010330 + world 0.009132 + year 0.009075 + olympic 0.007819 + race 0.006392 +cricket test 0.008492 + england 0.008235 + first 0.008016 + cricket 0.007906 + one 0.007760 +football said 0.009709 + chelsea 0.006234 + game 0.006071 + would 0.005866 + club 0.005601 +rugby england 0.012833 + said 0.008512 + wales 0.008025 + ireland 0.007440 + rugby 0.007245 +tennis said 0.013993 + open 0.010575 + first 0.009608 + set 0.009028 + year 0.008447 +Name: clean_text, dtype: float64 ``` ##### Visualization API -The complete visualization module API can be found at the following address: [api visualization](/docs/api-visualization). +The complete visualization module API can be found here: [api visualization](/docs/api-visualization). ## Summary @@ -217,6 +248,7 @@ df = pd.read_csv( df['pca'] = ( df['text'] .pipe(hero.clean) + .pipe(hero.tokenize) .pipe(hero.tfidf) .pipe(hero.pca) ) @@ -224,8 +256,11 @@ df['pca'] = ( hero.scatterplot(df, col='pca', color='topic', title="PCA BBC Sport news") ``` +![](/img/scatterplot_bccsport.svg) + + ##### Next section By now, you should have understood the main building blocks of texthero. -In the next sections, we will review each module, see how we can tune the default settings and we will show other application where Texthero might come in handy. +In the next sections, we will review each module, see how we can tune the default settings and we will show other applications where Texthero might come in handy. From bfbadb1ad3abb4fc7d239f8166c6e9ac9492aea2 Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Fri, 7 Aug 2020 20:24:45 +0200 Subject: [PATCH 02/10] updated getting started included the Typing in Texthero Co-authored-by: Henri Froese --- website/docs/getting-started.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/website/docs/getting-started.md b/website/docs/getting-started.md index 284796b9..cd441053 100644 --- a/website/docs/getting-started.md +++ b/website/docs/getting-started.md @@ -234,6 +234,32 @@ Name: clean_text, dtype: float64 The complete visualization module API can be found here: [api visualization](/docs/api-visualization). +## Quick look into hero typing + +Texthero does introduce some different pandas series types for it's different categories of functions: +1. __TextSeries__: Every cell is a text, i.e. a string. For example, +`pd.Series(["test", "test"])` is a valid TextSeries. Those series will be the input and output type of the preprocessing functions like `clean` + +2. __TokenSeries__: Every cell is a list of words/tokens, i.e. a list +of strings. For example, `pd.Series([["test"], ["token2", "token3"]])` is a valid TokenSeries. The NLP functions like `tfidf` do require a TokenSeries as an input. The function `tokenize` generates a TokenSeries + +3. __VectorSeries__: Every cell is a vector representing text, i.e. +a list of floats. For example, `pd.Series([[1.0, 2.0], [3.0]])` is a valid VectorSeries. Most dimensionality reduction functions, like `pca` will take VectorSeries as an input and also return a VectorSeries. + +4. __RepresentationSeries__: Series is multiindexed with level one +being the document, level two being the individual features and their values. +For example, +`pd.Series([1, 2, 3], index=pd.MultiIndex.from_tuples([("doc1", "word1"), ("doc1", "word2"), ("doc2", "word1")]))` +is a valid RepresentationSeries. RepresentationSeries will be the output type from the NLP functions like `count` or `term frequency` + +To get more detailed insights into this topic, you can have a look at the typing tutorial. But in general, if you use texthero with the common pipeline: +- cleaning the Series with functions from the preprocessing module +- tokenising the Series and then perform NLP functions +- calculating some Clustering +- reduce the dimension to display the data + +you won't need to worry much about it, as the functions are build in the way, that the corresponding input and output types match. + ## Summary We saw how in just a couple of lines of code we can represent and visualize any text dataset. We went from knowing nothing regarding the dataset to see that there are 5 (quite) distinct areas representig each topic. We went _from zero to hero_. From dad68b61c7c6460e6f602702e45a471b084ca3b6 Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Sat, 22 Aug 2020 14:33:54 +0200 Subject: [PATCH 03/10] updated getting started to introduce DocumentTermDF --- website/docs/getting-started.md | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/website/docs/getting-started.md b/website/docs/getting-started.md index cd441053..963e15a1 100644 --- a/website/docs/getting-started.md +++ b/website/docs/getting-started.md @@ -144,12 +144,22 @@ TFIDF is a formula to calculate the _relative importance_ of the words in a docu into account the words' occurrences in other documents. ```python -df['tfidf'] = hero.tfidf(df['tokenized_text']) +df = pd.concat([df, hero.tfidf(df['tokenized_text']]) ``` Now, we have calculated a vector for each document that tells us what words are characteristic for the document. Usually, documents about similar topics use similar terms, so their tfidf-vectors will be similar too. +###### Usage of concat + +Here you will have probably noticed something very odd. We didn't used the +assignement operator to insert the new created DataFrame. This is due to the +reason, that for each word in every document we created a new DataFrame column. This makes the insertion operation very expensive and therefore we recommend to use concat instead. Currently just the functions `count`, `term_frequency` and `tfidf` return that kind of DocumentTermDF. To read more about the different PandasTypes, introduced in this library, have a look at this tutorial. + +##### Normalisation of the data + +It is very imporant to normalize your data before you start to analyse them. The normalisation helps you to minimise the variance of your dataset, which is necessary to analyse your data further in a meaningful way, as outliers and and different ranges of numbers are now "handled". This is just a generalisation, as every clustering and dimension reduction algorithm works differently. + ##### Dimensionality reduction with PCA We now want to visualize the data. However, the tfidf-vectors are very high-dimensional (i.e. every @@ -173,6 +183,7 @@ df['pca'] = ( .pipe(hero.clean) .pipe(hero.tokenize) .pipe(hero.tfidf) + .pipe(hero.normalize) .pipe(hero.pca) ) ``` @@ -246,11 +257,11 @@ of strings. For example, `pd.Series([["test"], ["token2", "token3"]])` is a vali 3. __VectorSeries__: Every cell is a vector representing text, i.e. a list of floats. For example, `pd.Series([[1.0, 2.0], [3.0]])` is a valid VectorSeries. Most dimensionality reduction functions, like `pca` will take VectorSeries as an input and also return a VectorSeries. -4. __RepresentationSeries__: Series is multiindexed with level one -being the document, level two being the individual features and their values. -For example, -`pd.Series([1, 2, 3], index=pd.MultiIndex.from_tuples([("doc1", "word1"), ("doc1", "word2"), ("doc2", "word1")]))` -is a valid RepresentationSeries. RepresentationSeries will be the output type from the NLP functions like `count` or `term frequency` +4. **DocumentTermDF**: A DataFrame where the rows are the documents and the columns are the words/terms in all the documents. The columns are multiindexed with level one + being the content name (e.g. "tfidf"), level two being the individual features and their values. + For example, + `pd.DataFrame([[1, 2, 3], [4,5,6]], columns=pd.MultiIndex.from_tuples([("count", "hi"), ("count", "servus"), ("count", "hola")]))` + is a valid RepresentationSeries. To get more detailed insights into this topic, you can have a look at the typing tutorial. But in general, if you use texthero with the common pipeline: - cleaning the Series with functions from the preprocessing module @@ -262,7 +273,7 @@ you won't need to worry much about it, as the functions are build in the way, th ## Summary -We saw how in just a couple of lines of code we can represent and visualize any text dataset. We went from knowing nothing regarding the dataset to see that there are 5 (quite) distinct areas representig each topic. We went _from zero to hero_. +We saw how in just a couple of lines of code we can represent and visualize any text dataset. We went from knowing nothing regarding the dataset to see that there are 5 (quite) distinct areas representing each a topic. We went _from zero to hero_. ```python import texthero as hero From fba085ebb86b016f18190e44502e07e2cf804793 Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Sat, 22 Aug 2020 14:54:44 +0200 Subject: [PATCH 04/10] fixed review --- website/docs/getting-started.md | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/website/docs/getting-started.md b/website/docs/getting-started.md index 963e15a1..a31471c3 100644 --- a/website/docs/getting-started.md +++ b/website/docs/getting-started.md @@ -144,17 +144,25 @@ TFIDF is a formula to calculate the _relative importance_ of the words in a docu into account the words' occurrences in other documents. ```python -df = pd.concat([df, hero.tfidf(df['tokenized_text']]) +df_tfidf = hero.tfidf(df["tokenized_text"]) ``` Now, we have calculated a vector for each document that tells us what words are characteristic for the document. Usually, documents about similar topics use similar terms, so their tfidf-vectors will be similar too. -###### Usage of concat +###### Integration of the calculation into an existing dataframe -Here you will have probably noticed something very odd. We didn't used the -assignement operator to insert the new created DataFrame. This is due to the -reason, that for each word in every document we created a new DataFrame column. This makes the insertion operation very expensive and therefore we recommend to use concat instead. Currently just the functions `count`, `term_frequency` and `tfidf` return that kind of DocumentTermDF. To read more about the different PandasTypes, introduced in this library, have a look at this tutorial. +The only thing you _can_ but _should not_ do is store a _DocumentTermDF_ in your dataframe, as the performance is really bad. If you really want to, here's the two options: + ```python + >>> data = pd.read_csv("https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv") + >>> data_count = data["text"].pipe(count) + + >>> # Option 1: recommended if you really want to put the DocumenTermDF into your DataFrame + >>> data = pd.concat(data, data_count) + + >>> # Option 1: not recommended as performance is not optimal + >>> data["count"] = data_count + ``` ##### Normalisation of the data @@ -287,6 +295,7 @@ df['pca'] = ( .pipe(hero.clean) .pipe(hero.tokenize) .pipe(hero.tfidf) + .pipe(hero.normalize) .pipe(hero.pca) ) From 75f15fe75abd403a28cac5ce76f2761d98c08d74 Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Sat, 22 Aug 2020 17:16:54 +0200 Subject: [PATCH 05/10] incorporated suggested changes --- website/docs/getting-started.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/website/docs/getting-started.md b/website/docs/getting-started.md index a31471c3..abbf9c9b 100644 --- a/website/docs/getting-started.md +++ b/website/docs/getting-started.md @@ -152,15 +152,12 @@ Usually, documents about similar topics use similar terms, so their tfidf-vector ###### Integration of the calculation into an existing dataframe -The only thing you _can_ but _should not_ do is store a _DocumentTermDF_ in your dataframe, as the performance is really bad. If you really want to, here's the two options: +The only thing you _can_ but _should not_ do is store a _DocumentTermDF_ in your dataframe, as the performance is really bad. If you really want to, you can do it like this like this: ```python >>> data = pd.read_csv("https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv") >>> data_count = data["text"].pipe(count) - >>> # Option 1: recommended if you really want to put the DocumenTermDF into your DataFrame - >>> data = pd.concat(data, data_count) - - >>> # Option 1: not recommended as performance is not optimal + >>> # NOT recommended as performance is not optimal >>> data["count"] = data_count ``` From 434dfaa7d5200b6ec8095401fd82720b8a41666b Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Sat, 5 Sep 2020 17:53:37 +0200 Subject: [PATCH 06/10] updated getting started --- website/docs/getting-started.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/website/docs/getting-started.md b/website/docs/getting-started.md index abbf9c9b..ea7a65f7 100644 --- a/website/docs/getting-started.md +++ b/website/docs/getting-started.md @@ -152,7 +152,7 @@ Usually, documents about similar topics use similar terms, so their tfidf-vector ###### Integration of the calculation into an existing dataframe -The only thing you _can_ but _should not_ do is store a _DocumentTermDF_ in your dataframe, as the performance is really bad. If you really want to, you can do it like this like this: +The only thing you _can_ but _should not_ do is store a _DataFrame_ in your pandas dataframe, as the performance is really bad. If you really want to, you can do it like this like this: ```python >>> data = pd.read_csv("https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv") >>> data_count = data["text"].pipe(count) @@ -262,11 +262,11 @@ of strings. For example, `pd.Series([["test"], ["token2", "token3"]])` is a vali 3. __VectorSeries__: Every cell is a vector representing text, i.e. a list of floats. For example, `pd.Series([[1.0, 2.0], [3.0]])` is a valid VectorSeries. Most dimensionality reduction functions, like `pca` will take VectorSeries as an input and also return a VectorSeries. -4. **DocumentTermDF**: A DataFrame where the rows are the documents and the columns are the words/terms in all the documents. The columns are multiindexed with level one - being the content name (e.g. "tfidf"), level two being the individual features and their values. - For example, - `pd.DataFrame([[1, 2, 3], [4,5,6]], columns=pd.MultiIndex.from_tuples([("count", "hi"), ("count", "servus"), ("count", "hola")]))` - is a valid RepresentationSeries. +4. **DataFrame**: A Pandas DataFrame where the rows can be the documents and the columns can be the words/terms in all the documents. +This is a return type of the representation functions (count, term_frequenzy, tfidf) +For example, + `pd.DataFrame([[1, 2, 3], [4,5,6]], columns=["hi", "servus", "hola"])` + is a valid DataFrame. To get more detailed insights into this topic, you can have a look at the typing tutorial. But in general, if you use texthero with the common pipeline: - cleaning the Series with functions from the preprocessing module From 1e49df26215d0fae916ef7696d2c798c13688c16 Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Sat, 5 Sep 2020 18:14:23 +0200 Subject: [PATCH 07/10] fix black issues --- .travis.yml | 2 +- setup.cfg | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.travis.yml b/.travis.yml index f913f183..c76284b3 100644 --- a/.travis.yml +++ b/.travis.yml @@ -20,7 +20,7 @@ jobs: env: PATH=/c/Python38:/c/Python38/Scripts:$PATH install: - pip3 install --upgrade pip # all three OSes agree about 'pip3' - - pip3 install black + - pip3 install black==19.10b0 - pip3 install ".[dev]" . # 'python' points to Python 2.7 on macOS but points to Python 3.8 on Linux and Windows # 'python3' is a 'command not found' error on Windows but 'py' works on Windows only diff --git a/setup.cfg b/setup.cfg index d6103b02..3f86e7f3 100644 --- a/setup.cfg +++ b/setup.cfg @@ -41,7 +41,7 @@ install_requires = # TODO pick the correct version. [options.extras_require] dev = - black>=19.10b0 + black==19.10b0 pytest>=4.0.0 Sphinx>=3.0.3 sphinx-markdown-builder>=0.5.4 From 80a3b37c04bfadb83bd8625a5069255bbb4de983 Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Mon, 14 Sep 2020 18:06:01 +0200 Subject: [PATCH 08/10] updated getting-started.md to adjust to the newest type changes --- website/docs/getting-started.md | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-) diff --git a/website/docs/getting-started.md b/website/docs/getting-started.md index ea7a65f7..161373db 100644 --- a/website/docs/getting-started.md +++ b/website/docs/getting-started.md @@ -148,18 +148,10 @@ df_tfidf = hero.tfidf(df["tokenized_text"]) ``` Now, we have calculated a vector for each document that tells us what words are characteristic for the document. -Usually, documents about similar topics use similar terms, so their tfidf-vectors will be similar too. - -###### Integration of the calculation into an existing dataframe - -The only thing you _can_ but _should not_ do is store a _DataFrame_ in your pandas dataframe, as the performance is really bad. If you really want to, you can do it like this like this: - ```python - >>> data = pd.read_csv("https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv") - >>> data_count = data["text"].pipe(count) - - >>> # NOT recommended as performance is not optimal - >>> data["count"] = data_count - ``` +Usually, documents about similar topics use similar terms, so their tfidf-vectors will be similar too. You probably realised that we don't +save the output in our df with `df["tfidf"] = hero.tfidf(df["tokenized_text"])`. The reason is, that tfidf returns a DataFrame and the +pandas API does not support to insert a DataFrame into an existing one. At the **Quick look into hero typing** section you can have look, +which functions return a DataFrame. ##### Normalisation of the data @@ -262,7 +254,7 @@ of strings. For example, `pd.Series([["test"], ["token2", "token3"]])` is a vali 3. __VectorSeries__: Every cell is a vector representing text, i.e. a list of floats. For example, `pd.Series([[1.0, 2.0], [3.0]])` is a valid VectorSeries. Most dimensionality reduction functions, like `pca` will take VectorSeries as an input and also return a VectorSeries. -4. **DataFrame**: A Pandas DataFrame where the rows can be the documents and the columns can be the words/terms in all the documents. +We also return a **DataFrame**, which basically is a Pandas DataFrame. Those DataFrames are used to store some relations. For example `count` will return a DataFrame where the rows are the documents and the columns are be the words/terms in all the documents. This is a return type of the representation functions (count, term_frequenzy, tfidf) For example, `pd.DataFrame([[1, 2, 3], [4,5,6]], columns=["hi", "servus", "hola"])` From 93e15d3c21b94b45d29ed69705cad41745ab46d7 Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Mon, 14 Sep 2020 18:57:54 +0200 Subject: [PATCH 09/10] incorp suggested changes --- website/docs/getting-started.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/website/docs/getting-started.md b/website/docs/getting-started.md index 161373db..3484ca18 100644 --- a/website/docs/getting-started.md +++ b/website/docs/getting-started.md @@ -254,7 +254,8 @@ of strings. For example, `pd.Series([["test"], ["token2", "token3"]])` is a vali 3. __VectorSeries__: Every cell is a vector representing text, i.e. a list of floats. For example, `pd.Series([[1.0, 2.0], [3.0]])` is a valid VectorSeries. Most dimensionality reduction functions, like `pca` will take VectorSeries as an input and also return a VectorSeries. -We also return a **DataFrame**, which basically is a Pandas DataFrame. Those DataFrames are used to store some relations. For example `count` will return a DataFrame where the rows are the documents and the columns are be the words/terms in all the documents. +Some functions also return a Pandas DataFrame. The DataFrame then represents a matrix. For example +`count` will return a DataFrame where the rows are the documents and the columns are be the words/terms in all the documents. This is a return type of the representation functions (count, term_frequenzy, tfidf) For example, `pd.DataFrame([[1, 2, 3], [4,5,6]], columns=["hi", "servus", "hola"])` From 7459b89bc0afc35518bafbe62355ae558784549e Mon Sep 17 00:00:00 2001 From: Henri Froese Date: Tue, 22 Sep 2020 19:34:08 +0200 Subject: [PATCH 10/10] Go through getting-started again, fix minor stuff. --- website/docs/getting-started.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/website/docs/getting-started.md b/website/docs/getting-started.md index 3484ca18..788baf27 100644 --- a/website/docs/getting-started.md +++ b/website/docs/getting-started.md @@ -111,13 +111,13 @@ from texthero import preprocessing custom_pipeline = [preprocessing.fillna, preprocessing.lowercase, preprocessing.remove_whitespace] -df['clean_text'] = hero.clean(df['text'], custom_pipeline) +df['clean_text_custom_pipeline'] = hero.clean(df['text'], custom_pipeline) ``` or alternatively ```python -df['clean_text'] = df['clean_text'].pipe(hero.clean, custom_pipeline) +df['clean_text_custom_pipeline'] = df['text'].pipe(hero.clean, custom_pipeline) ``` ##### Tokenize @@ -167,7 +167,7 @@ many entries). For that, we can use PCA. PCA generates new vectors from the tfid that showcase the differences among the documents most strongly in fewer dimensions, often 2 or 3. ```python -df['pca'] = hero.pca(df['tfidf']) +df['pca'] = hero.pca(df_tfidf) ``` ##### All in one step