jbesomi · mk2510 · Jul 22, 2020 · Aug 7, 2020 · Aug 7, 2020 · Aug 7, 2020
diff --git a/website/docs/getting-started.md b/website/docs/getting-started.md
@@ -9,13 +9,13 @@ Texthero is a python package to let you work efficiently and quickly with text d
 
 ## Overview
 
-Given a dataset with structured data, it's easy to have a quick understanding of the underline data. Oppositely, given a dataset composed of text-only, it's harder to have a quick undertanding of the data. Texthero help you there, providing utility functions to quickly **clean the text data**, **map it into a vector space** and gather from it **primary insights**.
+Given a dataset with structured data, it's easy to have a quick understanding of the underlying data. Oppositely, given a dataset composed of text-only, it's harder to have a quick undertanding of the data. Texthero help you there, providing utility functions to quickly **clean the text data**, **tokenize it**, **map it into a vector space** and gather from it **primary insights**.
 
 ##### Pandas integration
 
 One of the main pillar of texthero is that is designed from the ground-up to work with **Pandas Dataframe** and **Series**.
 
-Most of texthero methods, simply apply transformation to Pandas Series. As a rule of thumb, the first argument and the return ouputs of almost all texthero methods are either a Pandas Series or a Pandas DataFrame.
+Most of texthero's methods simply apply a transformation to a Pandas Series. As a rule of thumb, the first argument and the ouput of almost all texthero methods are either a Pandas Series or a Pandas DataFrame.
 
 
 ##### Pipeline
@@ -46,7 +46,7 @@ The five different areas are _athletics_, _cricket_, _football_, _rugby_ and _te
 
 The original dataset comes as a zip files with five different folder containing the article as text data for each topic.
 
-For convenience, we createdThis script simply read all text data and store it into a Pandas Dataframe.
+For convenience, we created this script simply read all text data and store it into a Pandas Dataframe.
 
 Import texthero and pandas.
 
@@ -87,7 +87,7 @@ Recently, Pandas has introduced the pipe function. You can achieve the same resu
 df['clean_text'] = df['text'].pipe(hero.clean)
 ```
 
-> Tips. When we need to define a new column returned from a function, we prepend the name of the function to the column name. Example: df['tsne_col'] = df['col'].pipe(hero.tsne). This keep the code simple to read and permit to construct complex pipeline.
+> Tips. When we need to define a new column returned from a function, we prepend the name of the function to the column name. Example: df['tsne_col'] = df['col'].pipe(hero.tsne). This keep the code simple to read and allows us to construct complex pipelines.
 
 The default pipeline for the `clean` method is the following:
 
@@ -120,46 +120,77 @@ or alternatively
 df['clean_text'] = df['clean_text'].pipe(hero.clean, custom_pipeline)
 ```
 
+##### Tokenize
+
+Next, we usually want to tokenize the text (_tokenizing_ means splitting sentences/documents into separate words, the _tokens_). Of course, texthero provides an easy function for that!
+
+```python
+df['tokenized_text'] = hero.tokenize(df['clean_text'])
+```
+
+
 ##### Preprocessing API
 
-The complete preprocessing API can be found at the following address: [api preprocessing](/docs/api-preprocessing).
+The complete preprocessing API can be found here: [api preprocessing](/docs/api-preprocessing).
 
 
 ### Representation
 
-Once cleaned the data, the next natural is to map each document into a vector.
+Once the data is cleaned and tokenized, the next natural step is to map each document to a vector so we can compare documents with mathematical methods to derive insights.
 
 ##### TFIDF representation
 
+TFIDF is a formula to calculate the _relative importance_ of the words in a document, taking
+into account the words' occurrences in other documents.
 
 ```python
-df['tfidf_clean_text'] = hero.tfidf(df['clean_text'])
+df = pd.concat([df, hero.tfidf(df['tokenized_text']])
 ```
 
+Now, we have calculated a vector for each document that tells us what words are characteristic for the document.
+Usually, documents about similar topics use similar terms, so their tfidf-vectors will be similar too.
+
+###### Usage of concat
+
+Here you will have probably noticed something very odd. We didn't used the
+assignement operator to insert the new created DataFrame.  This is due to the
+reason, that for each word in every document we created a new DataFrame column. This makes the insertion operation very expensive and therefore we recommend to use concat instead. Currently just the functions `count`, `term_frequency` and `tfidf` return that kind of DocumentTermDF. To read more about the different PandasTypes, introduced in this library, have a look at this tutorial.
+
+##### Normalisation of the data
+
+It is very imporant to normalize your data before you start to analyse them. The normalisation helps you to minimise the variance of your dataset, which is necessary to analyse your data further in a meaningful way, as outliers and and different ranges of numbers are now "handled". This is just a generalisation, as every clustering and dimension reduction algorithm works differently.
+
 ##### Dimensionality reduction with PCA
 
-To visualize the data, we map each point to a two-dimensional representation with PCA. The principal component analysis algorithms returns the combination of attributes that better account the variance in the data.
+We now want to visualize the data. However, the tfidf-vectors are very high-dimensional (i.e. every
+document might have a tfidf-vector of length 100). Visualizing 100 dimensions is hard!
+
+Thus, we perform dimensionality reduction (generating vectors with fewer entries from vectors with
+many entries). For that, we can use PCA. PCA generates new vectors from the tfidf representation
+that showcase the differences among the documents most strongly in fewer dimensions, often 2 or 3.
 
 ```python
-df['pca_tfidf_clean_text'] = hero.pca(df['tfidf_clean_text'])
+df['pca'] = hero.pca(df['tfidf'])
 ```
 
 ##### All in one step
 
-We can achieve all the three steps show above, _cleaning_, _tf-idf representation_ and _dimensionality reduction_ in a single step. Isn't fabulous?
+We can achieve all the steps shown above, _cleaning_, _tokenizing_, _tf-idf representation_ and _dimensionality reduction_ in a single step. Isn't that fabulous?
 
 ```python
 df['pca'] = (
-            df['text']
-            .pipe(hero.clean)
-            .pipe(hero.tfidf)
-            .pipe(hero.pca)
-   )
+      df['text']
+      .pipe(hero.clean)
+      .pipe(hero.tokenize)
+      .pipe(hero.tfidf)
+      .pipe(hero.normalize)
+      .pipe(hero.pca)
+)
 ```
 
 ##### Representation API
 
-The complete representation module API can be found at the following address: [api representation](/docs/api-representation).
+The complete representation module API can be found here: [api representation](/docs/api-representation).
 
 ### Visualization
 
@@ -176,36 +207,73 @@ Also, we can "visualize" the most common words for each `topic` with `top_words`
 
 ```python
 NUM_TOP_WORDS = 5
-df.groupby('topic')['text'].apply(lambda x: hero.top_words(x)[:NUM_TOP_WORDS])
+df.groupby('topic')['clean_text'].apply(lambda x: hero.top_words(x, normalize=True)[:NUM_TOP_WORDS])
 ```
 
 ```
 topic             
-athletics  said       0.010068
-           world      0.008900
-           year       0.008844
-cricket    test       0.008250
-           england    0.008001
-           first      0.007787
-football   said       0.009515
-           chelsea    0.006110
-           game       0.005950
-rugby      england    0.012602
-           said       0.008359
-           wales      0.007880
-tennis     6          0.021047
-           said       0.013012
-           open       0.009834
+athletics  said       0.010330
+           world      0.009132
+           year       0.009075
+           olympic    0.007819
+           race       0.006392
+cricket    test       0.008492
+           england    0.008235
+           first      0.008016
+           cricket    0.007906
+           one        0.007760
+football   said       0.009709
+           chelsea    0.006234
+           game       0.006071
+           would      0.005866
+           club       0.005601
+rugby      england    0.012833
+           said       0.008512
+           wales      0.008025
+           ireland    0.007440
+           rugby      0.007245
+tennis     said       0.013993
+           open       0.010575
+           first      0.009608
+           set        0.009028
+           year       0.008447
+Name: clean_text, dtype: float64
 ```
 
 
 ##### Visualization API
 
-The complete visualization module API can be found at the following address: [api visualization](/docs/api-visualization).
+The complete visualization module API can be found here: [api visualization](/docs/api-visualization).
+
+## Quick look into hero typing
+
+Texthero does introduce some different pandas series types for it's different categories of functions:
+1. __TextSeries__: Every cell is a text, i.e. a string. For example,
+`pd.Series(["test", "test"])` is a valid TextSeries. Those series will be the input and output type of the preprocessing functions like `clean`
+
+2. __TokenSeries__: Every cell is a list of words/tokens, i.e. a list
+of strings. For example, `pd.Series([["test"], ["token2", "token3"]])` is a valid TokenSeries. The NLP functions like `tfidf` do require a TokenSeries as an input. The function `tokenize` generates a TokenSeries
+
+3. __VectorSeries__: Every cell is a vector representing text, i.e.
+a list of floats. For example, `pd.Series([[1.0, 2.0], [3.0]])` is a valid VectorSeries. Most dimensionality reduction functions, like `pca` will take VectorSeries as an input and also return a VectorSeries.
+
+4. **DocumentTermDF**: A DataFrame where the rows are the documents and the columns are the words/terms in all the documents. The columns are multiindexed with level one
+ being the content name (e.g. "tfidf"), level two being the individual features and their values.
+ For example,
+ `pd.DataFrame([[1, 2, 3], [4,5,6]], columns=pd.MultiIndex.from_tuples([("count", "hi"), ("count", "servus"), ("count", "hola")]))`
+ is a valid RepresentationSeries.
+
+To get more detailed insights into this topic, you can have a look at the typing tutorial. But in general, if you use texthero with the common pipeline:
+- cleaning the Series with functions from the preprocessing module
+- tokenising the Series and then perform NLP functions
+- calculating some Clustering
+- reduce the dimension to display the data
+
+you won't need to worry much about it, as the functions are build in the way, that the corresponding input and output types match.
 
 ## Summary
 
-We saw how in just a couple of lines of code we can represent and visualize any text dataset. We went from knowing nothing regarding the dataset to see that there are 5 (quite) distinct areas representig each topic. We went _from zero to hero_.
+We saw how in just a couple of lines of code we can represent and visualize any text dataset. We went from knowing nothing regarding the dataset to see that there are 5 (quite) distinct areas representing each a topic. We went _from zero to hero_.
 
 ```python
 import texthero as hero
@@ -217,15 +285,19 @@ df = pd.read_csv(
 df['pca'] = (
     df['text']
     .pipe(hero.clean)
+    .pipe(hero.tokenize)
     .pipe(hero.tfidf)
     .pipe(hero.pca)
 )
 
 hero.scatterplot(df, col='pca', color='topic', title="PCA BBC Sport news")
 ```
 
+![](/img/scatterplot_bccsport.svg)
+
+
 ##### Next section
 
 By now, you should have understood the main building blocks of texthero.
 
-In the next sections, we will review each module, see how we can tune the default settings and we will show other application where Texthero might come in handy.
+In the next sections, we will review each module, see how we can tune the default settings and we will show other applications where Texthero might come in handy.