## Text Hero
Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that.

Texthero include tools for:

* Preprocess text data: it offers both out-of-the-box solutions but it's also flexible for custom-solutions.
* Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.
* Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)
* Vector space analysis: clustering (K-means, Meanshift, DBSAN and Hierarchical), topic modelling (wip) and interpretation.
* Text visualization: vector space visualization, place localization on maps (wip).

Supported representation algorithms:

* Term frequency (count)
* Term frequency-inverse document frequency (tfidf)

Supported clustering algorithms:

* K-means (kmeans)
* Density-Based Spatial Clustering of Applications with Noise (dbscan)
* Meanshift (meanshift)

Supported dimensionality reduction algorithms:

* Principal component analysis (pca)
* t-distributed stochastic neighbor embedding (tsne)
* Non-negative matrix factorization (nmf)

## Steps

* conda activate texthero
* pip install ipykernel
* Add the kernel to Jupyter : python -m ipykernel install --user --name=texthero --display-name "Python (texthero)"

In [1]:
import texthero
help(texthero)

ModuleNotFoundError: No module named 'texthero'

## Text Preprocessing

In [None]:
import pandas as pd
text="It's a pleasant   day at Bangaloré; at / (10:30) am"
series=pd.Series(text)

In [None]:
series

In [None]:
import texthero as hero

hero.remove_digits(series)

In [None]:
#### Remove punctuations
hero.remove_punctuation(series)

In [None]:
#### Remove Brackets
hero.remove_brackets(series)

In [None]:
hero.remove_diacritics(series)

In [None]:
hero.remove_whitespace(series)

In [None]:
### Stopwords
hero.remove_stopwords(series)

In [None]:
hero.clean(series)

In [None]:
df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)
df.head()

In [None]:
###PCA
import texthero as hero
import pandas as pd

df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['pca'] = (
   df['text']
   .pipe(hero.clean)
   .pipe(hero.tfidf)###vectorizing
   .pipe(hero.pca)
)
hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")

In [None]:
df.head()

In [None]:
import texthero as hero
import pandas as pd

df = pd.read_csv(
    "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['tfidf'] = (
    df['text']
    .pipe(hero.clean)
    .pipe(hero.tfidf)
)
### Kmeans

df['kmeans_labels'] = (
    df['tfidf']
    .pipe(hero.kmeans, n_clusters=5)
    .astype(str)
)

df['pca'] = df['tfidf'].pipe(hero.pca)

hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")