## TextHero

Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that.

Texthero include tools for:

- Preprocess text data: it offers both out-of-the-box solutions but it's also flexible for custom-solutions.
- Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.
- Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)
- Vector space analysis: clustering (K-means, Meanshift, DBSAN and Hierarchical), topic modelling (wip) and interpretation.
- Text visualization: vector space visualization, place localization on maps (wip).

Supported representation algorithms:

- Term frequency (count)
- Term frequency-inverse document frequency (tfidf)
*********************************
Supported clustering algorithms:
*********************************
- K-means (kmeans)
- Density-Based Spatial Clustering of Applications with Noise (dbscan)
- Meanshift (meanshift)
********************************
Supported dimensionality reduction algorithms:
*******************************

- Principal component analysis (pca)
- t-distributed stochastic neighbor embedding (tsne)
- Non-negative matrix factorization (nmf)

In [45]:
!pip install texthero



In [46]:
import texthero
help(texthero)

Help on package texthero:

NAME
    texthero - Texthero: python toolkit for text preprocessing, representation and visualization.

PACKAGE CONTENTS
    extend_pandas
    nlp
    preprocessing
    representation
    stop_words
    stopwords
    visualization

DATA
    Callable = typing.Callable
    List = typing.List
    Optional = typing.Optional
    Set = typing.Set

FILE
    c:\users\win10\anaconda3\envs\myenv\lib\site-packages\texthero\__init__.py




#### Text Preprocessing

In [48]:
import pandas
text="It's a pleasant   day at Bangaloré; at / (10:30) am"
series=pd.Series(text)

In [49]:
series

0    It's a pleasant   day at Bangaloré; at / (10:3...
dtype: object

In [51]:
import texthero as hero

hero.remove_digits(series)

0    It's a pleasant   day at Bangaloré; at / ( : ) am
dtype: object

In [53]:
#### Remove punctuations
hero.remove_punctuation(series)

0    It s a pleasant   day at Bangaloré  at    10 3...
dtype: object

In [55]:
#### Remove Brackets
hero.remove_brackets(series)

0    It's a pleasant   day at Bangaloré; at /  am
dtype: object

In [56]:
hero.remove_diacritics(series)

0    It's a pleasant   day at Bangalore; at / (10:3...
dtype: object

In [57]:
hero.remove_whitespace(series)

0    It's a pleasant day at Bangaloré; at / (10:30) am
dtype: object

In [58]:
### Stopwords
hero.remove_stopwords(series)

0    It'  pleasant   day  Bangaloré;  / (10:30) 
dtype: object

In [59]:
hero.clean(series)

0    pleasant day bangalore
dtype: object

In [60]:
df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)
df.head()

Unnamed: 0,text,topic
0,Claxton hunting first major medal\n\nBritish h...,athletics
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics
2,Greene sets sights on world title\n\nMaurice G...,athletics
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics


In [62]:
###PCA
import texthero as hero
import pandas as pd

df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['pca'] = (
   df['text']
   .pipe(hero.clean)
   .pipe(hero.tfidf)###vectorizing
   .pipe(hero.pca)
)
hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")


The default value of regex will change from True to False in a future version.


The default value of regex will change from True to False in a future version.



In [63]:
df.head()

Unnamed: 0,text,topic,pca
0,Claxton hunting first major medal\n\nBritish h...,athletics,"[-0.09109912281144122, 0.10359351265238617]"
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics,"[-0.00036132812221576415, 0.02478045501220412]"
2,Greene sets sights on world title\n\nMaurice G...,athletics,"[-0.11760496196780282, 0.12860286068425186]"
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics,"[-0.09134845338902024, 0.15398002814497108]"
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics,"[-0.0912957165291783, 0.13507109027104225]"


In [2]:
df.head()

Unnamed: 0,text,topic,pca
0,Claxton hunting first major medal\n\nBritish h...,athletics,"[-0.09107819053003914, 0.10357210282741101]"
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics,"[-0.0002786547625436705, 0.02477621330944455]"
2,Greene sets sights on world title\n\nMaurice G...,athletics,"[-0.11765703516162962, 0.12865601739827767]"
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics,"[-0.09131528756100005, 0.15397654273191513]"
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics,"[-0.0912807640539468, 0.13510622701055774]"


In [65]:
import texthero as hero
import pandas as pd

df = pd.read_csv(
    "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['tfidf'] = (
    df['text']
    .pipe(hero.clean)
    .pipe(hero.tfidf)
)
### Kmeans

df['kmeans_labels'] = (
    df['tfidf']
    .pipe(hero.kmeans, n_clusters=5)
    .astype(str)
)

df['pca'] = df['tfidf'].pipe(hero.pca)

hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")


The default value of regex will change from True to False in a future version.


The default value of regex will change from True to False in a future version.


'precompute_distances' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). It has no effect


'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25).

