<a href="https://colab.research.google.com/github/gulabpatel/NLP_Basics/blob/main/Part%208.1%3A%20TextHero_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TextHero

Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that.

Texthero include tools for:

- Preprocess text data: it offers both out-of-the-box solutions but it's also flexible for custom-solutions.
- Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.
- Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)
- Vector space analysis: clustering (K-means, Meanshift, DBSAN and Hierarchical), topic modelling (wip) and interpretation.
- Text visualization: vector space visualization, place localization on maps (wip).

Supported representation algorithms:

- Term frequency (count)
- Term frequency-inverse document frequency (tfidf)
*********************************
Supported clustering algorithms:
*********************************
- K-means (kmeans)
- Density-Based Spatial Clustering of Applications with Noise (dbscan)
- Meanshift (meanshift)
********************************
Supported dimensionality reduction algorithms:
*******************************

- Principal component analysis (pca)
- t-distributed stochastic neighbor embedding (tsne)
- Non-negative matrix factorization (nmf)

In [1]:
!pip install texthero



In [2]:
import texthero
help(texthero)

Help on package texthero:

NAME
    texthero - Texthero: python toolkit for text preprocessing, representation and visualization.

PACKAGE CONTENTS
    extend_pandas
    nlp
    preprocessing
    representation
    stop_words
    stopwords
    visualization

DATA
    Callable = typing.Callable
    List = typing.List
    Optional = typing.Optional
    Set = typing.Set

FILE
    /usr/local/lib/python3.7/dist-packages/texthero/__init__.py




#### Text Preprocessing

In [3]:
import pandas as pd
text="It's a pleasant   day at Bangaloré; at / (10:30) am"
series=pd.Series(text)
series

0    It's a pleasant   day at Bangaloré; at / (10:3...
dtype: object

In [4]:
import texthero as hero

In [5]:
#### Remove digits
hero.remove_digits(series)

0    It's a pleasant   day at Bangaloré; at / ( : ) am
dtype: object

In [6]:
#### Remove punctuations
hero.remove_punctuation(series)

0    It s a pleasant   day at Bangaloré  at    10 3...
dtype: object

In [7]:
#### Remove Brackets
hero.remove_brackets(series)

0    It's a pleasant   day at Bangaloré; at /  am
dtype: object

In [8]:
#### Remove diacritics
hero.remove_diacritics(series)

0    It's a pleasant   day at Bangalore; at / (10:3...
dtype: object

In [9]:
#### Remove whitespace
hero.remove_whitespace(series)

0    It's a pleasant day at Bangaloré; at / (10:30) am
dtype: object

In [10]:
#### Remove Stopwords
hero.remove_stopwords(series)

0    It'  pleasant   day  Bangaloré;  / (10:30) 
dtype: object

In [11]:
hero.clean(series)

0    pleasant day bangalore
dtype: object

In [12]:
df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)
df.sample(10)

Unnamed: 0,text,topic
607,Marshall set for Leeds move\n\nAll Blacks scru...,rugby
306,Keegan hails comeback king Fowler\n\nMancheste...,football
299,FA decides not to punish Mourinho\n\nThe Footb...,football
135,West Indies opener Rae mourned\n\nFormer West ...,cricket
194,Sri Lankans cleared of misconduct\n\nTwo Sri L...,cricket
655,Venus stunned by Farina Elia\n\nVenus Williams...,tennis
391,McClaren eyes Uefa Cup top spot\n\nSteve McCla...,football
149,Wilson back in Kiwi cricket squad\n\nFormer Al...,cricket
477,Chelsea 3-0 Portsmouth\n\nDidier Drogba scored...,football
464,Eriksson warned on Cole comments\n\nPremier Le...,football


In [13]:
###PCA

df['pca'] = (
   df['text']
   .pipe(hero.clean)
   .pipe(hero.tfidf)###vectorizing
   .pipe(hero.pca)
)
hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")

Also, we can "visualize" the most common words for each topic with top_words

In [14]:
NUM_TOP_WORDS = 5
df.groupby('topic')['text'].apply(lambda x: hero.top_words(x)[:NUM_TOP_WORDS])

topic         
athletics  the    1731
           in      885
           to      882
           and     665
           a       613
cricket    the    2246
           to     1309
           a      1045
           in     1007
           and     988
football   the    4516
           to     2641
           a      2061
           and    1904
           in     1580
rugby      the    2770
           to     1393
           a      1210
           and    1129
           in     1093
tennis     the    1527
           to      826
           in      706
           a       587
           and     573
Name: text, dtype: int64

In [15]:
df.head()

Unnamed: 0,text,topic,pca
0,Claxton hunting first major medal\n\nBritish h...,athletics,"[-0.09104989057576593, 0.10347155373429032]"
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics,"[-0.00045037197177923686, 0.02481619299726644]"
2,Greene sets sights on world title\n\nMaurice G...,athletics,"[-0.11761165386741548, 0.12877483603725004]"
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics,"[-0.09137916008742447, 0.15409369238921933]"
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics,"[-0.09130212553792584, 0.1350658747907763]"


In [19]:
import texthero as hero
import pandas as pd

df = pd.read_csv(
    "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['tfidf'] = (
    df['text']
    .pipe(hero.clean)
    .pipe(hero.tfidf)
)
### Kmeans

df['kmeans_labels'] = (
    df['tfidf']
    .pipe(hero.kmeans, n_clusters=5)
    .astype(str)
)

df['pca'] = df['tfidf'].pipe(hero.pca)

hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")