<a href="https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/texthero_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [52]:
!pip install texthero -U -q

In [6]:
import pandas as pd

In [53]:
import texthero as hero

In [54]:
pd.options.plotting.backend = "plotly"

# Download a dataset

In [7]:
df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

In [56]:
df

Unnamed: 0,text,topic
0,Claxton hunting first major medal\n\nBritish h...,athletics
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics
2,Greene sets sights on world title\n\nMaurice G...,athletics
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics
...,...,...
732,Agassi into second round in Dubai\n\nFourth se...,tennis
733,Mauresmo fights back to win title\n\nWorld num...,tennis
734,Federer wins title in Rotterdam\n\nWorld numbe...,tennis
735,GB players warned over security\n\nBritain's D...,tennis


In [57]:
topic_counts = df['topic'].value_counts()
topic_counts.plot.bar()

> Feel free to change `TOPIC` to any of `football`, `rugby`, `criket`, `athletics`, `tennis`, and rerun the cell.

In [58]:
TOPIC = 'tennis'
filt = df['topic'] == TOPIC
sample_text = df[filt].sample(1).iloc[0, 0]
sample_text

'Federer claims Dubai crown\n\nWorld number one Roger Federer added the Dubai Championship trophy to his long list of successes - but not before he was given a test by Ivan Ljubicic.\n\nTop seed Federer looked to be on course for a easy victory when he thumped the eighth seed 6-1 in the first set. But Ljubicic, who beat Tim Henman in the last eight, dug deep to secure the second set after a tense tiebreak. Swiss star Federer was not about to lose his cool, though, turning on the style to win the deciding set 6-3. The match was a re-run of last week\'s final at the World Indoor Tournament in Rotterdam, where Federer triumphed, but not until Ljubicic had stretched him for five sets. "I really wanted to get off to a good start this time, and I did, and I could really play with confidence while he still looking for his rhythm," Federer said.\n\n"That took me all the way through to 6-1 3-1 0-30 on his serve and I almost ran away with it. But he came back, and that was a good effort on his s

# Clean texts

In [59]:
df['clean_text'] = hero.clean(df['text'])
df

Unnamed: 0,text,topic,clean_text
0,Claxton hunting first major medal\n\nBritish h...,athletics,claxton hunting first major medal british hurd...
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics,sullivan could run worlds sonia sullivan indic...
2,Greene sets sights on world title\n\nMaurice G...,athletics,greene sets sights world title maurice greene ...
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics,iaaf launches fight drugs iaaf athletics world...
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics,dibaba breaks 000m world record ethiopia tirun...
...,...,...,...
732,Agassi into second round in Dubai\n\nFourth se...,tennis,agassi second round dubai fourth seed andre ag...
733,Mauresmo fights back to win title\n\nWorld num...,tennis,mauresmo fights back win title world number tw...
734,Federer wins title in Rotterdam\n\nWorld numbe...,tennis,federer wins title rotterdam world number one ...
735,GB players warned over security\n\nBritain's D...,tennis,gb players warned security britain davis cup p...


> Feel free to change `TOPIC` to any of `football`, `rugby`, `criket`, `athletics`, `tennis`, and rerun the cell.

In [60]:
TOPIC = 'tennis'
filt = df['topic'] == TOPIC
sample = df[filt].sample(1)
raw_text = sample.iloc[0, 0]
clean_text = sample.iloc[0, 2]
print('Before cleaning >>>')
print(raw_text)
print('After cleaning >>>')
print(clean_text)

Before cleaning >>>
Roche 'turns down Federer offer'

Australian tennis coach Tony Roche has turned down an approach from Roger Federer to be the world number one's new full-time coach, say reports.

Melbourne's Herald-Sun said Roche, troubled by a hip complaint, did not want to travel full-time again. However, Roche is happy to work with the Swiss star on a casual basis and is helping him prepare for next month's defence of his Australian Open crown. Federer has been without a coach since splitting with Peter Lundgren in 2003. Roche, a former Davis Cup player for Australia, won the French Open, reached the Wimbledon and US Open finals and won five Wimbledon doubles titles with John Newcombe.

He also coached former number one Ivan Lendl and Pat Rafter to Grand Slam victories and has worked with Australia's Lleyton Hewitt. Some reports claim Federer initially wanted Andre Agassi's Australian coach Darren Cahill, before Agassi confirmed he would play on in 2005. Federer was named Swiss 

## Preprocessing operations

- Lowercase all texts
- Remove digits
- Remove punctuations
- Remove diacritics
- Remove whitespaces
- Remove stopwords


- Examples of stopwords

![](https://www.computerhope.com/jargon/s/stop-words.png)

# Top words per topic

> Feel free to change `NUM_TOP_WORDS` to other intergers.

In [61]:
NUM_TOP_WORDS = 10
df.groupby('topic')['clean_text'].apply(lambda x: hero.top_words(x)[:NUM_TOP_WORDS])

topic               
athletics  said         181
           world        160
           year         159
           olympic      137
           race         112
           athens        99
           champion      99
           indoor        96
           european      94
           time          83
cricket    test         232
           england      225
           first        219
           cricket      216
           one          212
           said         203
           day          203
           series       169
           australia    144
           south        143
football   said         475
           chelsea      305
           game         297
           would        287
           club         274
           arsenal      247
           united       246
           players      240
           league       237
           time         220
rugby      england      395
           said         262
           wales        247
           ireland      229
           rugby        223

# Vectorize texts

![](https://miro.medium.com/max/1000/1*vWWmJlDykVRkjg9c38VbxQ.png)

- To convert a document into number, we'll need the Document-Term Matrix (DTM).
![](https://rlads2021.github.io/LabBook/assets/img/dtm.png)

In [62]:
df['tfidf'] = hero.tfidf(df['clean_text'])
df

Unnamed: 0,text,topic,clean_text,tfidf
0,Claxton hunting first major medal\n\nBritish h...,athletics,claxton hunting first major medal british hurd...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics,sullivan could run worlds sonia sullivan indic...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,Greene sets sights on world title\n\nMaurice G...,athletics,greene sets sights world title maurice greene ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0533678197008..."
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics,iaaf launches fight drugs iaaf athletics world...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics,dibaba breaks 000m world record ethiopia tirun...,"[0.24734311047947527, 0.0, 0.0, 0.0, 0.0, 0.0,..."
...,...,...,...,...
732,Agassi into second round in Dubai\n\nFourth se...,tennis,agassi second round dubai fourth seed andre ag...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
733,Mauresmo fights back to win title\n\nWorld num...,tennis,mauresmo fights back win title world number tw...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
734,Federer wins title in Rotterdam\n\nWorld numbe...,tennis,federer wins title rotterdam world number one ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
735,GB players warned over security\n\nBritain's D...,tennis,gb players warned security britain davis cup p...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


# Train a clustering model

K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between 

1. assigning data points to clusters based on the current centroids 
2. chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

till convergence..

![](https://i.imgur.com/42n9uvR.png)

In [63]:
from sklearn.cluster import KMeans

In [64]:
NUM_CLUSTERS = 5
doc_term_matrix = df['tfidf'].to_list()
km = KMeans(
    n_clusters=NUM_CLUSTERS, 
    max_iter=10000, 
    random_state=123,
    ).fit(doc_term_matrix)

In [None]:
from collections import Counter

In [65]:
Counter(km.labels_)

Counter({0: 97, 1: 114, 2: 279, 3: 152, 4: 95})

In [66]:
df['cluster'] = km.labels_
df

Unnamed: 0,text,topic,clean_text,tfidf,cluster
0,Claxton hunting first major medal\n\nBritish h...,athletics,claxton hunting first major medal british hurd...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics,sullivan could run worlds sonia sullivan indic...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3
2,Greene sets sights on world title\n\nMaurice G...,athletics,greene sets sights world title maurice greene ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0533678197008...",0
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics,iaaf launches fight drugs iaaf athletics world...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics,dibaba breaks 000m world record ethiopia tirun...,"[0.24734311047947527, 0.0, 0.0, 0.0, 0.0, 0.0,...",0
...,...,...,...,...,...
732,Agassi into second round in Dubai\n\nFourth se...,tennis,agassi second round dubai fourth seed andre ag...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
733,Mauresmo fights back to win title\n\nWorld num...,tennis,mauresmo fights back win title world number tw...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
734,Federer wins title in Rotterdam\n\nWorld numbe...,tennis,federer wins title rotterdam world number one ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4
735,GB players warned over security\n\nBritain's D...,tennis,gb players warned security britain davis cup p...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2


# Reduce dimensions

- **Principal Component Analysis** (PCA) is a common technique for reducing dimensions.

![](https://miro.medium.com/max/1400/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg)

In [67]:
df['pca'] = hero.pca(df['tfidf'])
df

Unnamed: 0,text,topic,clean_text,tfidf,cluster,pca
0,Claxton hunting first major medal\n\nBritish h...,athletics,claxton hunting first major medal british hurd...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0,"[-0.09106000168473471, 0.10349560238655374]"
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics,sullivan could run worlds sonia sullivan indic...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3,"[-0.0003837513189644131, 0.024830609265307845]"
2,Greene sets sights on world title\n\nMaurice G...,athletics,greene sets sights world title maurice greene ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0533678197008...",0,"[-0.11767252493749902, 0.12865305467843438]"
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics,iaaf launches fight drugs iaaf athletics world...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0,"[-0.09133068203435618, 0.15403298214830907]"
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics,dibaba breaks 000m world record ethiopia tirun...,"[0.24734311047947527, 0.0, 0.0, 0.0, 0.0, 0.0,...",0,"[-0.09129515194120927, 0.13499357178481047]"
...,...,...,...,...,...,...
732,Agassi into second round in Dubai\n\nFourth se...,tennis,agassi second round dubai fourth seed andre ag...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4,"[-0.06663426167725048, 0.1087933230927288]"
733,Mauresmo fights back to win title\n\nWorld num...,tennis,mauresmo fights back win title world number tw...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4,"[-0.048151757363650766, 0.05289538850721994]"
734,Federer wins title in Rotterdam\n\nWorld numbe...,tennis,federer wins title rotterdam world number one ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4,"[-0.049762751202150894, 0.06020561675666957]"
735,GB players warned over security\n\nBritain's D...,tennis,gb players warned security britain davis cup p...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2,"[-0.04870969774446013, 0.013944823139749788]"


# Visualize texts

In [73]:
import plotly.express as px

In [74]:
df['pca_0'] = df['pca'].apply(lambda x: x[0])
df['pca_1'] = df['pca'].apply(lambda x: x[1])

In [75]:
fig = px.scatter(df, x="pca_0", y="pca_1", 
                 color="topic",
                 title="PCA BBC Sport news labelled by topics",
                 hover_name=df.index,)
fig.show()

In [80]:
fig = px.scatter(df, x="pca_0", y="pca_1", 
                 color="cluster",
                 title="PCA BBC Sport news labelled by clusters",
                 hover_name=df.index,)
fig.show()

# Text similarity

In [None]:
!pip install -U -q pip setuptools wheel
!pip install -U -q spacy
!python -m spacy download en_core_web_md

In [2]:
!python -m spacy info

[1m

spaCy version    3.3.0                         
Location         /usr/local/lib/python3.7/dist-packages/spacy
Platform         Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
Python version   3.7.13                        
Pipelines        en_core_web_md (3.3.0)        



In [3]:
import spacy

In [4]:
nlp = spacy.load('en_core_web_md')

- Cosine similarity

![](https://www.tyrrell4innovation.ca/wp-content/uploads/2021/06/rsz_jenny_du_miword.png)

> Feel free to change `text_1` and `text_2` to other English texts.

In [12]:
text_1 = "I'm a big fan of foreign tongues."
text_2 = "I'm interested in learning languges."
doc_1 = nlp(text_1)
doc_2 = nlp(text_2)
sim_score = doc_1.similarity(doc_2)
sim_score

0.8357480426963773

# Dependency parsing

In [42]:
from spacy import displacy

In [78]:
Robot = "What would you like to order?"
Human = "I'd like to order three Cheeseburgers and one small fries."
doc = nlp(Human)
displacy.render(doc, style='dep',jupyter=True)

In [77]:
for token in doc:
    if token.dep_ == "dobj":
        children_of_dobj = [t.i for t in token.children]
        left_most_idx = min(children_of_dobj)
        right_most_idx = max(children_of_dobj)
        order = doc[left_most_idx:right_most_idx+1].text
        print(f"Your order is: {order}")

Your order is: three Cheeseburgers and one small fries


# Named entities

> Feel free to change `DOCID` to any number between 0 and 736.

In [62]:
DOCID = 400
text = df.loc[DOCID, 'text']
doc = nlp(text)
displacy.render(doc, style='ent',jupyter=True)

In [52]:
#@title
def extract_ent(ent_label):
    corpus_ents = []
    for doc in nlp.pipe(texts):
        doc_ents = [ent.text for ent in doc.ents if ent.label_ == ent_label]
        corpus_ents.append(set(doc_ents))
    return corpus_ents

In [93]:
#@title
from collections import Counter

def show_top_entities(df, ent_label, top_k=10):
    corpus_ents = []
    for ents in df[ent_label]:
        doc_ents = list(ents)
        corpus_ents.extend(doc_ents)
    counter = Counter(corpus_ents)
    res = counter.most_common(top_k)
    return res

## Person

In [51]:
texts = df['text']

In [53]:
df['person'] = extract_ent('PERSON')
df

Unnamed: 0,text,topic,person
0,Claxton hunting first major medal\n\nBritish h...,athletics,"{Sarah Claxton, Irina Shevchenko}"
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics,"{Maria McCambridge, Fionnualla Britton, Jolene..."
2,Greene sets sights on world title\n\nMaurice G...,athletics,"{Greene, Ato, Lewis-Francis, Francis Obikwelu,..."
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics,"{Frankie Fredericks, Diack, Lamine Diack}"
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics,"{Bekele, Kenenisa Bekele's, Alistair Cragg, Jo..."
...,...,...,...
732,Agassi into second round in Dubai\n\nFourth se...,tennis,"{Olivier Rochus, Tim Henman, Andre Agassi, Rad..."
733,Mauresmo fights back to win title\n\nWorld num...,tennis,"{Mauresmo, Amelie, Amelie Mauresmo, Williams, ..."
734,Federer wins title in Rotterdam\n\nWorld numbe...,tennis,"{Rotterdam, Federer, Roger Federer, Ivan Ljubi..."
735,GB players warned over security\n\nBritain's D...,tennis,"{Bates, Murray, Alex Bogdanovic, David Sherwoo..."


> Feel free to change `TOP_K` to other numbers and rerun the cell.

In [94]:
TOP_K = 5
show_top_entities(df, 'person', top_k=TOP_K)

[('Williams', 32),
 ('Alex Ferguson', 29),
 ('Jose Mourinho', 28),
 ('Andy Robinson', 27),
 ('Robinson', 26)]

## Geopolitical entities

In [54]:
df['gpe'] = extract_ent('GPE')
df

Unnamed: 0,text,topic,person,gpe
0,Claxton hunting first major medal\n\nBritish h...,athletics,"{Sarah Claxton, Irina Shevchenko}","{Scotland, Claxton, London, Colchester, Madrid}"
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics,"{Maria McCambridge, Fionnualla Britton, Jolene...","{Dublin, Athletics Ireland, Australia, Santry,..."
2,Greene sets sights on world title\n\nMaurice G...,athletics,"{Greene, Ato, Lewis-Francis, Francis Obikwelu,...","{Kansas, Birmingham, Britain, Finland, Helsink..."
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics,"{Frankie Fredericks, Diack, Lamine Diack}","{Qatar, Monaco}"
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics,"{Bekele, Kenenisa Bekele's, Alistair Cragg, Jo...","{Carolina Kluft, Ethiopia, Stuttgart, Dibaba, ..."
...,...,...,...,...
732,Agassi into second round in Dubai\n\nFourth se...,tennis,"{Olivier Rochus, Tim Henman, Andre Agassi, Rad...","{Belgium, Dubai, Morocco, the United Arab Emir..."
733,Mauresmo fights back to win title\n\nWorld num...,tennis,"{Mauresmo, Amelie, Amelie Mauresmo, Williams, ...","{Dinara Safina, Antwerp}"
734,Federer wins title in Rotterdam\n\nWorld numbe...,tennis,"{Rotterdam, Federer, Roger Federer, Ivan Ljubi...","{Ljubicic, Doha, Rotterdam}"
735,GB players warned over security\n\nBritain's D...,tennis,"{Bates, Murray, Alex Bogdanovic, David Sherwoo...","{Israel, Britain, Tel Aviv}"


> Feel free to change `TOP_K` to other numbers and rerun the cell.

In [95]:
TOP_K = 5
show_top_entities(df, 'gpe', top_k=TOP_K)

[('England', 219),
 ('France', 110),
 ('Australia', 100),
 ('Ireland', 87),
 ('Scotland', 83)]