In [0]:
#@title <b><font color="red">▶</font><font color="black"> run this cell to prepare supplementary materials for the lesson</font></b>

!rm -rf harbour-space-text-mining-course
!git clone https://github.com/horoshenkih/harbour-space-text-mining-course.git
import sys
sys.path.append('harbour-space-text-mining-course')

from tmcourse.utils import (
    display_cv_results,
    display_token_importance,
)

import numpy as np
from collections import Counter
from math import exp
from tabulate import tabulate
from tqdm.notebook import tqdm
from IPython.display import HTML, display
!pip install fasttext

Cloning into 'harbour-space-text-mining-course'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects:   2% (1/41)[Kremote: Counting objects:   4% (2/41)[Kremote: Counting objects:   7% (3/41)[Kremote: Counting objects:   9% (4/41)[Kremote: Counting objects:  12% (5/41)[Kremote: Counting objects:  14% (6/41)[Kremote: Counting objects:  17% (7/41)[Kremote: Counting objects:  19% (8/41)[Kremote: Counting objects:  21% (9/41)[Kremote: Counting objects:  24% (10/41)[Kremote: Counting objects:  26% (11/41)[Kremote: Counting objects:  29% (12/41)[Kremote: Counting objects:  31% (13/41)[Kremote: Counting objects:  34% (14/41)[Kremote: Counting objects:  36% (15/41)[Kremote: Counting objects:  39% (16/41)[Kremote: Counting objects:  41% (17/41)[Kremote: Counting objects:  43% (18/41)[Kremote: Counting objects:  46% (19/41)[Kremote: Counting objects:  48% (20/41)[Kremote: Counting objects:  51% (21/41)[Kremote: Counting objects:  53% (22

<!--@slideshow slide-->
<center><h1>Case study: Telegram News Aggregator </h1></center>

<!--@slideshow slide-->
# So far
Concepts:
1. TF-IDF
1. Language Models
1. Text Classification
1. Text Clustering
1. Topic Modeling
1. Word Vectors

Algorithms:
1. $n$-gram language models
1. Logistic Regression
1. $k$-means
1. LDA
1. Neural Networks

<!--@slideshow slide-->
# Telegram News Aggregator Contest

1. Isolate articles in English and Russian.
1. Isolate news articles.
1. Group news articles by category.
1. Group similar news into threads.
1. Sort threads by their relative importance.

[link](https://contest.com/docs/data_clustering)

<!--@slideshow slide-->
## 1. Isolate articles in English and Russian.

Your algorithm must sort articles by language, filtering English and Russian articles. Articles in other languages are not relevant for this stage of the contest and may be discarded.

**Q: how to do it?**


<!--@slideshow fragment-->
> Text Mining problem: Language Identification.

Language Identification can be solved using _subword information_: for often different combinations of tokens occur:
- $n$-gram or RNN language model for characters
- Pre-trained FastText model for language isentification: [link](https://fasttext.cc/docs/en/language-identification.html)

<!--@slideshow slide-->
## 2. Isolate news articles.
Your algorithm must discard everything except for news articles.

**Q: how to do it?**


<!--@slideshow fragment-->
> Text Mining problem: Classification.

<!--@slideshow slide-->
## 3. Group news articles by category.
Your algorithm must place news articles into the following 7 categories:

  1. Society (includes Politics, Elections, Legislation, Incidents, Crime)
  1. Economy (includes Markets, Finance, Business)
  1. Technology (includes Gadgets, Auto, Apps, Internet services)
  1. Sports (includes E-Sports)
  1. Entertainment (includes Movies, Music, Games, Books, Arts)
  1. Science (includes Health, Biology, Physics, Genetics)
  1. Other (news articles that don't fall into any of the above categories)

**Q: how to do it?**



<!--@slideshow fragment-->
> Text Mining problem: Classification.

Instead of 2 classification problems, we can add one more category that corresponds to articles which are not news articles.


<!--@slideshow slide-->
## 4. Group similar news into threads.
Your algorithm must identify news articles about the same event and group them together into threads, selecting a relevant title for each thread. News articles inside each thread must be sorted according to their relevance (most relevant at the top).

**Q: how to do it?**

<!--@slideshow fragment-->
> Text Mining problem: Clustering.

Subproblems:
1. Texts to vectors
1. Vectors to clusters
1. Relevant title
1. Ranking


<!--@slideshow slide-->
## 5. Sort threads by their relative importance.
Your algorithm must sort news threads in each of the categories based on perceived importance (important at the top). In addition, the algorithm must build a global list of threads, indepedent of category, sorted by perceived importance (important at the top).

**Q: how to do it?**


<!--@slideshow fragment-->
> Not exactly a Text Mining problem, can be solved using Machine Learning (ranking).

Features:
- Number of documents in a thread.
- "Freshness" of documents if a thread.
- "Authority" of sources ([PageRank](https://en.wikipedia.org/wiki/PageRank)).

<!--@slideshow slide-->
# Colab demo: Get the data

In [0]:
# download one sample of data
!wget https://data-static.usercontent.dev/DataClusteringSample0107.tar.gz
!mkdir -p DataClustering
!tar -xvf DataClusteringSample0107.tar.gz -C DataClustering

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
20191107/14/1445164493428702727.html
20191107/14/1445164493488521983.html
20191107/14/1445164494706691901.html
20191107/14/2601641153624841144.html
20191107/14/3181417700779135274.html
20191107/14/3181417702754635802.html
20191107/14/3181417702533222779.html
20191107/14/3181417701435724700.html
20191107/14/6930392589795973240.html
20191107/14/6930392589570691082.html
20191107/14/552235480115285670.html
20191107/14/552235479855156485.html
20191107/14/4500271768558100704.html
20191107/14/3262344773534256308.html
20191107/14/4396150893111206685.html
20191107/14/4396150893719692249.html
20191107/14/5644198864208279048.html
20191107/14/5644198863403458566.html
20191107/14/5644198863528907775.html
20191107/14/3874101093290551843.html
20191107/14/8213836170379281631.html
20191107/14/8213836168812185852.html
20191107/14/6389894490754232693.html
20191107/14/1175411578289781422.html
20191107/14/1175411577961154713.html
20191107/14/

In [0]:
# get list of files using glob
import glob
from IPython.display import HTML, display
files = list(sorted(glob.glob("DataClustering/*/*/*.html")))
len(files)

166551

In [0]:
# data contains html files
with open(files[359]) as f:
    print(f.read())

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8"/>
    <meta property="og:url" content="https://www.moneycontrol.com/news/business/toyota-kirloskar-sales-drop-5-in-october-at-12610-units-4594541.html"/>
    <meta property="og:site_name" content="Moneycontrol"/>
    <meta property="article:published_time" content="2019-11-01T00:00:00+00:00"/>
    <meta property="og:title" content="Toyota Kirloskar sales drop 5% in October at 12,610 units"/>
    <meta property="og:description" content="Domestic sales were down 6 percent at 11,866 units as compared to 12,606 units in the corresponding month last year, down 6 percent, the company added."/>
  </head>
  <body>
    <article>
      <h1>Toyota Kirloskar sales drop 5% in October at 12,610 units</h1>
      <h2>Domestic sales were down 6 percent at 11,866 units as compared to 12,606 units in the corresponding month last year, down 6 percent, the company added.</h2>
      <address><time datetime="2019-11-01T00:00:00+00:00">01 Nov 2019</tim

In [0]:
# there are samples in different languages
with open(files[10003]) as f:
    display(HTML(f.read()))

In [0]:
# some of the samples are news
with open(files[359]) as f:
    display(HTML(f.read()))

In [0]:
# some samples are not news
with open(files[24114]) as f:
    display(HTML(f.read()))

Parsing

In [0]:
from dateutil.parser import parse
from collections import namedtuple

# store structured information about the document
Sample = namedtuple("Sample", "text title published_time short_text")

def parse_html(html):
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')
    title_tag = soup.find("meta", property="og:title")
    if title_tag:
        title = title_tag["content"].strip()
    else:
        title = ""
    description_tag = soup.find("meta", property="og:description")
    if description_tag:
        description = description_tag["content"].strip()
    else:
        description = ""
    h1 = [_.text.strip() for _ in soup.find_all('h1')]
    h2 = [_.text.strip() for _ in soup.find_all('h2')]
    paragraphs = [_.text.strip() for _ in soup.find_all('p')]

    text = "\t".join([title, description] + h1 + h2 + paragraphs).replace("\n", " ")
    short_text = "\t".join([title, description])
    published_time_tag = soup.find("meta", property="article:published_time")
    published_time = parse(published_time_tag["content"].strip())
    return Sample(text=text, title=title, published_time=published_time, short_text=short_text)


Look at the parsed data

In [0]:
from pprint import pprint
with open(files[100]) as f:
    sample = parse_html(f.read())
    print(sample.published_time)
    print(sample.title)
    print(sample.short_text)
    print(sample.text)

2019-11-01 00:00:00+00:00
Qué consecuencias tiene el nuevo cepo al dólar para quienes viajan al exterior
Qué consecuencias tiene el nuevo cepo al dólar para quienes viajan al exterior	Los viajeros deberán tener en cuenta algunas cuestiones a partir de las medidas dispuestas por el BCRA. Límites, extracciones y pagos.
Qué consecuencias tiene el nuevo cepo al dólar para quienes viajan al exterior	Los viajeros deberán tener en cuenta algunas cuestiones a partir de las medidas dispuestas por el BCRA. Límites, extracciones y pagos.	Qué consecuencias tiene el nuevo cepo al dólar para quienes viajan al exterior	A partir de este 1 de noviembre, la operatoria con tarjetas en el extranjero sólo se puede realizar por un monto máximo de 50 dólares de adelanto de efectivo por operación. Esto, tras una nueva disposición del Banco Central, que ya había impuesto el límite mensual de compra de hasta 200 dólares por homebanking.	A continuación, las cinco claves a tener en cuenta quienes viajen al exteri

In [0]:
# parse all
from tqdm.notebook import tqdm

samples = []
for p in tqdm(files):
    with open(p) as f:
        samples.append(parse_html(f.read()))

HBox(children=(FloatProgress(value=0.0, max=166551.0), HTML(value='')))




<!--@slideshow slide-->
# Colab demo: Language identification

We will isolate only English articles.

In [0]:
# get FastText model for language isentification
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz

--2020-06-04 09:34:56--  https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 2606:4700:10::6816:4a8e, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 938013 (916K) [binary/octet-stream]
Saving to: ‘lid.176.ftz.1’


2020-06-04 09:34:57 (2.01 MB/s) - ‘lid.176.ftz.1’ saved [938013/938013]



In [0]:
from collections import Counter
from pprint import pprint
import fasttext

li_model = fasttext.load_model("lid.176.ftz")

samples_en = []
language_counter = Counter()
for sample in tqdm(samples):
    text = sample.text
    predicted_language = li_model.predict(text)[0][0][len("__label__"):]
    language_counter[predicted_language] += 1
    if predicted_language == "en":
        samples_en.append(sample)
pprint(language_counter, compact=True)



HBox(children=(FloatProgress(value=0.0, max=166551.0), HTML(value='')))


Counter({'en': 36417,
         'ru': 32864,
         'es': 15021,
         'ar': 11214,
         'uk': 6485,
         'fa': 6129,
         'fr': 6098,
         'de': 5997,
         'pt': 5449,
         'id': 5063,
         'it': 4574,
         'el': 3920,
         'bg': 3086,
         'ko': 2675,
         'vi': 2185,
         'tr': 2060,
         'zh': 1425,
         'nl': 1358,
         'no': 1330,
         'ml': 1211,
         'ro': 1136,
         'hi': 1033,
         'ja': 1022,
         'cs': 956,
         'hu': 939,
         'th': 893,
         'sl': 730,
         'ca': 635,
         'te': 632,
         'sr': 631,
         'mr': 562,
         'sv': 520,
         'he': 500,
         'sk': 442,
         'ta': 292,
         'pl': 267,
         'bn': 260,
         'lt': 132,
         'uz': 71,
         'tl': 49,
         'sw': 48,
         'km': 45,
         'sh': 28,
         'kk': 24,
         'be': 20,
         'tg': 19,
         'nn': 19,
         'ur': 19,
         'da': 19,
   

<!--@slideshow slide-->
# Classification

**Problem**: In the contest, data is provided without labels. How to train classification?


<!--@slideshow fragment-->
Possible solutions:
- Classify training data manually.
- Use crowdsourcing ([Yandex.Toloka](https://toloka.yandex.com) or [Amazon Mechanical Turk](https://www.mturk.com/)).
- Find similar labelled dataset.
  - Use [Google Dataset Search](https://datasetsearch.research.google.com/)

<!--@slideshow slide-->
## Colab demo: [BBC articles dataset](https://www.kaggle.com/yufengdev/bbc-fulltext-and-category)

**Advantages**: full texts.

**Disadvantages**: small dataset, not enough categories.

In [0]:
import pandas as pd
df_bbc = pd.read_csv("harbour-space-text-mining-course/datasets/bbc/bbc-text.csv")
df_bbc.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


The external dataset has categories different from the contest, so we need to find the mapping between them.

In [0]:
df_bbc.category.unique()

array(['tech', 'business', 'sport', 'entertainment', 'politics'],
      dtype=object)

In [0]:
df_bbc["ContestCategory"] = df_bbc.category.map(
    {
        "tech": "Technology",
        "business": "Economy",
        "sport": "Sports",
        "entertainment": "Entertainment",
        "politics": "Society"
    }
)
df_bbc.head()

Unnamed: 0,category,text,ContestCategory
0,tech,tv future in the hands of viewers with home th...,Technology
1,business,worldcom boss left books alone former worldc...,Economy
2,sport,tigers wary of farrell gamble leicester say ...,Sports
3,sport,yeading face newcastle in fa cup premiership s...,Sports
4,entertainment,ocean s twelve raids box office ocean s twelve...,Entertainment


<!--@slideshow slide-->
## Colab demo: [News Category Dataset](https://www.kaggle.com/rmisra/news-category-dataset)

**Advantages**: large dataset, many categories.

**Disadvantages**: no full articles, only short descriptions.

In [0]:
df_news = pd.read_csv("harbour-space-text-mining-course/datasets/news_category_dataset/category_headline_description.csv")
df_news.head()

Unnamed: 0.1,Unnamed: 0,category,headline,short_description
0,0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,She left her husband. He killed their children...
1,1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Of course it has a song.
2,2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,The actor and his longtime girlfriend Anna Ebe...
3,3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,The actor gives Dems an ass-kicking for not fi...
4,4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,"The ""Dietland"" actress said using the bags is ..."


In [0]:
# get texts
def row2text(r):
    import re
    # the same format as for BBC: lowercase without punctuation
    text = str(r["headline"]) + " " + str(r["short_description"])
    return re.sub(r'[^\w\s]', '', text.lower())

df_news["text"] = df_news.fillna(".").apply(row2text, axis=1)#["text"]
df_news.head()

Unnamed: 0.1,Unnamed: 0,category,headline,short_description,text
0,0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,She left her husband. He killed their children...,there were 2 mass shootings in texas last week...
1,1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Of course it has a song.,will smith joins diplo and nicky jam for the 2...
2,2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,The actor and his longtime girlfriend Anna Ebe...,hugh grant marries for the first time at age 5...
3,3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,The actor gives Dems an ass-kicking for not fi...,jim carrey blasts castrato adam schiff and dem...
4,4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,"The ""Dietland"" actress said using the bags is ...",julianna margulies uses donald trump poop bags...


In [0]:
df_news.category.unique()

array(['CRIME', 'ENTERTAINMENT', 'WORLD NEWS', 'IMPACT', 'POLITICS',
       'WEIRD NEWS', 'BLACK VOICES', 'WOMEN', 'COMEDY', 'QUEER VOICES',
       'SPORTS', 'BUSINESS', 'TRAVEL', 'MEDIA', 'TECH', 'RELIGION',
       'SCIENCE', 'LATINO VOICES', 'EDUCATION', 'COLLEGE', 'PARENTS',
       'ARTS & CULTURE', 'STYLE', 'GREEN', 'TASTE', 'HEALTHY LIVING',
       'THE WORLDPOST', 'GOOD NEWS', 'WORLDPOST', 'FIFTY', 'ARTS',
       'WELLNESS', 'PARENTING', 'HOME & LIVING', 'STYLE & BEAUTY',
       'DIVORCE', 'WEDDINGS', 'FOOD & DRINK', 'MONEY', 'ENVIRONMENT',
       'CULTURE & ARTS'], dtype=object)

In [0]:
# convert categories
categories_mapping = {
    "CRIME": "Society",
    "ENTERTAINMENT": "Entertainment",
    "WORLD NEWS": "Society",
    "IMPACT": "NotNews",
    "POLITICS": "Society",
    "WEIRD NEWS": "Other",
    "BLACK VOICES": "Society",
    "WOMEN": "NotNews",
    "COMEDY": "Entertainment",
    "QUEER VOICES": "Society",
    "SPORTS": "Sports",
    "BUSINESS": "Economy",
    "TRAVEL": "NotNews",
    "MEDIA": "Society",
    "TECH": "Technology",
    "RELIGION": "Society",
    "SCIENCE": "Science",
    "LATINO VOICES": "Society",
    "EDUCATION": "Society",
    "COLLEGE": "Society",
    "PARENTS": "NotNews",
    "ARTS & CULTURE": "Society",
    "STYLE": "NotNews",
    "GREEN": "Society",
    "TASTE": "NotNews",
    "HEALTHY LIVING": "NotNews",
    "THE WORLDPOST": "Society",
    "GOOD NEWS": "Other",
    "WORLDPOST": "Society",
    "FIFTY": "NotNews",
    "ARTS": "NotNews",
    "WELLNESS": "NotNews",
    "PARENTING": "NotNews",
    "HOME & LIVING": "NotNews",
    "STYLE & BEAUTY": "NotNews",
    "DIVORCE": "NotNews",
    "WEDDINGS": "NotNews",
    "FOOD & DRINK": "NotNews",
    "MONEY": "NotNews",
    "ENVIRONMENT": "Other",
    "CULTURE & ARTS": "NotNews",
}
df_news["ContestCategory"] = df_news.category.map(categories_mapping)
df_news.head(10)

Unnamed: 0.1,Unnamed: 0,category,headline,short_description,text,ContestCategory
0,0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,She left her husband. He killed their children...,there were 2 mass shootings in texas last week...,Society
1,1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Of course it has a song.,will smith joins diplo and nicky jam for the 2...,Entertainment
2,2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,The actor and his longtime girlfriend Anna Ebe...,hugh grant marries for the first time at age 5...,Entertainment
3,3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,The actor gives Dems an ass-kicking for not fi...,jim carrey blasts castrato adam schiff and dem...,Entertainment
4,4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,"The ""Dietland"" actress said using the bags is ...",julianna margulies uses donald trump poop bags...,Entertainment
5,5,ENTERTAINMENT,Morgan Freeman 'Devastated' That Sexual Harass...,"""It is not right to equate horrific incidents ...",morgan freeman devastated that sexual harassme...,Entertainment
6,6,ENTERTAINMENT,Donald Trump Is Lovin' New McDonald's Jingle I...,"It's catchy, all right.",donald trump is lovin new mcdonalds jingle in ...,Entertainment
7,7,ENTERTAINMENT,What To Watch On Amazon Prime That’s New This ...,There's a great mini-series joining this week.,what to watch on amazon prime thats new this w...,Entertainment
8,8,ENTERTAINMENT,Mike Myers Reveals He'd 'Like To' Do A Fourth ...,"Myer's kids may be pushing for a new ""Powers"" ...",mike myers reveals hed like to do a fourth aus...,Entertainment
9,9,ENTERTAINMENT,What To Watch On Hulu That’s New This Week,You're getting a recent Academy Award-winning ...,what to watch on hulu thats new this week your...,Entertainment


<!--@slideshow slide-->
## Combine datasets

As we see, datasets are different:

In [0]:
#@slideshow fragment
def describe_dataset(df):
    total_tokens = sum([len(s.split()) for s in df.text])
    print(f"\tsamples: {df.shape[0]}, avg tokens: {total_tokens / df.shape[0]}, total tokens: {total_tokens}")
print("BBC:")
describe_dataset(df_bbc)
print("News Category Dataset")
describe_dataset(df_news)

BBC:
	samples: 2225, avg tokens: 390.2952808988764, total tokens: 868407
News Category Dataset
	samples: 200853, avg tokens: 29.146888520460237, total tokens: 5854240


<!--@slideshow fragment-->
**Q: how to train a classifier with two datasets?**

<!--@slideshow slide-->
## Colab demo: Train the classifier

Try the following approach: assign larger weight to samples with full texts.

In [0]:
df_bbc["weight"] = 1.0
df_news["weight"] = 0.1

Combine and shuffle

In [0]:
from sklearn.utils import shuffle
df = pd.concat(
    [
        df_bbc[["text", "ContestCategory", "weight"]],
        df_news[["text", "ContestCategory", "weight"]]
    ]
)
df = shuffle(df)
df.head()

Unnamed: 0,text,ContestCategory,weight
16862,hillary clinton calling trump supporters deplo...,Society,0.1
113139,how to do a lobster tail hair twist in 8 easy ...,NotNews,0.1
71153,1 killed after shooting at north carolina mall...,Society,0.1
163793,shoulda coulda woulda how to make a decision y...,NotNews,0.1
23971,the long lonely road of chelsea manning on a g...,Society,0.1


Train the classifier

In [0]:
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df.ContestCategory)

vec = TfidfVectorizer(stop_words="english")
clf = SGDClassifier(max_iter=50, loss="log", random_state=0)
pipeline = Pipeline([
    ("vec", vec),
    ("clf", clf),
])

param_grid = {
    "clf__alpha": [1e-6, 1e-7, 1e-8],
}

pipeline_grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=3, refit=True)
pipeline_grid_search.fit(df.text, y, clf__sample_weight=df.weight)

display_cv_results(pipeline_grid_search)

Unnamed: 0,mean_test_score,param_clf__alpha
0,0.798181,1e-07
1,0.785314,1e-06
2,0.77288,1e-08


Apply on the contest data

In [0]:
texts = [s.text for s in samples_en]
predictions = le.inverse_transform(pipeline_grid_search.predict(texts))
for p, s in zip(predictions[:10], samples_en[:10]):
    print(p)
    print(s.title)
    print("---")

Economy
Varcoe: Encana's founding CEO worries Calgary destined to become 'branch office'
---
Society
Former PMs Harper and Chrétien discuss election, energy and division in Calgary
---
Society
Meng’s lawyers still say RCMP shared phone details with FBI despite affidavits
---
Society
Despite congressional chaos, Pelosi sends positive signals on ratifying USMCA
---
Sports
Rob Vanstone: The Saskatchewan Roughriders got a scare as Halloween loomed
---
Entertainment
Netflix plans to release its first scripted podcast
---
Entertainment
Ending the War on Artisan Cheese begins battle for Oddest book title prize
---
NotNews
---
Sports
Juventus and Manchester United among 20 clubs watching Erling Braut Haaland
---
Society
Arrest after one killed and 15 injured as two buses and a car collide
---


<!--@slideshow slide-->
# Clustering

**Problems**:
1. In $k$-means algotithm, $k$ depends on the number of samples.
2. We cannot find centroids using one dataset and apply them on another dataset (news are time-dependent).
3. How do we understand if the number of clusters is too small?
4. How do we understand if the number of clusters is too large?

<!--@slideshow slide-->
## Agglomerative clustering

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/a/ad/Hierarchical_clustering_simple_diagram.svg">
</center>

<!--@slideshow slide-->
Complexity $O(n^2)$.
- OK for our dataset (clusterize each category)
- In general, can split the data: news are _time-dependent_, so split by 10000 with overlap 2000


<!--@slideshow fragment-->
**Q: How to choose the distance threshold?**


<!--@slideshow fragment-->
- Measure the max in-cluster distance.
- Use the time of publishing! News should be close in time.

<!--@slideshow slide-->
## Colab demo: find clusters using agglomerative clustering

In [0]:
def find_clusters(samples, vectorizer, distance_threshold=1.):
    # vectors
    print("Find vectors")
    texts = [s.text for s in samples]
    tfidf_vectors = vectorizer.transform(texts)
    # sparse -> dense
    from sklearn.decomposition import TruncatedSVD
    tfidf_vectors_svd = TruncatedSVD(n_components=300).fit_transform(tfidf_vectors)

    # clusters
    print("Find clusters")
    from sklearn.cluster import AgglomerativeClustering
    clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=distance_threshold)
    clusters = clustering.fit_predict(tfidf_vectors_svd)

    from itertools import combinations
    from scipy.spatial.distance import euclidean
    from collections import defaultdict
    import numpy as np

    print("Group into clusters")
    cluster2samples = defaultdict(list)
    cluster2vectors = defaultdict(list)

    for c, s, v in zip(clusters, samples, tfidf_vectors_svd):
        cluster2samples[c].append(s)
        cluster2vectors[c].append(v)

    rv = []
    for c in sorted(cluster2samples.keys()):
        # the most relevant title is the closest to the average vector
        # titles are ranked by the distance to the average vector
        avg_vec = sum(cluster2vectors[c]) / len(cluster2vectors[c])
        distances = [euclidean(avg_vec, v) for v in cluster2vectors[c]]
        samples_order = np.argsort(distances)
        best_title = cluster2samples[c][samples_order[0]].title
        cluster_samples = [cluster2samples[c][i] for i in samples_order]
        best_title = cluster2samples[c][0].title
        
        cluster_published_times = [s.published_time for s in cluster2samples[c]]
        # max difference between published time
        published_time_diff = max(cluster_published_times) - min(cluster_published_times)
        
        rv.append((cluster_samples, best_title, published_time_diff, avg_vec))

    return rv

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english").fit([s.text for s in samples_en])

In [0]:
# select samples with "Society" category
samples_en_society = []
for s, p in zip(samples_en, predictions):
    if p == "Society":
        samples_en_society.append(s)
len(samples_en_society)

13393

In [0]:
clusters_en_society = find_clusters(samples_en_society, tfidf_vectorizer, distance_threshold=1)

Find vectors
Find clusters
Group into clusters


In [0]:
print(f"num samples: {len(samples_en_society)}, num clusters: {len(clusters_en_society)}")

In [0]:
# time publishing difference distribution
import matplotlib.pyplot as plt
import numpy as np
time_publishing_difference_hours = [x[2].seconds/3600 for x in clusters_en_society]
print(f"avg time publishing difference: {np.mean(time_publishing_difference_hours)} hours")
plt.hist(time_publishing_difference_hours, bins=100)
plt.show()

Find largest clusters

In [0]:
largest_cluster_idxs = list(sorted(
    range(len(clusters_en_society)),
    key=lambda i: len(set([_.title for _ in clusters_en_society[i][0]])),
    reverse=True
))
largest_cluster_idxs[:5]

Print the largest cluster

In [0]:
[s.title for s in clusters_en_society[largest_cluster_idxs[0]][0]]

Find similar clusters

In [0]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

cluster_vectors = np.vstack([x[3] for x in clusters_en_society])
nbrs = NearestNeighbors().fit(cluster_vectors)

from pprint import pprint
for cluster_idx in range(0, len(clusters_en_society), 1000):  # random clusters
# for cluster_idx in largest_cluster_idxs[:10]:  # largest clusters
    distances, indices = nbrs.kneighbors(cluster_vectors[cluster_idx, :].reshape(1, -1))
    nearest_cluster_idx = indices[0][1]
    cluster_samples_titles = [s.title for s in clusters_en_society[cluster_idx][0]]
    nearest_cluster_samples_titles = [s.title for s in clusters_en_society[nearest_cluster_idx][0]]
    pprint(cluster_samples_titles)
    pprint(nearest_cluster_samples_titles)
    print("-" * 50)

<!--@slideshow slide-->

# Recommended resources
- https://contest.com/docs/data_clustering
- https://contest.com/docs/data_clustering2
- https://github.com/IlyaGusev/tgcontest