In the next blog post from the Knoyd series we will be focusing on text mining techniques. The main goal is to explore articles with different topics. These articles were scrape from websites www.sme.sk, www.pravda.sk and www.dennikn.sk. As you might already notices, the tricky part is that we will be dealing with article in Slovak, and that makes our job a little bit harder. We will compare to techniques to build the model for clustering of the articles and we will see if they are able to perform with a good quality.

## Data gathering

To get our data, we will use package BeautifulSoup to scrape the text as well as category of the article. BeautifulSoup parse a webpage using the HTML tags and it is very simple to access separate parts of the webpage using those tags. The category of article is extracted from the URL, for example, Using the following URL https://tech.sme.sk/c/20652556/vymena-baterii-namiesto-nabijania-elektroauta-sa-mozu-zmenit.html?ref=trz, we can identified tech as the topic of the article. 

Using our own scraper we were able to download text and category of xxxxxx articles. We will use two unsupervised techniques to identify the clusters and topics of those articles. Afterwards we will use extracted categories to evaluate the models. You can see the sample of our data below.

In [1]:
import glob
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [2]:
data_path = "/Users/jurajkapasny/Data/sk_text_for_api"

In [3]:
def load_data(path):
    allFiles = glob.glob(path + "/*.csv")
    print "Number of files to process: %d" %len(allFiles)
    df = pd.DataFrame()
    list_ = []
    i = 0
    for file_ in allFiles:
        if (file_.find("articles") != -1) & (file_.find("temo") == -1):
            i = i + 1
            if i%50 == 0:
                print "Processing file %d" %i
            temp = pd.read_csv(file_,sep = "|",index_col=None, header=0, parse_dates=True, low_memory=False)
            list_.append(temp)
    df = pd.concat(list_).reset_index(drop = True)
    return df

In [4]:
df = load_data(data_path)

Number of files to process: 189
Processing file 50
Processing file 100
Processing file 150


In [5]:
df.head()

Unnamed: 0,label,summary,text
0,,"Placebo má 20 rokov, oslavy boli aj v Bratisla...",\nPozrite si atmosféru z víkendového koncertu ...
1,,"Treba zakázať Uber len preto, že nezapadá do s...",\nV jednom svete najskôr všetko dovolí parlame...
2,techbox,Samsung už v minulosti skúšal zaujať SUHD tele...,"\nVerím, že mnohí z vás v tom majú jasno, no u..."
3,,Izrael opäť zaútočil na sýrske vojenské ciele ...,"\n po tom, čo na ním kontrolované Golanské výš..."
4,,Basketbalistky Španielska sa stali víťazkami m...,\n v Česku. Vo finálovom súboji triumfovali na...


In [6]:
df_clean = df[(df.label.notnull()) & (df.text.notnull())][["label","text"]]

In [7]:
df_clean.head()

Unnamed: 0,label,text
2,techbox,"\nVerím, že mnohí z vás v tom majú jasno, no u..."
5,techbox,\nGmail na prispôsobenie reklamy používa prehl...
9,techbox,"\nLen máloktorá firma zmenila svet internetu, ..."
116,techbox,\nEurópska únia v poslednom čase predstavuje j...
128,techbox,\nNemecká spoločnosť Loewe predstavila novú sé...


In [8]:
len(df_clean)

28545

In [18]:
print df_clean.iloc[1,0]
print df_clean.iloc[1,1]

techbox

Gmail na prispôsobenie reklamy používa prehliadanie správ, ktoré sú doručené do mailovej schránky. Tomu má do konca tohto roka definitívne odzvoniť.
Mailový klient od Google patrí medzi najpopulárnejšie produkty spoločnosti. Niektorí však môžu namietať, že celý vyhľadávací gigant je len veľký obchod s reklamou a zberom osobných údajov.
Pozrite siGmailify – funkcie Gmail aj bez @gmail.com
Gmail má v sebe zakomponovaný skener, ktorý hľadá kľúčové slová. Na základe nich vie veľký brat prispôsobiť reklamy každému osobne. A i napriek tomu, že toto prehliadanie má na starosti stroj a nie človek majú niektorí z 1,2 miliardy používateľov na tento zásah do súkromia ťažké srdce. To by sa však malo už o chvíľu zmeniť.
Po tom, ako spoločnosť zrušila túto aktivitu v business verzii Gmailu, ktorý spadá do skupiny G Suite sa tak zmeny dočkáme aj my, bežní používatelia. Nové pravidlo nadobudne účinnosť do konca tohto roka.
Pozrite si7 tipov, ktoré oceníte ak používate Gmail
To samozrejme nezn

In [8]:
text_file = open("/Users/jurajkapasny/nltk_data/corpora/stopwords-sk/stopwords-sk.json", "r")
lines = text_file.read()
stopwords_sk = lines.replace('"','').split(",")

In [None]:
# # inputs = df_clean.text.str.lower()
# # outputs = df_clean.summary.str.lower()

# inputs = df_clean.text
# outputs = df_clean.summary

# inputs = inputs.str.replace(",", " ")
# inputs = inputs.str.replace("\n", " ")
# inputs = inputs.str.replace(":", " ")
# # inputs = inputs.str.replace(".", " ")
# inputs = inputs.str.replace('"', " ")
# inputs = inputs.str.replace("(", " ")
# inputs = inputs.str.replace(")", " ")
# inputs = inputs.str.replace("-", " ")
# inputs = inputs.str.replace("“", " ")
# inputs = inputs.str.replace("„", " ")

# outputs = outputs.str.replace(",", " ")
# outputs = outputs.str.replace("\n", " ")
# outputs = outputs.str.replace(":", " ")
# # outputs = outputs.str.replace(".", " ")
# outputs = outputs.str.replace('"', " ")
# outputs = outputs.str.replace("(", " ")
# outputs = outputs.str.replace(")", " ")
# outputs = outputs.str.replace("-", " ")
# outputs = outputs.str.replace("“", " ")
# outputs = outputs.str.replace("„", " ")

In [None]:
# df_clean.text = inputs
# df_clean.summary = outputs

In [None]:
# df_clean

  story -> word2vec -> word mover distance as similarity metric. show 3 with highest similarity

In [9]:
df_clean.head()

Unnamed: 0,label,text
2,techbox,"\nVerím, že mnohí z vás v tom majú jasno, no u..."
5,techbox,\nGmail na prispôsobenie reklamy používa prehl...
9,techbox,"\nLen máloktorá firma zmenila svet internetu, ..."
116,techbox,\nEurópska únia v poslednom čase predstavuje j...
128,techbox,\nNemecká spoločnosť Loewe predstavila novú sé...


In [10]:
counts = df_clean.label.value_counts()

In [12]:
df_clean = df_clean[df_clean.label.isin(list(counts[counts > 60].index))]

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
tf = TfidfVectorizer(max_features = 1000, min_df = 100, max_df = 1000)

In [15]:
df_transformed = tf.fit_transform(df_clean["text"].values)

In [16]:
df_transformed_final = pd.SparseDataFrame([pd.SparseSeries(df_transformed[i].toarray().ravel()) 
                                    for i in np.arange(df_transformed.shape[0])])

## Kmeans

First approach we use is KMeans clustering. We need to transform the data from the text into the numeric features before that. We will use tf-idf algorithm for that, which transform the documents into the numeric vectors. It is short for term frequency and inverse document frequency and it uses the counts of different terms in each article and compare with the counts in all documents. For more information about the algorithm you can visit: 

In [18]:
from sklearn.cluster import KMeans

In [19]:
cl_10 = KMeans(n_clusters = 10)
cl_11 = KMeans(n_clusters = 11)
cl_12 = KMeans(n_clusters = 12)
cl_13 = KMeans(n_clusters = 13)
cl_14 = KMeans(n_clusters = 14)
cl_15 = KMeans(n_clusters = 15)
cl_16 = KMeans(n_clusters = 16)
cl_17 = KMeans(n_clusters = 17)
cl_18 = KMeans(n_clusters = 18)
res_18 = cl_18.fit_predict(df_transformed_final.as_matrix())
res_17 = cl_17.fit_predict(df_transformed_final.as_matrix())
res_10 = cl_10.fit_predict(df_transformed_final.as_matrix())
res_11 = cl_11.fit_predict(df_transformed_final.as_matrix())
res_12 = cl_12.fit_predict(df_transformed_final.as_matrix())
res_13 = cl_13.fit_predict(df_transformed_final.as_matrix())
res_14 = cl_14.fit_predict(df_transformed_final.as_matrix())
res_15 = cl_15.fit_predict(df_transformed_final.as_matrix())
res_16 = cl_16.fit_predict(df_transformed_final.as_matrix())

In [20]:
cl_19 = KMeans(n_clusters = 19)
cl_20 = KMeans(n_clusters = 20)
# cl_21 = KMeans(n_clusters = 21)
# cl_22 = KMeans(n_clusters = 22)
res_19 = cl_19.fit_predict(df_transformed_final.as_matrix())
res_20 = cl_20.fit_predict(df_transformed_final.as_matrix())
# res_21 = cl_21.fit_predict(df_transformed_final.as_matrix())
# res_22 = cl_22.fit_predict(df_transformed_final.as_matrix())

In [21]:
print("Inertia for KMeans with 10 clusters =  %lf "%(cl_10.inertia_))
print("Inertia for KMeans with 11 clusters = %lf "%(cl_11.inertia_))
print("Inertia for KMeans with 12 clusters = %lf " %(cl_12.inertia_))
print("Inertia for KMeans with 13 clusters = %lf " %(cl_13.inertia_))
print("Inertia for KMeans with 14 clusters =  %lf "%(cl_14.inertia_))
print("Inertia for KMeans with 15 clusters = %lf "%(cl_15.inertia_))
print("Inertia for KMeans with 16 clusters = %lf "%(cl_16.inertia_))
print("Inertia for KMeans with 17 clusters = %lf " %(cl_17.inertia_))
print("Inertia for KMeans with 18 clusters = %lf " %(cl_18.inertia_))
print("Inertia for KMeans with 19 clusters = %lf "%(cl_19.inertia_))
print("Inertia for KMeans with 20 clusters = %lf "%(cl_20.inertia_))
# print("Inertia for KMeans with 21 clusters = %lf " %(cl_21.inertia_))
# print("Inertia for KMeans with 22 clusters = %lf " %(cl_22.inertia_))

Inertia for KMeans with 10 clusters =  23260.223711 
Inertia for KMeans with 11 clusters = 23175.115672 
Inertia for KMeans with 12 clusters = 23066.631996 
Inertia for KMeans with 13 clusters = 22997.639527 
Inertia for KMeans with 14 clusters =  22913.975833 
Inertia for KMeans with 15 clusters = 22847.225592 
Inertia for KMeans with 16 clusters = 22779.764281 
Inertia for KMeans with 17 clusters = 22693.061253 
Inertia for KMeans with 18 clusters = 22623.904487 
Inertia for KMeans with 19 clusters = 22544.504475 
Inertia for KMeans with 20 clusters = 22519.283904 


The inertia of clusters is getting smaller with increasing number of cluster. We expected this behavior because there are many possible topics of the articles and would be hard to fit them into 10 clusters. 

In [22]:
from sklearn.metrics import silhouette_score

def get_silhouette_score(data, model):
    cluster_labels = model.fit_predict(data)
    score = silhouette_score(data, cluster_labels)
    return score

In [23]:
print get_silhouette_score(df_transformed_final, cl_10)
print get_silhouette_score(df_transformed_final, cl_11)
print get_silhouette_score(df_transformed_final, cl_12)
print get_silhouette_score(df_transformed_final, cl_13)
print get_silhouette_score(df_transformed_final, cl_14)
print get_silhouette_score(df_transformed_final, cl_15)
print get_silhouette_score(df_transformed_final, cl_16)
print get_silhouette_score(df_transformed_final, cl_17)
print get_silhouette_score(df_transformed_final, cl_18)
print get_silhouette_score(df_transformed_final, cl_19)
print get_silhouette_score(df_transformed_final, cl_20)
# print get_silhouette_score(df_transformed_final, cl_21)
# print get_silhouette_score(df_transformed_final, cl_22)

0.0210335557488
0.0214800705955
0.0218457408825
0.0237952614046
0.0244904783978
0.024998723227
0.0246250285108
0.0271721475043
0.0285557281953
0.0280178642222
0.0300612505969


In [25]:
# saving the results
import pickle

In [27]:
pickle.dump( cl_20, open( "/Users/jurajkapasny/Code/GitHub/jurajkapasny/text_analytics/cl_20.p", "wb" ) )
pickle.dump( cl_19, open( "/Users/jurajkapasny/Code/GitHub/jurajkapasny/text_analytics/cl_19.p", "wb" ) )
pickle.dump( cl_18, open( "/Users/jurajkapasny/Code/GitHub/jurajkapasny/text_analytics/cl_18.p", "wb" ) )
pickle.dump( cl_17, open( "/Users/jurajkapasny/Code/GitHub/jurajkapasny/text_analytics/cl_17.p", "wb" ) )
pickle.dump( cl_16, open( "/Users/jurajkapasny/Code/GitHub/jurajkapasny/text_analytics/cl_16.p", "wb" ) )
pickle.dump( cl_15, open( "/Users/jurajkapasny/Code/GitHub/jurajkapasny/text_analytics/cl_15.p", "wb" ) )
pickle.dump( cl_14, open( "/Users/jurajkapasny/Code/GitHub/jurajkapasny/text_analytics/cl_14.p", "wb" ) )
pickle.dump( cl_13, open( "/Users/jurajkapasny/Code/GitHub/jurajkapasny/text_analytics/cl_13.p", "wb" ) )
pickle.dump( cl_12, open( "/Users/jurajkapasny/Code/GitHub/jurajkapasny/text_analytics/cl_12.p", "wb" ) )
pickle.dump( cl_11, open( "/Users/jurajkapasny/Code/GitHub/jurajkapasny/text_analytics/cl_11.p", "wb" ) )
pickle.dump( cl_10, open( "/Users/jurajkapasny/Code/GitHub/jurajkapasny/text_analytics/cl_10.p", "wb" ) )

In [79]:
cl_40 = KMeans(n_clusters = 40)
res_40 = cl_40.fit_predict(df_transformed_final.as_matrix())

In [80]:
print("Inertia for KMeans with 20 clusters = %lf "%(cl_40.inertia_))
print get_silhouette_score(df_transformed_final, cl_40)

Inertia for KMeans with 20 clusters = 21476.973524 
0.0419544779021


## Evaluation

In [28]:
from copy import deepcopy

In [29]:
df_res = deepcopy(df_clean)
df_res["res15"] = res_15
df_res["res16"] = res_16
df_res["res17"] = res_17
df_res["res18"] = res_18
df_res["res19"] = res_19
df_res["res20"] = res_20

In [147]:
df_res[df_res.res20 == 14].label.value_counts()

svet                      1710
ekonomika                 1321
domace                    1285
regiony                    446
cestovny-ruch              265
mesta                      181
kniha                      160
vesmir                     143
hory                       139
komentare                  135
komentare-a-glosy          134
zamestnanie                127
nasaprievidza              121
tech                       121
zem                        113
nasazilina                 112
nasturiec                  110
magazin                    105
nasanitra                  102
nasekysuce                  99
aktuality                   94
film-a-televizia            94
nasapovazska                93
novinky                     90
spolocnost                  85
nasaorava                   84
nastrencin                  75
zdravy-zivot                72
nasabystrica                70
analyzy-a-postrehy          68
                          ... 
topolcanyinfo               52
domov   

In [23]:
import plotly.plotly as py
import plotly.graph_objs as go
py.sign_in('jurajkapasny', 'WhWXTdf0eG67r9ysMSFl')

In [21]:
#INertia

inertias = [23260.223711,23175.115672 ,23066.631996 ,22997.639527 ,22913.975833 ,22847.225592 ,22779.764281 
    ,22693.061253 ,22623.904487 ,22544.504475 ,22519.283904 ]
x_axis = range(10,21)

In [33]:
trace = go.Scatter(
    x = x_axis,
    y = inertias,
    mode = 'lines',
    name = 'lines',
    marker=dict(
                color="#00cedc")
)
layout = go.Layout(
    title='Inertia',
)
data = [trace]

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='styled-line')

In [30]:
#Sillhoutte

sillhoutte = [0.0210335557488,0.0214800705955,0.0218457408825,0.0237952614046,0.0244904783978
,0.024998723227,0.0246250285108,0.0271721475043,0.0285557281953,0.0280178642222,0.0300612505969]
x_axis = range(10,21)

In [32]:
trace = go.Scatter(
    x = x_axis,
    y = sillhoutte,
    mode = 'lines',
    name = 'lines',
    marker=dict(
                color="#00cedc")
)
layout = go.Layout(
    title='Silhouette Score',
)
data = [trace]

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='styled-line-silhouette')

In [149]:
data = [go.Bar(
            x=df_res[df_res.res20 == 0].label.value_counts().index[:5],
            y=df_res[df_res.res20 == 0].label.value_counts()[:5] / df_res[df_res.res20 == 0].label.value_counts().sum(),
            marker=dict(
                color="#00cedc")
    )]

# data = [trace0]
layout = go.Layout(
    title='Cluster 0 (Health and Food)',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar-0')

In [150]:
data = [go.Bar(
            x=df_res[df_res.res20 == 9].label.value_counts().index[:5],
            y=df_res[df_res.res20 == 9].label.value_counts()[:5] / df_res[df_res.res20 == 9].label.value_counts().sum(),
            marker=dict(
                color="#cb4b4a")
    )]

# data = [trace0]
layout = go.Layout(
    title='Cluster 9 (Sport)',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar-9')

In [151]:
data = [go.Bar(
            x=df_res[df_res.res20 == 18].label.value_counts().index[:5],
            y=df_res[df_res.res20 == 18].label.value_counts()[:5] / df_res[df_res.res20 == 18].label.value_counts().sum(),
            marker=dict(
                color="#cb4b4a")
    )]

# data = [trace0]
layout = go.Layout(
    title='Cluster 18',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar-18')

In [152]:
data = [go.Bar(
            x=df_res[df_res.res20 == 8].label.value_counts().index[:5],
            y=df_res[df_res.res20 == 8].label.value_counts()[:5] / df_res[df_res.res20 == 8].label.value_counts().sum(),
            marker=dict(
                color="#00cedc")
    )]

# data = [trace0]
layout = go.Layout(
    title='Cluster 8 (Relationships)',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar-8')

In [153]:
data = [go.Bar(
            x=df_res[df_res.res20 == 11].label.value_counts().index[:5],
            y=df_res[df_res.res20 == 11].label.value_counts()[:5] / df_res[df_res.res20 == 11].label.value_counts().sum(),
            marker=dict(
                color="#00cedc")
    )]

# data = [trace0]
layout = go.Layout(
    title='Cluster 11 (World)',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar-11')

In [154]:
data = [go.Bar(
            x=df_res[df_res.res20 == 10].label.value_counts().index[:5],
            y=df_res[df_res.res20 == 10].label.value_counts()[:5] / df_res[df_res.res20 == 10].label.value_counts().sum(),
            marker=dict(
                color="#cb4b4a")
    )]

# data = [trace0]
layout = go.Layout(
    title='Cluster 10 (Economics)',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar-10')

In [155]:
data = [go.Bar(
            x=df_res[df_res.res20 == 14].label.value_counts().index[:5],
            y=df_res[df_res.res20 == 14].label.value_counts()[:5] / df_res[df_res.res20 == 14].label.value_counts().sum(),
            marker=dict(
                color="#cb4b4a")
    )]

# data = [trace0]
layout = go.Layout(
    title='Cluster 14 (Mix)',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar-14')