## K means clustering

### Loading Data

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', 150)

df = pd.read_csv("/home/gtrane/ClassData/group03/html_project.csv", sep="\t")
df

Unnamed: 0,file_name,article_title,article_author,company_author,date,body_text
0,Panama Reports Country's First Monkeypox Case.html,Panama Reports Country's First Monkeypox Case,Elida Moreno,Reuters Health Information,"July 06, 2022",PANAMA CITY (Reuters) - Panama registered its first case of monkeypox in a resident who was infected after being in contact with tourists from Eur...
1,How Doctors Can Manage Their Daily Work Stress.html,How Doctors Can Manage Their Daily Work Stress,Rachel Reiff Ellis,Medscape Medical News,"March 23, 2023","As a physician, you may ruminate over an interaction with a patient or worry about a complicated procedure that didn't go as expected. You work th..."
2,Climate Change Projected to Fuel Rise in Suicide Deaths.html,Climate Change Projected to Fuel Rise in Suicide Deaths,Megan Brooks,Medscape Medical News,"March 29, 2023","The warming of the planet may mean more suicides, new research suggests. New findings show a significant association between higher temperatures a..."
3,A Moral Compass Is Apparent Even in Infants.html,A Moral Compass Is Apparent Even in Infants,Medscape Staff,Quick Take,"June 13, 2022","Even babies as young as 8 months old can recognize bad behavior and punish it, according to researchers at Osaka University in Japan. What to kno..."
4,Ohio Measles Outbreak Sickens Nearly 60 Children.html,Ohio Measles Outbreak Sickens Nearly 60 Children,Lisa O'Mary,WebMD Health News,"December 07, 2022",Measles has sickened 59 children in an outbreak that began in November and now spans four Ohio counties. None of the children had been fully vacci...
...,...,...,...,...,...,...
8760,EMPA-Kidney Moves the Needle for SGLT2 Inhibitors in Kidney Disease.html,EMPA-Kidney Moves the Needle for SGLT2 Inhibitors in Kidney Disease,"Mitchel L. Zoler, PhD",Medscape Medical News,"November 04, 2022","Dr William Herrington ORLANDO, Florida — The sodium-glucose cotransporter 2 (SGLT2) inhibitor empagliflozin (Jardiance) significantly slowed prog..."
8761,Paxlovid Reduces Risk of COVID Death by 79% in Older Adults.html,Paxlovid Reduces Risk of COVID Death by 79% in Older Adults: Study,Carolyn Crist,WebMD Health News,"August 26, 2022",Editor's note: Find the latest COVID-19 news and guidance in Medscape's Coronavirus Resource Center. The antiviral drug Paxlovid appears to reduc...
8762,"'Stunning Variation' in CV Test, Procedure Cost at US Hospitals.html","'Stunning Variation' in CV Test, Procedure Costs Revealed at Top US Hospitals",Marilynn Larkin,Medscape Medical News,"July 22, 2022","Wide variation in the cost of common cardiovascular (CV) tests and procedures, from stress tests to coronary interventions, was revealed in a cros..."
8763,Patients With Schizophrenia Twice as Likely to Develop Dementia?.html,Patients With Schizophrenia Twice as Likely to Develop Dementia?,"Batya Swift Yasgur, MA, LSW",Medscape Medical News,"October 24, 2022",Patients with psychotic disorders such as schizophrenia are more than twice as likely as those without a psychotic disorder to eventually develop ...


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8765 entries, 0 to 8764
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   file_name       8765 non-null   object
 1   article_title   8765 non-null   object
 2   article_author  8581 non-null   object
 3   company_author  8757 non-null   object
 4   date            8763 non-null   object
 5   body_text       8763 non-null   object
dtypes: object(6)
memory usage: 411.0+ KB


## Preparing Data for Modeling

In [3]:
df.body_text = df.body_text.astype(str)

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(use_idf=True, norm="l2", stop_words="english", max_df=0.7)
X = vectorizer.fit_transform(df.body_text)
X

<8765x77727 sparse matrix of type '<class 'numpy.float64'>'
	with 2317610 stored elements in Compressed Sparse Row format>

In [5]:
X.shape

(8765, 77727)

There are 8,765 documents, or samples, and 77,727 words, or features. 

## K-Means Clustering

### Step 1. Choose the number of clusters

In [6]:
k = 22

### Step 2. Initialize a model object with initial parameters

In [7]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=k, random_state=0)
kmeans

KMeans(n_clusters=22, random_state=0)

### Step 3. Fit the model on the input data

In [8]:
%time kmeans.fit(X)

CPU times: user 2min 37s, sys: 1.11 s, total: 2min 38s
Wall time: 27.4 s


KMeans(n_clusters=22, random_state=0)

### Step 4. Get the clustering outcome

In [9]:
kmeans.predict(X)

array([16, 15,  8, ...,  4, 18, 12], dtype=int32)

In [10]:
df["label"] = kmeans.predict(X)
df

Unnamed: 0,file_name,article_title,article_author,company_author,date,body_text,label
0,Panama Reports Country's First Monkeypox Case.html,Panama Reports Country's First Monkeypox Case,Elida Moreno,Reuters Health Information,"July 06, 2022",PANAMA CITY (Reuters) - Panama registered its first case of monkeypox in a resident who was infected after being in contact with tourists from Eur...,16
1,How Doctors Can Manage Their Daily Work Stress.html,How Doctors Can Manage Their Daily Work Stress,Rachel Reiff Ellis,Medscape Medical News,"March 23, 2023","As a physician, you may ruminate over an interaction with a patient or worry about a complicated procedure that didn't go as expected. You work th...",15
2,Climate Change Projected to Fuel Rise in Suicide Deaths.html,Climate Change Projected to Fuel Rise in Suicide Deaths,Megan Brooks,Medscape Medical News,"March 29, 2023","The warming of the planet may mean more suicides, new research suggests. New findings show a significant association between higher temperatures a...",8
3,A Moral Compass Is Apparent Even in Infants.html,A Moral Compass Is Apparent Even in Infants,Medscape Staff,Quick Take,"June 13, 2022","Even babies as young as 8 months old can recognize bad behavior and punish it, according to researchers at Osaka University in Japan. What to kno...",8
4,Ohio Measles Outbreak Sickens Nearly 60 Children.html,Ohio Measles Outbreak Sickens Nearly 60 Children,Lisa O'Mary,WebMD Health News,"December 07, 2022",Measles has sickened 59 children in an outbreak that began in November and now spans four Ohio counties. None of the children had been fully vacci...,11
...,...,...,...,...,...,...,...
8760,EMPA-Kidney Moves the Needle for SGLT2 Inhibitors in Kidney Disease.html,EMPA-Kidney Moves the Needle for SGLT2 Inhibitors in Kidney Disease,"Mitchel L. Zoler, PhD",Medscape Medical News,"November 04, 2022","Dr William Herrington ORLANDO, Florida — The sodium-glucose cotransporter 2 (SGLT2) inhibitor empagliflozin (Jardiance) significantly slowed prog...",21
8761,Paxlovid Reduces Risk of COVID Death by 79% in Older Adults.html,Paxlovid Reduces Risk of COVID Death by 79% in Older Adults: Study,Carolyn Crist,WebMD Health News,"August 26, 2022",Editor's note: Find the latest COVID-19 news and guidance in Medscape's Coronavirus Resource Center. The antiviral drug Paxlovid appears to reduc...,2
8762,"'Stunning Variation' in CV Test, Procedure Cost at US Hospitals.html","'Stunning Variation' in CV Test, Procedure Costs Revealed at Top US Hospitals",Marilynn Larkin,Medscape Medical News,"July 22, 2022","Wide variation in the cost of common cardiovascular (CV) tests and procedures, from stress tests to coronary interventions, was revealed in a cros...",4
8763,Patients With Schizophrenia Twice as Likely to Develop Dementia?.html,Patients With Schizophrenia Twice as Likely to Develop Dementia?,"Batya Swift Yasgur, MA, LSW",Medscape Medical News,"October 24, 2022",Patients with psychotic disorders such as schizophrenia are more than twice as likely as those without a psychotic disorder to eventually develop ...,18


In [11]:
df.label.value_counts()                          # Count the number of values for each cluster label 

8     1444
12    1165
15     849
3      628
13     491
4      461
1      411
2      350
17     345
9      310
6      282
20     277
11     211
0      210
21     203
7      199
16     187
19     177
18     174
5      164
14     149
10      78
Name: label, dtype: int64

The first two clusters are significantly larger than the others, however the rest are evenly dispersed

### Step 5. Evaluate the performance of the model

In [12]:
cluster_1st, cluster_2nd, cluster_3rd, cluster_4th, cluster_5th, cluster_6th,cluster_7th,cluster_8th,cluster_9th,cluster_10th,cluster_11th,cluster_12th,cluster_13th,cluster_14th,cluster_15th,cluster_16th,cluster_17th,cluster_18th,cluster_19th,cluster_20th,cluster_21st,cluster_22nd  = df.label.value_counts().index
cluster_1st, cluster_2nd, cluster_3rd, cluster_4th, cluster_5th, cluster_6th,cluster_7th,cluster_8th,cluster_9th,cluster_10th,cluster_11th,cluster_12th,cluster_13th,cluster_14th,cluster_15th,cluster_16th,cluster_17th,cluster_18th,cluster_19th,cluster_20th,cluster_21st,cluster_22nd

(8, 12, 15, 3, 13, 4, 1, 2, 17, 9, 6, 20, 11, 0, 21, 7, 16, 19, 18, 5, 14, 10)

In [13]:
#df[df.label == cluster_1st].sample(10, random_state=0)[["body_text", "label"]]     # the largest cluster

In [14]:
import nltk
df["words"] = df.body_text.apply(lambda x: nltk.word_tokenize(x))
df["tagged_words"] = df.words.apply(lambda x: nltk.pos_tag(x))

from collections import Counter

def get_counter(dataframe, stopwords=[]):
    counter = Counter()
    
    for l in dataframe.tagged_words:
        word_set = set()

        for t in l:
            word = t[0].lower()
            tag = t[1]

            if word not in stopwords:
                word_set.add(word)
            
        counter.update(word_set)
        
    return counter

from nltk.corpus import stopwords
import string

global_stopwords = stopwords.words("english") 
local_stopwords = [c for c in string.punctuation] +\
                  ['``',"''","'s",'also','md','medscape','said','2022',"n't",'may','reuters','medical','would','like']

In [15]:
counter = get_counter(df[df.label == cluster_1st], global_stopwords+local_stopwords)
counter.most_common(30)

[('university', 884),
 ('health', 881),
 ('new', 869),
 ('one', 869),
 ('news', 843),
 ('study', 821),
 ('twitter', 811),
 ('facebook', 779),
 ('research', 767),
 ('published', 765),
 ('found', 758),
 ('years', 745),
 ('according', 737),
 ('people', 731),
 ('follow', 730),
 ('patients', 713),
 ('could', 707),
 ('time', 697),
 ('researchers', 668),
 ('youtube', 646),
 ('instagram', 645),
 ('use', 631),
 ('used', 624),
 ('many', 610),
 ('including', 605),
 ('first', 596),
 ('us', 580),
 ('two', 579),
 ('well', 574),
 ('risk', 560)]

It looks like the largest cluster talks about published research and general information about health news, publishing, social media

In [16]:
#df[df.label == cluster_2nd].sample(10, random_state=0)[["body_text", "label"]]     # the second largest cluster

In [17]:
counter = get_counter(df[df.label == cluster_2nd], global_stopwords+local_stopwords)
counter.most_common(30)

[('patients', 1130),
 ('study', 1080),
 ('university', 892),
 ('treatment', 872),
 ('new', 825),
 ('research', 821),
 ('clinical', 811),
 ('news', 791),
 ('results', 788),
 ('years', 779),
 ('twitter', 765),
 ('data', 760),
 ('facebook', 757),
 ('one', 739),
 ('published', 719),
 ('compared', 700),
 ('disease', 697),
 ('patient', 690),
 ('reported', 687),
 ('researchers', 675),
 ('findings', 673),
 ('noted', 652),
 ('health', 645),
 ('studies', 643),
 ('financial', 631),
 ('including', 630),
 ('two', 630),
 ('relevant', 629),
 ('group', 622),
 ('however', 622)]

It seems the second largest cluster seems to be about clinical research/findings

In [18]:
#df[df.label == cluster_3rd].sample(10, random_state=0)[["body_text", "label"]]     # the third largest cluster

In [19]:
counter = get_counter(df[df.label == cluster_3rd], global_stopwords+local_stopwords)
counter.most_common(30)

[('health', 729),
 ('care', 707),
 ('patients', 650),
 ('new', 628),
 ('news', 616),
 ('one', 608),
 ('twitter', 592),
 ('facebook', 584),
 ('follow', 581),
 ('many', 556),
 ('instagram', 550),
 ('youtube', 546),
 ('people', 537),
 ('time', 527),
 ('years', 519),
 ('patient', 516),
 ('physicians', 508),
 ('healthcare', 501),
 ('according', 500),
 ('could', 499),
 ('need', 497),
 ('medicine', 467),
 ('physician', 465),
 ('work', 462),
 ('help', 452),
 ('—', 438),
 ('year', 436),
 ('university', 422),
 ('us', 420),
 ('including', 417)]

It seems like the third largest cluster is about physicians, medicine, and healthcare in general

In [20]:
#df[df.label == cluster_4th].sample(10, random_state=0)[["body_text", "label"]]     # the fourth largest cluster

In [21]:
counter = get_counter(df[df.label == cluster_4th], global_stopwords+local_stopwords)
counter.most_common(30)

[('health', 503),
 ('covid-19', 474),
 ('people', 357),
 ('new', 349),
 ('reporting', 339),
 ('pandemic', 318),
 ('editing', 303),
 ('cases', 301),
 ('last', 281),
 ('public', 275),
 ('first', 252),
 ('disease', 250),
 ('since', 248),
 ('world', 244),
 ('year', 243),
 ('covid', 242),
 ('virus', 239),
 ('including', 238),
 ('reported', 237),
 ('told', 236),
 ('according', 232),
 ('one', 231),
 ('country', 222),
 ('states', 216),
 ('coronavirus', 215),
 ('u.s.', 210),
 ('government', 206),
 ('could', 205),
 ('news', 204),
 ('two', 197)]

The 4th largest cluster is covid reporting in the U.S. and the world

In [22]:
#df[df.label == cluster_5th].sample(10, random_state=0)[["body_text", "label"]]     # the smallest cluster

In [23]:
counter = get_counter(df[df.label == cluster_5th], global_stopwords+local_stopwords)
counter.most_common(30)

[('drug', 395),
 ('company', 307),
 ('reporting', 300),
 ('health', 285),
 ('editing', 269),
 ('u.s.', 266),
 ('administration', 253),
 ('food', 252),
 ('year', 243),
 ('new', 238),
 ('patients', 236),
 ('fda', 232),
 ('drugs', 216),
 ('according', 194),
 ('last', 191),
 ('use', 188),
 ('including', 187),
 ('market', 183),
 ('could', 181),
 ('states', 177),
 ('one', 177),
 ('data', 177),
 ('products', 176),
 ('companies', 176),
 ('treatment', 175),
 ('first', 174),
 ('people', 172),
 ('two', 159),
 ('approved', 159),
 ('used', 157)]

The 5th cluster is about the FDA: drugs, market, companies, reporting

In [24]:
#df[df.label == cluster_6th].sample(10, random_state=0)[["body_text", "label"]]

In [25]:
counter = get_counter(df[df.label == cluster_6th], global_stopwords+local_stopwords)
counter.most_common(30)

[('study', 428),
 ('patients', 421),
 ('risk', 371),
 ('university', 366),
 ('twitter', 358),
 ('new', 356),
 ('facebook', 354),
 ('published', 339),
 ('data', 334),
 ('cardiology', 332),
 ('results', 330),
 ('clinical', 329),
 ('years', 326),
 ('research', 321),
 ('us', 316),
 ('cardiovascular', 315),
 ('heart', 313),
 ('follow', 310),
 ('one', 298),
 ('disease', 293),
 ('group', 282),
 ('outcomes', 281),
 ('compared', 278),
 ('health', 275),
 ('theheart.org', 270),
 ('reported', 268),
 ('studies', 265),
 ('however', 263),
 ('added', 261),
 ('associated', 259)]

The 6th cluster seems to be about published data/clinical results in cardiology

In [26]:
#df[df.label == cluster_7th].sample(10, random_state=0)[["body_text", "label"]]

In [27]:
counter = get_counter(df[df.label == cluster_7th], global_stopwords+local_stopwords)
counter.most_common(30)

[('cancer', 411),
 ('study', 340),
 ('patients', 339),
 ('twitter', 310),
 ('facebook', 307),
 ('new', 302),
 ('research', 296),
 ('published', 289),
 ('news', 288),
 ('years', 285),
 ('health', 268),
 ('risk', 267),
 ('found', 266),
 ('university', 244),
 ('follow', 239),
 ('among', 238),
 ('data', 235),
 ('treatment', 234),
 ('one', 232),
 ('youtube', 229),
 ('results', 228),
 ('instagram', 228),
 ('authors', 226),
 ('oncology', 225),
 ('researchers', 222),
 ('disease', 216),
 ('us', 215),
 ('including', 212),
 ('care', 211),
 ('cancers', 210)]

The 7th cluster is about cancer studies: oncology, treatment

In [28]:
#df[df.label == cluster_8th].sample(10, random_state=0)[["body_text", "label"]]

In [29]:
counter = get_counter(df[df.label == cluster_8th], global_stopwords+local_stopwords)
counter.most_common(30)

[('covid-19', 345),
 ('covid', 274),
 ('patients', 273),
 ('study', 272),
 ('news', 272),
 ('health', 269),
 ('people', 265),
 ('new', 258),
 ('center', 249),
 ('infection', 249),
 ('find', 242),
 ('note', 239),
 ('latest', 236),
 ('guidance', 233),
 ('resource', 228),
 ('editor', 226),
 ('symptoms', 225),
 ('research', 215),
 ('one', 215),
 ('data', 213),
 ('researchers', 213),
 ('disease', 212),
 ('published', 211),
 ('risk', 208),
 ('university', 207),
 ('coronavirus', 207),
 ('found', 198),
 ('severe', 198),
 ('among', 185),
 ('time', 184)]

The 8th cluster is about general COVID news: health, guidance, resource, risk, center, infections

In [30]:
#df[df.label == cluster_9th].sample(10, random_state=0)[["body_text", "label"]]

In [31]:
counter = get_counter(df[df.label == cluster_9th], global_stopwords+local_stopwords)
counter.most_common(30)

[('patients', 344),
 ('study', 317),
 ('cancer', 313),
 ('treatment', 303),
 ('results', 275),
 ('survival', 274),
 ('twitter', 274),
 ('overall', 273),
 ('facebook', 272),
 ('news', 259),
 ('therapy', 249),
 ('received', 244),
 ('median', 236),
 ('new', 232),
 ('months', 231),
 ('research', 226),
 ('published', 224),
 ('clinical', 224),
 ('disease', 223),
 ('university', 220),
 ('oncology', 219),
 ('trial', 214),
 ('3', 210),
 ('follow', 210),
 ('data', 206),
 ('instagram', 205),
 ('youtube', 205),
 ('two', 194),
 ('reported', 194),
 ('group', 193)]

The 9th cluster is about clinical trials: oncololgy, therapy, group, median, reseasrch, treatment

In [32]:
#df[df.label == cluster_10th].sample(10, random_state=0)[["body_text", "label"]]

In [33]:
counter = get_counter(df[df.label == cluster_10th], global_stopwords+local_stopwords)
counter.most_common(30)

[('vaccine', 309),
 ('covid-19', 265),
 ('vaccines', 239),
 ('health', 183),
 ('new', 173),
 ('moderna', 171),
 ('doses', 164),
 ('people', 161),
 ('data', 156),
 ('pfizer', 153),
 ('shots', 147),
 ('years', 146),
 ('first', 146),
 ('shot', 144),
 ('covid', 144),
 ('one', 141),
 ('year', 140),
 ('two', 138),
 ('according', 136),
 ('reporting', 134),
 ('use', 131),
 ('disease', 130),
 ('vaccination', 129),
 ('booster', 127),
 ('news', 127),
 ('could', 126),
 ('mrna', 123),
 ('company', 123),
 ('dose', 122),
 ('months', 122)]

The 10th cluster is mostly about COVID vaccines: dose, pfizer, moderna, booster, mRNA

In [34]:
#df[df.label == cluster_11th].sample(10, random_state=0)[["body_text", "label"]]

In [35]:
counter = get_counter(df[df.label == cluster_11th], global_stopwords+local_stopwords)
counter.most_common(30)

[('women', 268),
 ('study', 242),
 ('health', 223),
 ('university', 221),
 ('data', 206),
 ('research', 205),
 ('risk', 203),
 ('new', 197),
 ('researchers', 180),
 ('published', 179),
 ('one', 177),
 ('years', 174),
 ('patients', 173),
 ('according', 172),
 ('found', 170),
 ('among', 169),
 ('higher', 164),
 ('pregnancy', 162),
 ('findings', 157),
 ('reported', 155),
 ('studies', 155),
 ('news', 154),
 ('age', 153),
 ('likely', 152),
 ('increased', 150),
 ('twitter', 148),
 ('financial', 147),
 ('need', 146),
 ('facebook', 146),
 ('compared', 145)]

Pregnancy, women; general study, research, university, publish

In [36]:
#df[df.label == cluster_12th].sample(10, random_state=0)[["body_text", "label"]]

In [37]:
counter = get_counter(df[df.label == cluster_12th], global_stopwords+local_stopwords)
counter.most_common(30)

[('china', 268),
 ('reporting', 255),
 ('editing', 252),
 ('covid-19', 243),
 ('beijing', 207),
 ('covid', 204),
 ('health', 199),
 ('new', 198),
 ('people', 191),
 ('infections', 185),
 ('chinese', 183),
 ('cases', 179),
 ('reported', 163),
 ('shanghai', 160),
 ('policy', 155),
 ('government', 154),
 ('city', 152),
 ('world', 144),
 ('last', 143),
 ('since', 142),
 ('three', 137),
 ('virus', 137),
 ('many', 137),
 ('one', 135),
 ('told', 132),
 ('authorities', 132),
 ('two', 130),
 ('zero-covid', 130),
 ('testing', 129),
 ('million', 127)]

China zero-tolerance covid policy

In [38]:
#df[df.label == cluster_13th].sample(10, random_state=0)[["body_text", "label"]]

In [39]:
counter = get_counter(df[df.label == cluster_13th], global_stopwords+local_stopwords)
counter.most_common(30)

[('children', 202),
 ('years', 159),
 ('health', 156),
 ('according', 150),
 ('new', 145),
 ('data', 142),
 ('reported', 139),
 ('one', 132),
 ('covid-19', 125),
 ('news', 125),
 ('disease', 123),
 ('among', 121),
 ('age', 120),
 ('hospital', 120),
 ('cases', 116),
 ('control', 110),
 ('pediatric', 106),
 ('child', 106),
 ('year', 105),
 ('states', 102),
 ('cdc', 101),
 ('study', 100),
 ('pediatrics', 98),
 ('prevention', 98),
 ('time', 98),
 ('two', 97),
 ('higher', 97),
 ('first', 95),
 ('compared', 95),
 ('report', 94)]

Pediatrics

In [40]:
#df[df.label == cluster_14th].sample(10, random_state=0)[["body_text", "label"]]

In [41]:
counter = get_counter(df[df.label == cluster_14th], global_stopwords+local_stopwords)
counter.most_common(30)

[('study', 180),
 ('health', 167),
 ('university', 166),
 ('research', 160),
 ('weight', 151),
 ('new', 149),
 ('news', 144),
 ('patients', 143),
 ('obesity', 142),
 ('body', 142),
 ('twitter', 135),
 ('years', 135),
 ('results', 134),
 ('data', 133),
 ('published', 131),
 ('facebook', 129),
 ('one', 129),
 ('studies', 129),
 ('people', 129),
 ('diabetes', 123),
 ('follow', 122),
 ('researchers', 122),
 ('age', 118),
 ('among', 117),
 ('time', 116),
 ('associated', 115),
 ('risk', 115),
 ('medicine', 114),
 ('findings', 114),
 ('compared', 114)]

weight and obesity, diabetes, body

In [42]:
#df[df.label == cluster_15th].sample(10, random_state=0)[["body_text", "label"]]

In [43]:
counter = get_counter(df[df.label == cluster_15th], global_stopwords+local_stopwords)
counter.most_common(30)

[('patients', 181),
 ('study', 170),
 ('university', 165),
 ('heart', 158),
 ('new', 153),
 ('risk', 151),
 ('failure', 149),
 ('published', 144),
 ('twitter', 144),
 ('facebook', 142),
 ('follow', 136),
 ('health', 133),
 ('years', 130),
 ('research', 129),
 ('clinical', 128),
 ('data', 128),
 ('disease', 124),
 ('compared', 122),
 ('us', 121),
 ('one', 120),
 ('reported', 114),
 ('results', 113),
 ('cardiology', 110),
 ('online', 109),
 ('death', 108),
 ('findings', 108),
 ('including', 108),
 ('among', 106),
 ('medicine', 106),
 ('patient', 105)]

Cardiology, heart failure

In [44]:
#df[df.label == cluster_16th].sample(10, random_state=0)[["body_text", "label"]]

In [45]:
counter = get_counter(df[df.label == cluster_16th], global_stopwords+local_stopwords)
counter.most_common(30)

[('abortion', 199),
 ('court', 185),
 ('supreme', 182),
 ('states', 179),
 ('abortions', 173),
 ('roe', 173),
 ('state', 172),
 ('wade', 169),
 ('women', 159),
 ('law', 155),
 ('v.', 155),
 ('u.s.', 152),
 ('health', 150),
 ('rights', 144),
 ('decision', 139),
 ('access', 135),
 ('laws', 134),
 ('could', 131),
 ('right', 129),
 ('new', 128),
 ('care', 125),
 ('legal', 125),
 ('ruling', 123),
 ('ban', 121),
 ('pregnancy', 120),
 ('weeks', 117),
 ('overturned', 112),
 ('one', 111),
 ('including', 110),
 ('patients', 109)]

Abortion, roe v wade, health rights, access, weeks of pregnancy

In [46]:
#df[df.label == cluster_17th].sample(10, random_state=0)[["body_text", "label"]]

In [47]:
counter = get_counter(df[df.label == cluster_17th], global_stopwords+local_stopwords)
counter.most_common(30)

[('monkeypox', 187),
 ('cases', 179),
 ('health', 177),
 ('disease', 153),
 ('reported', 142),
 ('outbreak', 138),
 ('virus', 136),
 ('people', 126),
 ('new', 117),
 ('states', 116),
 ('public', 115),
 ('according', 114),
 ('countries', 111),
 ('world', 109),
 ('spread', 108),
 ('contact', 106),
 ('vaccine', 105),
 ('men', 104),
 ('confirmed', 103),
 ('case', 102),
 ('united', 99),
 ('one', 98),
 ('risk', 97),
 ('organization', 95),
 ('symptoms', 93),
 ('sex', 92),
 ('control', 91),
 ('cdc', 91),
 ('vaccines', 90),
 ('first', 89)]

Monkeypox outbreak

In [48]:
#df[df.label == cluster_18th].sample(10, random_state=0)[["body_text", "label"]]

In [49]:
counter = get_counter(df[df.label == cluster_18th], global_stopwords+local_stopwords)
counter.most_common(30)

[('diabetes', 177),
 ('type', 162),
 ('study', 158),
 ('2', 149),
 ('years', 138),
 ('published', 137),
 ('data', 137),
 ('risk', 136),
 ('research', 135),
 ('people', 128),
 ('new', 126),
 ('patients', 125),
 ('follow', 125),
 ('news', 125),
 ('twitter', 122),
 ('health', 122),
 ('university', 121),
 ('facebook', 121),
 ('results', 111),
 ('disease', 110),
 ('findings', 109),
 ('among', 108),
 ('compared', 107),
 ('us', 107),
 ('reported', 105),
 ('researchers', 105),
 ('one', 105),
 ('association', 101),
 ('endocrinology', 100),
 ('based', 99)]

Type 1 and 2 diabetes

In [50]:
#df[df.label == cluster_19th].sample(10, random_state=0)[["body_text", "label"]]

In [51]:
counter = get_counter(df[df.label == cluster_19th], global_stopwords+local_stopwords)
counter.most_common(30)

[('study', 164),
 ('research', 153),
 ('twitter', 151),
 ('facebook', 151),
 ('disease', 150),
 ('news', 148),
 ('alzheimer', 145),
 ('published', 144),
 ('dementia', 143),
 ('health', 138),
 ('new', 138),
 ('risk', 138),
 ('us', 137),
 ('researchers', 136),
 ('university', 135),
 ('years', 135),
 ('results', 135),
 ('findings', 132),
 ('association', 129),
 ('relevant', 128),
 ('cognitive', 127),
 ('phd', 123),
 ('participants', 119),
 ('patients', 119),
 ('financial', 117),
 ('associated', 117),
 ('data', 116),
 ('added', 115),
 ('age', 114),
 ('studies', 114)]

Cognitive diseases: alzheimers, dementia

In [52]:
#df[df.label == cluster_20th].sample(10, random_state=0)[["body_text", "label"]]

In [53]:
counter = get_counter(df[df.label == cluster_20th], global_stopwords+local_stopwords)
counter.most_common(30)

[('omicron', 159),
 ('covid-19', 145),
 ('ba.5', 126),
 ('new', 125),
 ('booster', 120),
 ('vaccine', 114),
 ('people', 114),
 ('coronavirus', 110),
 ('ba.4', 108),
 ('health', 101),
 ('variant', 98),
 ('data', 97),
 ('vaccines', 97),
 ('subvariants', 93),
 ('two', 92),
 ('virus', 92),
 ('variants', 91),
 ('original', 90),
 ('last', 89),
 ('disease', 88),
 ('covid', 87),
 ('u.s.', 86),
 ('according', 85),
 ('states', 84),
 ('shots', 83),
 ('updated', 80),
 ('week', 79),
 ('protection', 76),
 ('reporting', 76),
 ('older', 75)]

Covid in the U.S: omicron variant, booster, vaccines, protection

In [54]:
#df[df.label == cluster_21st].sample(10, random_state=0)[["body_text", "label"]]

In [55]:
counter = get_counter(df[df.label == cluster_21st], global_stopwords+local_stopwords)
counter.most_common(30)

[('use', 120),
 ('health', 118),
 ('drug', 104),
 ('opioid', 102),
 ('news', 102),
 ('new', 100),
 ('patients', 99),
 ('study', 98),
 ('opioids', 95),
 ('twitter', 94),
 ('facebook', 93),
 ('people', 90),
 ('university', 89),
 ('years', 88),
 ('data', 88),
 ('drugs', 84),
 ('research', 83),
 ('including', 81),
 ('published', 81),
 ('one', 80),
 ('according', 79),
 ('researchers', 77),
 ('treatment', 77),
 ('pain', 76),
 ('found', 74),
 ('overdose', 73),
 ('used', 72),
 ('medicine', 72),
 ('reported', 70),
 ('need', 69)]

Opioids, overdoses, drugs, health

In [56]:
#df[df.label == cluster_22nd].sample(10, random_state=0)[["body_text", "label"]]

In [57]:
counter = get_counter(df[df.label == cluster_22nd], global_stopwords+local_stopwords)
counter.most_common(30)

[('diabetes', 78),
 ('insulin', 74),
 ('patients', 68),
 ('type', 64),
 ('people', 60),
 ('1', 57),
 ('use', 56),
 ('years', 55),
 ('glucose', 55),
 ('follow', 55),
 ('twitter', 54),
 ('study', 54),
 ('facebook', 53),
 ('research', 53),
 ('news', 53),
 ('based', 53),
 ('new', 53),
 ('us', 50),
 ('data', 47),
 ('one', 47),
 ('health', 47),
 ('university', 47),
 ('endocrinology', 47),
 ('work', 46),
 ('two', 46),
 ('reported', 44),
 ('published', 44),
 ('2', 44),
 ('washington', 43),
 ('area', 43)]

The smallest cluster is about type 1 and 2 diabetes; glucose

## LDA Topic Modeling

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(use_idf=True, norm="l2", stop_words="english", max_df=0.6)
X = vectorizer.fit_transform(df.body_text)
X

<8765x77723 sparse matrix of type '<class 'numpy.float64'>'
	with 2294393 stored elements in Compressed Sparse Row format>

In [59]:
num_topics = 22

In [60]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

lda = LDA(n_components=num_topics, random_state=0)     # LDA uses randomness to get a probability distribution
lda

LatentDirichletAllocation(n_components=22, random_state=0)

In [61]:
%time lda.fit(X)

CPU times: user 5min 49s, sys: 28min 41s, total: 34min 31s
Wall time: 4min 21s


LatentDirichletAllocation(n_components=22, random_state=0)

In [62]:
lda.components_

array([[0.04545455, 0.04545455, 0.04545455, ..., 0.04545455, 0.04545455,
        0.08478263],
       [0.04545455, 0.04545455, 0.04545455, ..., 0.04545455, 0.04545455,
        0.04545455],
       [0.04545455, 0.04545455, 0.04545455, ..., 0.04545455, 0.04545455,
        0.04545455],
       ...,
       [0.04545455, 0.04545455, 0.04545455, ..., 0.09610897, 0.09610897,
        0.04545455],
       [0.04545455, 0.04545455, 0.04545455, ..., 0.04545455, 0.04545455,
        0.04545455],
       [0.04545455, 0.04545455, 0.04545455, ..., 0.04545455, 0.04545455,
        0.04545455]])

In [63]:
lda.components_.shape

(22, 77723)

In [64]:
def show_topics(model, feature_names, num_top_words):
    for topic_idx, topic_scores in enumerate(model.components_):
        print(f"*** Topic {topic_idx}:")
        print(" + ".join(["{:.2f} * {}".format(topic_scores[i], feature_names[i]) for i in topic_scores.argsort()[::-1][:num_top_words]]))
        print()

In [65]:
show_topics(lda, vectorizer.get_feature_names_out(), 30)

*** Topic 0:
4.04 * thrombectomy + 3.75 * valneva + 3.06 * endovascular + 1.99 * bankruptcy + 1.93 * nihss + 1.90 * mrs + 1.66 * talc + 1.58 * amd + 1.53 * hummel + 1.51 * tassilo + 1.39 * ltl + 1.34 * fh + 1.20 * nafld + 1.12 * rett + 1.00 * sids + 0.99 * filgotinib + 0.99 * infarct + 0.98 * trofinetide + 0.97 * jovin + 0.96 * mamas + 0.94 * tattoos + 0.93 * belimumab + 0.93 * isc + 0.90 * kluge + 0.89 * pde5 + 0.86 * rankin + 0.85 * subbarow + 0.83 * agga + 0.82 * durvalumab + 0.82 * chmp

*** Topic 1:
4.07 * rankings + 1.65 * berdazimer + 1.39 * stiko + 1.27 * ec + 1.07 * valneva + 1.05 * vaxzevria + 0.93 * curevac + 0.89 * immersion + 0.87 * turtles + 0.85 * oon + 0.84 * scars + 0.84 * pylori + 0.84 * intolerance + 0.84 * venetoclax + 0.84 * gram + 0.83 * lada + 0.82 * ivig + 0.81 * sabizabulin + 0.81 * reese + 0.80 * ddh + 0.80 * molluscum + 0.79 * aapc + 0.79 * etminan + 0.79 * brewster + 0.78 * apremilast + 0.77 * ranking + 0.76 * deucravacitinib + 0.75 * mpox + 0.75 * croce + 0

4.27 * ppi + 4.15 * ppis + 4.12 * vedolizumab + 3.71 * upadacitinib + 2.68 * hcm + 2.09 * gy + 2.01 * sekeres + 1.76 * cd + 1.72 * cansino + 1.51 * hypofractionation + 1.46 * surmount + 1.42 * hypofractionated + 1.29 * pegloticase + 1.27 * nr + 1.26 * vt + 1.18 * ebv + 1.18 * buses + 1.05 * mizelle + 1.02 * eb + 1.01 * wine + 0.98 * hcv + 0.96 * rasi + 0.96 * cf + 0.93 * mcdermott + 0.92 * directory + 0.90 * nsaid + 0.90 * araújo + 0.88 * scribes + 0.88 * sa + 0.87 * stillbirths

*** Topic 20:
9.55 * fauci + 5.48 * satija + 5.48 * bhanvi + 5.45 * cholera + 5.20 * ganguli + 5.19 * shinjini + 5.00 * hcc + 4.88 * mishra + 4.79 * evusheld + 4.44 * manas + 4.04 * khushi + 3.93 * mpox + 3.93 * mandowara + 3.81 * dasgupta + 3.73 * shounak + 3.38 * bourla + 3.35 * krishna + 3.11 * chandra + 3.01 * eluri + 2.67 * cha + 2.54 * malawi + 2.42 * bq + 2.18 * dwivedi + 2.14 * bebtelovimab + 2.10 * xbb + 2.10 * huang + 2.09 * maju + 2.07 * nhc + 2.05 * zanubrutinib + 2.02 * indoors

*** Topic 21:
1.42

Some topics are the same, covid news, cancer studies, abortion news, but most are drugs and otherwise unrecognizable words to form topics

### Topic Model Visualization

In [66]:
import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

In [67]:
pyLDAvis.sklearn.prepare(lda, X, vectorizer)

  default_term_info = default_term_info.sort_values(
