In [1]:
import pandas as pd

df = pd.read_csv("../data/interim/openalex_pruned.csv", dtype=str, low_memory=False)

## Small note

Some rows do not contain abstracts, thus we can either ignore them or substitute them. In this case, I will substitute them with other metadata. Additionally, some titles and publication years are missing, so I'll completely remove those entries.

In [14]:
df = df.dropna(subset=["title", "publication_year"], how="any")

In [15]:
df["abstract"].isna().sum(), len(df)


(34515, 45532)

So it looks like roughly 10K entries lack an abstract. I'll fill the missing entries below with the title since some titles are descriptive enough to describe the contents of the articles.

In [16]:
df["text"] = df["abstract"].fillna(df["title"])

Below simply checks if any titles are missing. If the list is empty, then we're good.

In [17]:
df[df["text"].isna()]


Unnamed: 0,id,title,publication_year,authorships.author.display_name,concepts.display_name,topics.display_name,primary_location.source.host_organization_name,cited_by_count,abstract,text


## Vectorization

Below I create the embeddings using TD-IDF so that I can use them later for a dimensional reduction.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    stop_words="english",
    max_features=50000,
    max_df=0.3,
    min_df=10
)

tfidf = vectorizer.fit_transform(df["text"])
tfidf

<45532x7455 sparse matrix of type '<class 'numpy.float64'>'
	with 927557 stored elements in Compressed Sparse Row format>

## Dimensional Reduction with UMAP

In [20]:
import umap

umap_3d = umap.UMAP(
    n_components=3,
    n_neighbors=30,
    min_dist=0.0,
    metric='cosine',
    random_state=42
)

embedding_3d = umap_3d.fit_transform(tfidf)


  warn(
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


In [21]:
df["x"] = embedding_3d[:, 0]
df["y"] = embedding_3d[:, 1]
df["z"] = embedding_3d[:, 2]

## Clustering

We can group the above embedding based on how close they are below

In [22]:
import hdbscan

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=50,
    metric='euclidean',
    cluster_selection_method='eom'
)

labels = clusterer.fit_predict(embedding_3d)
df["cluster"] = labels




## Final inspection and export

Check the metadata we want and export it out

In [26]:
df.head()

Unnamed: 0,id,title,publication_year,authorships.author.display_name,concepts.display_name,topics.display_name,primary_location.source.host_organization_name,cited_by_count,abstract,text,x,y,z,cluster
0,https://openalex.org/W2028056984,<i>VESTA 3</i>for three-dimensional visualizat...,2011.0,Koichi Momma|Fujio Izumi,Undo|Visualization|Voronoi diagram|Computer sc...,Catalysis and Oxidation Reactions|X-ray Diffra...,Wiley,22839,,<i>VESTA 3</i>for three-dimensional visualizat...,9.508135,12.053864,-0.384366,110
1,https://openalex.org/W2056279562,phyloseq: An R Package for Reproducible Intera...,2013.0,Paul J. McMurdie|Susan Holmes,UniFrac|Computer science|Data science|Data min...,Species Distribution and Climate Change|Data A...,Public Library of Science,20085,The phyloseq project for R is a new open-sourc...,The phyloseq project for R is a new open-sourc...,11.787406,11.985596,-0.692306,-1
2,https://openalex.org/W2128880918,Geneious Basic: An integrated and extendable d...,2012.0,Matthew D. Kearse|Richard Moir|Amy Wilson|Stev...,Computer science|Software|Personalization|Leve...,Scientific Computing and Data Management|Genom...,Oxford University Press,19717,Abstract Summary: The two main functions of bi...,Abstract Summary: The two main functions of bi...,11.948407,12.137466,-0.855199,-1
3,https://openalex.org/W2114843025,Integrative Analysis of Complex Cancer Genomic...,2013.0,Jianjiong Gao|Bülent Arman Aksoy|Uğur Doğrusöz...,Genomics|Visualization|Interface (matter)|Comp...,Bioinformatics and Genomic Networks|Cancer Gen...,American Association for the Advancement of Sc...,15249,"The cBioPortal enables integration, visualizat...","The cBioPortal enables integration, visualizat...",11.768088,12.328234,-1.023483,47
4,https://openalex.org/W2047968138,Visualization and analysis of atomistic simula...,2009.0,Alexander Stukowski,Visualization|Python (programming language)|Sc...,Theoretical and Computational Physics|Machine ...,IOP Publishing,14576,,Visualization and analysis of atomistic simula...,9.105033,11.859065,-0.284749,-1


## Checking final dataset

Below should give a quick run-down of what's in the dataset

In [36]:
df_out.head()

Unnamed: 0,id,title,publication_year,authorships,concepts,topics,host_organization,cited_by_count,abstract,text,x,y,z,cluster
0,https://openalex.org/W2028056984,<i>VESTA 3</i>for three-dimensional visualizat...,2011.0,Koichi Momma|Fujio Izumi,Undo|Visualization|Voronoi diagram|Computer sc...,Catalysis and Oxidation Reactions|X-ray Diffra...,Wiley,22839,,<i>VESTA 3</i>for three-dimensional visualizat...,9.508135,12.053864,-0.384366,110
1,https://openalex.org/W2056279562,phyloseq: An R Package for Reproducible Intera...,2013.0,Paul J. McMurdie|Susan Holmes,UniFrac|Computer science|Data science|Data min...,Species Distribution and Climate Change|Data A...,Public Library of Science,20085,The phyloseq project for R is a new open-sourc...,The phyloseq project for R is a new open-sourc...,11.787406,11.985596,-0.692306,-1
2,https://openalex.org/W2128880918,Geneious Basic: An integrated and extendable d...,2012.0,Matthew D. Kearse|Richard Moir|Amy Wilson|Stev...,Computer science|Software|Personalization|Leve...,Scientific Computing and Data Management|Genom...,Oxford University Press,19717,Abstract Summary: The two main functions of bi...,Abstract Summary: The two main functions of bi...,11.948407,12.137466,-0.855199,-1
3,https://openalex.org/W2114843025,Integrative Analysis of Complex Cancer Genomic...,2013.0,Jianjiong Gao|Bülent Arman Aksoy|Uğur Doğrusöz...,Genomics|Visualization|Interface (matter)|Comp...,Bioinformatics and Genomic Networks|Cancer Gen...,American Association for the Advancement of Sc...,15249,"The cBioPortal enables integration, visualizat...","The cBioPortal enables integration, visualizat...",11.768088,12.328234,-1.023483,47
4,https://openalex.org/W2047968138,Visualization and analysis of atomistic simula...,2009.0,Alexander Stukowski,Visualization|Python (programming language)|Sc...,Theoretical and Computational Physics|Machine ...,IOP Publishing,14576,,Visualization and analysis of atomistic simula...,9.105033,11.859065,-0.284749,-1


In [31]:
df_out.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45532 entries, 0 to 45553
Data columns (total 13 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   id                                              45532 non-null  object 
 1   title                                           45532 non-null  object 
 2   publication_year                                45532 non-null  object 
 3   authorships.author.display_name                 45075 non-null  object 
 4   concepts.display_name                           45532 non-null  object 
 5   topics.display_name                             45153 non-null  object 
 6   primary_location.source.host_organization_name  18207 non-null  object 
 7   cited_by_count                                  45532 non-null  object 
 8   text                                            45532 non-null  object 
 9   x                                           

Simply because I don't like the column names, I'll rename them below

In [33]:
new_column_names = {
    'id': 'id',
    'title': 'title',
    'publication_year': 'publication_year',
    'authorships.author.display_name': 'authorships',
    'concepts.display_name': 'concepts',
    'topics.display_name': 'topics',
    'primary_location.source.host_organization_name': 'host_organization',
    'cited_by_count': 'cited_by_count',
    'text': 'text',
}

df_out = df.rename(columns=new_column_names)

In [34]:
df_out.head()

Unnamed: 0,id,title,publication_year,authorships,concepts,topics,host_organization,cited_by_count,abstract,text,x,y,z,cluster
0,https://openalex.org/W2028056984,<i>VESTA 3</i>for three-dimensional visualizat...,2011.0,Koichi Momma|Fujio Izumi,Undo|Visualization|Voronoi diagram|Computer sc...,Catalysis and Oxidation Reactions|X-ray Diffra...,Wiley,22839,,<i>VESTA 3</i>for three-dimensional visualizat...,9.508135,12.053864,-0.384366,110
1,https://openalex.org/W2056279562,phyloseq: An R Package for Reproducible Intera...,2013.0,Paul J. McMurdie|Susan Holmes,UniFrac|Computer science|Data science|Data min...,Species Distribution and Climate Change|Data A...,Public Library of Science,20085,The phyloseq project for R is a new open-sourc...,The phyloseq project for R is a new open-sourc...,11.787406,11.985596,-0.692306,-1
2,https://openalex.org/W2128880918,Geneious Basic: An integrated and extendable d...,2012.0,Matthew D. Kearse|Richard Moir|Amy Wilson|Stev...,Computer science|Software|Personalization|Leve...,Scientific Computing and Data Management|Genom...,Oxford University Press,19717,Abstract Summary: The two main functions of bi...,Abstract Summary: The two main functions of bi...,11.948407,12.137466,-0.855199,-1
3,https://openalex.org/W2114843025,Integrative Analysis of Complex Cancer Genomic...,2013.0,Jianjiong Gao|Bülent Arman Aksoy|Uğur Doğrusöz...,Genomics|Visualization|Interface (matter)|Comp...,Bioinformatics and Genomic Networks|Cancer Gen...,American Association for the Advancement of Sc...,15249,"The cBioPortal enables integration, visualizat...","The cBioPortal enables integration, visualizat...",11.768088,12.328234,-1.023483,47
4,https://openalex.org/W2047968138,Visualization and analysis of atomistic simula...,2009.0,Alexander Stukowski,Visualization|Python (programming language)|Sc...,Theoretical and Computational Physics|Machine ...,IOP Publishing,14576,,Visualization and analysis of atomistic simula...,9.105033,11.859065,-0.284749,-1


## Export

In [37]:
df_out = df[[
    "id", "title", "publication_year",
    "authorships", "concepts", "topics", "host_organization", "cited_by_count", "text",
    "x", "y", "z", "cluster"
]]

df_out.to_json("../data/processed/topics_3d.json",
               orient="records",
               force_ascii=False)
