<a href="https://colab.research.google.com/github/justinfmccarty/urban_bipv_annotated_bib/blob/main/data/bipv_lit_rev_clusters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Literature Clusters for BIPV

This notebook is a replication of an effort to cluster COVID-19 publications:

Eren, Maksim Ekin, Nick Solovyev, Edward Raff, Charles Nicholas, and Ben Johnson. “COVID-19 Kaggle Literature Organization.” In Proceedings of the ACM Symposium on Document Engineering 2020, 1–4. DocEng ’20. New York, NY, USA: Association for Computing Machinery, 2020. https://doi.org/10.1145/3395027.3419591.

Using the database engine "lens.org" I have extracted building integrated photovoltaic literature to feed into Eren et al.'s clustering workflow.

A future effort will be done once I have access to the Lens API system and can gather a more comprhensive dataset.



### Section 1: Load and Process Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
url = r'https://github.com/justinfmccarty/urban_bipv_annotated_bib/blob/main/data/bipv_inclusive.csv?raw=true'
df_raw = pd.read_csv(url)
df_raw.sample(3)

Unnamed: 0,Lens ID,Title,Date Published,Publication Year,Publication Type,Source Title,ISSNs,Publisher,Source Country,Author/s,...,Funding,Source URLs,External URL,PMID,DOI,Microsoft Academic ID,PMCID,Citing Patents Count,References,Citing Works Count
775,007-253-641-397-424,Smart Sensor Systems for Wearable Electronic D...,2017-07-25,2017.0,journal article,Polymers,20734360.0,MDPI AG,Switzerland,Byeong Wan An; Jung Hwal Shin; So Yun Kim; Joo...,...,,https://www.scilit.net/article/fcb761f8caef8c3...,https://www.ncbi.nlm.nih.gov/pubmed/30970981,30970981.0,,2736319074,PMC6418677,0,000-440-836-672-886; 000-496-978-680-075; 000-...,92
10038,142-767-808-292-193,Environmental Impact of Modern Wind Power unde...,2010-06-01,2010.0,book chapter,Wind Power,,InTech,,Eduardo Martínez Cámara; Emilio Jiménez Macías...,...,,https://cdn.intechopen.com/pdfs/9565/InTech-En...,http://dx.doi.org/10.5772/8348,,10.5772/8348,1514479657,,0,006-192-308-331-998; 012-651-781-429-939; 016-...,2
8344,104-844-063-980-01X,Modelling and system analysis of new photovolt...,2016-02-23,2016.0,,,,,,M. Katiyar,...,,https://www.narcis.nl/publication/RecordID/oai...,https://www.narcis.nl/publication/RecordID/oai...,,,2808108790,,0,,0


In [None]:
# for this notebook I am only getting journal articles
df_raw_j = df_raw[df_raw['Publication Type']=='journal article']
print(f'Original shape was {df_raw.shape}')
print(f'New shape is {df_raw_j.shape}')

In [51]:
# remove publications without an abstract as this is key to the clustering effort
df_raw_j_clean = df_raw_j.dropna(subset=['Abstract'])
print(f'Original shape was {df_raw_j.shape}')
print(f'New shape is {df_raw_j_clean.shape}')
df_raw_j_clean.sample(2)

Original shape was (7391, 29)
New shape is (7132, 29)


Unnamed: 0,Lens ID,Title,Date Published,Publication Year,Publication Type,Source Title,ISSNs,Publisher,Source Country,Author/s,Abstract,Volume,Issue Number,Start Page,End Page,Fields of Study,Keywords,MeSH Terms,Chemicals,Funding,Source URLs,External URL,PMID,DOI,Microsoft Academic ID,PMCID,Citing Patents Count,References,Citing Works Count
7325,088-131-561-961-454,DC nanogrid for renewable sources with modular...,2017-02-16,2017.0,journal article,IET Power Electronics,17554543; 17554535,Institution of Engineering and Technology (IET),United Kingdom,Carlo Cecati; Hassan Abdullah Khalid; Mario Ti...,Centralised electricity systems are being inte...,10,5,536,544,Electrical engineering; Voltage source; Energy...,,,,,https://ieeexplore.ieee.org/document/7894943/ ...,http://dx.doi.org/10.1049/iet-pel.2016.0200,,10.1049/iet-pel.2016.0200,2548982234,,0,002-740-335-923-55X; 003-072-942-903-919; 009-...,24
2390,023-263-936-665-747,"Bifacial, color-tunable semitransparent perovs...",2019-12-20,2019.0,journal article,ACS applied materials & interfaces,19448252; 19448244,American Chemical Society,United States,Hao Wang; Herlina Arianita Dewi; Teck Ming Koh...,"Recently, semitransparent perovskite solar cel...",12,1,484,493,Building-integrated photovoltaics; Perovskite ...,BIPV; bifacial solar cell; colorful perovskite...,,,National Research Foundation Singapore; Nation...,https://pubs.acs.org/doi/10.1021/acsami.9b1548...,http://dx.doi.org/10.1021/acsami.9b15488,31814394.0,10.1021/acsami.9b15488,2996469781,,0,006-207-723-814-646; 009-870-531-460-150; 012-...,21


In [38]:
# there may be a need for publication year later on in visualization 
# so I will be dropping anything without a publication year

# lets see how many
print(f"Missing in Publication Year {len(df_raw_j_clean[df_raw_j_clean['Publication Year'].isnull()])}")
print(f"Missing in Date Published {len(df_raw_j_clean[df_raw_j_clean['Date Published'].isnull()])}")

df_raw_j_clean_years = df_raw_j_clean.dropna(subset=['Publication Year'])
print(f'Original shape was {df_raw_j_clean.shape}')
print(f'New shape is {df_raw_j_clean_years.shape}')

Missing in Publication Year 27
Missing in Date Published 4033
Original shape was (7132, 29)
New shape is (7105, 29)


In [52]:
# simplify dataframe for clustering workflow

columns = ['Lens ID',
           'Title',
           'Publication Year',
           'Abstract',
           'External URL',
           'DOI',
           'Author/s']

df_raw_j_clean_sub = df_raw_j_clean_years[columns].astype({'Publication Year':'int32'})
df_raw_j_clean_sub = df_raw_j_clean_sub.rename(columns={'Lens ID':'lens_id',
                                                        'Title':'title',
                                                        'Publication Year':'year',
                                                        'Abstract':'abstract',
                                                        'External URL':'external_url',
                                                        'DOI':'doi',
                                                        'Author/s':'authors'})

df_raw_j_clean_sub.sample(3)


Unnamed: 0,lens_id,title,year,abstract,external_url,doi,authors
4036,042-663-516-810-524,Where the wicked problems are: The case of men...,2010,Abstract Objective To use system ideas and the...,http://dx.doi.org/10.1016/j.healthpol.2010.11.002,10.1016/j.healthpol.2010.11.002,Ben Hannigan; Michael Coffey
4116,043-683-034-743-917,Modelling of solar micro gas turbine for parab...,2020,Dish-Stirling unit and photovoltaic panels are...,http://dx.doi.org/10.12928/telkomnika.v18i6.16676,10.12928/telkomnika.v18i6.16676,Syariffah Othman; Mohd Ruddin Ab Ghani; Zanari...
3779,039-449-078-784-489,Design and Implementation of Multi-Directional...,2013,This paper introduces the design of the all-ro...,http://dx.doi.org/10.4028/www.scientific.net/a...,10.4028/www.scientific.net/amm.380-384.889,null De Jun Li; Zhi Hu


In [53]:
# last check for any NaN values
for col in df_raw_j_clean_sub.columns:
  if df_raw_j_clean_sub[col].isnull().values.any():
    print(f'{col} contains NaN values.')
  else:
    pass

external_url contains NaN values.
doi contains NaN values.
authors contains NaN values.


In [60]:
# ok so we may want to link back to papers at the end of this so let's see
# where links are missing

print(f"External URL NaN length is {len(df_raw_j_clean_sub[df_raw_j_clean_sub['external_url'].isnull()])}")
print(f"DOI NaN length is {len(df_raw_j_clean_sub[df_raw_j_clean_sub['doi'].isnull()])}")

External URL NaN length is 14
DOI NaN length is 266


In [66]:
# only several missing so I will leave it now and come back with the API from Lens or Crossref 
# so we have the dataset ready for the clustering workflow

df_raw_j_clean_sub.sample(4)

Unnamed: 0,lens_id,title,year,abstract,external_url,doi,authors
481,004-316-714-454-503,Developing an optimal electricity generation m...,2016,The UK electricity sector is undergoing a tran...,http://dx.doi.org/10.1016/j.energy.2016.01.077,10.1016/j.energy.2016.01.077,H. Sithole; Tim Cockerill; Kevin J. Hughes; De...
8431,106-692-423-910-289,Technological Progress towards Sustainable Dev...,2002,"The purpose of this paper is twofold. First, t...",https://core.ac.uk/display/33898033,,Ger Klaassen; Asami Miketa; Keywan Riahi; Leo ...
7786,095-715-703-331-291,Economic analysis of BIPV systems as a buildin...,2020,Abstract The main purpose of this study is to ...,http://dx.doi.org/10.1016/j.energy.2020.117931,10.1016/j.energy.2020.117931,Hassan Gholami; Harald N. Røstvik
7084,084-567-728-454-801,Evaluation of Combined Solar Thermal Heat Pump...,2014,Abstract This paper presents results from the ...,http://dx.doi.org/10.1016/j.egypro.2014.02.070,10.1016/j.egypro.2014.02.070,Werner Lerch; Andreas Heinz; Richard Heimrath


### Section 2: Clustering Workflow (Eren et al., 2020)

source: https://github.com/MaksimEkin/arXiv-Literature-Clustering/blob/master/arxiv_clustering.ipynb

In [67]:
# so we can skip the first few cells of the original notebook as they 
# read their data into df from the source json files in the Kaggle data

# We beging with the NLP 
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

In [68]:
# stop words are common words that may complicate the analysis so we set up 
# a system to remove them

import string

punctuations = string.punctuation
stopwords = list(STOP_WORDS)
stopwords[:10]

['now',
 'whose',
 'why',
 'just',
 '‘ve',
 'thereafter',
 'toward',
 'although',
 'if',
 'everywhere']

In [70]:
# also adding in a few more common stop words 
custom_stop_words = [
    'doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'https', 'et', 'al',
    'author', 'figure', 'rights', 'reserved', 'permission', 'used', 'using',
    'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', 'al.', 'Elsevier', 'PMC', 
    'CZI', 'www','abstract', 'Abstract'
]

for w in custom_stop_words:
    if w not in stopwords:
        stopwords.append(w)

In [None]:
# Parser
parser = spacy.load("en_core_web_sm")
parser.max_length = 7000000

def call_tokenizer(df):
    df["processed_abstract"] = df["abstract"].apply(spacy_tokenizer)
    return df

def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens