## Google scholar text analysis

In this notebook, I read in google scholar data from a query URL and run some clustering (KMeans) and topic modeling (LDA) algorithms to attempt to uncover trends in the recent published research on google scholar. I've put comments in hashes and cells all thru the notebook

In [1]:
#bring in the google scholar API to web-scrape the papers, titles, and their abstracts
import scholarly

In [2]:
#go to google scholar and grab the URL from the search query and put it here
#everything after 'https://google.scholar.com'
results = scholarly.search_pubs_custom_url('/scholar?as_ylo=2019&q="supply+chain"&hl=en&as_sdt=1,1&as_vis=1')

In [3]:
#print an example
for result in results:
    print(result)
    break

{'_filled': False,
 'bib': {'abstract': 'We opened our 2010 paper in the Journal of Business '
                     'Logistics with a 6th century quote by Heraclitus–“The '
                     'only constant is change.” This immutable law certainly '
                     "holds in today's volatile business world, especially for "
                     'supply chain management, and has been the driving …',
         'author': 'TJ Pettit and KL Croxton and J Fiksel',
         'eprint': 'https://onlinelibrary.wiley.com/doi/pdf/10.1111/jbl.12202',
         'title': 'The Evolution of Resilience in Supply Chain Management: A '
                  'Retrospective on Ensuring Supply Chain Resilience',
         'url': 'https://onlinelibrary.wiley.com/doi/abs/10.1111/jbl.12202'},
 'citedby': 1,
 'id_scholarcitedby': '16408947525918857329',
 'source': 'scholar',
 'url_scholarbib': 'https://scholar.googleusercontent.com/scholar.bib?q=info:cRhrJNZKuOMJ:scholar.google.com/&output=citation&scisdr=CgUD3q0

In [4]:
#build a list of titles

num_scraped = 0
title_list = []

for result in results:
    num_scraped += 1
    
    title_list.append(result.bib['title'])
    
    if num_scraped == 5000:
        break

In [5]:
#example of these titles
title_list[2]

'An empirical analysis of supply chain finance adoption'

In [6]:
#get all the data science/NLP packages we'll need
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from gensim.models.phrases import Phrases
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import remove_stopwords

from sklearn.metrics import pairwise_distances_argmin_min

In [7]:
example = title_list[0]

In [8]:
example

'A behavioral investigation of supply chain contracts for a newsvendor problem in a developing economy'

In [9]:
#process the titles by removing stopwords, break up into 'tokens' (words)
simple_preprocess(example)

['behavioral',
 'investigation',
 'of',
 'supply',
 'chain',
 'contracts',
 'for',
 'newsvendor',
 'problem',
 'in',
 'developing',
 'economy']

In [10]:
#initialize the data table
corpus = pd.DataFrame(columns=['raw_text', 'processed_text'])

In [11]:
#build the table

for i, title in enumerate(title_list):
    
    corpus.loc[i, 'raw_text'] = title
    
    no_stop_title = remove_stopwords(title)
    
    processed = simple_preprocess(no_stop_title)
    
    corpus.loc[i, 'processed_text'] = processed

In [12]:
corpus.loc[0, 'processed_text']

['behavioral',
 'investigation',
 'supply',
 'chain',
 'contracts',
 'newsvendor',
 'problem',
 'developing',
 'economy']

In [13]:
#put bigrams (two-word phrases) together
bigrammer = Phrases(corpus['processed_text'], threshold=5)

In [14]:
#example of the bigrammer putting the bigrams together (supply chain turns into 1 token)
bigrammer[corpus.loc[7, 'processed_text']]

['blockchain_technology',
 'relationships',
 'sustainable',
 'supply_chain',
 'management']

In [15]:
#run the bigrammer on all the processed data
corpus['bigram_proc_text'] = [[0]]*len(corpus)

for i in range(len(corpus)):
    
    corpus.loc[i, 'bigram_proc_text'] = bigrammer[corpus.loc[i, 'processed_text']]

In [16]:
teststr = ' hi my name is jake'

In [17]:
teststr[1:]

'hi my name is jake'

In [18]:
#put them back into a whole string, as that's what scikit's vectorizer needs
new_corpus = []

for doc in corpus['bigram_proc_text']:
    
    build_str = ''
    
    for xstr in doc:
        
        build_str = build_str + ' ' + xstr
        
    build_str = build_str[1:]
    
    new_corpus.append(build_str)

In [19]:
#transform words into vectors using tfidf (term-frequency inverse-document-frequency)
vectorizer = TfidfVectorizer(lowercase=False)

In [20]:
#run the vectorizer
vectorized = vectorizer.fit_transform(new_corpus)

In [22]:
vectorized[0]

<1x2036 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [69]:
#run kmeans clustering
kmeans = KMeans(n_clusters=50).fit(vectorized)

In [70]:
kmeans.cluster_centers_

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

a lot of times, you'll want the documents closest to the cluster centers in order to get the best "exemplar" of the cluster. it's all about going from the topic --> useful information.

In [72]:
#do that with this scikit function. not sure how reliable this is
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, vectorized)

In [73]:
closest
#looks okay

array([581, 607, 627, 466, 235, 123, 587, 311, 541, 831, 634, 746, 122,
       395, 263, 309, 740, 147, 719, 138, 631, 683, 407, 832, 432, 149,
       480, 530, 742, 756, 625, 226, 167,  76, 682, 844, 916, 290, 103,
       228, 729,  23, 305, 315, 860, 566, 361,  39, 154, 246])

In [74]:
#let's check out a few
corpus.loc[closest[0], 'raw_text']

'Identifying trade-offs between sustainability dimensions in the supply chain of biodiesel in Colombia'

In [75]:
corpus.loc[closest[1], 'raw_text']

"Supply chain coordination to optimize manufacturer's capacity procurement decisions through a new commitment-based model with penalty and revenue …"

In [76]:
corpus.loc[closest[2], 'raw_text']

'Architectures for Green-Field Supply Chain Integration-Supply Chain Integration Design'

In [77]:
corpus.loc[closest[3], 'raw_text']

#not that impressive. let's predict the data we have and see where they get clustered.

'Optimal planning of municipal solid waste management systems in an integrated supply chain network'

In [78]:
#cluster the data we have
corpus['kmeans_pred'] = kmeans.predict(vectorized)

In [79]:
#show largest clusters
corpus['kmeans_pred'].value_counts()

45    75
25    46
6     43
34    31
22    31
24    28
29    27
35    27
7     26
19    26
46    26
20    25
16    24
23    23
4     22
3     22
14    22
2     21
49    21
15    20
37    20
40    20
32    19
48    18
8     18
31    17
11    17
21    17
13    17
44    16
27    15
42    15
17    15
30    14
28    14
26    13
43    13
1     13
39    12
18    12
41    11
5     11
47    11
38    10
9      9
33     9
12     8
10     8
36     7
0      3
Name: kmeans_pred, dtype: int64

In [83]:
#print a few clusters
corpus[corpus['kmeans_pred']==6].head(10)

#this one is all about sustainability

Unnamed: 0,raw_text,processed_text,bigram_proc_text,kmeans_pred
5,Supply chain management in industrial marketin...,"[supply, chain, management, industrial, market...","[supply_chain, management, industrial, marketi...",6
10,Resilience of medium-sized firms to supply cha...,"[resilience, medium, sized, firms, supply, cha...","[resilience, medium, sized, firms, supply_chai...",6
28,Competition policy and antitrust law: implicat...,"[competition, policy, antitrust, law, implicat...","[competition, policy, antitrust, law, implicat...",6
51,The Promise: Signaling Sustainability in Suppl...,"[the, promise, signaling, sustainability, supp...","[the, promise, signaling, sustainability, supp...",6
98,Reporting on supply chain sustainability: Meas...,"[reporting, supply, chain, sustainability, mea...","[reporting, supply_chain, sustainability, meas...",6
134,Supply Chain Linked Sustainability Assessment ...,"[supply, chain, linked, sustainability, assess...","[supply_chain, linked, sustainability, assessm...",6
143,Does social capital matter for supply chain re...,"[does, social, capital, matter, supply, chain,...","[does, social, capital, matter, supply_chain, ...",6
144,Supply chain sustainability risk and assessment,"[supply, chain, sustainability, risk, assessment]","[supply_chain, sustainability, risk_assessment]",6
152,Information sharing and the impact of shutdown...,"[information, sharing, impact, shutdown, polic...","[information_sharing, impact, shutdown, policy...",6
222,"Exploring the Social, Economic and Environment...","[exploring, social, economic, environmental, f...","[exploring, social, economic, environmental, f...",6


In [85]:
corpus[corpus['kmeans_pred']==45].head(20)
#cant really tell what this one is about. needs more text (like abstract instead of title maybe)

Unnamed: 0,raw_text,processed_text,bigram_proc_text,kmeans_pred
9,Personal relationships and loyalty in supply c...,"[personal, relationships, loyalty, supply, chain]","[personal, relationships, loyalty, supply_chain]",45
11,Does supply chain visibility affect operating ...,"[does, supply, chain, visibility, affect, oper...","[does, supply_chain, visibility, affect, opera...",45
20,Toward a Digitally Dominant Paradigm for twent...,"[toward, digitally, dominant, paradigm, twenty...","[toward, digitally, dominant, paradigm, twenty...",45
31,From consumer to prosumer: a supply chain revo...,"[from, consumer, prosumer, supply, chain, revo...","[from, consumer, prosumer, supply_chain, revol...",45
36,How to secure your supply chain,"[how, secure, supply, chain]","[how, secure, supply_chain]",45
69,Radio frequency identification (RFID) technolo...,"[radio, frequency, identification, rfid, techn...","[radio, frequency, identification, rfid, techn...",45
73,Performance Impact Analysis of Disruption Prop...,"[performance, impact, analysis, disruption, pr...","[performance, impact, analysis, disruption, pr...",45
84,Designing response supply chain against bioatt...,"[designing, response, supply, chain, bioattacks]","[designing, response, supply_chain, bioattacks]",45
104,Achieving sustainable performance in a data-dr...,"[achieving, sustainable, performance, data, dr...","[achieving, sustainable, performance, data_dri...",45
106,Broadening the perspective of supply chain fin...,"[broadening, perspective, supply, chain, finan...","[broadening, perspective, supply_chain, financ...",45


In [86]:
corpus[corpus['kmeans_pred']==37].head(20)
#service

Unnamed: 0,raw_text,processed_text,bigram_proc_text,kmeans_pred
219,Robust gasoline closed loop supply chain desig...,"[robust, gasoline, closed, loop, supply, chain...","[robust, gasoline, closed_loop, supply_chain, ...",37
290,Service quality coordination contracts for onl...,"[service, quality, coordination, contracts, on...","[service, quality, coordination, contracts, on...",37
355,Moving sequence preference in coopetition outs...,"[moving, sequence, preference, coopetition, ou...","[moving, sequence, preference, coopetition, ou...",37
453,Developing the framework of sustainable servic...,"[developing, framework, sustainable, service, ...","[developing, framework, sustainable, service, ...",37
471,Supply chain-a service delivery enhancement or...,"[supply, chain, service, delivery, enhancement...","[supply_chain, service, delivery, enhancement,...",37
482,3D Printing for Supply Chain Service Companies,"[printing, supply, chain, service, companies]","[printing, supply_chain, service, companies]",37
495,Component Procurement for an Assembly Supply C...,"[component, procurement, assembly, supply, cha...","[component, procurement, assembly, supply_chai...",37
548,Coordination Effects of Market Power and Fairn...,"[coordination, effects, market, power, fairnes...","[coordination, effects, market, power, fairnes...",37
649,After-sale Service Deployment and Information ...,"[after, sale, service, deployment, information...","[after, sale, service, deployment, information...",37
650,Price and Service Competition in a Tourism Sup...,"[price, service, competition, tourism, supply,...","[price, service, competition, tourism, supply_...",37


In [87]:
corpus[corpus['kmeans_pred']==41].head(20)
#cyber and IoT

Unnamed: 0,raw_text,processed_text,bigram_proc_text,kmeans_pred
0,A behavioral investigation of supply chain con...,"[behavioral, investigation, supply, chain, con...","[behavioral, investigation, supply_chain, cont...",41
23,Cyber risk from IoT technologies in the supply...,"[cyber, risk, iot, technologies, supply, chain...","[cyber, risk, iot, technologies, supply_chain,...",41
37,Towards Industry 4.0: Mapping digital technolo...,"[towards, industry, mapping, digital, technolo...","[towards, industry, mapping, digital, technolo...",41
350,"Risk, Trustworthiness, and Justice: Understand...","[risk, trustworthiness, justice, understanding...","[risk, trustworthiness, justice, understanding...",41
384,Towards a decision support framework for techn...,"[towards, decision, support, framework, techno...","[towards, decision_support, framework, technol...",41
437,Decision support system for light petroleum pr...,"[decision, support, light, petroleum, products...","[decision_support, light, petroleum, products,...",41
828,PERCEPTUS: Predictive complex event processing...,"[perceptus, predictive, complex, event, proces...","[perceptus, predictive, complex, event, proces...",41
939,The Review of Risk Identification of E-Commerc...,"[the, review, risk, identification, commerce, ...","[the, review, risk, identification, commerce, ...",41
942,How the Digital Economy is Impacting the Suppl...,"[how, digital, economy, impacting, supply, chain]","[how, digital, economy, impacting, supply_chain]",41
960,Defining granularity levels for supply chain t...,"[defining, granularity, levels, supply, chain,...","[defining, granularity, levels, supply_chain, ...",41


In [None]:
#sustainability/green/env friendly, supply disruption, cyber risks from IoT
#supplier evaluation and selection (optimized)

#all of the typical ways people deal with uncertainty & decision making (like fuzzy logic)

## Do LDA with LDA vis for a little more intuition

LDA assumes "topics" (what the doc is about) can be quantified by a group of words associated with the topic.

LIke, for a document about dogs, the words "dog", "bowl", "paw", "woof" are the words for the topics.

We see those words, we say, "oh, this topic is about dogs"

In [32]:
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [88]:
lda = LatentDirichletAllocation(n_components=5)

the tough part is: we don't know how many topics. so, to save time, i'll just start at five

In [89]:
lda.fit(vectorized)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=15, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

now let's visualize using pyLDAvis, which gives a great representation of the topics.

In [90]:
pyLDAvis.sklearn.prepare(lda, vectorized, vectorizer)