<a href="https://colab.research.google.com/github/patimus-prime/ML_notebooks/blob/master/sci_pub_db_builder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

"There are many powers in this world, for good or for evil. Some are greater than I am. Against some I have not yet been measured."

-- Gandalf

---

So, the most recent, open source way to access scientific publication data is OpenAlex. Google scholar is also possible but isn't guaranteed to have a graph method etc. and scholarly isn't even official. So... we're going to try with the OpenAlex API. Doing it via this method also opens the possibility to just store all this stuff in a database and build out a whole app

In [None]:
!pip install pyalex retrying openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.0-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.0


This next one validates our approach. The cursor method implemented below is shown in https://github.com/ourresearch/openalex-api-tutorials/blob/main/notebooks/getting-started/paging.ipynb

In [None]:
import requests
import json

# Set up the API endpoint and parameters
endpoint = "https://api.openalex.org/authors"
params = {
          # "filter": "last_known_institution.id:I138006243",
          "filter": "display_name:Sorooshian",
          "per-page": 50,
          "mailto": "pat@patrickfinnerty.com", # for polite
          "cursor": "*" # initialize cursor
}

# Send the initial request
response = requests.get(endpoint, params=params)
results = response.json()
print(results)


{'meta': {'count': 1, 'db_response_time_ms': 35, 'page': None, 'per_page': 50, 'next_cursor': 'IlsxLCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvQTMxNTE4Mzc2NDgnXSI='}, 'results': [{'id': 'https://openalex.org/A3151837648', 'orcid': None, 'display_name': 'Sorooshian', 'display_name_alternatives': [], 'works_count': 1, 'cited_by_count': 0, 'ids': {'openalex': 'https://openalex.org/A3151837648', 'mag': '3151837648'}, 'last_known_institution': None, 'x_concepts': [{'id': 'https://openalex.org/C17744445', 'wikidata': 'https://www.wikidata.org/wiki/Q36442', 'display_name': 'Political science', 'level': 0, 'score': 100.0}, {'id': 'https://openalex.org/C18903297', 'wikidata': 'https://www.wikidata.org/wiki/Q7150', 'display_name': 'Ecology', 'level': 1, 'score': 100.0}, {'id': 'https://openalex.org/C39432304', 'wikidata': 'https://www.wikidata.org/wiki/Q188847', 'display_name': 'Environmental science', 'level': 0, 'score': 100.0}, {'id': 'https://openalex.org/C58640448', 'wikidata': 'https://www.wikidata.or

Now actually grab everything from U of A

In [None]:
# Set up the API endpoint and parameters
endpoint = "https://api.openalex.org/authors"
params = {
          "filter": "last_known_institution.id:I138006243", # get U of A authors
          "per-page": 1,
          "mailto": "pat@patrickfinnerty.com", # for polite
          "cursor": "*" # initialize cursor
}

results = [] # stores all responses/results

while params["cursor"]:
  response = requests.get(endpoint, params=params)
  currentResponse = response.json()
  results += currentResponse["results"] # append new objects to results
  params["cursor"] = currentResponse["meta"]["next_cursor"] # update to next page.
  print(currentResponse)
  if len(results) >= 2:
    break

# print(len(results)) # print the length of the results list
# unique_entries = len(set([json.dumps(result) for result in results])) # get the number of unique entries
# print(unique_entries) # print the number of unique entries


{'meta': {'count': 37215, 'db_response_time_ms': 80, 'page': None, 'per_page': 1, 'next_cursor': 'IlszNTU2LCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvQTIxNDc4NjU2OTMnXSI='}, 'results': [{'id': 'https://openalex.org/A2147865693', 'orcid': 'https://orcid.org/0000-0003-3310-0131', 'display_name': 'Xiaohui Fan', 'display_name_alternatives': [], 'works_count': 3556, 'cited_by_count': 86052, 'ids': {'openalex': 'https://openalex.org/A2147865693', 'orcid': 'https://orcid.org/0000-0003-3310-0131', 'mag': '2147865693'}, 'last_known_institution': {'id': 'https://openalex.org/I138006243', 'ror': 'https://ror.org/03m2x1q45', 'display_name': 'University of Arizona', 'country_code': 'US', 'type': 'education'}, 'x_concepts': [{'id': 'https://openalex.org/C121332964', 'wikidata': 'https://www.wikidata.org/wiki/Q413', 'display_name': 'Physics', 'level': 0, 'score': 355.9}, {'id': 'https://openalex.org/C185592680', 'wikidata': 'https://www.wikidata.org/wiki/Q2329', 'display_name': 'Chemistry', 'level': 0, 'score': 

YES! This gets us 200 peeps. Fantastic. Ok. Let's work with this as a test set. 

In [None]:
print(results)

[{'id': 'https://openalex.org/A2147865693', 'orcid': 'https://orcid.org/0000-0003-3310-0131', 'display_name': 'Xiaohui Fan', 'display_name_alternatives': [], 'works_count': 3556, 'cited_by_count': 86052, 'ids': {'openalex': 'https://openalex.org/A2147865693', 'orcid': 'https://orcid.org/0000-0003-3310-0131', 'mag': '2147865693'}, 'last_known_institution': {'id': 'https://openalex.org/I138006243', 'ror': 'https://ror.org/03m2x1q45', 'display_name': 'University of Arizona', 'country_code': 'US', 'type': 'education'}, 'x_concepts': [{'id': 'https://openalex.org/C121332964', 'wikidata': 'https://www.wikidata.org/wiki/Q413', 'display_name': 'Physics', 'level': 0, 'score': 355.9}, {'id': 'https://openalex.org/C185592680', 'wikidata': 'https://www.wikidata.org/wiki/Q2329', 'display_name': 'Chemistry', 'level': 0, 'score': 294.0}, {'id': 'https://openalex.org/C62520636', 'wikidata': 'https://www.wikidata.org/wiki/Q944', 'display_name': 'Quantum mechanics', 'level': 1, 'score': 284.3}, {'id': '

In [None]:
# first let's be sure this ain't BS
print(json.dumps(results, indent = 2))
# I ctrl+F for last_known_inst to verify these are unique

[
  {
    "id": "https://openalex.org/A2147865693",
    "orcid": "https://orcid.org/0000-0003-3310-0131",
    "display_name": "Xiaohui Fan",
    "display_name_alternatives": [],
    "works_count": 3556,
    "cited_by_count": 86052,
    "ids": {
      "openalex": "https://openalex.org/A2147865693",
      "orcid": "https://orcid.org/0000-0003-3310-0131",
      "mag": "2147865693"
    },
    "last_known_institution": {
      "id": "https://openalex.org/I138006243",
      "ror": "https://ror.org/03m2x1q45",
      "display_name": "University of Arizona",
      "country_code": "US",
      "type": "education"
    },
    "x_concepts": [
      {
        "id": "https://openalex.org/C121332964",
        "wikidata": "https://www.wikidata.org/wiki/Q413",
        "display_name": "Physics",
        "level": 0,
        "score": 355.9
      },
      {
        "id": "https://openalex.org/C185592680",
        "wikidata": "https://www.wikidata.org/wiki/Q2329",
        "display_name": "Chemistry",
        

Let's now see if we can iterate over these peeps and get their recent abstracts...

Results is a list of dictionaries. Yes.

In [None]:
# Loop through the authors and retrieve their publications
for author in results:#["data"]:
    # print(author['id'])
    author_id = author["id"] 
    # returns https://openalex.org/A2147865693
    # but has redirect/equivalent API:
    # eg https://api.openalex.org/authors?filter=openalex_id:A2147865693
    # or https://api.openalex.org/people/A2147865693
    works_api_url = author["works_api_url"] 

A case study: this guy is PROLIFERATE, like very. Never heard of him while I was at U of A. This is what the query looks like and what we'll need to work on below.
https://api.openalex.org/works?filter=author.id:A2147865693

In [None]:
import pandas as pd
df = pd.DataFrame(results)
df.head()

Unnamed: 0,id,orcid,display_name,display_name_alternatives,works_count,cited_by_count,ids,last_known_institution,x_concepts,counts_by_year,works_api_url,updated_date,created_date
0,https://openalex.org/A2147865693,https://orcid.org/0000-0003-3310-0131,Xiaohui Fan,[],3556,86052,{'openalex': 'https://openalex.org/A2147865693...,"{'id': 'https://openalex.org/I138006243', 'ror...","[{'id': 'https://openalex.org/C121332964', 'wi...","[{'year': 2023, 'works_count': 3, 'cited_by_co...",https://api.openalex.org/works?filter=author.i...,2023-02-26T23:20:49.332194,2016-06-24
1,https://openalex.org/A2279422341,https://orcid.org/0000-0002-7743-3491,Vishnu Reddy,[],3080,31200,{'openalex': 'https://openalex.org/A2279422341...,"{'id': 'https://openalex.org/I138006243', 'ror...","[{'id': 'https://openalex.org/C86803240', 'wik...","[{'year': 2023, 'works_count': 1, 'cited_by_co...",https://api.openalex.org/works?filter=author.i...,2023-02-27T03:06:11.901262,2016-06-24


In [None]:
raw = pd.DataFrame(results)

In [None]:
df = pd.DataFrame()
df['ProfName'] = raw['display_name']
# split by /, then grab last string chunk.
# so in https://openalex.org/A2147865693
# only keeps that last ID. We could use the endpoint returned by the api etc.
# but I think it looks a little uglier and harder to manipulate
df['api_id'] = raw['id'].str.split('/').str[-1]  
df.head() #YAS YAS YAS 

Unnamed: 0,ProfName,api_id
0,Xiaohui Fan,A2147865693
1,Vishnu Reddy,A2279422341


Let's see... at their works page, each id is the W##. This can be fed into pyAlex to acquire the pinche abstracts for each W, and we can order/sort.


# Class definition

This could be pretty dang useful! Defining a class so this can be wrapped up easier in future development. All methods below are to acquire authors of interest, their works, and grab all into a dataframe. Primarily interested in abstracts, 'concepts' we can search later, author, and I guess citations.

In [None]:
# Set up the API endpoint and parameters
# https://api.openalex.org/works?filter=author.id:A2147865693

from pyalex import Works
import pandas as pd 
import requests
import json
import concurrent.futures # to multithread requests, but requires...
from retrying import retry
import time
# because of: HTTPError: 429 Client Error: TOO MANY REQUESTS for url: https://api.openalex.org/works/W4285082036

class abstractGetter:
  # optional type declarations but otherwise I may get confused, uncharted territory
  def __init__(self, 
               # default values
               # look up codes on OpenAlex
               institutionCode: str = 'I138006243', # U of A
               # FIXME: PERPAGE APPEARS TO OVERRIDE HOW MANY ABSTRACTS AND AUTHORS RETURNED???
               perPage: int = 1, # determines how many objects queried at once per page for authors/works
               numAuthors: int = 2, # how many authors to get from the institution. -1 -> all of em
               numAbstracts: int = 2, # we have to punch the abstracts into chatGPT, which does have token limit 
               ):
    # first prop here not referenced but here FOR reference
    self.code_UniversityofArizona = 'I138006243'
    self.institutionCode = institutionCode
    self.perPage = 1 # DON'T CHANGE THIS. THE CURSOR METHOD IS FAULTY. OVERRIDES OTHER OPTIONS.
    self.numAuthors = numAuthors
    self.numAbstracts = numAbstracts
    

  def getAuthors(self,):# institutionCode: str): #= None):
    # I've written thus far for one institution. We'll want to I think,
    # have multiple getAuthorsByX functions in future, or conditions with args.
    # so... beware future me, for MORE WORK!

    # if institutionCode is None:
        # institutionCode = self.institutionOfInterest

    # Set up the API endpoint and parameters
    endpoint = "https://api.openalex.org/authors"
    params = {
              "filter": "last_known_institution.id:"+self.institutionCode, # get U of A authors
              # can use "," I think to attach more filters. need to double check
              # so ... +",concepts.id:XXX,citations: etc"
              "per-page": self.perPage,
              "mailto": "pat@patrickfinnerty.com", # for polite
              "cursor": "*" # initialize cursor
    }

    results = [] # stores all responses/results

    while params["cursor"]:
      response = requests.get(endpoint, params=params)
      currentResponse = response.json()
      results += currentResponse["results"] # append new objects to results
      params["cursor"] = currentResponse["meta"]["next_cursor"] # update to next page.
      # print(currentResponse)
      # break early if numAuthors has been specified
      # if self.numAuthors != -1:
      if len(results) >= self.numAuthors:
          break
    return results

  def authorsToList(self):
    code = self.institutionCode
    authorsReturned = pd.DataFrame(self.getAuthors()) #update if mult. gets wonky #self.institutionCode))
    df = pd.DataFrame()
    df['Author_Name'] = authorsReturned['display_name']
    # split by /, then grab last string chunk.
    # so in https://openalex.org/A2147865693
    # only keeps that last ID. We could use the endpoint returned by the api etc.
    # but I think it looks a little uglier and harder to manipulate
    df['api_id'] = authorsReturned['id'].str.split('/').str[-1]  
    return df

  def getAuthorWorks(self, authorid):

    endpoint = "https://api.openalex.org/works"
    params = {
              "filter": "author.id:"+authorid, # get U of A authors
              "sort": "publication-date:desc",
              "per-page": self.perPage,
              "mailto": "pat@patrickfinnerty.com", # for polite
              "cursor": "*" # initialize cursor
    }

    results = [] # stores all responses/results
    i = 0
    while params["cursor"]:
      response = requests.get(endpoint, params=params)
      currentResponse = response.json()
      results += currentResponse["results"] # append new objects to results
      params["cursor"] = currentResponse["meta"]["next_cursor"] # update to next page.
      # print(currentResponse)
      i += 1
      if len(results) >= self.numAbstracts:# or i >= 1:
        break
    return results

  # this function executed per author!
  # ... tried to speed it up, but OpenAlex refuses. Therefore this left here
  # in case we use the SnapShot method later to copy ALL scientific work... what
  # decorator function, specifies the behavior of the retry-er below
  # @retry(stop_max_attempt_number=5, wait_exponential_multiplier=1000, wait_exponential_max=10000)
  def getRecentAbstracts(self, authorid):
    # temp_df because otherwise this will get difficult to follow processing json, 
    # at least for my brain
    # convert results to data frame
    tdf = pd.DataFrame(self.getAuthorWorks(authorid))
    # get new feature and then make it a list, of the worksIds of the author
    tdf['WorksID'] = tdf['id'].str.split('/').str[-1]
    worksList = list(tdf['WorksID'])
    # we'll be passing this into chatGPT to assess, so...
    # combinedAbstracts = "" # init
    strList = []
    for work in worksList:
      w = Works()[work] #pyAlex call, gets the abstract for the work
      # combinedAbstracts.append(str(w['abstract'])+'\n\n')
      strList.append(str(w['abstract']))
      combinedAbstracts = "\n\n".join(strList)
    return combinedAbstracts

  def populateDF(self):
    # 3 minutes for 150 abstracts to process with this method
    df = self.authorsToList()
    df['Recent_Abstracts_Superstring'] = df['api_id'].apply(lambda x:
                                 self.getRecentAbstracts(x))
    return df
    


In [None]:
getter = abstractGetter()
df = getter.populateDF()
df.head()

Unnamed: 0,Author_Name,api_id,Recent_Abstracts_Superstring
0,Xiaohui Fan,A2147865693,"Abstract Direct observations of low-mass, low-..."
1,Vishnu Reddy,A2279422341,Abstract Large constellations of bright artifi...


In [None]:
df.describe()

Unnamed: 0,Author_Name,api_id,Recent_Abstracts_Superstring
count,25,25,25
unique,25,25,25
top,Xiaohui Fan,A2147865693,"Abstract Direct observations of low-mass, low-..."
freq,1,1,1


# NOW READY FOR THE MAGIC OF JUST POPULATING VIA CHATGPT...
1. Some summary of these. Two features maybe, in jargon, and as for 5th grader.
2. Match based on input for abstracts. And then generate the features. Maybe easier. We can experiment with both. For now let's get more authors. Thus far seems to implicitly call the authors in order of citations

In [None]:
# check nothing wonky is happening and correct # abstracts returned per author. BEWARE PER-PAGE
print(str(df['Recent_Abstracts_Superstring'][0]))


Abstract Direct observations of low-mass, low-metallicity galaxies at z ≳ 4 provide an indispensable opportunity for detailed inspection of the ionization radiation, gas flow, and metal enrichment in sources similar to those that reionized the universe. Combining the James Webb Space Telescope (JWST), Very Large Telescope/MUSE, and Atacama Large Millimeter/submillimeter Array, we present detailed observations of a strongly lensed, low-mass (≈10 7.6 M ⊙ ) galaxy at z = 3.98 (also see Vanzella et al.). We identify strong narrow nebular emission, including C iv λ λ 1548, 1550, He ii λ 1640, O iii ] λ λ 1661, 1666, [Ne iii ] λ 3868, [O ii ] λ 3727, and the Balmer series of hydrogen from this galaxy, indicating a metal-poor H ii region (≲0.12 Z ⊙ ) powered by massive stars. Further, we detect a metal-enriched damped Ly α system (DLA) associated with the galaxy with the H i column density of N H I ≈ 10 21.8 cm −2 . The metallicity of the associated DLA may reach the supersolar metallicity (≳

In [None]:
import os
import openai
import json
# Load your API key from an environment variable or secret management service
# openai.api_key = os.getenv("OPENAI_API_KEY")

strTry = str(df['Recent_Abstracts_Superstring'][0])
# in case you ever wanna try: you cannot just give api for Wnnn. it will not work.

# XX INSERT KEY HERE, REMOVE AFTER COLAB SESSION
openai.api_key = ""

completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo", 
  messages=[{"role": "user",
             "content":
             "Given the two abstracts below, provide a lay summary for each abstract:"+strTry}]
)

# choices is array with 1 element, then reference 'message' and 'content' withn that.
# print(completion) if you want to see data shape
content = completion['choices'][0]['message']['content']
print(content)



The article discusses the observation of a low-mass, low-metallicity galaxy at z≳4. The galaxy was analyzed using the James Webb Space Telescope (JWST), Very Large Telescope/MUSE, and Atacama Large Millimeter/submillimeter Array. The study found strong narrow nebular emission, indicating a metal-poor H ii region (≲0.12 Z⊙) powered by massive stars. The study also detected a metal-enriched damped Lyα system associated with the galaxy, with the H i column density of NH I ≈ 1021.8 cm−2. The article concludes that low-mass, low-metallicity galaxies, which dominate reionization, could be surrounded by a high covering fraction of metal-enriched, neutral-gaseous clouds. This implies that the metal enrichment of low-mass galaxies is highly efficient, and further supports that in low-mass galaxies, only a small fraction of ionizing radiation can escape through the interstellar or circumgalactic channels with low-column-density neutral gas.

The article presents new constraints on the volume-a

In [None]:
# if you really wanna see!
print(completion)


{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "\n\n\"All we have to decide is what to do with the time that is given us.\" - Gandalf",
        "role": "assistant"
      }
    }
  ],
  "created": 1678099489,
  "id": "chatcmpl-6r2cjOAWyGavrlOBr9kTPmjTHmTWM",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 22,
    "prompt_tokens": 15,
    "total_tokens": 37
  }
}


---
---
# Code that didn't work or that was of no benefit, or stray thoughts on how to implement are below. Retained to show what not to do, future Pat.

In [None]:


#### STUFF I TRIED TO GO FAST, BUT WE'RE RATE LIMITED BY OPENALEX SO IT'S WORSE
"""
def populateDF(self):
    df = self.authorsToList()
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # Create a list of futures for each author
        futures = [executor.submit(self.getRecentAbstracts, x) for x in df['api_id']]

        # Iterate through the futures and update the dataframe
        for i, future in enumerate(concurrent.futures.as_completed(futures)):
            df.at[i, 'Recent_Abstracts_Superstring'] = future.result()

    return df




@retry(stop_max_attempt_number=5, wait_exponential_multiplier=1000, wait_exponential_max=10000)
  def getRecentAbstracts(self, authorid):
    # temp_df because otherwise this will get difficult to follow processing json, 
    # at least for my brain
    # convert results to data frame
    tdf = pd.DataFrame(self.getAuthorWorks(authorid))
    # get new feature and then make it a list, of the worksIds of the author
    tdf['WorksID'] = tdf['id'].str.split('/').str[-1]
    worksList = list(tdf['WorksID'])
    # we'll be passing this into chatGPT to assess, so...
    combinedAbstracts = "" # init
    
    for work in worksList:
      try:
        w = Works()[work]
      # Define the retry decorator with the max number of retries and the backoff factor
      except requests.exceptions.HTTPError as error:
        if error.response.status_code == 429:
            print('Rate limited. Waiting and retrying...')
            time.sleep(10)  # wait for 10 seconds
            # work_data = get_works('W4285082036')
            w = Works()[work]
            # do something with the work data
        else:
            raise error  # re-raise the exception if it's not a 429 error code
      # otherwise if it goes well
      combinedAbstracts += (str(w['abstract'])+'\n\n')
      # and re-loop
"""

In [None]:
# getAuthorWorks(df['api_id'])
authorid = "A2147865693"
tdf = pd.DataFrame(getAuthorWorks(authorid))

In [None]:
tdf.head()

Unnamed: 0,id,doi,title,display_name,publication_year,publication_date,ids,primary_location,host_venue,type,...,best_oa_location,alternate_host_venues,referenced_works,related_works,ngrams_url,abstract_inverted_index,cited_by_api_url,counts_by_year,updated_date,created_date
0,https://openalex.org/W4322776333,https://doi.org/10.3847/2041-8213/aca1c4,Metal-enriched Neutral Gas Reservoir around a ...,Metal-enriched Neutral Gas Reservoir around a ...,2023,2023-02-01,{'openalex': 'https://openalex.org/W4322776333...,"{'is_oa': False, 'landing_page_url': 'https://...","{'id': 'https://openalex.org/S4210175824', 'is...",journal-article,...,,"[{'id': 'https://openalex.org/S4210175824', 'd...","[https://openalex.org/W1766563480, https://ope...","[https://openalex.org/W2047108234, https://ope...",https://api.openalex.org/works/W4322776333/ngrams,"{'Abstract': [0], 'Direct': [1], 'observations...",https://api.openalex.org/works?filter=cites:W4...,[],2023-03-05T20:57:51.788130,2023-03-03
1,https://openalex.org/W4315644105,https://doi.org/10.3847/1538-4357/aca678,(Nearly) Model-independent Constraints on the ...,(Nearly) Model-independent Constraints on the ...,2023,2023-01-01,{'openalex': 'https://openalex.org/W4315644105...,"{'is_oa': True, 'landing_page_url': 'https://d...","{'id': 'https://openalex.org/S1980519', 'issn_...",journal-article,...,"{'is_oa': True, 'landing_page_url': 'https://d...","[{'id': 'https://openalex.org/S1980519', 'disp...","[https://openalex.org/W1525855014, https://ope...","[https://openalex.org/W707654, https://openale...",https://api.openalex.org/works/W4315644105/ngrams,"{'Cosmic': [0], 'reionization': [1], 'was': [2...",https://api.openalex.org/works?filter=cites:W4...,[],2023-02-13T02:04:48.899552,2023-01-12
2,https://openalex.org/W4317951747,https://doi.org/10.3847/1538-4357/aca7ca,The Pan-STARRS1 z &gt; 5.6 Quasar Survey. III....,The Pan-STARRS1 z &gt; 5.6 Quasar Survey. III....,2023,2023-01-01,{'openalex': 'https://openalex.org/W4317951747...,"{'is_oa': True, 'landing_page_url': 'https://d...","{'id': 'https://openalex.org/S1980519', 'issn_...",journal-article,...,"{'is_oa': True, 'landing_page_url': 'https://d...","[{'id': 'https://openalex.org/S1980519', 'disp...","[https://openalex.org/W168946373, https://open...","[https://openalex.org/W1598056052, https://ope...",https://api.openalex.org/works/W4317951747/ngrams,"{'We': [0, 69], 'present': [1], 'the': [2, 11,...",https://api.openalex.org/works?filter=cites:W4...,[],2023-02-16T21:23:47.436531,2023-01-25


In [None]:
tdf['WorksID'] = tdf['id'].str.split('/').str[-1]  

# df = pd.DataFrame()
# df['ProfName'] = raw['display_name']
# # split by /, then grab last string chunk.
# # so in https://openalex.org/A2147865693
# # only keeps that last ID. We could use the endpoint returned by the api etc.
# # but I think it looks a little uglier and harder to manipulate
# df['api_id'] = raw['id'].str.split('/').str[-1]  
# df.head() #YAS YAS YAS 

In [None]:
l = list(tdf['WorksID'])

In [None]:
# from pyalex import Works, Authors, Venues, Institutions, Concepts
from pyalex import Works

In [None]:
aS = ""
for work in l:
  # print(work)
  w = Works()[work]
  # print(w['abstract'])
  aS += str(w['abstract'])
print(aS)

Abstract Direct observations of low-mass, low-metallicity galaxies at z ≳ 4 provide an indispensable opportunity for detailed inspection of the ionization radiation, gas flow, and metal enrichment in sources similar to those that reionized the universe. Combining the James Webb Space Telescope (JWST), Very Large Telescope/MUSE, and Atacama Large Millimeter/submillimeter Array, we present detailed observations of a strongly lensed, low-mass (≈10 7.6 M ⊙ ) galaxy at z = 3.98 (also see Vanzella et al.). We identify strong narrow nebular emission, including C iv λ λ 1548, 1550, He ii λ 1640, O iii ] λ λ 1661, 1666, [Ne iii ] λ 3868, [O ii ] λ 3727, and the Balmer series of hydrogen from this galaxy, indicating a metal-poor H ii region (≲0.12 Z ⊙ ) powered by massive stars. Further, we detect a metal-enriched damped Ly α system (DLA) associated with the galaxy with the H i column density of N H I ≈ 10 21.8 cm −2 . The metallicity of the associated DLA may reach the supersolar metallicity (≳

In [None]:
# df['FpDensityMorgan1'] = df['ROMol'].apply(lambda x: D.FpDensityMorgan1(x))
df['works'] = df['api_id'].apply(lambda x:
                                 TURBOFN(x))

Unnamed: 0,ProfName,api_id,works
0,Xiaohui Fan,A2147865693,
1,Vishnu Reddy,A2279422341,


In [None]:
w = Works()["W3128349626"]
w["abstract"]

Let's stop and think for a sec. We could keep the 'results' as json. And to each item, append the 5 most recent abstracts and their summaries. This at the end can then be transformed into a dataframe or maintained as json, and kept in a database to pull from on demand or... hmmm. 

We may find multiple benefits leaving this as JSON and ... hmmm... so. Could
1) Leave as JSON, append summaries to these authors, as the most recent work. This could then be independent of the school etc. and college etc. 
If we leave with the current json structure and just append within our database... let's say I just decide to add the stuff from Caltech, each school I'm looking at for potential future study or collaboration. I query the shit for here's what I'm looking for at that school... or ... in that country. And then it spits out matches at that school; maybe I can integrate with a region, list of schools. Then, bam. I'm interestedin stuff between electrical engineering, biology, and aerospace at Caltech or in southern california. BING BANG BOOM. Yes you can query chatGPT for this list as well and then cross-reference or something for... let's see if it could take NLP input... yeah we could make it work to check different univeristies, get rated top matches. And have different functions for all that.

This is kind of close to job matching, but using published literature over the job description. There are tags provided in the openalex database but... let's see how they're computed...

2) this was to consider compressing into dataframe and then SQL if necessary. We'd have author, publicaitons, etc. provided bing bang boom. But... Yeah not abundantly clear this would be useful if we were going for the above also as a mega project. Would be far more limited and annoying to use. 

THEREFORE we stick with json formats. We build this all out. If we're going to build a matching algorithm and shit, dunno. This is clearly getting pretty big. We're so far working on building out the database we'll match to. Then the rest of the suggestions etc. will get me VERY familiar with all those algorithms.

BUT! We can have it live query maybe? Or better prebuilt with chat I think, so it can look over all profs at the university. 

Ways this benefits ME: Then I can ask about different fields and specify locations, read into the stuff. Is this a significant benefit versus just looking into the labs that make the most publications and headlines?

Umm... I think so, to explore and consider. I'd hope I could build this out reasonably quick. Database would be quick after this publication and chatGPT is figured out. Then the NLP and stuff, use pre-built ones no training. And then like either javascript or Django front end to make things easy and grab their shit, make suggestions. boom. EZ. 

So... maybe a couple weeks? And then I empower myself to make far more queries.

---

A test. These are the last 3 abstracts I found from this guy from google scholar. I'll plug into openAI and query, see if there's a way we can maybe match peeps based on their interest to the text. 
---
URLS: 
https://arxiv.org/abs/2302.04312
https://iopscience.iop.org/article/10.3847/1538-4357/aca7ca/meta
https://arxiv.org/abs/2301.07688

---


Characterizing the physical conditions (density, temperature, ionization state, metallicity, etc) of the interstellar medium is critical to our understanding of the formation and evolution of galaxies. Here we present a multi-line study of the interstellar medium in the host galaxy of a quasar at z~6.4, i.e., when the universe was 840 Myr old. This galaxy is one of the most active and massive objects emerging from the dark ages, and therefore represents a benchmark for models of the early formation of massive galaxies. We used the Atacama Large Millimeter Array to target an ensemble of tracers of ionized, neutral, and molecular gas, namely the fine-structure lines: [OIII] 88μm, [NII] 122μm, [CII] 158μm, and [CI] 370μm and the rotational transitions of CO(7-6), CO(15-14), CO(16-15), and CO(19-18); OH 163.1μm and 163.4μm; and H2O 3(0,3)-2(1,2), 3(3,1)-4(0,4), 3(3,1)-3(2,2), 4(0,4)-3(1,3), 4(3,2)-4(2,3). All the targeted fine-structure lines are detected, as are half of the targeted molecular transitions. By combining the associated line luminosities, the constraints on the dust temperature from the underlying continuum emission, and predictions from photoionization models of the interstellar medium, we find that the ionized phase accounts for about one third of the total gaseous mass budget, and is responsible for half of the total [CII] emission. It is characterized by high density (n~180 cm−3), typical of HII regions. The spectral energy distribution of the photoionizing radiation is comparable to that emitted by B-type stars. Star formation also appears to drive the excitation of the molecular medium. We find marginal evidence for outflow-related shocks in the dense molecular phase, but not in other gas phases. This study showcases the power of multi-line investigations in unveiling the properties of the star-forming medium in galaxies at cosmic dawn.

We present the z ≈ 6 type-1 quasar luminosity function (QLF), based on the Pan-STARRS1 (PS1) quasar survey. The PS1 sample includes 125 quasars at z ≈ 5.7–6.2, with −28 ≲ M1450 ≲ −25. With the addition of 48 fainter quasars from the SHELLQs survey, we evaluate the z ≈ 6 QLF over −28 ≲ M1450 ≲ −22. Adopting a double power law with an exponential evolution of the quasar density (Φ(z) ∝ 10k(z−6); k = −0.7), we use a maximum likelihood method to model our data. We find a break magnitude of ${M}^{* }=-{26.38}_{-0.60}^{+0.79}\,\mathrm{mag}$, a faint-end slope of $\alpha =-{1.70}_{-0.19}^{+0.29}$, and a steep bright-end slope of $\beta =-{3.84}_{-1.21}^{+0.63}$. Based on our new QLF model, we determine the quasar comoving spatial density at z ≈ 6 to be $n({M}_{1450}\lt -26)={1.16}_{-0.12}^{+0.13}\,{\mathrm{cGpc}}^{-3}$. In comparison with the literature, we find the quasar density to evolve with a constant value of k ≈ −0.7, from z ≈ 7 to z ≈ 4. Additionally, we derive an ionizing emissivity of ${\epsilon }_{912}(z=6)={7.23}_{-1.02}^{+1.65}\times {10}^{22}\,\mathrm{erg}\,{{\rm{s}}}^{-1}\,{\mathrm{Hz}}^{-1}\,{\mathrm{cMpc}}^{-3}$, based on the QLF measurement. Given standard assumptions, and the recent measurement of the mean free path by Becker et al. at z ≈ 6, we calculate an H i photoionizing rate of ΓH I(z = 6) ≈ 6 × 10−16 s−1, strongly disfavoring a dominant role of quasars in hydrogen reionization.

The eighteenth data release of the Sloan Digital Sky Surveys (SDSS) is the first one for SDSS-V, the fifth generation of the survey. SDSS-V comprises three primary scientific programs, or "Mappers": Milky Way Mapper (MWM), Black Hole Mapper (BHM), and Local Volume Mapper (LVM). This data release contains extensive targeting information for the two multi-object spectroscopy programs (MWM and BHM), including input catalogs and selection functions for their numerous scientific objectives. We describe the production of the targeting databases and their calibration- and scientifically-focused components. DR18 also includes ~25,000 new SDSS spectra and supplemental information for X-ray sources identified by eROSITA in its eFEDS field. We present updates to some of the SDSS software pipelines and preview changes anticipated for DR19. We also describe three value-added catalogs (VACs) based on SDSS-IV data that have been published since DR17, and one VAC based on the SDSS-V data in the eFEDS field.

---
---
--- 
# a bunch of stuff that didn't work is below. what not to do.

In [None]:

# Loop through the authors and retrieve their publications
for author in results["data"]:
    author_id = author["id"]
    endpoint = f"https://api.openalex.org/authors/{author_id}/publications"
    response = requests.get(endpoint)
    publications = response.json()

    # Extract the two most recent abstracts for each author
    abstracts = []
    for publication in publications["data"]:
        if "abstract" in publication:
            abstracts.append(publication["abstract"])
        if len(abstracts) == 2:
            break

    # Store the results in a data structure or write to a file
    author_name = author["first_name"] + " " + author["last_name"]
    print(f"{author_name}: {abstracts}")


In [None]:
import scholarly


# Define a function to get the two most recent publications for a professor
# ChatGPT implementation doesn't work out the box, trying...
# https://scholarly.readthedocs.io/en/stable/AuthorParser.html
def get_publications(professor):
    search_query = scholarly.search_author(professor)
    author = next(search_query)
    author = scholarly.fill(author) #sections = []) # can specify just pubs if you want
    # author = next(search_query).fill(author)
    publications = author.publications
    publications.sort(key=lambda x: x.bib['year'], reverse=True)
    most_recent_publications = publications[:2] # change here for MORE of most recent pubs
    return most_recent_publications

# MAIN FN, EVENTUALLY WRAP OR MAKE THIS A CLASS
# Loop through each professor and get their two most recent publications
publications = []
for professor in armin_df['professor']: # change to feature name w/e has name
    try:
        pub_list = []
        for publication in get_publications(professor): # iterate over ea. prof's pubs
            pub_dict = {}
            pub_dict['title'] = publication.bib['title']
            pub_dict['abstract'] = publication.bib.get('abstract', None)
            pub_dict['url'] = publication.bib.get('url', None)
            pub_list.append(pub_dict)
        publications.append((professor, pub_list))
    except StopIteration:
        pass

# Print URLs or abstracts for each professor's two most recent publications
for professor, pub_list in publications:
    print(f"Professor {professor}:")
    for i, pub in enumerate(pub_list):
        print(f"\tPublication {i+1}:")
        if pub['url'] is not None:
            print(f"\t\tURL: {pub['url']}")
        if pub['abstract'] is not None:
            print(f"\t\tAbstract: {pub['abstract']}")

AttributeError: ignored

In [None]:
# alright so no author parser anymore even though it appears wtf
# see: https://scholarly.readthedocs.io/en/stable/quickstart.html#search-for-articles-publications-and-return-generator-of-publication-objects


An example query to openalex before I lose it:
https://api.openalex.org/authors?search=armin%20sorooshian

In [None]:
!pip install pyalex

In [None]:
# univ_of_arizona_authors = pa.Authors().search("affiliation: University of Arizona").get_all()
import pyalex
help(Authors())

NameError: ignored

In [None]:
import pyalex as pa
import pandas as pd

pyalex.config.email = "pat@patrickfinnerty.com"

ua_affiliation_id = "60019464" # affiliation ID for University of Arizona

# Search for authors affiliated with University of Arizona
# univ_of_arizona_authors = Authors().search_filter(last_known_institution.id == ua_affilitation_id).get()
ua = Authors().search_filter("University of Arizona")

TypeError: ignored

In [None]:
from pyalex import PyAlex

def get_publications(author_id):
    api = PyAlex()
    author_pubs = api.get_author_publications(author_id, max_results=10)
    publications = []
    for pub in author_pubs:
        pub_dict = {}
        pub_dict['title'] = pub['title']
        pub_dict['abstract'] = api.get_abstract(pub['id'])
        publications.append(pub_dict)
    return publications

pyalex.config.email = "pat@patrickfinnerty.com"

ua_affiliation_id = "60019464" # affiliation ID for University of Arizona
ua_authors = PyAlex().search_affiliation(ua_affiliation_id)["result"]

pub_list = []
for author in ua_authors:
    author_id = author["id"]
    publications = get_publications(author_id)
    pub_list.extend(publications)

print(pub_list)


In [None]:
# Set up the API endpoint and parameters
endpoint = "https://api.openalex.org/authors"
params = {
          "filter": "last_known_institution.id:I138006243", # get U of A authors
          "per-page": 2,
          "mailto": "pat@patrickfinnerty.com", # for polite
          "cursor": "*" # initialize cursor
}

results = [] # stores all responses/results

while params["cursor"]:
  response = requests.get(endpoint, params=params)
  currentResponse = response.json()
  results += currentResponse["results"] # append new objects to results
  params["cursor"] = currentResponse["meta"]["next_cursor"] # update to next page.
  print(currentResponse)
  if len(results) >= 2:
    break

# print(len(results)) # print the length of the results list
# unique_entries = len(set([json.dumps(result) for result in results])) # get the number of unique entries
# print(unique_entries) # print the number of unique entries


{'meta': {'count': 37062, 'db_response_time_ms': 55, 'page': None, 'per_page': 2, 'next_cursor': 'IlszMDgwLCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvQTIyNzk0MjIzNDEnXSI='}, 'results': [{'id': 'https://openalex.org/A2147865693', 'orcid': 'https://orcid.org/0000-0003-3310-0131', 'display_name': 'Xiaohui Fan', 'display_name_alternatives': [], 'works_count': 3556, 'cited_by_count': 86052, 'ids': {'openalex': 'https://openalex.org/A2147865693', 'orcid': 'https://orcid.org/0000-0003-3310-0131', 'mag': '2147865693'}, 'last_known_institution': {'id': 'https://openalex.org/I138006243', 'ror': 'https://ror.org/03m2x1q45', 'display_name': 'University of Arizona', 'country_code': 'US', 'type': 'education'}, 'x_concepts': [{'id': 'https://openalex.org/C121332964', 'wikidata': 'https://www.wikidata.org/wiki/Q413', 'display_name': 'Physics', 'level': 0, 'score': 355.9}, {'id': 'https://openalex.org/C185592680', 'wikidata': 'https://www.wikidata.org/wiki/Q2329', 'display_name': 'Chemistry', 'level': 0, 'score': 

In [None]:
import pyalex
import pandas as pd

pyalex.config.email = "pat@patrickfinnerty.com"

# Search for authors affiliated with University of Arizona
query = pyalex.Query(author_affiliation='University of Arizona')
authors = query.search_authors()
author_names = [author.name for author in authors]

# Create a dataframe with one column for author names
df = pd.DataFrame({'professors': author_names})

# Define a function to get the two most recent abstracts for an author
def get_abstracts(author_name):
    works = pyalex.WorkSearch(author=author_name).search()
    works_sorted = sorted(works, key=lambda work: work.date, reverse=True)
    abstracts = []
    for work in works_sorted[:2]:
        if work.abstract:
            abstracts.append(work.abstract)
    return abstracts

# Apply the function to each row in the dataframe to get the abstracts
df['abstracts'] = df['professors'].apply(get_abstracts)

# Print the resulting dataframe
print(df)


AttributeError: ignored

In [None]:
from pyalex import Works, Authors, Venues, Institutions, Concepts

armin = Authors().search_filter(display_name="Armin Sorooshian").get()
w = Works().search_filter()

PyAlex isn't well documented enough for what I want homie. Therefore...

In [None]:
# THIS QUERY WORKS. WE CAN USE IT TO GET ALL OF EM
# https://api.openalex.org/authors?filter=last_known_institution.id:I138006243&per-page=50

In [None]:
import pyalex as pa
import pandas as pd

pyalex.config.email = "pat@patrickfinnerty.com"

ua_affiliation_id = "60019464" # affiliation ID for University of Arizona

# https://api.openalex.org/authors?filter=last_known_institution.id:I138006243?per-page=50
# Search for authors affiliated with University of Arizona
univ_of_arizona_authors = pa.Authors().search_filter("last_known_institution.id" == "I138006243").get()
                                                  # "University of Arizona").get()

# Create a dataframe with the list of authors
df = pd.DataFrame({"professors": [author.display_name for author in univ_of_arizona_authors]})

# Get the two most recent publication abstracts for each author
for index, row in df.iterrows():
    author_name = row['professors']
    author = pa.Authors().search_filter(display_name=author_name).get()
    works = author.works
    pub_list = []
    for work in works:
        pub_dict = {}
        pub_dict['title'] = work.title
        pub_dict['abstract'] = work.abstract
        pub_list.append(pub_dict)
    sorted_pub_list = sorted(pub_list, key=lambda x: x['title'], reverse=True)
    for i in range(min(len(sorted_pub_list), 2)):
        df.at[index, f"publication_{i+1}_title"] = sorted_pub_list[i]['title']
        df.at[index, f"publication_{i+1}_abstract"] = sorted_pub_list[i]['abstract']

print(df)


TypeError: ignored

In [None]:
import pyalex as pa
import pandas as pd

# Search for authors affiliated with University of Arizona
univ_of_arizona_authors = pa.Authors().search(query="affiliation: University of Arizona").get_all()

# Create a dataframe with the list of authors
professors_df = pd.DataFrame({"professors": [author.display_name for author in univ_of_arizona_authors]})

# Iterate over the list of professors
for professor in professors_df['professors']:
    try:
        pub_list = []
        for publication in pa.Works().search(query=f"au:{professor}"):
            pub_dict = {}
            pub_dict['title'] = publication.title
            pub_dict['abstract'] = publication.abstract
            pub_list.append(pub_dict)
            if len(pub_list) >= 2:
                break
        professors_df.loc[professors_df['professors'] == professor, 'publications'] = str(pub_list)
    except Exception as e:
        print(f"Error getting publications for {professor}: {e}")

print(professors_df)


TypeError: ignored