In [13]:
from openai import AsyncOpenAI
import json

keys = json.load(open("secrets.json"))

client = AsyncOpenAI(
    base_url="http://101.35.52.226:9090/v1",
    api_key=keys["api_key"],
    timeout=45,
)


async def chat(prompt, stream=False, temperature=0.0, n=1):
    response = await client.chat.completions.create(
        model="qwen-110b-chat",
        messages=[{"role": "user", "content": prompt}],
        stream=stream,
        max_tokens=512,
        temperature=temperature,
        n=n,
        stop=["<|endoftext|>", "<|im_end|>"],
    )
    if not stream:
        if n == 1:
            return response.choices[0].message.content.strip()
        return response.choices
    return response

import vertexai
from vertexai.generative_models import GenerativeModel

vertexai.init(project="talk-to-a-local-427009", location="us-central1")

gemini = GenerativeModel(model_name="gemini-1.5-flash-001")

### Goal
gain insights about each professor, link all of these data together, and make a prediction about the quality of the professor.

we can use reddit and RMP data to validate whether LLM's prediction is accurate or not.

there would probably be a lot of labeling to do, and the labels might be across colleges, not just limited to one. thus, we need to be able to morph our data into a correct form by using a good prompt or some other method.

In [39]:
# different types of faculties in each college and what information do we have access to? 
# the features are already kinda human extracted already, just their bios are some long natural langs

await chat("你好，你是谁？")

'你好，我是来自阿里云的超大规模语言模型，我叫通义千问。'

##### Course Recommendation

this is different from the course recommendation module. this is motivated by a professor's quality of teaching, not what you can gain out of a course. this requires an extensive review of their teaching capabilities.

##### Research Opportunities

when a user would query about their potential research interests from perspectives such as
- having an academic department
    - general inquiry: i want to do some biology research, what are some opportunities?
- having a research interest / direction
    - about a specific topic: i want to do some research on cancer, what are some opportunities?
- a specific professor: we can only pass along their contacts if very specific, but we can provide relative recommendations for one's who are similar to the professor
    - how to get involved? we can provide courses & the activities that he is involved in
    - how reputable is him? what are his publishings? 

extract where the professors are expert in from their bios, publications that can be turned into vectors.
maybe even give an ai based summary on each of these metrics while we're at it -> only do that if it is worthwhile.

for quicker search (when we have determined the domain and we need a list of students), i want it to also output students / any faculty that have a similar work experience

since some faculties are professors who have actual contents on their page, and a lot are just students, we need to come up with a list of metrics, json fields that the LLM must determine / output. 

- whether worth it to give a review
- how much experience
- label all the fields the faculty is expert in (keywords & phrases)
- any related fields based on what the bio & publication & all such

##### LLM based reviews
- research
- teaching
- reddit reviews
- cu reviews
- rmp reviews

on both the professor & the course they teach

In [64]:
EXTRACT_FACULTY_PROMPT = """
### FACULTY DETAILS
education: {education}
department: {department}
position: {position}

##### BIOGRAPHY
{bio}

##### RESEARCH INTERESTS
{research}

##### PUBLICATIONS
{publications}


### INSTRUCTIONS
From the above faculty details, please extract the following: 
"subdomains": Which subdomain of academia does the faculty have expertise in based on ? (e.g. within Artificial Intelligence, we have reinforcement learning with human feedback, traditioal ML algorithms & its proofs, convolutional neural networks in multi-object tracking, etc.). Subdomains are areas where the faculty has expertise in. The output should be a list of strings of ideally length 3-5.
"goals": What work do the faculty seem to work towards based on his research, publications, biography? (e.g. "to develop a new algorithm for X", "to improve the efficiency of Y", etc.) THIS SHOULD BE A LIST OF STRINGS.
"experience": How experienced is the faculty? 
"summary": in one or two sentences, summarize the faculty's expertise and research interests.

has_publication takes the value of TRUE if you think the faculty has publications related in the ballpark of that subdomain, otherwise FALSE.

### OUTPUT REQUIREMENT
If there is not enough information to extract these information, output "None".
The output must be JSON.

### EXTRACTED JSON DATA FROM FACULTY DETAILS:
"""

In [63]:
import pandas as pd

eng_profs = pd.read_json("eng-prof-details.json")
eng_profs_list = pd.read_json("eng-prof-list.json")
eng_profs.columns

prompt = EXTRACT_FACULTY_PROMPT.format(
    department=eng_profs_list.iloc[0]["department"],
    position=eng_profs_list.iloc[0]["position"],
    bio=eng_profs.iloc[0]["bio"],
    research=eng_profs.iloc[0]["research_interests"],
    education=eng_profs.iloc[0]["education"],
    publications=eng_profs.iloc[0]["selected_publications"],
)

response = await chat(prompt)

# response = gemini.generate_content(prompt).text
print(response)
# print(prompt)

```json
{
  "subdomains": ["Colloids and Interfacial Science", "Liquid Crystalline Materials", "Nanoparticle Synthesis"],
  "goals": ["Developing chemically tailored interfaces for advanced sensor technologies", "Exploring reversible control of surfactant properties for various applications", "Understanding and manipulating hydrophobic interactions at the nanoscale"],
  "experience": "Professor Abbott has over three decades of experience in chemical engineering, with positions at prestigious institutions including UC Davis, University of Wisconsin-Madison, and currently Cornell University. He has led departments and research centers, and is a Member of the US National Academy of Engineering.",
  "summary": "Nicholas Abbott is an expert in colloidal and interfacial phenomena, focusing on the design of surfactants with molecular triggers, colloidal forces in liquid crystals for sensor applications, and nanoscale hydrophobic interactions for biomolecular engineering. His work bridges fund

In [37]:
# for checking the generated content
i = 0

print(
    eng_profs.iloc[i]["prof_name"],
    eng_profs.iloc[i]["in_the_news"],
)
eng_profs.columns

Mohamed Abdelfattah []


Index(['prof_name', 'bio', 'research_interests', 'selected_publications',
       'awards', 'education', 'in_the_news', 'related_links',
       'teaching_interests', 'websites'],
      dtype='object')

In [77]:
EXTRACT_PROFESSOR_REVIEW_PROMPT = """
department: {department}
overall rating: {overall_rating}
overall difficulty: {overall_difficulty}

### PROFESSOR REVIEWS
{reviews}

### INSTRUCTIONS
From the above reviews, please extract the following:
"positive": What are some positive aspects of the professor based on the reviews? This should be a list of strings.
"negative": list of strings.
"others": other attributes that are not positive or negative. This should be a list of strings.
"summary": in one or two sentences, summarize the professor based on the reviews.

### OUTPUT REQUIREMENT
Do not make up any information, only strictly based on the professor reviews. Output less than 3 attributes if there is not enough relevant information to be classified into that category.
If there is not enough information to extract these information, output "[]" as an empty JSON list.

### JSON DATA EXTRACTED FROM REVIEWS:
"""

In [78]:
ratings_df = pd.read_json('ratings.jsonl', lines=True)
ratings_sample_list = [
    {
        "score": r["rating"],
        "difficulty": r["difficulty"],
        "review": r["comment"],
    }
    for r in ratings_df.iloc[0]["ratings"]
]
department = ratings_df.iloc[0]["department"]
overall_rating = ratings_df.iloc[0]["rating"]
overall_difficulty = ratings_df.iloc[0]["difficulty"]

prompt = EXTRACT_PROFESSOR_REVIEW_PROMPT.format(
    department=department,
    overall_rating=overall_rating,
    overall_difficulty=overall_difficulty,
    reviews=ratings_sample_list,
)

response = await chat(prompt)

print(response)

```json
{
"positive": ["easy grader", "helpful grader", "enjoyed the material", "really nice", "interesting course material"],
"negative": ["disorganized", "awkward", "incompetent", "frustrating", "boring", "WAYYYYYY too much work", "terrible grading system"],
"others": ["graded on participation, essays, and one group project", "assignments reminiscent of middle school", "nicely tries to be helpful"],
"summary": "Professor Edwards receives mixed reviews with some appreciating her easy grading and niceness, while others find her classes disorganized, with middle school-level assignments, and frustrating due to a lack of effective teaching and a heavy workload."
}
```


##### Career Orientations
    
what a user would query about their potential career paths that a professor can help set them up.
This requires a professors' research and previous track records. We potentially need some of their linkedin data on where they are before joining the university.

- having an academic department
- past experience

we need to be able to find a relevant professor that can set the student up for their career goals.