# TopicGPT Notebook
## Intro
This notebook will run TopicGPT's scripts to generate a list of level-1/2 concepts/topics from text and can assign this list of concepts to other texts. This notebook also fits a simple logistic regression model for testing baseline concept accuracy. Importantly, we have changed topicGPT's scripts to use the open-source Llama-3 model instead of GPT-4.

## Setup
1. Setup topicGPT using their README: https://github.com/chtmp223/topicGPT/tree/main. (Note that all scripts are already included in this topicGPT directory. Just install the requirements and read through the README to understand how data and prompts can be formatted in those respective folders.)
2. Install vllm and setup its server: https://docs.vllm.ai/en/latest/getting_started/installation.html

In [1]:
import os
import pandas as pd
import numpy as np
import sklearn
pd.set_option('display.max_rows', 500)

# LR model imports
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

## Generating Concepts
Below is the merged dataframe that Alex provided. If using concepts generated from other sources (i.e. ChatGPT, LLooM), you can skip to the `Assigning Concepts` section below.

Otherwise, make sure to again reference topicGPT's README for how to structure input data into a jsonl and apply the generation scripts for level-1/2 concepts. Below is an example of generating these topics from 100 samples. Input data can be found in `./data/input` and scripts can be found in `./script`

In [2]:
city_council_data_df = pd.read_json('../data/full_newsworthiness_training_data.jsonl',  lines=True)
final_matching_df = pd.read_csv('../data/final-matching-articles-and-meetings.csv', index_col=0)
# full article information/ text
# the original final_matching_df doesn't have the full article text, 
# so you might want to look at the actual text

json_file = '../data/sfchron-fetched-articles.jsonl/sfchron-fetched-articles.jsonl'
articles = []
import json
for line in open(json_file, encoding="utf8"):
    articles.append(json.loads(line))

sf_articles_df = pd.DataFrame(articles)
final_matching_df['key'] = (final_matching_df['article_url']
     .str.split(')')
     .str.get(-1)
     .str.replace('https://', 'http://')
     .str.replace('www.', '')
     .str.replace('http://sfchronicle.com', '')
)
matching_df_with_full_text = (
    sf_articles_df
         .assign(key=lambda df: df['article_url'].str.split('sfchronicle.com').str.get(-1))
         [['key', 'article_text']]
         .merge(final_matching_df, on='key', how='right')
)

In [3]:
# Merging policy text with true/false label
renamed_article_matched_df = matching_df_with_full_text.rename(columns={
    'meeting text': 'policy text',
    'summary_text': 'article summary text',
    'article_text': 'article full text'
})
renamed_city_council_data_df = city_council_data_df.rename(columns={
    'text': 'policy text',
    'transcribed_text': 'meeting transcribed text'
})
full_merged_df = (
    renamed_article_matched_df[['File #', 'article full text', 'article summary text']]
         .merge(
             right=renamed_city_council_data_df[['proposal_number', 'policy text', 'meeting transcribed text', 'label']], 
             left_on='File #',
             right_on='proposal_number', 
             how='right'
         )
).drop(columns='File #')
full_merged_df['policy text'].iloc[15]

'121196 Ordinance authorizing the Department of the Environment to accept and expend a grant in the amount of $13,100,000 from the California Public Utilities Commission, through Pacific Gas and Electric Company, to implement an Energy Use and Demand Reduction through Energy Efficiency Program and amending Ordinance No. 165-12 (Annual Salary Ordinance, FYs 2012-2013 and 2013-2014) to reflect the addition of three grant funded positions (3 FTE) at the Department of the Environment, for a term from January 1, 2013, through December 31, 2014.'

In [4]:
# here's how you might combine the `policy text` and the `meeting transcribed text` columns:
full_merged_df_w_full_policy_text = (
    full_merged_df
     .assign(meeting_transcribed_text_col = lambda df:
             df.apply(lambda x: list(map(lambda y: y['text'], x['meeting transcribed text'])), axis=1)
            )
     .assign(full_policy_text=lambda df: 'policy text:\n\n' + df['policy text'] + '\n\n' + 'meeting text:\n\n' + df['meeting_transcribed_text_col'].str.join('\n'))
     .drop(columns=['meeting transcribed text', 'meeting_transcribed_text_col'])
)
print(full_merged_df_w_full_policy_text['full_policy_text'].iloc[1])

policy text:

121007 Ordinance authorizing, pursuant to Charter Section 9.118(a), a System Impact Mitigation Agreement with North Star Solar, LLC, requiring North Star Solar, LLC, to pay the Public Utilities Commission the costs necessary to mitigate the impacts to the City’s electric system caused by the interconnection of North Star Solar, LLC’s solar project to the electric grid; authorizing similar mitigation agreements with other projects in the future; appropriating funds from these agreements to pay the costs of mitigation work; and placing various mitigation funds on reserve with the Board of Supervisors.

meeting text:

Madam Clerk, could you please call item 12?
 Item 12 is an ordinance appropriating $843,000 of state reserves and approximately $1.4 million from school districts set aside funds for the San Francisco Unified School District for fiscal year 2012 through 2013.
Supervisor Kim.
 Thank you.
I realize that we are now finally coming to near end on discussion around t

### Topic Generation example begins here

In [11]:
df = pd.read_csv('../data/final-matching-articles-and-meetings.csv')
# Grab meeting text column (policies)
meeting_text_df = pd.DataFrame(df, columns =['meeting text'])
first_100_df = meeting_text_df.head(100).rename(columns={'meeting text': 'text'})
first_100_df

Unnamed: 0,text
0,Administrative Code - Short-Term Residential R...
1,Hearing - Update on the Municipal Transportati...
2,Affirming the Statutory Exemption From Environ...
3,Committee of the Whole - Urgency Ordinance - Z...
4,Concurring in Actions to Meet Local Emergency ...
...,...
95,Planning Code - Medical Cannabis Dispensaries ...
96,Initiative Ordinance - Business and Tax Regula...
97,Supporting California State Senate Bill 1045 (...
98,Redevelopment Plan Amendment - Transbay Redeve...


In [14]:
# Convert to jsonl format for topicGPT scripts
# Save this to ./data/input directory
meeting_text_json_first_100 = first_100_df.to_json('./data/input/sf_meeting_text_first_100.jsonl', orient='records', lines=True)
meeting_text_json_first_100

In [4]:
# Create prompts/seeds with relevant example topics and documents
# Example ones for US bills can be found in the prompts directory

%OPENAI_API_KEY%


In [16]:
# Topic generation - top level
# Make sure to replace the command below with the correct prompt files, output files, and sample size
! python script/generation_1.py --deployment_name llama-3-70b \
                        --max_tokens 300 --temperature 0.0 --top_p 0.0 \
                        --data data/input/sf_meeting_text_first_100.jsonl \
                        --prompt_file prompt/sf_policies/sf_policies_generation_1.txt \
                        --seed_file prompt/sf_policies/sf_policies_seed_1.md \
                        --out_file data/output/sf_policies/sf_policies_result_1.jsonl \
                        --topic_file data/output/sf_policies/sf_policies_result_1.md \
                        --verbose True

Document: 1
Topics: [1] Residential Policies: Mentions policies relating to residential rentals and housing administration.
--------------------
Document: 2
Topics: [1] Public Transportation: Mentions policies relating to public transportation and public transit.
--------------------
Document: 3
Topics: [1] Public Transportation: Mentions policies relating to public transportation and environmental considerations.
--------------------
Document: 4
Topics: [1] Residential Policies: The document discusses an urgency ordinance related to new residential uses and zoning in a specific area.
--------------------
Document: 5
Topics: [1] Drug Policies: Mentions policies relating to drugs and drug usage.
--------------------
Document: 6
Topics: [1] Residential Policies: The document discusses policies related to housing and residential planning, including inclusionary housing requirements and transferable development rights.
--------------------
Document: 7
Topics: [1] Public Utilities Policies:


  0%|          | 0/100 [00:00<?, ?it/s]
  1%|1         | 1/100 [00:20<34:24, 20.85s/it]
  2%|2         | 2/100 [00:27<20:31, 12.56s/it]
  3%|3         | 3/100 [00:34<15:51,  9.81s/it]
  4%|4         | 4/100 [00:41<13:57,  8.72s/it]
  5%|5         | 5/100 [00:48<12:43,  8.04s/it]
  6%|6         | 6/100 [00:55<12:10,  7.78s/it]
  7%|7         | 7/100 [01:02<11:48,  7.61s/it]
  8%|8         | 8/100 [01:10<11:57,  7.80s/it]
  9%|9         | 9/100 [01:17<11:26,  7.54s/it]
 10%|#         | 10/100 [01:24<10:50,  7.22s/it]
 11%|#1        | 11/100 [01:31<10:32,  7.11s/it]
 12%|#2        | 12/100 [01:41<11:58,  8.16s/it]
 13%|#3        | 13/100 [01:48<11:11,  7.72s/it]
 14%|#4        | 14/100 [01:55<10:39,  7.43s/it]
 15%|#5        | 15/100 [02:01<10:08,  7.16s/it]
 16%|#6        | 16/100 [02:08<09:47,  7.00s/it]
 17%|#7        | 17/100 [02:14<09:24,  6.81s/it]
 18%|#8        | 18/100 [02:22<09:34,  7.01s/it]
 19%|#9        | 19/100 [02:29<09:40,  7.16s/it]
 20%|##        | 20/100 [02:36<09:21,

In [18]:
# Topic generation - second level
# Make sure to replace the command below with the correct prompt files, output files, and sample size

! python script/generation_2.py --deployment_name llama-3-70b \
                --max_tokens 300 --temperature 0.0 --top_p 0.0 \
                --data data/output/sf_policies/sf_policies_result_1.jsonl \
                --seed_file data/output/sf_policies/sf_policies_result_1.md \
                --prompt_file prompt/sf_policies/sf_policies_generation_2.txt \
                --out_file data/output/sf_policies/sf_policies_generation_2.jsonl \
                --topic_file data/output/sf_policies/sf_policies_generation_2.md \
                --verbose True

Number of remaining documents for prompting: 69
Current topic: [1] Residential Policies
Prompt length: 3932
Subtopics: [1] Residential Policies
    [2] Short-Term Residential Rentals (Document: 1): Discusses regulations and policies related to short-term residential rentals.
    [2] Zoning and Land Use (Document: 2, 11, 19, 23): Discusses zoning laws, land use, and related ordinances.
    [2] Inclusionary Housing (Document: 3, 4, 12, 13, 22): Discusses policies and regulations related to inclusionary housing and affordable housing requirements.
    [2] Mayoral Succession (Document: 5, 8): Discusses the process of nominating and appointing a successor mayor.
    [2] Environmental Review (Document: 6, 19): Discusses environmental review processes and policies.
    [2] Building Safety (Document: 7): Discusses policies related to building safety, specifically seismic safety.
    [2] Affordable Housing Needs (Document: 9, 17, 20, 21): Discusses the assessment of affordable housing needs and


  0%|          | 0/15 [00:00<?, ?it/s]
  7%|6         | 1/15 [00:17<03:59, 17.08s/it]
 13%|#3        | 2/15 [00:25<02:35, 11.94s/it]
 20%|##        | 3/15 [00:36<02:16, 11.36s/it]
 27%|##6       | 4/15 [00:46<02:00, 10.93s/it]
 33%|###3      | 5/15 [00:54<01:39,  9.95s/it]
 53%|#####3    | 8/15 [01:01<00:37,  5.41s/it]
 60%|######    | 9/15 [01:11<00:38,  6.38s/it]
 67%|######6   | 10/15 [01:18<00:32,  6.56s/it]
 73%|#######3  | 11/15 [01:26<00:27,  6.80s/it]
 80%|########  | 12/15 [01:36<00:23,  7.82s/it]
 87%|########6 | 13/15 [01:43<00:15,  7.51s/it]
 93%|#########3| 14/15 [01:51<00:07,  7.71s/it]
100%|##########| 15/15 [01:51<00:00,  7.46s/it]


## Assigning Topics
The next step is to assign your list of concepts to other texts and eventually train/test a model. Your list of concepts should go in the `./data/output/sf_policies` directory and should be formatted like the example md file `./data/output/sf_policies/sf_policies_result_1.md`.

Then, get a sample from the `full_merged_df` for assigning those topics. We are downsampling rows with `False` labels since those represent the majority of the data. The example below shows a sample of 50 `True` policies and 50 `False` policies.

Convert this `text_and_label_sample_df` to a jsonl file and place it in `./data/input`. Then, run the assignment python script (note where all the different files are located in the python command). The concept assignments should be outputted in `./data/output/sf_policies/your_file_name`.

In [111]:
# Grab only policy text and label columns and rename policy text to text for topicGPT script formatting purposes
only_policy_and_label_df = full_merged_df[["policy text", "label"]].rename(columns={'policy text' : 'text'})

# Split data into only True and only False texts
only_true_policy_df = only_policy_and_label_df.loc[only_policy_and_label_df['label']==True]
only_false_policy_df = only_policy_and_label_df.loc[only_policy_and_label_df['label']==False]

# Get num_samples from each df and combine them
num_samples = 50
only_true_sample_df = only_true_policy_df.sample(num_samples)
only_false_sample_df = only_false_policy_df.sample(num_samples)
text_and_label_sample_df = pd.concat([only_true_sample_df, only_false_sample_df], axis=0)

# Save as jsonl file
text_and_label_json = text_and_label_sample_df.to_json('./data/input/sf_text_and_label_sample.jsonl', orient='records', lines=True)
text_and_label_sample_df

Unnamed: 0,text,label
17888,221033 Hearing - Committee of the Whole - Draf...,True
17188,"220401 Street Name Change - From ""Hahn Street""...",True
3577,150191 Health Code - Wild or Exotic Animals fo...,True
11242,190319 Opposing California State Senate Bill N...,True
4849,160028 Urging an Independent Federal Investiga...,True
12506,191148 Administrative Code - Mental Health SF,True
15907,210966 Sublease Agreement - California State L...,True
7280,170507 Settlement of Lawsuit and Tolling Agree...,True
15926,210794 Preparation of Findings Related to Cond...,True
9509,180331 Affirming the Board of Supervisors Comm...,True


In [112]:
# Topic assignment command
! python script/assignment.py --deployment_name llama-3-70b \
                        --max_tokens 300 --temperature 0.0 --top_p 0.0 \
                        --data data/input/sf_text_and_label_sample.jsonl \
                        --prompt_file prompt/sf_policies/sf_policies_assignment.txt \
                        --topic_file data/output/sf_policies/sf_policies_result_1.md \
                        --out_file data/output/sf_policies/sf_topic1_sample_assignment.jsonl \
                        --verbose True

^C


## Logistic Regression Model
Now that we have our concepts assigned to our sampled texts, we can fit a LR model and test its accuracy.

We begin by loading the outputted assignment file and adding a `concepts` column. We run this `assignment_sample_df` through our method `get_concepts` to filter the assignd concept from the `response` column and place it in the `concept` column.

Next, we vectorize our concepts to prepare them for our LR model. We use `CountVectorizer` here from `scikit-learn` but other processes like `TfidfVectorizer` or `Word2Vec` can be used.

Finally, we fit and test our LR model to determine an accuracy.

In [7]:
# Get topics with texts and labels
assignment_sample_df = pd.read_json('./data/output/sf_policies/sf_topic1_sample_assignment.jsonl', lines=True)

# Copy responses column to new column called concepts
assignment_sample_df['concepts'] = assignment_sample_df['responses']
assignment_sample_df

Unnamed: 0,text,label,prompted_docs,responses,concepts
0,210492 Police Code - Third-Party Food Delivery...,True,210492 Police Code - Third-Party Food Delivery...,[1] Business Policies: The document discusses ...,[1] Business Policies: The document discusses ...
1,"180214 Transportation, Public Works Codes - Un...",True,"180214 Transportation, Public Works Codes - Un...",[1] Public Transportation: The document discus...,[1] Public Transportation: The document discus...
2,180681 Development Agreement - India Basin Inv...,True,180681 Development Agreement - India Basin Inv...,[1] Residential Policies: The document mention...,[1] Residential Policies: The document mention...
3,190093 Hearing - Appeal of Determination of Co...,True,190093 Hearing - Appeal of Determination of Co...,[1] Residential Policies: The document mention...,[1] Residential Policies: The document mention...
4,161065 Police Code - Rental Car Disclosure Req...,True,161065 Police Code - Rental Car Disclosure Req...,[1] Residential Policies: The document mention...,[1] Residential Policies: The document mention...
5,171041 Planning Code - Cannabis Regulation,True,171041 Planning Code - Cannabis Regulation,[1] Drug Policies\n\nAssignment Reasoning: \nT...,[1] Drug Policies\n\nAssignment Reasoning: \nT...
6,"180214 Transportation, Public Works Codes - Un...",True,"180214 Transportation, Public Works Codes - Un...",[1] Public Transportation: The document discus...,[1] Public Transportation: The document discus...
7,190984 Public Health Crisis on Drug Overdoses ...,True,190984 Public Health Crisis on Drug Overdoses ...,[1] Drug Policies: The document discusses issu...,[1] Drug Policies: The document discusses issu...
8,210537 Administrative Code - Extension Of Temp...,True,210537 Administrative Code - Extension Of Temp...,[1] Residential Policies: The document discuss...,[1] Residential Policies: The document discuss...
9,210921 Conditionally Reversing the Final Envir...,True,210921 Conditionally Reversing the Final Envir...,[1] Environmental Policies\n\nAssignment Reaso...,[1] Environmental Policies\n\nAssignment Reaso...


In [8]:
import re
# Get concepts
# Also removes any rows that couldn't be assinged a concept for some reason
def get_concepts(df):
    drop_rows = []
    for index, row in df.iterrows():
        text = row['responses']
        concept = re.search("\[1\] (.*?):", text)
        if concept:
            concept = concept.group(1)
        else:
            drop_rows.append(index)
        df.at[index, 'concepts'] = concept
    return df.copy().drop(drop_rows)

In [9]:
# Get concepts
concepts_sample_df = get_concepts(assignment_sample_df)
concepts_sample_df

Unnamed: 0,text,label,prompted_docs,responses,concepts
0,210492 Police Code - Third-Party Food Delivery...,True,210492 Police Code - Third-Party Food Delivery...,[1] Business Policies: The document discusses ...,Business Policies
1,"180214 Transportation, Public Works Codes - Un...",True,"180214 Transportation, Public Works Codes - Un...",[1] Public Transportation: The document discus...,Public Transportation
2,180681 Development Agreement - India Basin Inv...,True,180681 Development Agreement - India Basin Inv...,[1] Residential Policies: The document mention...,Residential Policies
3,190093 Hearing - Appeal of Determination of Co...,True,190093 Hearing - Appeal of Determination of Co...,[1] Residential Policies: The document mention...,Residential Policies
4,161065 Police Code - Rental Car Disclosure Req...,True,161065 Police Code - Rental Car Disclosure Req...,[1] Residential Policies: The document mention...,Residential Policies
6,"180214 Transportation, Public Works Codes - Un...",True,"180214 Transportation, Public Works Codes - Un...",[1] Public Transportation: The document discus...,Public Transportation
7,190984 Public Health Crisis on Drug Overdoses ...,True,190984 Public Health Crisis on Drug Overdoses ...,[1] Drug Policies: The document discusses issu...,Drug Policies
8,210537 Administrative Code - Extension Of Temp...,True,210537 Administrative Code - Extension Of Temp...,[1] Residential Policies: The document discuss...,Residential Policies
10,180064 Confirming the Appointment of the Succe...,True,180064 Confirming the Appointment of the Succe...,[1] Election Policies: The document mentions t...,Election Policies
11,190224 Supporting California State Assembly Bi...,True,190224 Supporting California State Assembly Bi...,[1] Public Health Policies: The document discu...,Public Health Policies


In [107]:
# Vectorize concepts into numeric value for logistic regression model
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer =TfidfVectorizer()
countVectorizer = CountVectorizer()
X = countVectorizer.fit_transform(concepts_sample_df['concepts'])
y = concepts_sample_df['label']

In [108]:
# Split 100 policies into 90/10 train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9)

In [109]:
# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)

In [110]:
# Test model
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.7777777777777778
Confusion Matrix:
 [[3 2]
 [0 4]]
