Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 22 first exploration of questions #23

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from

Conversation

helloaidank
Copy link
Contributor


Description

This PR introduces two Python scripts designed to perform our first iteration of a question analysis: BERTopic_first_analysis.py and questions_eda_analysis.py. The outputs from these scripts are solely figures.

Instructions for Reviewer

Hi @sofiapinto, thanks a lot for reviewing my code!

Setup

  • note you will have had to have run the extracted questions (issue Extracting Questions from MSE & Buildhub #16) scripts for this to work as it is based on their outputs. Can you run these scripts over the extracted questions from the 119_air_source_heat_pumps_ashp forum from buildhub please.
  • clone this repo: git clone git@github.com:nestauk/asf_public_discourse_home_decarbonisation.git
  • checkout to the correct branch: git checkout issue-22-first_exploration_of_questions
  • Run: make install;
  • Run: direnv allow;
  • Activate the conda environment using: conda activate asf_public_discourse_home_decarbonisation

Review

  • Can you run python BERTopic_first_analysis.py
  • Can you also run python questions_eda_analysis.py
  • The subsequent outputs will be saved in asf_public_discourse_home_decarbonisation/outputs/figures/extracted_questions/

Scripts

If you could review the following scripts:

  • asf_public_discourse_home_decarbonisation/analysis/FAQ_analysis/BERTopic_first_analysis.py: Implements topic modeling using the BERTopic library, providing insights into the thematic structure of the question data.
  • asf_public_discourse_home_decarbonisation/analysis/FAQ_analysis/questions_eda_analysis.py: Facilitates EDA on question datasets, plotting most frequent questions from the extracted questions data.

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

Copy link
Contributor

@sofiapinto sofiapinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @helloaidank,

  • bertopic==0.16.0 and kaleido are missing from the requirements. No sure if you just forgot to commit the requirements file, but without these two requirements, your BERTopic_first_analysis.py script doesn't work;
  • I think it would make more sense to have the EDA applied to the whole of the BuildHub data, instead of each sub-forum, as each sub-forum is quite small. Hence, why you don't get many repetitions.
  • What I learnt by looking at the data & investigating a bit further:
    • A lot of the questions we are getting are questions made up of 3 or less words (see plot below) - so perhaps something is not working well in your code to remove these;
image
  • The above also tells us that we're missing the context some times! (as we expected)
  • If we take a good look at the questions data, most of them are not questions. For the Green and Ethical Money Saving sub-forum, we get 184143 questions without a question mark and 40219 with a question mark. That's a lot of potential false positives! Before we go any further, we need to improve our method for identifying questions. A few thoughts:
    • either removing some of the starting keywords such as "is" OR just focusing on sentences ending with a question mark;
    • only focusing on the original posts for now, not replies; - unsure as to whether this will help;
    • we should for sure use the titles to identifying questions in addition to using the text as we are already - I am guessing a lot of titles are posed as questions, so they should
    • adding the sentences coming before the question, to have some context.
  • We don't seem to have a lot of questions (or at least repeated questions posed as "don't knows" which is good) - w/ ~200 "don't know's" in the Green and Ethical Money Saving category.

Happy to discuss more in our stand up.

"""
python BERTopic_first_analysis.py

This script performs topic modeling on a set of questions using the BERTopic library.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This script performs topic modeling on a set of questions using the BERTopic library.
This script clusters questions together to identify groups of similar questions. To do that we apply BERTopic topic model on a set of questions.

Comment on lines 60 to 70
def load_data(file_path: str) -> List[str]:
"""
Loads extracted questions from a CSV file into a list.

Args:
file_path (str): The path to the CSV file containing the questions.

Returns:
List[str]: A list of questions.
"""
return pd.read_csv(file_path)["Question"].tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create getters for questions' data in the getters folder, and then call the getter here instead of creating this function in this script. Doesn't need to happen now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this was very much a temporary measure.

Comment on lines 121 to 138
def plot_topic_distribution(topic_model: BERTopic, figure_file_path: str) -> None:
"""
Plots and saves the distribution of the top topics identified by the BERTopic model.

Args:
topic_model (BERTopic): The BERTopic model after fitting to data.
"""
topic_counts = topic_model.get_topic_info()["Count"][1:17]
topic_labels = topic_model.get_topic_info()["Name"][1:17].str.replace("_", " ")
plt.figure(figsize=(14, 8))
plt.barh(topic_labels, topic_counts, color=NESTA_COLOURS[0])
plt.ylabel("Topics")
plt.xlabel("Count")
plt.title("Topic Distribution")
plt.tight_layout()
plt.savefig(
figure_file_path + "topic_distribution.png", dpi=300, bbox_inches="tight"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once I merge my BERTopic code you can use the utils I created and call them here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script can be moved to the pipeline faqs_identification folder after a few future changes. Doesn't need to happen in this PR.

After we do the evaluation of different models on different datasets, we can set the parms for each questions dataset and fix a random seed. These parms can then be read by this file.

Comment on lines 88 to 118
def visualise_topics(topic_model: BERTopic, figure_file_path: str) -> None:
"""
Generates and saves a visualisation of topics identified by the BERTopic model.

Args:
topic_model (BERTopic): The BERTopic model after fitting to data.
"""
fig = topic_model.visualize_topics()
fig.write_image(figure_file_path + "topic_visualisation.png")


def visualise_barchart(topic_model: BERTopic, figure_file_path: str) -> None:
"""
Generates and saves a barchart visualisation of the top n topics identified by the BERTopic model.

Args:
topic_model (BERTopic): The BERTopic model after fitting to data.
"""
fig_barchart = topic_model.visualize_barchart(top_n_topics=16, n_words=10)
fig_barchart.write_image(figure_file_path + "topic_visualisation_barchart.png")


def visualise_hierarchy(topic_model: BERTopic, figure_file_path: str) -> None:
"""
Generates and saves a hierarchical visualisation of topics identified by the BERTopic model.

Args:
topic_model (BERTopic): The BERTopic model after fitting to data.
"""
fig_hierarchy = topic_model.visualize_hierarchy()
fig_hierarchy.write_image(figure_file_path + "topic_visualisation_hierarchy.png")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you find these helpful, let's move them to the topic modelling utils file I created, it's a better place for them to live. needs to be done in a PR after we both merge our PRs.

Comment on lines 149 to 152
extracted_questions_df = load_data(input_data)
question_counts = get_question_counts(extracted_questions_df)
plot_question_counts(question_counts, figure_path=output_figures_path)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few additional things I did/checked (the code is not necessarily the best, just initial thoughts from me!):

Suggested change
extracted_questions_df = load_data(input_data)
question_counts = get_question_counts(extracted_questions_df)
plot_question_counts(question_counts, figure_path=output_figures_path)
os.makedirs(output_figures_path, exist_ok=True)
extracted_questions_df = load_data(input_data)
extracted_questions_df["contains_qm"] =extracted_questions_df["Question"].apply(lambda x: True if "?" in x else False)
print("Contains question mark?\n", extracted_questions_df["contains_qm"].value_counts())
extracted_questions_df["Question"] = extracted_questions_df["Question"].str.lower()
question_counts = get_question_counts(extracted_questions_df)
plot_question_counts(question_counts, figure_path=output_figures_path, top_n=15)
dont_knows_path = os.path.join(
PROJECT_DIR,
f"outputs/data/extracted_questions/{forum}/forum_{category}/idk_phrases_{category}_all.csv",
)
dont_knows_df = load_data(dont_knows_path)
dont_knows_df["sentences_without_inclusion"] = dont_knows_df["sentences_without_inclusion"].str.lower()
dk_counts = get_question_counts(dont_knows_df, column_name="sentences_without_inclusion")
dk_counts = dk_counts[dk_counts> 1]
if len(dk_counts)>0:
print(dk_counts)
else:
print('No frequent "dont know" expressions found')

@helloaidank
Copy link
Contributor Author

Hi @sofiapinto , I have added some of the modifications you suggested and also included the LLM label titling. The code should still run fine but please do check!
If you could have a look at some of the changes that I have made, that would be great. There are still some outstanding actions to take after this PR (if I have understood the comments correctly?) such as moving some of these functions to the "topic modelling utils file".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants