Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EDA + initial text analysis for MSE #11

Merged
merged 72 commits into from
Jan 5, 2024
Merged

Conversation

sofiapinto
Copy link
Contributor

@sofiapinto sofiapinto commented Nov 14, 2023


Description

This PR adds scripts to perform EDA (exploratory data analysis) and initial text analysis (looking at top words and ngrams) for Money Saving Expert data.

Closes #1
Closes #5

Instructions for Reviewer

Setup

In order to test the code in this PR you need to:

  • clone this repo: git clone git@github.com:nestauk/asf_public_discourse_home_decarbonisation.git
  • checkout to the correct branch: git checkout 01_initial_analysis_mse
  • Run make install;
  • Run direnv allow;
  • Activate the conda enviroment: conda activate asf_public_discourse_web_scraping;

Review

Hey @helloaidank and @lizgzil, thanks a lot for taking the time to review this PR. @crispy-wonton, I've also tagged you as you said you'd like to take a look - feel free to take as much or a little time to look at this, but really appreciate you taking the time.

Scripts to be reviewed:
There are a couple of scripts to be reviewed in this PR.

  • Getters:
    • Useful getter utils: asf_public_discourse_home_decarbonisation/getters/getter_utils.py
    • MSE specific getters: asf_public_discourse_home_decarbonisation/getters/mse_getters.py
  • Utils:
    • Plotting utils: asf_public_discourse_home_decarbonisation/utils/plotting_utils.py
    • Text processing utils: asf_public_discourse_home_decarbonisation/utils/text_processing_utils.py
  • Analysis scripts:
    • EDA: asf_public_discourse_home_decarbonisation/analysis/mse/eda_mse_category_data.py
    • Initial text analysis: asf_public_discourse_home_decarbonisation/analysis/mse/initial_text_analysis_category_data.py

Note that the two files in notebooks/ do not need to be reviewed. They serve only as a helper: you can open the notebooks and run them if any of the steps in analysis scripts does not make sense and you want to take a look at them. To open the files as a notebooks follow the instructions below (also present at the top of the notebook files):
- Run jupytext --to notebook asf_public_discourse_home_decarbonisation/notebooks/mse/name_of_notebook.py
- If the correct kernel does not come up (asf_public_discourse_home_decarbonisation), please run the following in your terminal: python -m ipykernel install --user --name=asf_public_discourse_home_decarbonisation

Code to run
Could you also please run:

  • python asf_public_discourse_home_decarbonisation/analysis/mse/eda_mse_category_data.py
  • python asf_public_discourse_home_decarbonisation/analysis/mse/initial_text_analysis_category_data.py
    and let me know if it runs smoothly.

Things to pay special attention to:

  • The code runs well when we use one of the smallest categories, but when running for the "energy" category, the code might break. What can we do to optimise the steps? Especially, the lemmatising/tokenising steps? Shall I use SpaCy instead of NLTK? The code is much faster since I started using np.vectorize() - but is there anything else I can do? @lizgzil if you don't have much time to look at the PR, it would be great if you could just take a look at this.
  • Anything in the logic does not make sense?
  • Should any part of the codebase live somewhere else in the cookiecutter structure? These small pieces of analysis still are the weirdest to organise/refactor for me.

Note that

  • The results I presented at the project sprint review contained an extended analysis which will follow in a 2nd PR;
  • I've not yet implemented Liz's suggestion to not lemmatise all words, and to leave out low frequency words.
  • I think the data needs some extra cleaning - if you have any tips or functions you have implemented in past projects, do let me know.

Thanks a lot in advance!

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

@sofiapinto sofiapinto linked an issue Nov 14, 2023 that may be closed by this pull request
3 tasks
@sofiapinto sofiapinto self-assigned this Nov 16, 2023
@sofiapinto sofiapinto marked this pull request as ready for review December 1, 2023 14:41
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not require review

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not require review

sofiapinto and others added 27 commits December 22, 2023 13:04
…e_category_data.py

Removing EDA notebook as it's out of date
…l_text_analysis_category_data.py

Removing text analysis notebook as it's out of date
@sofiapinto sofiapinto merged commit b1201dc into dev Jan 5, 2024
@sofiapinto sofiapinto deleted the 01_initial_analysis_mse branch January 5, 2024 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Initial/basic text analysis (top words & n grams) Money Saving Expert EDA for Money Saving Expert
4 participants