EDA + initial text analysis for MSE #11

sofiapinto · 2023-11-14T11:36:59Z

Description

This PR adds scripts to perform EDA (exploratory data analysis) and initial text analysis (looking at top words and ngrams) for Money Saving Expert data.

Closes #1
Closes #5

Instructions for Reviewer

Setup

In order to test the code in this PR you need to:

clone this repo: git clone git@github.com:nestauk/asf_public_discourse_home_decarbonisation.git
checkout to the correct branch: git checkout 01_initial_analysis_mse
Run make install;
Run direnv allow;
Activate the conda enviroment: conda activate asf_public_discourse_web_scraping;

Review

Hey @helloaidank and @lizgzil, thanks a lot for taking the time to review this PR. @crispy-wonton, I've also tagged you as you said you'd like to take a look - feel free to take as much or a little time to look at this, but really appreciate you taking the time.

Scripts to be reviewed:
There are a couple of scripts to be reviewed in this PR.

Getters:
- Useful getter utils: asf_public_discourse_home_decarbonisation/getters/getter_utils.py
- MSE specific getters: asf_public_discourse_home_decarbonisation/getters/mse_getters.py
Utils:
- Plotting utils: asf_public_discourse_home_decarbonisation/utils/plotting_utils.py
- Text processing utils: asf_public_discourse_home_decarbonisation/utils/text_processing_utils.py
Analysis scripts:
- EDA: asf_public_discourse_home_decarbonisation/analysis/mse/eda_mse_category_data.py
- Initial text analysis: asf_public_discourse_home_decarbonisation/analysis/mse/initial_text_analysis_category_data.py

Note that the two files in notebooks/ do not need to be reviewed. They serve only as a helper: you can open the notebooks and run them if any of the steps in analysis scripts does not make sense and you want to take a look at them. To open the files as a notebooks follow the instructions below (also present at the top of the notebook files):
- Run jupytext --to notebook asf_public_discourse_home_decarbonisation/notebooks/mse/name_of_notebook.py
- If the correct kernel does not come up (asf_public_discourse_home_decarbonisation), please run the following in your terminal: python -m ipykernel install --user --name=asf_public_discourse_home_decarbonisation

Code to run
Could you also please run:

python asf_public_discourse_home_decarbonisation/analysis/mse/eda_mse_category_data.py
python asf_public_discourse_home_decarbonisation/analysis/mse/initial_text_analysis_category_data.py
and let me know if it runs smoothly.

Things to pay special attention to:

The code runs well when we use one of the smallest categories, but when running for the "energy" category, the code might break. What can we do to optimise the steps? Especially, the lemmatising/tokenising steps? Shall I use SpaCy instead of NLTK? The code is much faster since I started using np.vectorize() - but is there anything else I can do? @lizgzil if you don't have much time to look at the PR, it would be great if you could just take a look at this.
Anything in the logic does not make sense?
Should any part of the codebase live somewhere else in the cookiecutter structure? These small pieces of analysis still are the weirdest to organise/refactor for me.

Note that

The results I presented at the project sprint review contained an extended analysis which will follow in a 2nd PR;
I've not yet implemented Liz's suggestion to not lemmatise all words, and to leave out low frequency words.
I think the data needs some extra cleaning - if you have any tips or functions you have implemented in past projects, do let me know.

Thanks a lot in advance!

Checklist:

…ipt to analysis/

sofiapinto · 2023-12-01T14:42:29Z

asf_public_discourse_home_decarbonisation/notebooks/mse/initial_text_analysis_category_data.py

does not require review

sofiapinto · 2023-12-01T14:42:38Z

asf_public_discourse_home_decarbonisation/notebooks/mse/eda_mse_category_data.py

does not require review

asf_public_discourse_home_decarbonisation/pipeline/data_processing_flows/flow_utils.py

…ls to the flow utils

''

…e_category_data.py Removing EDA notebook as it's out of date

…l_text_analysis_category_data.py Removing text analysis notebook as it's out of date

Renaming flow utils

sofiapinto added 5 commits November 14, 2023 11:22

script with text processing utils

80de4cd

script with plotting utils

945f5ab

updating requirements

b8248c7

S3 bucket to config file

ede7344

getter for first sample of data collected

8a2dfa1

sofiapinto linked an issue Nov 14, 2023 that may be closed by this pull request

EDA for Money Saving Expert #1

Closed

3 tasks

sofiapinto added 9 commits November 15, 2023 15:19

updating requirements

ec2b3aa

update lemmatization

3391805

creating getter utils and updating getter functions

3f993d1

adding exploratory data analysis for MSE

94b34d7

updating plotting utils

c6678ff

changes to EDA before ASF away day

70e4a62

separating text analysis to a different file

ea0fa13

updating MSE getters

82640b2

updating create_wordcloud() arguments

777d9bf

sofiapinto self-assigned this Nov 16, 2023

sofiapinto added 8 commits November 30, 2023 16:10

splitting plotting utils into utils and configs

bcd418e

updating MSE getters

2061038

keeping a notebook with EDA in notebooks/ and adding a refactored scr…

1214a1d

…ipt to analysis/

updating requirements

b3bf895

update text processing utils and plotting utils

9c95c60

adding documentation to EDA script

09995de

updating text analysis notebook

cbd8d5f

adding refactored text analysis script

c7ce76f

sofiapinto marked this pull request as ready for review December 1, 2023 14:41

sofiapinto requested review from helloaidank, lizgzil and crispy-wonton December 1, 2023 14:41

sofiapinto commented Dec 1, 2023

View reviewed changes

asf_public_discourse_home_decarbonisation/notebooks/mse/eda_mse_category_data.py Outdated

Copy link

Contributor Author

sofiapinto Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not require review

helloaidank reviewed Dec 19, 2023

View reviewed changes

asf_public_discourse_home_decarbonisation/pipeline/data_processing_flows/flow_utils.py Outdated Show resolved Hide resolved

sofiapinto and others added 27 commits December 22, 2023 13:04

updating text processing flow

1094547

update initial text analysis after changing text processing pipeline

87ebda5

update mse getters with raw vs processed versions + improving docs

89ee7dc

update plotting utils to creal figures at the end + add docs

b87cacc

update processing utils used in analysis after moving some of the uti…

376b1a4

…ls to the flow utils

Adding info to download and install the Averta font

b7fd400

Moving S3 bucket definition

46e69b8

Improving MSE getters docs

7d75e13

Import S3 bucket from init

3aaba8c

small improvements to processing flow

eabd9f7

altering patterns function + adding function to remove emojis

2fef16f

adding documentation to EDA script

86be74b

removing if-else's in energy

c2f18ab

small fix in plotting utils

0ba4539

small fix in text analysis

fd0ab6a

lower case tokenised and lemmatised results from spacy

b66a5d5

add a retry decorator to one of the steps'

6dc529d

''

Merge branch 'dev' into 01_initial_analysis_mse

54e151b

Delete asf_public_discourse_home_decarbonisation/notebooks/mse/eda_ms…

ae85807

…e_category_data.py Removing EDA notebook as it's out of date

Delete asf_public_discourse_home_decarbonisation/notebooks/mse/initia…

5b4da07

…l_text_analysis_category_data.py Removing text analysis notebook as it's out of date

Rename flow_utils.py to text_processing_utils.py

5174153

Renaming flow utils

Replacing flow_utils by text_processing_utils

3f31b55

Rename text_processing_utils.py to ngram_utils.py

72b2c1b

Replace text_processing_utils by ngram_utils

1c2a8dc

fixes to BuildHub after changes from dev incorporated

d048339

removing if else because or energy

20d3e09

change import after renaming

84684f4

sofiapinto merged commit b1201dc into dev Jan 5, 2024

sofiapinto deleted the 01_initial_analysis_mse branch January 5, 2024 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EDA + initial text analysis for MSE #11

EDA + initial text analysis for MSE #11

sofiapinto commented Nov 14, 2023 •

edited

Loading

sofiapinto Dec 1, 2023

sofiapinto Dec 1, 2023

EDA + initial text analysis for MSE #11

EDA + initial text analysis for MSE #11

Conversation

sofiapinto commented Nov 14, 2023 • edited Loading

Description

Instructions for Reviewer

Setup

Review

Checklist:

sofiapinto Dec 1, 2023

Choose a reason for hiding this comment

sofiapinto Dec 1, 2023

Choose a reason for hiding this comment

sofiapinto commented Nov 14, 2023 •

edited

Loading