Topic modelling on financial news with Natural Language Processing
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
01_nlp_stopwords_punct_entity.ipynb
02_trim_cleaned_article_data.ipynb first commit Aug 24, 2017
03_nlp_lda_extra_stopwords.ipynb
04_nlp_stem_lemmatize_text.ipynb
05_nlp_viz_clusters.ipynb
06_nlp_snowball_dbscan_k_means_clustering.ipynb
07_explore_label_k_means_clusters.ipynb
08_nlp_chart_data.ipynb
README.md
articles_by_year.png
cb_regulation_chart.png
country_region_chart.png
economy_chart.png

README.md

Topic Modelling on Financial News Articles

Summary

This repo contains code for pre-processing and vectorizing raw text collected from 85,000 news articles downloaded from a variety of online broadsheet newspapers and newswires covering finance, business and the economy.

A detailed blog post can be found at http://mattmurray.net/topic-modelling-financial-news-with-natural-language-processing/

Article counts by year

The data was pre-processed with the removal of stop words, punctuation and numbers, and the words were stemmed using the Snowball stemmer.

The data was vectorized into a TF-IDF matrix, then Latent Semantic Analysis techniques were applied to reduce the dimensions into a smaller number of latent features.

Finally, the latent features were clustered into topic clusters and the trends in the topics visualized over time.

Outcome