Skip to content

michaelborisov/text-analysis-lab

Repository files navigation

Important comment

If do not want to download .ipynb files and face some issues with graph displaying, please refer to the html folder.


Overview

Given set of financial reports, issued by companies, which are publicly traded. Apply machine learning techniques to find topics of paragraphs of documents.

  • Use programming language python3.4;
  • Use libraries spacy, sklearn;
  • Apply methods TF/IDF and LDA to analyze given texts.

Folder/Structure

  • html contains html-version of .ipynb files;
  • img contains the charts from the result of two methods TF/IDF and LDA;
  • presentation has the slides of our presentation;
  • src has all the configurations we needed;
  • file LDA apply the method LDA;
  • file TFIDF apply the method TF/IDF.

Theory and Algorithm

Preprocessing

  • Remove all the unnecessary symbols;
  • Remove all the stop words;
  • Remove all the numbers;
  • Classify all the words with their lemma.

TF/IDF

  • Stands for term frequency-inverse document frequency
  • Our goals:
    1. Find the most important words for certain text;
    2. Learn the trend for this words during several years.
    3. Apply LSI technique at TF/IDF matrix to implement an information query program.

LDA

  • Stands for Latent Dirichlet Allocation
  • Our goals:
    1. Find several topics from certain text;
    2. Find paragraphs in the text, which are mostly related to the topic;
    3. Learn the trend for topics during several years.

Reference