# Text mining

To mine text you can use i.e. :
* TF-IDF
* [TextRank](https://github.com/davidadamojr/TextRank)
* [RAKE](https://github.com/aneesha/RAKE)

Below I will be using RAKE (Rapid Automtic Keyword Exraction) algorithm. In addition we will use [NTLK](http://www.nltk.org/).

In this file we will analyse the text form [this](https://thenextweb.com/artificial-intelligence/2018/01/11/ai-learns-how-to-fool-text-to-speech-thats-bad-news-for-voice-assistants/) article. 

## Before we start
Let's import needed packages and paste some text. 

In [None]:
from rake_nltk import Rake
from collections import OrderedDict
from operator import itemgetter 

In [None]:
r = Rake()

## How does it work?

Rake was made based on observation of reasearchers. They realized that very often keyword contain multiple words, punctation or stop words such as function words *and*, *the* and *of* or other words with low lexical meaning.  

### Math behind it
First, the text is split into the array of words by the word delimeteres. This array is then split into sequences using phase delimeters and stop word positioning. This is how candidate words are made.

#### Keyword scores
Next for each of the candidate words we are alco creating the graph of co-occurence

![co-occurence_graph.png](https://github.com/konradbjk/Rule-Based-Engine-pyknow/graphics/co-occurence_graph.png)

Next we calculate the frequency of occurance of each of the words. The sore of phase is sum of score of each words in this phase. Score is calculated by dividing the word degree by word's frequency.

![img](http://bit.ly/2nAFeJV)

So we recive table like this:

![score_calculating.png](https://github.com/konradbjk/Rule-Based-Engine-pyknow/graphics/score_calculating.png)

For people who want to know more details I herby suggest reading [this article](https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents)

### Keywords extraction

First of all let's extract keywords from the given text. Unfortunately this package does not automatically import files but we can code by by our own.

In [None]:
with open('sample_text.txt', 'r') as f:
    text=f.read().replace('\n', '')

In [None]:
r.extract_keywords_from_text(text)

r.get_ranked_phrases()

But getting the phases aloe is not enought. Let's get also score of the phases. It will tell us more about it.

In [None]:
r.get_ranked_phrases_with_scores()

### Word degrees

Okay we did phases extraction but in many cases this can be not enought. That is why also we can use word degrees (co-occurences of the words. 

In [None]:
word_degrees = r.get_word_degrees()

sorted_word_degrees = OrderedDict(sorted(word_degrees.items(), key = itemgetter(1), reverse = True))

sorted_word_degrees

### Word occurences

As we have word degrees and we also fetched the key phases / words now we would like to get also word occurence in the text.

In [None]:
word_frequency = r.frequency_dist

sorted_word_frequency = OrderedDict(sorted(word_frequency.items(), key = itemgetter(1), reverse = True))

sorted_word_frequency