# Count Vectorizer(AKA One-Hot Encoding)

1. One of the most basic ways we can numerically represent words is through the one-hot encoding method (also sometimes called count vectorizing).
2. The idea is super simple. Create a vector that has as many dimensions as your corpora has unique words. Each unique word has a unique dimension and will be represented by a 1 in that dimension with 0s everywhere else. Unfortunately, this won’t provide use with any semantic or relational information, but that’s okay since that’s not the point of using this technique.
3. https://towardsdatascience.com/introduction-to-word-embeddings-4cf857b12edc

# Term Frequency — Inverse Document Frequency (TF-IDF)
1. is a statistic that measures how important a term is relative to a document and to a corpus, a collection of documents. The TF-IDF of a term is given by the equation:
2. TF-IDF(term) = TF(term in a document) * IDF(term)

    TF(term) = # of times the term appears in document / total # of terms in document
    IDF(term) = log(total # of documents / # of documents with term in it)
                                                                                

3. https://triton.ml/blog/tf-idf-from-scratch
4. https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/

4. The inverse document frequency (IDF) tells us how important a term is to a collection of documents. A good example of how IDF comes into play is for the word “the.” We know that just about every document contains “the,” so the term isn’t really special anymore, thereby producing a very low IDF. Now let’s contrast “the” with “Python” in our example. “Python” appears rarely in the other posts, so its IDF should be high. In fact, “Python” now carries a weight signaling that in any document in which it appears, it is important to that document.

When we multiply TF and IDF, we observe that the larger the number, the more important a term in a document is to that document. We can then compute the TF-IDF for each word in each document and create a vector.
5. We can also use TF-IDF vectors for machine learning, we can use to power our recommendations. 
6. TF-IDF returns a vector per word per document based on frequency,

### Steps:
1. Tokenize the sentences
2. Create the Frequency matrix of the words in each sentence.
3. Calculate TermFrequency and generate a matrix
4. Creating a table for documents per words
5. Calculate IDF and generate a matrix
6. Calculate TF-IDF and generate a matrix
7. Score the sentences
8. Find the threshold
9. Generate the summary

In [0]:
text = "Those Who Are Resilient Stay In The Game Longer “On the mountains of truth you can never climb in vain: either you will reach a point higher up today, or you will be training your powers so that you will be able to climb higher tomorrow.” — Friedrich Nietzsche Challenges and setbacks are not meant to defeat you, but promote you. However, I realise after many years of defeats, it can crush your spirit and it is easier to give up than risk further setbacks and disappointments. Have you experienced this before? To be honest, I don’t have the answers. I can’t tell you what the right course of action is; only you will know. However, it’s important not to be discouraged by failure when pursuing a goal or a dream, since failure itself means different things to different people. To a person with a Fixed Mindset failure is a blow to their self-esteem, yet to a person with a Growth Mindset, it’s an opportunity to improve and find new ways to overcome their obstacles. Same failure, yet different responses. Who is right and who is wrong? Neither. Each person has a different mindset that decides their outcome. Those who are resilient stay in the game longer and draw on their inner means to succeed.I’ve coached many clients who gave up after many years toiling away at their respective goal or dream. It was at that point their biggest breakthrough came. Perhaps all those years of perseverance finally paid off. It was the 19th Century’s minister Henry Ward Beecher who once said: “One’s best success comes after their greatest disappointments.” No one knows what the future holds, so your only guide is whether you can endure repeated defeats and disappointments and still pursue your dream. Consider the advice from the American academic and psychologist Angela Duckworth who writes in Grit: The Power of Passion and Perseverance: “Many of us, it seems, quit what we start far too early and far too often. Even more than the effort a gritty person puts in on a single day, what matters is that they wake up the next day, and the next, ready to get on that treadmill and keep going.”I know one thing for certain: don’t settle for less than what you’re capable of, but strive for something bigger. Some of you reading this might identify with this message because it resonates with you on a deeper level. For others, at the end of their tether the message might be nothing more than a trivial pep talk. What I wish to convey irrespective of where you are in your journey is: NEVER settle for less. If you settle for less, you will receive less than you deserve and convince yourself you are justified to receive it.“Two people on a precipice over Yosemite Valley” by Nathan Shipps on Unsplash Develop A Powerful Vision Of What You Want “Your problem is to bridge the gap which exists between where you are now and the goal you intend to reach.” — Earl Nightingale I recall a passage my father often used growing up in 1990s: “Don’t tell me your problems unless you’ve spent weeks trying to solve them yourself.” That advice has echoed in my mind for decades and became my motivator. Don’t leave it to other people or outside circumstances to motivate you because you will be let down every time. It must come from within you. Gnaw away at your problems until you solve them or find a solution. Problems are not stop signs, they are advising you that more work is required to overcome them. Most times, problems help you gain a skill or develop the resources to succeed later. So embrace your challenges and develop the grit to push past them instead of retreat in resignation. Where are you settling in your life right now? Could you be you playing for bigger stakes than you are? Are you willing to play bigger even if it means repeated failures and setbacks? You should ask yourself these questions to decide whether you’re willing to put yourself on the line or settle for less. And that’s fine if you’re content to receive less, as long as you’re not regretful later.If you have not achieved the success you deserve and are considering giving up, will you regret it in a few years or decades from now? Only you can answer that, but you should carve out time to discover your motivation for pursuing your goals. It’s a fact, if you don’t know what you want you’ll get what life hands you and it may not be in your best interest, affirms author Larry Weidel: “Winners know that if you don’t figure out what you want, you’ll get whatever life hands you.” The key is to develop a powerful vision of what you want and hold that image in your mind. Nurture it daily and give it life by taking purposeful action towards it.Vision + desire + dedication + patience + daily action leads to astonishing success. Are you willing to commit to this way of life or jump ship at the first sign of failure? I’m amused when I read questions written by millennials on Quora who ask how they can become rich and famous or the next Elon Musk. Success is a fickle and long game with highs and lows. Similarly, there are no assurances even if you’re an overnight sensation, to sustain it for long, particularly if you don’t have the mental and emotional means to endure it. This means you must rely on the one true constant in your favour: your personal development. The more you grow, the more you gain in terms of financial resources, status, success — simple. If you leave it to outside conditions to dictate your circumstances, you are rolling the dice on your future.So become intentional on what you want out of life. Commit to it. Nurture your dreams. Focus on your development and if you want to give up, know what’s involved before you take the plunge. Because I assure you, someone out there right now is working harder than you, reading more books, sleeping less and sacrificing all they have to realise their dreams and it may contest with yours. Don’t leave your dreams to chance."

In [0]:
import nltk
# nltk.download('punkt')
import math

from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.corpus import stopwords    
    
'''
We already have a sentence tokenizer, so we just need 
to run the sent_tokenize() method to create the array of sentences.
'''
# 1 Sentence Tokenize
sentences = sent_tokenize(text)
for sent in sentences:
    print(sent)
total_documents = len(sentences)


Those Who Are Resilient Stay In The Game Longer “On the mountains of truth you can never climb in vain: either you will reach a point higher up today, or you will be training your powers so that you will be able to climb higher tomorrow.” — Friedrich Nietzsche Challenges and setbacks are not meant to defeat you, but promote you.
However, I realise after many years of defeats, it can crush your spirit and it is easier to give up than risk further setbacks and disappointments.
Have you experienced this before?
To be honest, I don’t have the answers.
I can’t tell you what the right course of action is; only you will know.
However, it’s important not to be discouraged by failure when pursuing a goal or a dream, since failure itself means different things to different people.
To a person with a Fixed Mindset failure is a blow to their self-esteem, yet to a person with a Growth Mindset, it’s an opportunity to improve and find new ways to overcome their obstacles.
Same failure, yet different 

In [0]:
total_documents

47

In [0]:
# 2 Create the Frequency matrix of the words in each sentence.
# nltk.download('stopwords')
def _create_frequency_matrix(sentences):
    frequency_matrix = {}
    stopWords = set(stopwords.words("english"))
    ps = PorterStemmer()

    for sent in sentences:
        freq_table = {}
        words = word_tokenize(sent)
        for word in words:
            word = word.lower()
            word = ps.stem(word)
            if word in stopWords:
                continue

            if word in freq_table:
                freq_table[word] += 1
            else:
                freq_table[word] = 1

        frequency_matrix[sent[:15]] = freq_table # :15 denotes first 15 letters of the sentence 

    return frequency_matrix
freq_matrix = _create_frequency_matrix(sentences)
print('frequency matrix of words in each sentence: ', freq_matrix)

frequency matrix of words in each sentence:  {'Those Who Are R': {'resili': 1, 'stay': 1, 'game': 1, 'longer': 1, '“': 1, 'mountain': 1, 'truth': 1, 'never': 1, 'climb': 2, 'vain': 1, ':': 1, 'either': 1, 'reach': 1, 'point': 1, 'higher': 2, 'today': 1, ',': 2, 'train': 1, 'power': 1, 'abl': 1, 'tomorrow.': 1, '”': 1, '—': 1, 'friedrich': 1, 'nietzsch': 1, 'challeng': 1, 'setback': 1, 'meant': 1, 'defeat': 1, 'promot': 1, '.': 1}, 'However, I real': {'howev': 1, ',': 2, 'realis': 1, 'mani': 1, 'year': 1, 'defeat': 1, 'crush': 1, 'spirit': 1, 'easier': 1, 'give': 1, 'risk': 1, 'setback': 1, 'disappoint': 1, '.': 1}, 'Have you experi': {'experienc': 1, 'thi': 1, 'befor': 1, '?': 1}, 'To be honest, I': {'honest': 1, ',': 1, '’': 1, 'answer': 1, '.': 1}, 'I can’t tell yo': {'’': 1, 'tell': 1, 'right': 1, 'cours': 1, 'action': 1, ';': 1, 'onli': 1, 'know': 1, '.': 1}, 'However, it’s i': {'howev': 1, ',': 2, '’': 1, 'import': 1, 'discourag': 1, 'failur': 2, 'pursu': 1, 'goal': 1, 'dream': 1,

In [0]:
'''
Term frequency (TF) is how often a word appears in a document, divided by how many words are there in a document.
'''
def _create_tf_matrix(freq_matrix):
    tf_matrix = {}

    for sent, f_table in freq_matrix.items():
        tf_table = {}

        count_words_in_sentence = len(f_table)
        for word, count in f_table.items():
            tf_table[word] = count / count_words_in_sentence

        tf_matrix[sent] = tf_table

    return tf_matrix
# 3 Calculate TermFrequency and generate a matrix
tf_matrix = _create_tf_matrix(freq_matrix)
print(tf_matrix)



{'Those Who Are R': {'resili': 0.03225806451612903, 'stay': 0.03225806451612903, 'game': 0.03225806451612903, 'longer': 0.03225806451612903, '“': 0.03225806451612903, 'mountain': 0.03225806451612903, 'truth': 0.03225806451612903, 'never': 0.03225806451612903, 'climb': 0.06451612903225806, 'vain': 0.03225806451612903, ':': 0.03225806451612903, 'either': 0.03225806451612903, 'reach': 0.03225806451612903, 'point': 0.03225806451612903, 'higher': 0.06451612903225806, 'today': 0.03225806451612903, ',': 0.06451612903225806, 'train': 0.03225806451612903, 'power': 0.03225806451612903, 'abl': 0.03225806451612903, 'tomorrow.': 0.03225806451612903, '”': 0.03225806451612903, '—': 0.03225806451612903, 'friedrich': 0.03225806451612903, 'nietzsch': 0.03225806451612903, 'challeng': 0.03225806451612903, 'setback': 0.03225806451612903, 'meant': 0.03225806451612903, 'defeat': 0.03225806451612903, 'promot': 0.03225806451612903, '.': 0.03225806451612903}, 'However, I real': {'howev': 0.07142857142857142, ',

In [0]:
# 4 creating table for documents per words i.e number of documents the word appear into
def _create_documents_per_words(freq_matrix):
    word_per_doc_table = {}

    for sent, f_table in freq_matrix.items():
        for word, count in f_table.items():
            if word in word_per_doc_table:
                word_per_doc_table[word] += 1
            else:
                word_per_doc_table[word] = 1

    return word_per_doc_table

count_doc_per_words = _create_documents_per_words(freq_matrix)
print(count_doc_per_words)



{'resili': 2, 'stay': 2, 'game': 3, 'longer': 2, '“': 5, 'mountain': 1, 'truth': 1, 'never': 2, 'climb': 1, 'vain': 1, ':': 8, 'either': 1, 'reach': 1, 'point': 2, 'higher': 1, 'today': 1, ',': 21, 'train': 1, 'power': 4, 'abl': 1, 'tomorrow.': 1, '”': 5, '—': 3, 'friedrich': 1, 'nietzsch': 1, 'challeng': 2, 'setback': 2, 'meant': 1, 'defeat': 3, 'promot': 1, '.': 40, 'howev': 2, 'realis': 2, 'mani': 3, 'year': 4, 'crush': 1, 'spirit': 1, 'easier': 1, 'give': 4, 'risk': 1, 'disappoint': 2, 'experienc': 1, 'thi': 4, 'befor': 2, '?': 6, 'honest': 1, '’': 16, 'answer': 2, 'tell': 2, 'right': 4, 'cours': 1, 'action': 2, ';': 1, 'onli': 3, 'know': 5, 'import': 1, 'discourag': 1, 'failur': 4, 'pursu': 3, 'goal': 4, 'dream': 6, 'sinc': 1, 'mean': 4, 'differ': 3, 'thing': 2, 'peopl': 3, 'person': 4, 'fix': 1, 'mindset': 2, 'blow': 1, 'self-esteem': 1, 'yet': 2, 'growth': 1, 'opportun': 1, 'improv': 1, 'find': 2, 'new': 1, 'way': 2, 'overcom': 2, 'obstacl': 1, 'respons': 1, 'wrong': 1, 'neither

In [0]:
'''
Inverse document frequency (IDF) is how unique or rare a word is.
'''
# 5 Calculate IDF and generate a matrix
def _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents):
    idf_matrix = {}

    for sent, f_table in freq_matrix.items():
        idf_table = {}

        for word in f_table.keys():
            idf_table[word] = math.log10(total_documents / float(count_doc_per_words[word]))

        idf_matrix[sent] = idf_table

    return idf_matrix
    
idf_matrix = _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents)
print(idf_matrix)



{'Those Who Are R': {'resili': 1.3710678622717363, 'stay': 1.3710678622717363, 'game': 1.194976603216055, 'longer': 1.3710678622717363, '“': 0.9731278535996987, 'mountain': 1.6720978579357175, 'truth': 1.6720978579357175, 'never': 1.3710678622717363, 'climb': 1.6720978579357175, 'vain': 1.6720978579357175, ':': 0.7690078709437739, 'either': 1.6720978579357175, 'reach': 1.6720978579357175, 'point': 1.3710678622717363, 'higher': 1.6720978579357175, 'today': 1.6720978579357175, ',': 0.3498785632017982, 'train': 1.6720978579357175, 'power': 1.070037866607755, 'abl': 1.6720978579357175, 'tomorrow.': 1.6720978579357175, '”': 0.9731278535996987, '—': 1.194976603216055, 'friedrich': 1.6720978579357175, 'nietzsch': 1.6720978579357175, 'challeng': 1.3710678622717363, 'setback': 1.3710678622717363, 'meant': 1.6720978579357175, 'defeat': 1.194976603216055, 'promot': 1.6720978579357175, '.': 0.07003786660775509}, 'However, I real': {'howev': 1.3710678622717363, ',': 0.3498785632017982, 'realis': 1.

In [0]:
# 6 Calculate TF-IDF and generate a matrix
def _create_tf_idf_matrix(tf_matrix, idf_matrix):
    tf_idf_matrix = {}

    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):

        tf_idf_table = {}

        for (word1, value1), (word2, value2) in zip(f_table1.items(),
                                                    f_table2.items()):  # here, keys are the same in both the table
            tf_idf_table[word1] = float(value1 * value2)

        tf_idf_matrix[sent1] = tf_idf_table

    return tf_idf_matrix

tf_idf_matrix = _create_tf_idf_matrix(tf_matrix, idf_matrix)
print(tf_idf_matrix)



{'Those Who Are R': {'resili': 0.04422799555715278, 'stay': 0.04422799555715278, 'game': 0.03854763236180822, 'longer': 0.04422799555715278, '“': 0.03139122108386125, 'mountain': 0.053938640578571534, 'truth': 0.053938640578571534, 'never': 0.04422799555715278, 'climb': 0.10787728115714307, 'vain': 0.053938640578571534, ':': 0.024806705514315287, 'either': 0.053938640578571534, 'reach': 0.053938640578571534, 'point': 0.04422799555715278, 'higher': 0.10787728115714307, 'today': 0.053938640578571534, ',': 0.022572810529148273, 'train': 0.053938640578571534, 'power': 0.03451735053573403, 'abl': 0.053938640578571534, 'tomorrow.': 0.053938640578571534, '”': 0.03139122108386125, '—': 0.03854763236180822, 'friedrich': 0.053938640578571534, 'nietzsch': 0.053938640578571534, 'challeng': 0.04422799555715278, 'setback': 0.04422799555715278, 'meant': 0.053938640578571534, 'defeat': 0.03854763236180822, 'promot': 0.053938640578571534, '.': 0.0022592860196050026}, 'However, I real': {'howev': 0.0979

In [0]:
# 7 Important Algorithm: score the sentences
def _score_sentences(tf_idf_matrix) -> dict:
    """
    score a sentence by its word's TF
    Basic algorithm: adding the TF frequency of every non-stop word in a sentence divided by total no of words in a sentence.
    :rtype: dict
    """

    sentenceValue = {}

    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence = 0

        count_words_in_sentence = len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence += score

        sentenceValue[sent] = total_score_per_sentence / count_words_in_sentence

    return sentenceValue

sentence_scores = _score_sentences(tf_idf_matrix)
print(sentence_scores)



{'Those Who Are R': 0.04803659195444306, 'However, I real': 0.08914534088514688, 'Have you experi': 0.31294688714795516, 'To be honest, I': 0.15724240101187198, 'I can’t tell yo': 0.1217591297003205, 'However, it’s i': 0.0842231406821737, 'To a person wit': 0.08040455711490915, 'Same failure, y': 0.15911379499557823, 'Who is right an': 0.40400914801061627, 'Neither.': 0.43553393113586814, 'Each person has': 0.16572150573841818, 'Those who are r': 0.06772015017870933, 'It was at that ': 0.21745742124884385, 'Perhaps all tho': 0.2090954769248444, 'It was the 19th': 0.03969988420099157, 'Consider the ad': 0.06217128885552281, 'Even more than ': 0.04881311476003467, 'Some of you rea': 0.14030593962541718, 'For others, at ': 0.1319463930196733, 'What I wish to ': 0.13290838879898648, 'If you settle f': 0.02726319713624063, 'Don’t leave it ': 0.10412586940812174, 'It must come fr': 0.28026696556793407, 'Gnaw away at yo': 0.17545867420208477, 'Problems are no': 0.13110433897579837, 'Most time

In [0]:
# 8 Find the threshold
def _find_average_score(sentenceValue) -> int:
    """
    Find the average score from the sentence value dictionary
    :rtype: int
    """
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    # Average value of a sentence from original summary_text
    average = (sumValues / len(sentenceValue))

    return average
threshold = _find_average_score(sentence_scores)
print(threshold)



0.14822866456525816


In [0]:
# 9 Important Algorithm: Generate the summary
def _generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary
summary = _generate_summary(sentences, sentence_scores, 1.3 * threshold)
print(summary)

 Have you experienced this before? Who is right and who is wrong? Neither. It was at that point their biggest breakthrough came. Perhaps all those years of perseverance finally paid off. It must come from within you. Where are you settling in your life right now? Could you be you playing for bigger stakes than you are? Commit to it. Nurture your dreams.


### TASK: 
1. find out most impoertant sentence in a document.

# LDA(Latent Dirichlet allocation)

1. https://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
2. https://www.youtube.com/watch?v=DDq3OVp9dNA
3. https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d

Suppose you have the following set of sentences:

    I like to eat broccoli and bananas.
    I ate a banana and spinach smoothie for breakfast.
    Chinchillas and kittens are cute.
    My sister adopted a kitten yesterday.
    Look at this cute hamster munching on a piece of broccoli.

What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

    Sentences 1 and 2: 100% Topic A
    Sentences 3 and 4: 100% Topic B
    Sentence 5: 60% Topic A, 40% Topic B
    Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
    Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

The question, of course, is: how does LDA perform this discovery?