# Representing Text
    - Bag of Words Model
        - Representation that turns arbitary text into fixed-length vectors by frequencies 
        - Often called as Vectorization
        - They lose the order of the words and grammar
    - One Hot Encoding
        - If word exists value has 1 else 0
        - Doesn't attach any importance to the word
    - TF/IDF Term Weighting
        - Term Frequency * Inverse Term Ferquency
        - If term appears frequently in a document, it's important give the term high score
        - If a term apprears in many documents, it's not a unique identifier - give the term a low score


## Bag of Words Model

In [1]:
documents = [
    "the cat sat",
    "the cat sat in the hat",
    "the cat with the hat",
]

In [19]:
vocabulary = list(set(' '.join(documents).split()))
vocabulary

['in', 'cat', 'sat', 'hat', 'the', 'with']

In [20]:
bow_vectors = []
for each_doc in documents:
    doc_words = each_doc.split()
    doc_vector = []
    for x in vocabulary:
        i = doc_words.count(x)
        doc_vector.append(i)
    bow_vectors.append(doc_vector)

print(vocabulary)
for i, x in enumerate(bow_vectors):
    print(documents[i], x)

['in', 'cat', 'sat', 'hat', 'the', 'with']
the cat sat [0, 1, 1, 0, 1, 0]
the cat sat in the hat [1, 1, 1, 1, 2, 0]
the cat with the hat [0, 1, 0, 1, 2, 1]


## One Hot Encoding

In [18]:
ohc_vectors = []
for each_doc in documents:
    words = each_doc.split()
    doc_vector = []
    for word in vocabulary:
        i = words.count(word)
        if i > 0:
            i = 1
        doc_vector.append(i)
    ohc_vectors.append(doc_vector)
print(vocabulary)
for i, x in enumerate(ohc_vectors):
    print(documents[i], x)

['in', 'cat', 'sat', 'hat', 'the', 'with']
the cat sat [0, 1, 1, 0, 1, 0]
the cat sat in the hat [1, 1, 1, 1, 1, 0]
the cat with the hat [0, 1, 0, 1, 1, 1]


# TF-IDF

Number of documents = N
Number of Documents with Term: Nt

Normalized Term Frequency (TF):
    - Number of times word appears in document / Total number of terms in the document
    - More frequent term higher TF

Inverse Document Frequency (IDF):
    - IDF = log(N/ (1 + Nt))

TF - IDF:
    TF * IDF


In [3]:
document1 = '''Most college students are not equipped with the skills necessary to project three-dimensional meaning onto a two-dimensional canvass of paper. And, unfortunately, they are not provided with more than a benevolent injunction: go and write. But how effective is this diktat? If you ever wriggled in front of a blank page, you know that its usefulness borders on zero. Even if you don’t struggle with writing, it does not necessarily mean you live up to your literary potential. Do your readers hold their breath when poring over your essay? No? Well, they could. If, of course, you learn how to attack them with a series of short, punchy sentences that are simple enough to get past their guard. Your readers would hold their breath until the last full stop if you also treat them to long, meaty sentences that have enough substance to nourish their hunger for quality. You got to learn how to do it. Here you can do that for free by learning from the best!
'''
document2 = '''A decent sample of essay writing could teach you that good content should be neither thunderously pretentious nor placidly banal. A mind shamelessly aggrandizing and frolicking on a page is rarely palatable; the same applies to muffled intelligence. It’s all about balance. A top-notch academic essay example could also inspire you to let your literary style shape the message. Your own voice should be clear, distinctive, and, above all, heard through the fuzz of text.
'''
document3 = '''It might happen that the topic is secondary to your needs while you have to get a better idea of how a particular type of paper is crafted. We've taken full care of that and systemized our essays writing examples according to their type. Thus, if you need, for instance, a great argumentative essay sample (be it on some legal, environmental or literary topic), getting to it is as simple as just clicking on the respective category in the table.
'''

In [6]:
import math
documents_list = [document1, document2, document3]

document1_words = document1.split()
document2_words = document2.split()
document3_words = document3.split()


N = len(documents_list)
overall_tfidf = []

for document in documents_list:
    doc_tf_idf = {}
    document_words = document.split()
    for x in document_words:
        # calculate tf
        tf = document_words.count(x) / len(document_words)
        # calculate idf
        nwdt = 0
        for d in documents_list:
            if x in d:
                nwdt += 1
        
        idf = math.log(N/(1 + nwdt))

        doc_tf_idf[x] = tf*idf

    overall_tfidf.append(doc_tf_idf)


In [11]:
for x in overall_tfidf:
    y = sorted(x.items(), key=lambda x: x[1], reverse=True)
    print(y[:3])

[('with', 0.009653931145432485), ('they', 0.004826965572716242), ('readers', 0.004826965572716242)]
[('should', 0.010812402882884384), ('decent', 0.005406201441442192), ('teach', 0.005406201441442192)]
[('might', 0.005068313851352055), ('happen', 0.005068313851352055), ('topic', 0.005068313851352055)]


# N-Grams

In [25]:
def create_ngrams(sentence, n=2, r=2):
    """
    Function which create ngrams and returns
    top 5 n grams
    """

    # Create n grams
    ngrams = []
    final_dict = {}
    words = sentence.split()
    if n < len(words):
        for i in range(0, len(words)+1-n):
            ng = ' '.join(words[i: i+n])
            ngrams.append(ng)
    
    if ngrams:
        uniq_ngrams = list(set(ngrams))
        for x in uniq_ngrams:
            final_dict[x] = ngrams.count(x)

        sorted_dict = sorted(final_dict.items(), key=lambda x: x[1], reverse=True)

    return sorted_dict[:r]

In [28]:
sentence = ''' this is a testing module this is a testing module testing module'''
create_ngrams(sentence, 2)

[('testing module', 3), ('a testing', 2)]