<a href="https://colab.research.google.com/github/navneeshkaur/ML-and-AI-Course/blob/master/TF_IDF_from_Scratch_Solution_Grader_Computation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NOTE:

1. Please implement the TFIDf function such that for each word in a sentence, its corresponding tfidf value is assigned. Thus a 4 x 6 sized matrix should be returned where the rows represent sentences and the columns represent words. We wish to keep it simple in the beginning.

2. In reality the TFIDF function should return a matrix where the rows represent sentences and the columns represent words (ie: Features). Every sentence vector in this matrix will be 'd' dimensional, where d = number of unique words in the corpus (ie: Vocabulary).
Every position/cell in a sentence vector correponds to a particular word in the vocabulary. If the word is not present in the current sentence, we assign a value of 0 to that cell, else we assign the TFIDF value.

# **Implement TF-IDF from scratch**

In this assignment, you will implement TF-IDF vectorization of text from scratch using only Python and inbuilt data structures. You will then verify the correctness of the your implementation using a "grader" function/cell (provided by us) which will match your implmentation.

The grader fucntion would help you validate the correctness of your code. 

Please submit the final Colab notebook in the classroom ONLY after you have verified your code using the grader function/cell.

**(FAQ) Why bother about implementing a function to compute TF-IDF when it is already available in major libraries?**

Ans.
1. It helps you improve your coding proficiency.
2. It helps you obtain a deeper understanding of the concepts and how it works internally. Knowledge of the internals will also help you debug problems better.
3. A lot of product based startups and companies do focus on this in thier interviews to gauge your depth and clarity of understanding along with your programming skills. Hence, most top universities have implementations of some ML algorithms/concepts as mandatory assignments.

**NOTE: DO NOT change the "grader" functions or code snippets written by us.Please add your code in the suggested locations.**

Ethics Code:
1. You are welcome to read up online resources to implement the code. 
2. You can also discuss with your classmates on the implmentation over Slack.
3. But, the code you wirte and submit should be yours ONLY. Your code will be compared against other stduents' code and online code snippets to check for plagiarism. If your code is found to be plagiarised, you will be awarded zero-marks for all assignments, which have a 10% wieghtage in the final marks for this course.

In [None]:
# Corpus to be used for this assignment

corpus = [
     'this is the first document mostly',
     'this document is the second document',
     'and this is the third one',
     'is this the first document here',
]

In [None]:
from collections import Counter
from math import log
import numpy as np


In [None]:
# Please implement this fucntion and write your code wherever asked. Do NOT change the code snippets provided by us.
def computeTFIDF (corpus):
  """Given a list of sentences as "corpus", return the TF-IDF vectors for all the 
  sentences in the corpus as a numpy 2D matrix. 
  
  Each row of the 2D matrix must correspond to one sentence 
  and each column corresponds to a word in the text corpus. 
  
  Please order the rows in the same order as the 
  sentences in the input "corpus". 
  
  Please order the words in the columns in the 
  alphabetic order when you featurize the corpus. 
  
  Ignore puncutation symbols like comma, fullstop, 
  exclamation, question-mark etc from the input corpus.
  
  For e.g, If the corpus contains sentences with these 
  9 distinct words, ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'], 
  then the first column of the 2D matrix will correpsond to word "and", the second column will 
  correspond to column "document" and so on. 
  
  Write this function using only basic Python code, inbuilt Data Structures and  NumPy ONLY.

  Implement the code as optimally as possible using the inbuilt data structures of Python.
  """

  ##############################################################
  ####   YOUR CODE BELOW  as per the above instructions #######
  ##############################################################

  # Calculating inverse document frequency values for entire corpus

  inv_doc_freq = {}
  unique_words_in_corpus = {}
  total_docs_in_corpus_count = len(corpus)

  # Calculating the number of documents in which a specific word occured: "n"
  for document in corpus:
    document = list(document.split())
    unique_words_in_doc = set(document)
    for word in unique_words_in_doc:
      if word not in unique_words_in_corpus:
        unique_words_in_corpus[word] = 1
      else:
        unique_words_in_corpus[word] += 1
  print(unique_words_in_corpus)
  
  # Computing idf values using the above calculated "n"
  for document in corpus:
    document = list(document.split())
    for word in document:
      if word not in inv_doc_freq:
        # No:of documents which contain the word
        n = unique_words_in_corpus[word]
        idf = log(total_docs_in_corpus_count/n)
        inv_doc_freq[word] = idf


  # Calculating term frequency and tf-idf values for entire corpus

  tf_idf = []
  
  for document in corpus:
    document = list(document.split())
    freq_of_words_in_doc = Counter(document)
    tf_idf_document = []

    for word in document:
      # retrieving idf value for the given word
      idf = inv_doc_freq[word]
      # calculating term frequency value for the give word
      tf = freq_of_words_in_doc[word]/len(document)
      # computing tfidf value for the given word and rounding it to 2 places after decimal.

      tf_idf_word = round((tf * idf),2)

      tf_idf_document.append(tf_idf_word)

    tf_idf.append(tf_idf_document)

  tf_idf = np.array(tf_idf)

  return tf_idf


In [None]:
 computeTFIDF (corpus)

{'is': 4, 'first': 2, 'mostly': 1, 'document': 3, 'the': 4, 'this': 4, 'second': 1, 'one': 1, 'third': 1, 'and': 1, 'here': 1}


array([[0.  , 0.  , 0.  , 0.12, 0.05, 0.23],
       [0.  , 0.1 , 0.  , 0.  , 0.23, 0.1 ],
       [0.23, 0.  , 0.  , 0.  , 0.23, 0.23],
       [0.  , 0.  , 0.  , 0.12, 0.05, 0.23]])

# Grader Cell
Please execute the following Grader cell to verify the correctness of your above implementation. This cell will print "Success" if your implmentation of the computeTFIDF() is correct, else, it will print "Failed". Make sure you get a "Success" before you submit the code in the classroom.

In [None]:
###########################################
## GRADER CELL: Do NOT Change this.
# This cell will print "Success" if your implmentation of the computeTFIDF() is correct.
# Else, it will print "Failed"
###########################################

# compute TF-IDF using the computeTFIDF() function
X_custom = computeTFIDF(corpus)
#print(X_custom)

X_grader = np.array(
    [[0, 0, 0, 0.12, 0.05, 0.23],
     [0, 0.1, 0, 0, 0.23, 0.1],
     [0.23, 0, 0, 0, 0.23, 0.23],
     [0, 0, 0, 0.12, 0.05, 0.23]]
     )

# compare X_sklearn and X_custom
comparison = ( X_grader == X_custom )
isEqual = comparison.all()

if isEqual:
  print("******** Success ********")
else:
  print("####### Failed #######")
  print("\nX_grader = \n\n", X_grader)
  print("\n","*"*50)
  print("\nX_custom = \n\n", X_custom)




{'is': 4, 'first': 2, 'mostly': 1, 'document': 3, 'the': 4, 'this': 4, 'second': 1, 'one': 1, 'third': 1, 'and': 1, 'here': 1}
******** Success ********
