<p style="color:#153462; 
          font-weight: bold; 
          font-size: 30px; 
          font-family: Gill Sans, sans-serif; 
          text-align: center;">
          Term Frequency - Inverse Document Frequency(IF - IDF)
</p>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       TF-IDF creates a document term matrix, where there is one row per text message and the column represents a single unique term. But instead of cell representing the count, the cell represents a weighting that's mean to identify how important a word is to an invidual text message. The tf–idf value increases proportionally to the number of times a word appears in the document
   </font>
</p>

$$w_{ij} = tf_{ij} \times log(\frac{N}{df_{i}})$$

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
      $tf_{ij}$: The number of times the term <i>i</i> occurs in text message <i>j</i> divided by total number of terms text message<br>
$df_{i}$: Number of documents containing the term <i>i</i> <br>
$N$: Total number of documents 
   </font>
</p>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       <b>Calculation for example sentence</b>: I like NLP <br><br>
       $tf_{NLP, j} = \frac{Total\hspace{0.1cm}num. \hspace{0.1cm} of\hspace{0.1cm}times\hspace{0.1cm}NLP\hspace{0.1cm}term\hspace{0.1cm}occurred \hspace{0.1cm}sentence}{Number\hspace{0.1cm}of\hspace{0.1cm}words\hspace{0.1cm}in\hspace{0.1cm}text\hspace{0.1cm}message}$<br><br>
       $tf_{NLP, j} = \frac{1}{3} = 0.333$<br><br>
       Lets assume number of message/documents in, $N=20$<br><br>
       Total number of documents containing word NLP, $df_{NLP}=1$<br><br>
       $w_{ij}=tf_{ij}\times log(\frac{N}{df_{i}}) = 0.333 \times log(\frac{20}{1}) = 0.333 \times 1.301 =0.43$
   </font>
</p>


### Importing Packages

In [2]:
import pandas as pd
import re
import string
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

### Loading the data

In [5]:
ps = nltk.PorterStemmer()

nltk.data.path.append(r"D:\Artificial_Intelligence\nat_lang_proc\nltk_data")
stopwords = nltk.corpus.stopwords.words("english")

data_df = pd.read_csv(r"D:/Artificial_Intelligence/nat_lang_proc/data/SMSSpamCollection.tsv", delimiter="\t", header=None)
data_df.columns = ["labels", "body_text"]
data_df.head()

Unnamed: 0,labels,body_text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


### Cleaning the Data

In [6]:
def clean_data(text):
    without_punc = "".join([char.lower() for char in text if char not in string.punctuation])
    tokenzied_text = re.findall("\w+", without_punc)
    stemmed_tokens = [ps.stem(word) for word in tokenzied_text if word not in stopwords]
    return stemmed_tokens

### IF-IDF Implementation

In [8]:
tfidf_vect = TfidfVectorizer(analyzer=clean_data)
X_tfidf = tfidf_vect.fit_transform(data_df["body_text"])
print(X_tfidf.shape)
print(tfidf_vect.get_feature_names_out())

(5568, 8106)
['0' '008704050406' '0089mi' ... 'ü' 'üll' '〨ud']


#### Sample Data

In [10]:
data_sample = data_df[:20]

In [11]:
tfidf_vect_sample = TfidfVectorizer(analyzer=clean_data)
X_tfidf_sample = tfidf_vect_sample.fit_transform(data_sample["body_text"])
print(X_tfidf_sample.shape)
print(tfidf_vect_sample.get_feature_names_out())

(20, 201)
['08002986030' '08452810075over18' '09061701461' '1' '100' '100000' '11'
 '12' '150pday' '16' '2' '20000' '2005' '21st' '3' '4' '4403ldnw1a7rw18'
 '4txtú120' '6day' '81010' '87077' '87121' '87575' '9' '900' 'aft' 'aid'
 'alreadi' 'anymor' 'appli' 'ard' 'around' 'b' 'bless' 'breather'
 'brother' 'call' 'caller' 'callertun' 'camera' 'cash' 'chanc' 'claim'
 'click' 'co' 'code' 'colour' 'comin' 'comp' 'copi' 'cost' 'credit' 'cri'
 'csh11' 'cup' 'custom' 'da' 'date' 'dont' 'eg' 'eh' 'england' 'enough'
 'entitl' 'entri' 'even' 'fa' 'feel' 'final' 'fine' 'finish' 'first'
 'free' 'friend' 'fulfil' 'go' 'goalsteam' 'goe' 'gonna' 'gota' 'grant'
 'ha' 'help' 'hl' 'home' 'hour' 'httpwap' 'im' 'info' 'ive' 'jackpot'
 'joke' 'k' 'kim' 'kl341' 'lar' 'latest' 'lccltd' 'like' 'link' 'live'
 'lor' 'lunch' 'macedonia' 'make' 'may' 'mell' 'membership' 'messag'
 'minnaminungint' 'miss' 'mobil' 'month' 'nah' 'name' 'nation' 'naughti'
 'network' 'news' 'next' 'nurungu' 'oh' 'oru' 'patent' 'pay' 'pe

In [14]:
sample_data_df = pd.DataFrame(X_tfidf_sample.toarray())
sample_data_df.columns = tfidf_vect_sample.get_feature_names_out()
sample_data_df.head()

Unnamed: 0,08002986030,08452810075over18,09061701461,1,100,100000,11,12,150pday,16,...,winner,wkli,wonder,wont,word,wwwdbuknet,xxxmobilemovieclub,xxxmobilemovieclubcomnqjkgighjjgcbl,ye,ü
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.238737,0.238737,0.209853,0.0,0.0,0.0,0.0,0.0
1,0.0,0.198986,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.198986,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
