**<h1 style="color:cyan" align="center">Natural Language Toolkit - NLTK</h1>**



**<h1 style="color:cyan" align="center">Email Spam Filtering</h1>**



In Terminal type in: pip install nltk

In [1]:
import nltk
import pandas as pd 
# nltk.download()

In [2]:
# Set the display width of Pandas
pd.set_option("display.max_colwidth", 100)

**<h3 style="color:yellow" align="left">Check installed packages</h3>**

In [3]:
dir(nltk)

['ARLSTem',
 'ARLSTem2',
 'AbstractLazySequence',
 'AffixTagger',
 'AlignedSent',
 'Alignment',
 'AnnotationTask',
 'ApplicationExpression',
 'Assignment',
 'BigramAssocMeasures',
 'BigramCollocationFinder',
 'BigramTagger',
 'BinaryMaxentFeatureEncoding',
 'BlanklineTokenizer',
 'BllipParser',
 'BottomUpChartParser',
 'BottomUpLeftCornerChartParser',
 'BottomUpProbabilisticChartParser',
 'Boxer',
 'BrillTagger',
 'BrillTaggerTrainer',
 'CFG',
 'CRFTagger',
 'CfgReadingCommand',
 'ChartParser',
 'ChunkParserI',
 'ChunkScore',
 'Cistem',
 'ClassifierBasedPOSTagger',
 'ClassifierBasedTagger',
 'ClassifierI',
 'ConcordanceIndex',
 'ConditionalExponentialClassifier',
 'ConditionalFreqDist',
 'ConditionalProbDist',
 'ConditionalProbDistI',
 'ConfusionMatrix',
 'ContextIndex',
 'ContextTagger',
 'ContingencyMeasures',
 'CoreNLPDependencyParser',
 'CoreNLPParser',
 'Counter',
 'CrossValidationProbDist',
 'DRS',
 'DecisionTreeClassifier',
 'DefaultTagger',
 'DependencyEvaluator',
 'DependencyG

**<h3 style="color:yellow" align="left">Tokenize a string (converting the sentences to words)</h3>**

In [4]:
# Example of tokenize
from nltk.tokenize import word_tokenize

input_text = "I am learning NLP with Mohamed and using NLTK!"
word_tokenized = word_tokenize(input_text)

print(input_text) # string
print(word_tokenized) # list of words ['I', 'am', 'learning', 'NLP', 'with', 'Mohamed', 'and', 'using', 'NLTK', '!']

I am learning NLP with Mohamed and using NLTK!
['I', 'am', 'learning', 'NLP', 'with', 'Mohamed', 'and', 'using', 'NLTK', '!']


In [5]:
# Use SMS Spam collection

# 1. Read file
raw_data = open("./datasets/01_NLTK/SMSSpamCollection").read()
# \ in python is an ignoring character, therefore a path in python we write with / as shown above
# real path C:\Users\Administrator\Desktop\All_Files_Together\Python\Gena_Codes\31_NLP\datasets\01_NLTK\SMSSpamCollection

# Read first 500 charachters
print(raw_data[0:500]) # includes \t \n

ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives around here though
spam	FreeMsg Hey there darling it's been 3 week's now and no word bac


In [6]:
# Split the string by \t \n
parsed_data = raw_data.replace("\t", "\n").split("\n")
print(parsed_data[:10])

['ham', 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'ham', 'Ok lar... Joking wif u oni...', 'spam', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 'ham', 'U dun say so early hor... U c already then say...', 'ham', "Nah I don't think he goes to usf, he lives around here though"]


In [7]:
# Split the big list into two lists (label list, message list)

label_list = parsed_data[::2] # Starting from 0 to end step 2
message_list = parsed_data[1::2] # starting from 1 to end step 2

# Show sample data
print(label_list[0:5])
print(message_list[0:5])

# Show length --> count of the records are not equal
print(len(label_list)) # 5575
print(len(message_list)) # 5574

print(label_list[-3:])

# Delete the last element in the label list because it is empty
label_list.pop()

# Confirm the length of the lists
print(len(label_list)) # 5574
print(len(message_list)) # 5574
print(label_list[-3])

['ham', 'ham', 'spam', 'ham', 'ham']
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'Ok lar... Joking wif u oni...', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 'U dun say so early hor... U c already then say...', "Nah I don't think he goes to usf, he lives around here though"]
5575
5574
['ham', 'ham', '']
5574
5574
ham


**<h3 style="color:yellow" align="left">Create a Pandas DataFrame</h3>**

In [8]:
# Create Data Frame from two lists
combined_df = pd.DataFrame(
    {
        "label": label_list,
        "sms": message_list
    }
)
combined_df.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [9]:
# Alternative - Read file using Pandas
dataset = pd.read_csv("./datasets/01_NLTK/SMSSpamCollection", sep = "\t", header = None)
dataset.head()


Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [10]:
# Add columns header manually
dataset.columns = ["label", "sms"]
dataset.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


**<h3 style="color:yellow" align="left">Exploring Raw Datasets</h3>**

- How many rows are in the datasets? 
- How many hams and spams are there? 
- Is there a missing data in the columns? 

In [11]:
# Shape of the dataset
#############################

print(f"Size of the dataset rows: {len(dataset)} and columns: {len(dataset.columns)}" )

Size of the dataset rows: 5572 and columns: 2


In [12]:
# HAMs & SPAMs
#############################
ham_rows = dataset[dataset["label"] == "ham"]
spam_rows = dataset[dataset["label"] == "spam"]

print(f"HAM Rows: {len(ham_rows)}")
print(f"SPAM Rows: {len(spam_rows)}")

HAM Rows: 4825
SPAM Rows: 747


In [13]:
# Missing Data
##############################
sum_of_missing_labels = dataset["label"].isnull().sum()
sum_of_missing_sms = dataset["sms"].isnull().sum()

print(f"Missing labels: {sum_of_missing_labels}")
print(f"Missing SMS: {sum_of_missing_sms}")

Missing labels: 0
Missing SMS: 0


**<h3 style="color:yellow" align="left">Data Preparation</h3>**

* Punctiuation removal
* Tokenizing and lowering
* Stop Words (Words which are not needed or have lower semantic value). **NLTK has a built-in stop words dictionary!**


**<h4 style="color:green" align="left">Punctuation Removal</h4>**

In [14]:
import string
print(string.punctuation)

def remove_punctuation(txt):
    txt_nonpunct = "".join([char for char in txt if char not in string.punctuation])
    return txt_nonpunct

dataset["sms_nonpunct"] = dataset["sms"].apply(lambda x:remove_punctuation(x))
dataset.head()

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Unnamed: 0,label,sms,sms_nonpunct
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though


**<h4 style="color:green" align="left">Tokenize (using REGEX) and Lowering</h4>**

In [15]:
import re

def tokenize(txt): 
    tokens = re.split("\W+", txt) # split on all NON-Words, +: means one or more charachters
    return tokens

dataset["sms_tokens"] = dataset["sms_nonpunct"].apply(lambda x: tokenize(x.lower()))
dataset.head()

Unnamed: 0,label,sms,sms_nonpunct,sms_tokens
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, ci..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"


**<h4 style="color:green" align="left">Stop Words Removal</h4>**

In [16]:
stopwords = nltk.corpus.stopwords.words("english")
print(stopwords[0:10])
print(len(stopwords)) # 179 stop word in NLTK

def remove_stopwords(txt):
    txt_clean = [word for word in txt if word not in stopwords]
    return txt_clean

dataset["sms_nonstopw"] = dataset["sms_tokens"].apply(lambda x: remove_stopwords(x))
dataset.head()

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
179


Unnamed: 0,label,sms,sms_nonpunct,sms_tokens,sms_nonstopw
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, ci...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"


In [17]:
# TODO: I need more explanation about lambda!!!

**<h1 style="color:yellow" align="left">Stemming</h1>**


This is the final and most important part of the preprocessing. Stemming converts words to their stem.


**<h4 style="color:green" align="left">1. Option</h4>**
Find the root of the words. 


It is quick, but not so precise. 

In [18]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

print(ps.stem("coder"))
print(ps.stem("coders"))

print(ps.stem("coding"))
print(ps.stem("code"))

coder
coder
code
code


In [19]:
# On our dataset

def stemming(txt):
    text_stem = [ps.stem(word) for word in txt]
    return text_stem

dataset['stemed'] = dataset['sms_nonstopw'].apply(lambda x: stemming(x))
dataset.head()

Unnamed: 0,label,sms,sms_nonpunct,sms_tokens,sms_nonstopw,stemed
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, ci...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]","[go, jurong, point, crazi, avail, bugi, n, great, world, la, e, buffet, cine, got, amor, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"


**<h4 style="color:green" align="left">2. Option - Lemmatisation</h4>**


- More accurate than stemming
- Takes more time than stemming


Lemmatisation in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item,


identified by the word's lemma, or dictionary form.

In [20]:
# Lematizer
wn = nltk.WordNetLemmatizer()

print(wn.lemmatize("coding"))
print(wn.lemmatize("code"))
print(wn.lemmatize("coder"))
print(wn.lemmatize("coders"))

coding
code
coder
coder


In [21]:
def lemmatize(txt):
    txt_lemmas = [wn.lemmatize(word) for word in txt]
    return txt_lemmas

dataset["lemmatized"] = dataset["sms_nonstopw"].apply(lambda x: lemmatize(x))
dataset.head() 

Unnamed: 0,label,sms,sms_nonpunct,sms_tokens,sms_nonstopw,stemed,lemmatized
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, ci...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]","[go, jurong, point, crazi, avail, bugi, n, great, world, la, e, buffet, cine, got, amor, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]","[nah, dont, think, go, usf, life, around, though]"


**<h3 style="color:yellow" align="left">Standardization</h3>**

Some text have abbriviations --> full text






**Python String join() Method**

The join() method takes all items in an iterable and joins them into one string.

A string must be specified as the separator.


string.join(iterable)

In [22]:
import re

lookup_dict = {
    "nlp": "natural language processing", 
    "ur": "your",
    "brb": "be right back", 
    "asap": "as soon as possible", 
    "wbu": "what about you"
}

def text_abbr(txt):
    words = txt.split()

    new_words = []

    for word in words: 
        word = re.sub(r"[^\w\s]", "", word)
        if word.lower() in lookup_dict: 
            word = lookup_dict[word.lower()]
            new_words.append(word)
            new_text = " ".join(new_words)

    return new_text, new_words

text = "I like nlp. This is ur choice, wbu?"
txt_str, new_word_list = text_abbr(text)

print(txt_str) # natural language processing your what about you
print(new_word_list)

natural language processing your what about you
['natural language processing', 'your', 'what about you']


**<h1 style="color:pink" align="left">Vectorization (1.Option)</h1>**

Vectorization --> Converting words to numbers

- Count Vectorization
- Train the dictionary
- Get a list of words (dictionary)
- Get a DTM --> Document-Term Matrix

In [23]:
# pip install sklearn

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

# 3x rows --> or 3x documents 
# Document means Sentence
corpus = ["This is a sentence",
            "This is another sentence", 
            "Third document is there"]

# Train the dictionary of corpus
x = cv.fit(corpus)

# Get a list of words (dictionary)
print(cv.get_feature_names())

# Get list of words with index (key:word, value: index)
print(x.vocabulary_)
# The . vocabulary_ is a dictionary with the unique words as the keys and the indexes as the values. ... 
# For each sentence, a new vector is created, and for each word in the sentence, the index of the word is a 1 in the same index in the vector.

# Create a Document-Term Matrix (DTM)
x = cv.transform(corpus)

print(x.toarray())
print(x.shape) # (3,7) 3x rows(documents), 7x columns(words)

# Create Pd DataFrame
df = pd.DataFrame(x.toarray(), columns = cv.get_feature_names())
df.head()


['another', 'document', 'is', 'sentence', 'there', 'third', 'this']
{'this': 6, 'is': 2, 'sentence': 3, 'another': 0, 'third': 5, 'document': 1, 'there': 4}
[[0 0 1 1 0 0 1]
 [1 0 1 1 0 0 1]
 [0 1 1 0 1 1 0]]
(3, 7)




Unnamed: 0,another,document,is,sentence,there,third,this
0,0,0,1,1,0,0,1
1,1,0,1,1,0,0,1
2,0,1,1,0,1,1,0


In [24]:
# Using our Dataset
corpus = dataset["sms_nonpunct"]

# learn the dictionary of corpus
x = cv.fit(corpus)

# Create a Document-Term Matrix (DTM)
x = cv.transform(corpus)

print(x.shape) # (3,7) 3x rows(documents), 7x columns(words)

# Create PD Dataframe
df = pd.DataFrame(x.toarray(), columns = cv.get_feature_names())
df.head(10).T

(5572, 9544)




Unnamed: 0,0,1,2,3,4,5,6,7,8,9
008704050406,0,0,0,0,0,0,0,0,0,0
0089my,0,0,0,0,0,0,0,0,0,0
0121,0,0,0,0,0,0,0,0,0,0
01223585236,0,0,0,0,0,0,0,0,0,0
01223585334,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
zoom,0,0,0,0,0,0,0,0,0,0
zouk,0,0,0,0,0,0,0,0,0,0
zyada,0,0,0,0,0,0,0,0,0,0
üll,0,0,0,0,0,0,0,0,0,0


**<h1 style="color:pink" align="left">TF-IDF (2.Option)</h1>**


TF-IDF stands for “Term Frequency — Inverse Document Frequency”. 

This is a technique to quantify words in a set of documents. 

We generally compute a score for each word to signify its importance in the document and corpus. 

This method is a widely used technique in Information Retrieval and Text Mining.

*************************************************************

**What does DF t mean in Python?**


T property is used to transpose index and columns of the data frame. 

The property T is somehow related to method transpose().

The main function of this property is to create a reflection of the data frame overs the main diagonal by making rows as columns and vice versa. 


Syntax: DataFrame.T

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**What is vocabulary_ for?**

The . vocabulary_ is a dictionary with the unique words as the keys and the indexes as the values. ...

For each sentence, a new vector is created, and for each word in the sentence,

the index of the word is a 1 in the same index in the vector.


Term frequency is often categorized as a “bag of words model,” which just means getting a list of counted words/phrases in a document.

You simply get a percentage of how often words or short phrases occur in a document. If you have a term show up three times in a 500 word document, it’s scored as 3/500 = 0.006.

Alternatively, a log may be used to normalize the score

We care about the inverse of document frequency and then use a logarithm to get a factor of relative frequencies of terms to docs with a normalized score for each term as it relates to a given document in that context.

The TF-IDF score is normalized to a 0 to 1 scale.

For our purposes, the math boils down to lowest weights being given to words common to all pages with the highest weights given to words that appear just once on a whole site.

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfid_vect = TfidfVectorizer()

# 3x rows --> or 3x documents
# Document means Sentence
corpus = ["This is a sentence",
            "This is another sentence", 
            "Third document is there"]

# Train the dictionary of corpus
x = tfid_vect.fit(corpus)

#  Get a list of words (dictionary)
print(tfid_vect.get_feature_names())

# Get list of words with index (key:word, value: index)
print(x.vocabulary_)

# Create a Document-Term Matrix (DTM)
x = tfid_vect.transform(corpus)

# Create Pd DataFrame
df = pd.DataFrame(x.toarray(), columns = tfid_vect.get_feature_names())
df.head().T


['another', 'document', 'is', 'sentence', 'there', 'third', 'this']
{'this': 6, 'is': 2, 'sentence': 3, 'another': 0, 'third': 5, 'document': 1, 'there': 4}




Unnamed: 0,0,1,2
another,0.0,0.631745,0.0
document,0.0,0.0,0.546454
is,0.481334,0.373119,0.322745
sentence,0.619805,0.480458,0.0
there,0.0,0.0,0.546454
third,0.0,0.0,0.546454
this,0.619805,0.480458,0.0
