<a href="https://colab.research.google.com/github/joezerr/Project/blob/main/tfidf_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**IMPORT THE LIBRARY**

*The first thing we need to do is to import all of the required libraries and download needed library tools, which are:*

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.tokenize import word_tokenize
import pandas as pd
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True



*   NLTK is a libraries for natural language processing written in English in the Python language
*   punkt is a NLTK library tool used for tokenizing texts/sentences.

*   stopwords is a NLTK library tool used to remove unwanted words (stopwords)
*   tokenize is a NLTK library tool used to divide strings into list of substrings


*   pandas is a toolkit used to present/visualize data
*   nltk.tokenize.treebank detokenizer is used to detokenize words back into sentences
*   sklearn.feature_extraction.text is used to convert text of documents into matrix of token







Then, we **import the corpus** 

In [None]:
corpus = ['Saya suka dengan makanan laut',
          'Saya suka dengan minuman yang manis',
          'Kemarin saya makan makanan laut',
          'Hari ini saya makan makanan khas Sunda',
          'Besok saya berencana makan makanan khas Betawi',
          'Kemarin saya makan roti',
          'Hari ini saya makan Pizza',
          'Besok saya berencana makan Burger',
          'Hari ini saya minum teh tawar',
          'Kemarin saya minum Coca Cola',
          'Besok saya berencana minum kopi',
          'Lusa saya berencana minum teh manis'
          ]


In [None]:
df = pd.DataFrame(corpus, columns = ['text'])
df.head()

Unnamed: 0,text
0,Saya suka dengan makanan laut
1,Saya suka dengan minuman yang manis
2,Kemarin saya makan makanan laut
3,Hari ini saya makan makanan khas Sunda
4,Besok saya berencana makan makanan khas Betawi


**TOKENIZATION**

*After that, we need to call function nltk.word_tokenize to divide string into words*

In [None]:
df['text'] = df['text'].apply(nltk.word_tokenize)
df.head()

Unnamed: 0,text
0,"[Saya, suka, dengan, makanan, laut]"
1,"[Saya, suka, dengan, minuman, yang, manis]"
2,"[Kemarin, saya, makan, makanan, laut]"
3,"[Hari, ini, saya, makan, makanan, khas, Sunda]"
4,"[Besok, saya, berencana, makan, makanan, khas,..."


**REMOVE STOPWORDS**

*The next step is to remove any stopwords. Since the corpus is in Bahasa Indonesia, we need to call stopwords.words and set the languange in Bahasa Indonesia*

In [None]:
removeStopWords = set(stopwords.words('indonesian'))
df['text']= df['text'].apply(lambda x: [i for i in x if i not in removeStopWords])
df.head()

Unnamed: 0,text
0,"[Saya, suka, makanan, laut]"
1,"[Saya, suka, minuman, manis]"
2,"[Kemarin, makan, makanan, laut]"
3,"[Hari, makan, makanan, khas, Sunda]"
4,"[Besok, berencana, makan, makanan, khas, Betawi]"


**DETOKENIZE**

*After the unnecessary words are removed, we need to call TreebankWordDetokenizer().detokenize to detokenize the tokens back into sentences*

In [None]:
df['text'] = df['text'].apply(TreebankWordDetokenizer().detokenize)
df.head()

Unnamed: 0,text
0,Saya suka makanan laut
1,Saya suka minuman manis
2,Kemarin makan makanan laut
3,Hari makan makanan khas Sunda
4,Besok berencana makan makanan khas Betawi


**TRANSFORM data frame 'text' into VECTORS**

*The next step is to transform data frame 'text' into vectors using TF-IDF vectorizer.fit_transform*

In [None]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(df['text'])

*Transform the vectors back into data frame and present it in a table*




In [None]:
df = pd.DataFrame(vectors.todense().T,
                  index = vectorizer.get_feature_names(),
                  columns = [i for i in range(len(df['text']))])

df



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
berencana,0.0,0.0,0.0,0.0,0.360983,0.0,0.0,0.443529,0.0,0.0,0.430414,0.369155
besok,0.0,0.0,0.0,0.0,0.402174,0.0,0.0,0.49414,0.0,0.0,0.479528,0.0
betawi,0.0,0.0,0.0,0.0,0.530128,0.0,0.0,0.0,0.0,0.0,0.0,0.0
burger,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.651353,0.0,0.0,0.0,0.0
coca,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.573615,0.0,0.0
cola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.573615,0.0,0.0
hari,0.0,0.0,0.0,0.431253,0.0,0.0,0.551336,0.0,0.455266,0.0,0.0,0.0
kemarin,0.0,0.0,0.524184,0.0,0.0,0.551336,0.0,0.0,0.0,0.435165,0.0,0.0
khas,0.0,0.0,0.0,0.488198,0.45528,0.0,0.0,0.0,0.0,0.0,0.0,0.0
kopi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.632092,0.0


Above table is the end result of TF - IDF