# Natural Language Processing

Natural Language Processing alias NLP adalah salah satu bidang di dalam ilmu Artificial Intelligence (AI) alias kecerdasan buatan. Tujuan utama bidang ini adalah untuk membantu mesin memahami bahasa manusia secara lebih akurat.

Jadi, NLP memanfaatkan ilmu linguistik dan ilmu komputer untuk “mengajarkan” mesin cara menganalisis dan mencari makna dari rantaian kata-kata.

In [31]:
pancasila = """

Ketuhanan Yang Maha Esa
Kemanusiaan Yang Adil dan Beradab
Persatuan Indonesia
Kerakyatan Yang Dipimpin oleh Hikmat Kebijaksanaan, Dalam Permusyawaratan dan Perwakilan
Keadilan Sosial bagi seluruh Rakyat Indonesia

"""

## Tokenization

Tokenization adalah proses pembagian teks yang panjang menjadi bagian yang lebih kecil. Bagian-bagian yang lebih kecil ini biasa dikenal dengan token. Pemrosesan akan dilanjutkan saat kalimat-kalimat tersebut telah menjadi token. Proses tokenization juga disebut dengan segmentasi teks atau analisis leksikal. Dengan kata lain, proses tokenization adalah proses pemecahan kalimat menjadi kata-kata penyusunnya. Proses ini terkesan mudah, namun pada kenyataannya proses ini sangat sulit. Proses pemisahan kata-kata dari kalimat dapat menggunakan tanda baca atau tanda spasi. Artinya jika dua kata dihubungkan menggunakan spasi, maka kata-kata tersebut akan dipisah berdasarkan tanda spasi.

In [32]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')
pancasila_tokenize = word_tokenize(pancasila)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [33]:
pancasila_tokenize

['Ketuhanan',
 'Yang',
 'Maha',
 'Esa',
 'Kemanusiaan',
 'Yang',
 'Adil',
 'dan',
 'Beradab',
 'Persatuan',
 'Indonesia',
 'Kerakyatan',
 'Yang',
 'Dipimpin',
 'oleh',
 'Hikmat',
 'Kebijaksanaan',
 ',',
 'Dalam',
 'Permusyawaratan',
 'dan',
 'Perwakilan',
 'Keadilan',
 'Sosial',
 'bagi',
 'seluruh',
 'Rakyat',
 'Indonesia']

## Stopwords

In [34]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

In [35]:
from nltk.corpus import stopwords

ina_stops = stopwords.words("indonesian")

In [36]:
ina_stops

['ada',
 'adalah',
 'adanya',
 'adapun',
 'agak',
 'agaknya',
 'agar',
 'akan',
 'akankah',
 'akhir',
 'akhiri',
 'akhirnya',
 'aku',
 'akulah',
 'amat',
 'amatlah',
 'anda',
 'andalah',
 'antar',
 'antara',
 'antaranya',
 'apa',
 'apaan',
 'apabila',
 'apakah',
 'apalagi',
 'apatah',
 'artinya',
 'asal',
 'asalkan',
 'atas',
 'atau',
 'ataukah',
 'ataupun',
 'awal',
 'awalnya',
 'bagai',
 'bagaikan',
 'bagaimana',
 'bagaimanakah',
 'bagaimanapun',
 'bagi',
 'bagian',
 'bahkan',
 'bahwa',
 'bahwasanya',
 'baik',
 'bakal',
 'bakalan',
 'balik',
 'banyak',
 'bapak',
 'baru',
 'bawah',
 'beberapa',
 'begini',
 'beginian',
 'beginikah',
 'beginilah',
 'begitu',
 'begitukah',
 'begitulah',
 'begitupun',
 'bekerja',
 'belakang',
 'belakangan',
 'belum',
 'belumlah',
 'benar',
 'benarkah',
 'benarlah',
 'berada',
 'berakhir',
 'berakhirlah',
 'berakhirnya',
 'berapa',
 'berapakah',
 'berapalah',
 'berapapun',
 'berarti',
 'berawal',
 'berbagai',
 'berdatangan',
 'beri',
 'berikan',
 'berikut'

In [37]:
type(pancasila_tokenize)

list

In [38]:
pancasila_after_stops = []

for kata in pancasila_tokenize:
  if kata in ina_stops:
    pass
  else:
    pancasila_after_stops.append(kata)

pancasila_after_stops

['Ketuhanan',
 'Yang',
 'Maha',
 'Esa',
 'Kemanusiaan',
 'Yang',
 'Adil',
 'Beradab',
 'Persatuan',
 'Indonesia',
 'Kerakyatan',
 'Yang',
 'Dipimpin',
 'Hikmat',
 'Kebijaksanaan',
 ',',
 'Dalam',
 'Permusyawaratan',
 'Perwakilan',
 'Keadilan',
 'Sosial',
 'Rakyat',
 'Indonesia']

In [39]:
!mkdir ~/.kaggle
!touch ~/.kaggle/kaggle.json

api = {"username":"alifvianmarco","key":"c7e99afb66771412b4ce050c2367ec0b"}

import json

with open('/root/.kaggle/kaggle.json', 'w') as file:
    json.dump(api, file)

!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d infamouscoder/mental-health-social-media

mkdir: cannot create directory ‘/root/.kaggle’: File exists
mental-health-social-media.zip: Skipping, found more recently modified local copy (use --force to force download)


In [40]:
!unzip /content/mental-health-social-media.zip

Archive:  /content/mental-health-social-media.zip
replace Mental-Health-Twitter.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: Mental-Health-Twitter.csv  


In [41]:
import pandas as pd
import matplotlib.pyplot as plt
import collections
import re

In [42]:
df = pd.read_csv("/content/Mental-Health-Twitter.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,It's just over 2 years since I was diagnosed w...,1013187241,84,211,251,837,0,1
1,1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,"It's Sunday, I need a break, so I'm planning t...",1013187241,84,211,251,837,1,1
2,2,637749345908051968,Sat Aug 29 22:11:07 +0000 2015,Awake but tired. I need to sleep but my brain ...,1013187241,84,211,251,837,0,1
3,3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,RT @SewHQ: #Retro bears make perfect gifts and...,1013187241,84,211,251,837,2,1
4,4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,It’s hard to say whether packing lists are mak...,1013187241,84,211,251,837,1,1


In [43]:
df["post_text"][0]

"It's just over 2 years since I was diagnosed with #anxiety and #depression. Today I'm taking a moment to reflect on how far I've come since."

In [44]:
doc1 = 'Game of Thrones is an amazing tv series!'
l_doc1 = re.sub(r"[^a-zA-Z0-9]", " ", doc1.lower()).split()
l_doc1

['game', 'of', 'thrones', 'is', 'an', 'amazing', 'tv', 'series']

In [45]:
text = df[["post_text"]]

In [46]:
text["text_cleaned"] = df["post_text"].apply(lambda x: re.sub(r"[^a-zA-Z0-9]", " ", x.lower()).split())
text["text_cleaned"] = text["text_cleaned"].apply(
    lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [47]:
text

Unnamed: 0,post_text,text_cleaned
0,It's just over 2 years since I was diagnosed w...,it s just over 2 years since i was diagnosed w...
1,"It's Sunday, I need a break, so I'm planning t...",it s sunday i need a break so i m planning to ...
2,Awake but tired. I need to sleep but my brain ...,awake but tired i need to sleep but my brain h...
3,RT @SewHQ: #Retro bears make perfect gifts and...,rt sewhq retro bears make perfect gifts and ar...
4,It’s hard to say whether packing lists are mak...,it s hard to say whether packing lists are mak...
...,...,...
19995,A day without sunshine is like night.,a day without sunshine is like night
19996,"Boren's Laws: (1) When in charge, ponder. (2) ...",boren s laws 1 when in charge ponder 2 when in...
19997,The flow chart is a most thoroughly oversold p...,the flow chart is a most thoroughly oversold p...
19998,"Ships are safe in harbor, but they were never ...",ships are safe in harbor but they were never m...


In [48]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [49]:
count_vectorizer = CountVectorizer(max_features=128)
bow = count_vectorizer.fit_transform(text["text_cleaned"])

In [50]:
bow.toarray().shape

(20000, 128)

# Classify BOW

In [51]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() ## Create object for lemmatizer
example_words = ['history','formality','changes']
for w in example_words:
    print(lemmatizer.lemmatize(w))

history
formality
change


In [52]:
text_sport = "Liverpool held off a late charge from Tottenham after Mohamed Salah struck twice to win 2-1 in north London and lift themselves back into the Premier League top-four race. Jurgen Klopp side had suffered shock defeats to relegation-threatened pair Nottingham Forest and Leeds in their last two league outings but started fast against Spurs, with a sharp touch and finish into the bottom corner from in-form Salah giving them the lead on 11 minutes. Ivan Perisic had a header deflected onto the post by Liverpool goalkeeper Alisson and Ryan Sessegnon saw a penalty shout from a challenge by Trent Alexander-Arnold waved away as Spurs came to life - but the Reds struck again just before the break thanks to a gift from Eric Dier. The centre-back miscued a header towards his own goal and Salah (40) raced through to chip in his ninth goal in eight games. Tottenham were sent out early for the second half and - not for the first time this season - were better after the break, with Alisson again pushing a Perisic effort onto the woodwork before Harry Kane (70) fired home a brilliant strike when played in by sub Dejan Kulusevski. Rodrigo Bentancur went close with a couple of headers and Kane glanced wide as Spurs desperately sought another late goal but Liverpool clung on for their first away win in the Premier League this season to move up to eighth and within seven points of fourth-placed Spurs, with a game in hand."
text_medical = "There is growing popularity of the Hospital Incident Command System (HICS) as an organizational tool for hospital management in the COVID-19 pandemic. We specifically describe implementation of HICS at the Isfahan province reference hospital (Isabn-e-Maryam) during the COVID-19 pandemic and try to explore performance of it. Methods: To document the actions taken during the COVID-19 pandemic, standard, open-ended interviews were conducted with individuals occupying activated HICS leadership positions during the event. A checklist based on the job action sheets of the HICS was used for performance assessment. Results: With the onset of the pandemic, hospital director revised ICS structure that adheres to span of better control of COVID-19. Methods of expanding hospital inpatient capacity to enable surge capacity were considered. The highest performance score was in the field of planning. Performance was intermediate in Financial/Administration section and good in other fields. Discussion: In the current COVID-19 pandemic, establishing HICS with some consideration about long-standing events can help improve communication, resource use, staff and patient protection, and maintenance of roles."
text_finance = "According to Fidelity’s Financial Resolutions survey, saving more money was the number one-resolution for respondents. Close to half (43%) said this was a goal they wanted to work toward in the new year. Building a nest egg can help you pay for big ticket items — like a house, a vacation, a wedding or even just an expensive item you really want — without taking on additional debt. Having savings can also come in handy if an emergency expense comes up. If you’re looking to increase your savings in the new year, it helps to start small — even if you’re only transferring $10 a week into your savings account. Starting small helps you build a muscle for saving. This way, when you receive salary bumps, bonuses and gift money, you’ve already gotten into the habit of saving, and you’ll be more likely to transfer that money to your savings account. You may also want to consider automating that process instead of just manually moving money into your savings. Relying on manual transfers leaves a lot of room for procrastination — and before we know it, we’ve spent the money we intended to save. But when you set up automatic transfers into your savings account, you take away the need to make that decision altogether. You can usually schedule automatic transfers through your bank’s mobile app. Lastly, if you want to see your savings grow just a little faster, you can opt for a high-yield savings account instead of a traditional savings account. High-yield savings accounts — like the Marcus by Goldman Sachs Online Savings Account or the Ally Online Savings Account — pay you more in interest each month compared to traditional savings accounts."

texts = [text_sport, text_medical, text_finance]
bow_keys = []
corpus_texts = []
for text in texts:
    words  = word_tokenize(text)
    texts = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    bow_keys += texts
    text = ' '.join(texts)
    corpus_texts.append(text)
bow_keys = set(bow_keys)
print(bow_keys)   #### Cleaned Data
print(corpus_texts)   #### Cleaned Data

{'giving', 'half', 'Fidelity', 'Incident', 'automatic', 'muscle', 'control', 'transferring', 'score', '43', 'Ryan', 'process', ')', 'top-four', 'really', 'bank', '$', 'pushing', 'enable', 'without', '2-1', 'take', 'way', 'Having', 'manually', 'action', 'like', 'shout', 'struck', '’', 'decision', 'handy', 'home', 'number', 'activated', 'based', 'app', 'Account', 'occupying', 'onset', 'Resolutions', 'survey', 'traditional', 'Tottenham', 'Goldman', 'week', 'game', 'north', 'miscued', 'saw', 'In', 'touch', 'move', 'hospital', 'Spurs', 'conducted', 'nest', '(', 'adheres', 'mobile', 'small', 'point', 'section', 'communication', 'bottom', 'transfer', 'span', 'maintenance', 'director', 'role', 'sent', 'sharp', 'onto', 'woodwork', ',', 'went', 'life', 'With', 'moving', 'Marcus', 'Salah', 'lead', 'Eric', 'fast', 'saving', 'one-resolution', 'fourth-placed', 'post', 'This', 'additional', 'Methods', 'bump', 'already', 'grow', 'build', 'specifically', 'wide', 'automating', 'field', 'lift', 'raced', 

In [53]:
len(bow_keys)

338

In [54]:
query_text = "Federal revenue for the period from January to September 2022 totalled approximately €256.7bn, up by 10.1% (about €23.6bn) on the year. Tax receipts (including EU own resources that are subtracted from the total) increased by 10.1% (about €22.0bn) on the year. Revenue from value added taxes rose by 22.2% (about €18.5bn), while receipts from income tax and corporation tax grew by 7.9% (about €8.9bn). Federal revenue fell as a result of a year-on-year increase of approximately €4.1bn in public transport subsidies to the Länder. These additional subsidies were used to offset revenue losses in the public transport sector and to finance the 9-euro ticket scheme (a temporary reduced-rate public transport ticket costing €9 per month in the months of June, July and August 2022)."
query_words = word_tokenize(query_text)
query_words_clean = [lemmatizer.lemmatize(word) for word in query_words if word not in set(stopwords.words('english'))]
query_words_corpus = [word for word in query_words_clean if word in set(bow_keys)]
query_text_corpus = ' '.join(query_words_corpus)
corpus_texts.append(query_text_corpus)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() ## Creating Object for CountVectorizer
bow_vectors = cv.fit_transform(corpus_texts).toarray()
print(bow_vectors)
print(len(bow_vectors[0]))

[[0 1 0 ... 0 0 0]
 [0 0 5 ... 0 0 0]
 [1 0 0 ... 2 2 2]
 [0 0 0 ... 2 0 0]]
330


In [55]:
# Normalize the BoW vectors
bow_texts_norm = []
for bow in bow_vectors:
  length = (sum(i*i for i in bow)) ** 0.5
  bow_norm = bow / length
  bow_texts_norm.append(bow_norm)

# Compute similarity using dot product
similarity_vector = []
bow_norm_query = bow_texts_norm[3]
for bow in bow_texts_norm[:3]:
  similarity_vector.append(sum(i*j for i,j in zip(bow,bow_norm_query)))
print(similarity_vector)

# Find the highest similarity
id_max_sim = similarity_vector.index(max(similarity_vector))
if (id_max_sim == 0):
  print ("The query text is classified as: Sport")
elif (id_max_sim == 1):
  print ("The query text is classified as: Medical")
elif (id_max_sim == 2):
  print ("The query text is classified as: Finance")

[0.0, 0.03214121732666125, 0.11158046066936295]
The query text is classified as: Finance


In [56]:
similarity_vector

[0.0, 0.03214121732666125, 0.11158046066936295]

In [57]:
max(similarity_vector)

0.11158046066936295

In [58]:
similarity_vector.index(0.11158046066936295)

2

In [59]:
var_max = max(similarity_vector)
similarity_vector.index(var_max)

2

In [60]:
# uji kelas 1 = [0.3 0.7]  = 0.7
# uji kelas 2 = [0.2 0.8]  = 0.8


