# **Part of Speech (POS) Tagging AloDokter App Reviews**





**Nama**: Anak Agung Istri Istadewanti

**NRP**: 5026211143

**Kelas**: PBA (A)







---

**POS Tagging** atau Part-of-Speech Tagging adalah proses memberikan label pada kata-kata dalam sebuah kalimat berdasarkan kategori tata bahasanya, seperti kata benda (*noun*), kata kerja (*verb*), kata sifat (*adjective*), kata depan (*preposition*), dan sebagainya. Pada kesempatan kali ini, setiap kata dalam teks akan diidentifikasi posisinya dalam struktur tata bahasa untuk membantu pemahaman dan analisis lebih lanjut.

Data yang digunakan adalah data review aplikasi AloDokter pada Google Play Store sebanyak 3000 data.


---

# Set Up

## Install Dependencies

In [3]:
# Install Dependencies
!pip install polyglot
!pip install pyicu
!pip install pycld2
!pip install morfessor
!pip install wordcloud
!pip install seaborn



In [2]:
# Install Polyglot Embeddings untuk Bahasa Indonesia
!polyglot download pos2.id
!polyglot download embeddings2.id

[polyglot_data] Downloading package pos2.id to /root/polyglot_data...
[polyglot_data] Downloading package embeddings2.id to
[polyglot_data]     /root/polyglot_data...


## Import Libraries and Load Data

In [4]:
import pandas as pd
from polyglot.text import Text
from collections import Counter
import re

In [5]:
# Upload Dataset
from google.colab import files
uploaded = files.upload()

Saving 1. dataset_df_alodokter_3000_reviews.csv to 1. dataset_df_alodokter_3000_reviews.csv


In [6]:
# Baca dataset yang diambil
file_name = list(uploaded.keys())[0]
df_alodokter = pd.read_csv(file_name)
df_alodokter.head()

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion
0,51a79c74-ad74-4651-a7dd-109225ce5804,ahmad gunara,https://play-lh.googleusercontent.com/a/ACg8oc...,informasi dan rekomendasi lengkap dan mudah di...,5,0,6.8.0,2024-09-22 14:06:34,,,6.8.0
1,0ab111a0-22f7-4201-9af0-16ea16bd742e,Ilham Ramdan,https://play-lh.googleusercontent.com/a-/ALV-U...,baik' untuk kesehatan,5,0,6.8.0,2024-09-22 13:00:56,,,6.8.0
2,5bdbf5fd-0c98-4391-b478-590b3cec7ca4,Rohman Kamarru,https://play-lh.googleusercontent.com/a/ACg8oc...,"Terimakasih dok, sehat sehat ya, terimakasih a...",5,0,6.8.0,2024-09-22 12:32:10,,,6.8.0
3,56ea6b55-33da-4e6a-b260-5bb404d8bdb9,Tresno Vivo,https://play-lh.googleusercontent.com/a/ACg8oc...,sangat baik dan memuaskan,5,0,,2024-09-22 12:00:45,,,
4,8876e606-3a46-41d7-a148-0e46f70350f2,Asna Izaati,https://play-lh.googleusercontent.com/a/ACg8oc...,sangat baik,5,0,6.8.0,2024-09-22 07:44:07,,,6.8.0


# Preprocessing Data

Sebelum melakukan POS tagging, kita dapat melakukan preprocessing tambahan jika diperlukan, seperti menghilangkan stopwords dan pembersihan teks.

In [7]:
def clean_text(text):
    # Convert to lowercase, remove special characters, and extra spaces
    text = re.sub(r'[^\w\s]', '', text.lower())
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply cleaning to the 'content' column of Alodokter reviews
df_alodokter['cleaned_content'] = df_alodokter['content'].apply(clean_text)


In [8]:
df_alodokter

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion,cleaned_content
0,51a79c74-ad74-4651-a7dd-109225ce5804,ahmad gunara,https://play-lh.googleusercontent.com/a/ACg8oc...,informasi dan rekomendasi lengkap dan mudah di...,5,0,6.8.0,2024-09-22 14:06:34,,,6.8.0,informasi dan rekomendasi lengkap dan mudah di...
1,0ab111a0-22f7-4201-9af0-16ea16bd742e,Ilham Ramdan,https://play-lh.googleusercontent.com/a-/ALV-U...,baik' untuk kesehatan,5,0,6.8.0,2024-09-22 13:00:56,,,6.8.0,baik untuk kesehatan
2,5bdbf5fd-0c98-4391-b478-590b3cec7ca4,Rohman Kamarru,https://play-lh.googleusercontent.com/a/ACg8oc...,"Terimakasih dok, sehat sehat ya, terimakasih a...",5,0,6.8.0,2024-09-22 12:32:10,,,6.8.0,terimakasih dok sehat sehat ya terimakasih ata...
3,56ea6b55-33da-4e6a-b260-5bb404d8bdb9,Tresno Vivo,https://play-lh.googleusercontent.com/a/ACg8oc...,sangat baik dan memuaskan,5,0,,2024-09-22 12:00:45,,,,sangat baik dan memuaskan
4,8876e606-3a46-41d7-a148-0e46f70350f2,Asna Izaati,https://play-lh.googleusercontent.com/a/ACg8oc...,sangat baik,5,0,6.8.0,2024-09-22 07:44:07,,,6.8.0,sangat baik
...,...,...,...,...,...,...,...,...,...,...,...,...
2995,fac23284-40a7-45d9-b9eb-9b6ec7262794,Puspita R,https://play-lh.googleusercontent.com/a-/ALV-U...,Udah bayar mahal. Dokter online tp tidak meres...,1,2,5.7.0,2023-08-06 18:23:33,"Alo, Puspita Rv! Untuk informasi lebih lanjut ...",2023-08-07 08:43:59,5.7.0,udah bayar mahal dokter online tp tidak meresp...
2996,9299e206-7e06-4291-bfd3-c86b6219198a,Namaku Almeer,https://play-lh.googleusercontent.com/a/ACg8oc...,Sangat membantu ketika butuh jawaban segera,5,0,,2023-08-06 17:01:38,,,,sangat membantu ketika butuh jawaban segera
2997,ef79659a-2c9f-46a1-b7f3-eb2e8d739903,Luna Bafagih,https://play-lh.googleusercontent.com/a-/ALV-U...,Good app with good price,5,0,5.7.0,2023-08-06 12:30:32,,,5.7.0,good app with good price
2998,5d63bbec-9b62-48d4-8d65-425504b9fdb4,Jihan Salwa,https://play-lh.googleusercontent.com/a/ACg8oc...,Alhamdulillah resep obat yg diberikan dokter b...,5,4,5.7.0,2023-08-06 12:09:10,,,5.7.0,alhamdulillah resep obat yg diberikan dokter b...


# POS Tagging

Kita dapat melakukan POS tagging menggunakan Polyglot untuk setiap ulasan dalam dataset.

In [12]:
import pandas as pd
from collections import Counter

# Store all POS tags and words
all_words = []
all_pos_tags = []
word_pos_map = {}

# Loop through each review in 'cleaned_content' column
for content in df_alodokter['cleaned_content']:
    if content.strip():  # Check if content is not empty
        text = Text(content, hint_language_code='id')  # 'id' indicates Indonesian
        pos_tags = text.pos_tags

        # Collect words and their corresponding POS tags
        for word, pos in pos_tags:
            word_lower = word.lower()
            all_words.append(word_lower)
            all_pos_tags.append(pos)

            # Map each POS tag with the corresponding word
            if pos in word_pos_map:
                word_pos_map[pos].append(word_lower)
            else:
                word_pos_map[pos] = [word_lower]

# Calculate count of POS tags
pos_counts = Counter(all_pos_tags)

# Calculate unique tokens for each POS tag
unique_tokens_count = {pos: len(set(words)) for pos, words in word_pos_map.items()}

# Prepare data for the DataFrame
tag_data = []
for pos, count in pos_counts.items():
    unique_tokens = unique_tokens_count.get(pos, 0)
    tag_data.append({'Tag': pos, 'Count': count, 'Unique Tokens': unique_tokens})

# Create DataFrame
pos_df = pd.DataFrame(tag_data)

# Display the DataFrame
print(pos_df)

      Tag  Count  Unique Tokens
0    NOUN   9797           2204
1    CONJ   1220             38
2     ADJ   2451            274
3    VERB   3776            945
4     ADP   1626             77
5     ADV   3601            165
6       X    527             24
7     NUM    957            557
8    PRON    813             18
9     DET    455             27
10  PROPN   2450           1096
11  PUNCT      6              3


In [13]:
# Download the CSV
pos_df.to_csv('pos_tag_counts.csv', index=False)

## Menghitung Frekuensi Kata dan POS

Setelah mendapatkan kata-kata dan POS tags, kita bisa menghitung frekuensi kata dan POS yang muncul di dataset.

In [14]:
# Count word frequency
word_counts = Counter(all_words)

# Count POS tag frequency
pos_counts = Counter(all_pos_tags)

# Display the most common words and POS tags
print("Top 10 most common words:", word_counts.most_common(10))
print("Top 10 most common POS tags:", pos_counts.most_common(10))

Top 10 most common words: [('sangat', 901), ('dan', 712), ('dokter', 642), ('membantu', 538), ('di', 449), ('saya', 424), ('nya', 348), ('baik', 341), ('respon', 295), ('cepat', 271)]
Top 10 most common POS tags: [('NOUN', 9797), ('VERB', 3776), ('ADV', 3601), ('ADJ', 2451), ('PROPN', 2450), ('ADP', 1626), ('CONJ', 1220), ('NUM', 957), ('PRON', 813), ('X', 527)]


## Membuat DataFrame untuk Kata, POS Tag, dan Frekuensi

Sekarang kita akan menyusun DataFrame yang berisi kata, POS tag yang terkait, dan frekuensi masing-masing kata.

In [16]:
# Create a DataFrame to store words, POS tags, and their frequencies
word_info = []

for word, freq in word_counts.items():
    # Find the POS tag for the word
    pos_tag = None
    for content in df_alodokter['cleaned_content']:
      if content.strip():
        text = Text(content, hint_language_code='id')
        for w, pos in text.pos_tags:
            if w.lower() == word:
                pos_tag = pos
                break
        if pos_tag:
            break
    word_info.append((word, pos_tag, freq))

# Create DataFrame
word_info_df = pd.DataFrame(word_info, columns=['Word', 'POS Tag', 'Frequency'])

# Display the DataFrame
word_info_df.head()

Unnamed: 0,Word,POS Tag,Frequency
0,informasi,NOUN,21
1,dan,CONJ,712
2,rekomendasi,NOUN,7
3,lengkap,ADJ,17
4,mudah,ADJ,92


In [17]:
# Get the top 10 most frequent words
top_10_words = word_info_df.nlargest(10, 'Frequency')

# Display the top 10 words
print(top_10_words)

        Word POS Tag  Frequency
16    sangat     ADV        901
1        dan    CONJ        712
68    dokter    NOUN        642
58  membantu    VERB        538
78        di     ADP        449
82      saya    PRON        424
84       nya    NOUN        348
6       baik     ADJ        341
18    respon    NOUN        295
19     cepat     ADJ        271


In [18]:
# Save df to CSV file
output_file = 'words_pos_freq_alodokter.csv'
word_info_df.to_csv(output_file, index=False)