# **Find Frequently Used Words from My Personal Blog**

---
It has been about a month since the last post on my personal blog, "Untitled Door". Now I am in the mood to write but have no idea what to write or where to start. At the same time, I wondered what the most frequent word that appears on my blog was, so I decided to find out. 

What is it, do you think?

---



First, let's import the `pandas` package because we will work with dataframes a lot.

In [43]:
import pandas as pd

I thought of doing web-scraping on my blog, but since I still don't know how, I collected the data manually. After all, there are only 42 published posts. It is faster to do it manually than to insist on applying the concept that I'm not sure of yet. I'll just keep web scraping as my next project to do for now. 

I keep the data in tabular form, consisting of the information about my post's title `"title"`, its category `"category"`, its publish date `"publish_date"`, and its content `"blog_post"`.

In [44]:
df = pd.read_csv('untitledoor_blog_post.csv')
df.head()

Unnamed: 0,title,category,publish_date,blog_post
0,Kamu tidak perlu khawatir tentang perasaanku,untitledoorseries,6/20/2022,"Sekarang aku sadar, aku tidak mencintaimu deng..."
1,Perasaan yang ingin kuungkapkan,untitledoorseries,6/16/2022,Duduk berdampingan denganmu sudah cukup untuk ...
2,"Malam itu, saat dia pulang",untitledoorseries,6/14/2022,"Malam itu, saat hatiku sedikit tergerak padamu..."
3,"Bahkan meskipun kamu bukan siapa-siapa, kamu t...",untuk: aku,6/9/2022,"“Kalau kamu masih terbangun esok hari, apa yan..."
4,Ketika wanita dalam dekapan pria yang mencinta...,favorit,9/10/2021,“Satu orang istimewa sudah cukup rasanya untuk...


Let's see how many posts are in each category. 

In [45]:
df[['category']].value_counts()

category         
untitledoorseries    24
untuk: aku           10
favorit               8
dtype: int64

It seems my favorite `untitledoorseries` category holds the most of my blog contents. I actually write far more than that. It's just that the few pieces of writing that I produce don't always live up to my standards. 

Uh-hm. Ok, fine. It does sound like an excuse. 

NEXT.

I want to start right away by showing you the first content of my latest post in the blog, which is in the first row of the dataframe. But you might wonder, how can I know? Well, I'm the author of the blog, and I collected the data from the latest to the older post. So, I know.


However, we can also check if this is true or not by using a few lines of code. 

In [46]:
# first let's check our dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         42 non-null     object
 1   category      42 non-null     object
 2   publish_date  42 non-null     object
 3   blog_post     42 non-null     object
dtypes: object(4)
memory usage: 1.4+ KB


In [47]:
# The 'publish date' column still has an object type. 
# I will change it to datetime first, so the result won't give us a wrong answer.
df['publish_date'] = pd.to_datetime(df['publish_date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   title         42 non-null     object        
 1   category      42 non-null     object        
 2   publish_date  42 non-null     datetime64[ns]
 3   blog_post     42 non-null     object        
dtypes: datetime64[ns](1), object(3)
memory usage: 1.4+ KB


This line of code below gives you the row of dataframe `df` where the value from the `publish_date` column matches the maximum date (the latest post). 

In [48]:
# Let's see if I'm right
df[df.publish_date == df.publish_date.max()]

Unnamed: 0,title,category,publish_date,blog_post
0,Kamu tidak perlu khawatir tentang perasaanku,untitledoorseries,2022-06-20,"Sekarang aku sadar, aku tidak mencintaimu deng..."


As I said before, the latest post is in the first row of the dataframe. 

Now, let me work on one line of string at a time to figure out how to do it for the overall rows of blog posts. 

In [49]:
sample_line = df.loc[0, 'blog_post']
sample_line

'Sekarang aku sadar, aku tidak mencintaimu dengan maksud memilikimu. Aku mencintaimu, dan masih akan tetap mencintai kamu, karena aku tau bahwa dengan cara itu aku terus memperbaiki diri. Karena aku tau bahwa aku bisa menjadi aku yang lebih baik dalam banyak hal setiap hari. Dan karena itu, kamu tidak perlu khawatir untuk balik mencintaiku. Kamu berhak untuk jatuh cinta pada siapa saja. Atau memilih untuk tidak jatuh cinta pada siapa-siapa. Kamu berhak untuk memilih kekasih yang akan kamu bawa hidup bersama. Atau memilih sendiri karena ingin fokus mengembangkan hidupmu dulu saja. Kamu berhak atas itu semua, dan aku masih akan baik-baik saja. Karena aku pun tidak pernah berdoa agar kamu adalah satu-satunya. Karena aku sepercaya itu bahwa Tuhan lebih tau yang terbaik untuk segalanya. Dan karena itu, orang itu tidak harus kamu. Aku mencintaimu. Tentu. Tapi tentang apakah kamu juga mencintaiku atau tidak, itu di luar kuasaku. Akan kunyatakan perasaan saat kesempatan yang tepat itu datang. 

In [50]:
word_counts = dict()
word_list = sample_line.split()
for word in word_list:
  word_counts[word] = word_counts.get(word, 0) + 1
  
print(word_counts)

{'Sekarang': 1, 'aku': 10, 'sadar,': 1, 'tidak': 5, 'mencintaimu': 1, 'dengan': 3, 'maksud': 1, 'memilikimu.': 1, 'Aku': 2, 'mencintaimu,': 1, 'dan': 2, 'masih': 2, 'akan': 4, 'tetap': 1, 'mencintai': 1, 'kamu,': 1, 'karena': 4, 'tau': 3, 'bahwa': 3, 'cara': 2, 'itu': 8, 'terus': 1, 'memperbaiki': 1, 'diri.': 1, 'Karena': 3, 'bisa': 1, 'menjadi': 1, 'yang': 5, 'lebih': 2, 'baik': 1, 'dalam': 1, 'banyak': 1, 'hal': 1, 'setiap': 1, 'hari.': 1, 'Dan': 2, 'itu,': 2, 'kamu': 5, 'perlu': 1, 'khawatir': 1, 'untuk': 6, 'balik': 1, 'mencintaiku.': 1, 'Kamu': 3, 'berhak': 3, 'jatuh': 2, 'cinta': 2, 'pada': 2, 'siapa': 1, 'saja.': 3, 'Atau': 2, 'memilih': 4, 'siapa-siapa.': 1, 'kekasih': 1, 'bawa': 1, 'hidup': 1, 'bersama.': 1, 'sendiri': 1, 'ingin': 1, 'fokus': 1, 'mengembangkan': 1, 'hidupmu': 1, 'dulu': 1, 'atas': 1, 'semua,': 1, 'baik-baik': 1, 'pun': 1, 'pernah': 2, 'berdoa': 1, 'agar': 1, 'adalah': 1, 'satu-satunya.': 1, 'sepercaya': 1, 'Tuhan': 2, 'terbaik': 1, 'segalanya.': 1, 'orang': 1,

I use the split method to split a string into a list of words. Then I utilize a dictionary to keep records about how many of every word that it encounters along the way of iterating the list. 



However, if you are someone like me, who understands Bahasa Indonesia, you will notice several flaws in those simple lines of code. Therefore:

*   I would treat `"Sekarang"` and `"sekarang"` words as the same. Hence, I need to lowercase all of the words. 
*   `"mencintaimu."` and `"mencintaimu"` would also be the same word. Hence, I need to ignore every punctuation mark on the string.
*   Furthermore, `"mencintaimu"` is basically just a word "mencintai" with the suffix "-mu", which has the same meaning as `"mencintai kamu"`. So I would rewrite every word that ends with "-mu" into "word" + "kamu". However, to be sure I am not changing a wrong word in this formula, I would do an additional check to see if there is any word that I need to ignore for this matter.
*   The same rule also applies to the suffix "-ku", which can be rewritten into "word" + "aku", except for a few words as an exception.
*   There is also the word `"kunyatakan"`, which can be rewritten as `"aku nyatakan"`. It means every word with the prefix "ku-" would be rewritten as "aku" + "word", except for a few words as an exception.

To resolve the latter three problems above, I collect every word that is counted after lowering case and removing punctuation.

In [51]:
# lowercase every blog post
df["blog_post"] = df["blog_post"].apply(str.lower)
# remove all punctuation marks
df["blog_post"] = df["blog_post"].replace('[^\w\s]', '', regex=True)
# split every string in the blog_post column into a list of words
df["blog_post"] = df["blog_post"].apply(str.split)
# see changes
df["blog_post"].head()

0    [sekarang, aku, sadar, aku, tidak, mencintaimu...
1    [duduk, berdampingan, denganmu, sudah, cukup, ...
2    [malam, itu, saat, hatiku, sedikit, tergerak, ...
3    [kalau, kamu, masih, terbangun, esok, hari, ap...
4    [satu, orang, istimewa, sudah, cukup, rasanya,...
Name: blog_post, dtype: object

In [52]:
word_counts = dict()
for index, row in df.iterrows():
  for word in row["blog_post"]:
    word_counts[word] = word_counts.get(word, 0) + 1

print(word_counts)

{'sekarang': 32, 'aku': 504, 'sadar': 13, 'tidak': 267, 'mencintaimu': 28, 'dengan': 86, 'maksud': 2, 'memilikimu': 1, 'dan': 184, 'masih': 56, 'akan': 98, 'tetap': 26, 'mencintai': 7, 'kamu': 344, 'karena': 88, 'tau': 20, 'bahwa': 76, 'cara': 13, 'itu': 159, 'terus': 38, 'memperbaiki': 2, 'diri': 20, 'bisa': 80, 'menjadi': 36, 'yang': 484, 'lebih': 46, 'baik': 27, 'dalam': 47, 'banyak': 34, 'hal': 30, 'setiap': 10, 'hari': 50, 'perlu': 25, 'khawatir': 8, 'untuk': 113, 'balik': 4, 'mencintaiku': 3, 'berhak': 9, 'jatuh': 35, 'cinta': 35, 'pada': 44, 'siapa': 9, 'saja': 82, 'atau': 28, 'memilih': 11, 'siapasiapa': 4, 'kekasih': 1, 'bawa': 5, 'hidup': 45, 'bersama': 18, 'sendiri': 28, 'ingin': 43, 'fokus': 2, 'mengembangkan': 1, 'hidupmu': 8, 'dulu': 16, 'atas': 22, 'semua': 40, 'baikbaik': 15, 'pun': 39, 'pernah': 65, 'berdoa': 2, 'agar': 22, 'adalah': 43, 'satusatunya': 5, 'sepercaya': 1, 'tuhan': 31, 'terbaik': 10, 'segalanya': 8, 'orang': 83, 'harus': 25, 'tentu': 10, 'tapi': 85, 'ten

Now I will check every word with the affix '-mu', '-ku', and 'ku-' that needs to be set as an exception.

In [53]:
# word + '-mu'
suffix_mu = [key for key, value in word_counts.items() if key.endswith('mu')]
print(suffix_mu)

# exception:
# kamu, bertemu, tamu, temu

['mencintaimu', 'memilikimu', 'kamu', 'hidupmu', 'denganmu', 'ramahmu', 'sosokmu', 'melihatmu', 'menyapamu', 'padamu', 'bertemu', 'untukmu', 'dirimu', 'hadirmu', 'matamu', 'membuatmu', 'hariharimu', 'usahamu', 'memikirkanmu', 'kesalahanmu', 'hatimu', 'sisimu', 'sikapmu', 'memberimu', 'tentangmu', 'perlakuanmu', 'memahamimu', 'membawamu', 'meninggalkanmu', 'mendampingimu', 'mengajarkanmu', 'perasaanmu', 'baikmu', 'darimu', 'memperhatikanmu', 'memilihmu', 'rambutmu', 'senyummu', 'pendampingmu', 'tanganmu', 'mengantarkanmu', 'berkatmu', 'tempatmu', 'membelikanmu', 'sayangmu', 'menyayangimu', 'memarahimu', 'bersamamu', 'menyukaimu', 'dekatmu', 'posisimu', 'ambisimu', 'kesibukanmu', 'mengenalmu', 'tawamu', 'karenamu', 'termanismu', 'memintamu', 'melupakanmu', 'menjadikanmu', 'mengagumimu', 'menemuimu', 'wajahmu', 'barumu', 'tulusmu', 'kenanganmu', 'salahmu', 'menolakmu', 'perkataanmu', 'bagimu', 'menyikapimu', 'bahagiamu', 'lukamu', 'buatmu', 'mediamu', 'bayanganmu', 'suaramu', 'menatapmu',

In [54]:
# word + '-ku'
suffix_ku = [key for key, value in word_counts.items() if key.endswith('ku')]
print(suffix_ku)

# exception:
# aku, berlaku, kaku, saku

['aku', 'mencintaiku', 'kuasaku', 'menolakku', 'membuatku', 'pikiranku', 'hatiku', 'menyapaku', 'protesku', 'raguku', 'hariku', 'duniaku', 'tulisku', 'untukku', 'diriku', 'membantuku', 'bagiku', 'penglihatanku', 'menunjukkanku', 'mengajarkanku', 'mengingatkanku', 'caraku', 'hidupku', 'hadapanku', 'kemampuanku', 'berlaku', 'kebahagiaanku', 'ingatanku', 'jantungku', 'menyukaiku', 'mengagumiku', 'sampingku', 'milikku', 'rasaku', 'mataku', 'tubuhku', 'melihatku', 'mencegatku', 'bentakku', 'mengikutiku', 'napasku', 'penerbanganku', 'menarikku', 'usahaku', 'meninggalkanku', 'memandangku', 'benakku', 'ponselku', 'alamatku', 'maksudku', 'menasihatiku', 'memperlakukanku', 'padaku', 'mukaku', 'kebaikanku', 'kebodohanku', 'ucapanku', 'sumpahku', 'kepalaku', 'telingaku', 'pikirku', 'tanyaku', 'mengejutkanku', 'makananku', 'menatapku', 'diamku', 'mengantarku', 'kaku', 'kataku', 'celanaku', 'menghindarkanku', 'sebelahku', 'telunjukku', 'badanku', 'otakku', 'langkahku', 'tanganku', 'pekerjaanku', 'me

In [55]:
# 'ku-' + word
prefix_ku = [key for key, value in word_counts.items() if key.startswith('ku')]
print(prefix_ku)

# exception:
# kuasaku, kuliah, kurang, kuat, kursi, kue, kufur, kunci, kulit, kuasa, kursimu
# kuasaku == kuasa + aku
# kursimu == kursi + kamu
# FINAL EXCEPTION ===>> kuasa, kuliah, kurang, kuat, kursi, kue, kufur, kunci, kulit

['kuasaku', 'kunyatakan', 'kuanggap', 'kulalui', 'kuliah', 'kuidamkan', 'kuharapkan', 'kubuat', 'kurasakan', 'kutinggalkan', 'kurang', 'kukatakan', 'kuakui', 'kuhapus', 'kupikir', 'kubilang', 'kutemui', 'kusesali', 'kusayangi', 'kumau', 'kudapat', 'kusayang', 'kutengok', 'kupikirpikir', 'kuputuskan', 'kulakukan', 'kutempuh', 'kutahan', 'kuat', 'kuungkapkan', 'kulanjutkan', 'kutemukan', 'kumohon', 'kudengar', 'kumiliki', 'kujaga', 'kulihat', 'kupunya', 'kuharap', 'kusambungkan', 'kupegang', 'kupasang', 'kursi', 'kuabaikan', 'kubawa', 'kuperhatikan', 'kuteruskan', 'kukenali', 'kusuguhkan', 'kusimpan', 'kurasa', 'kue', 'kuambil', 'kubuka', 'kuterima', 'kufur', 'kunci', 'kulit', 'kuberanikan', 'kulirik', 'kursimu', 'kupastikan', 'kuberikan', 'kuasa']


I would save those words on separate lists now that I am aware of this.

In [56]:
not_suffix_mu = ["kamu", "bertemu", "tamu", "temu"]
not_suffix_ku = ["aku", "berlaku", "kaku", "saku"]
not_prefix_ku = ["kuasa", "kuliah", "kurang", "kuat", "kursi", "kue", "kufur", "kunci", "kulit"]

The exception words from the initial affix list will then be removed. Following that, the new lists will be given the names `final_suffix_mu`, `final_suffix_ku`, and `final_prefix_ku`.

In [57]:
final_suffix_mu = [word for word in suffix_mu if word not in not_suffix_mu]
final_suffix_ku = [word for word in suffix_ku if word not in not_suffix_ku]
final_prefix_ku = [word for word in suffix_mu if word not in not_suffix_mu]

I suddenly realize that my inconsistent word usage is another issue I might run into. I use the formal language when writing sometimes, but not always. I can use the words `"enggak"`, `"nggak"`, `"tak"` and `"gak"`, which have the same meaning as `"tidak"`, for example. It is simply translated as `"No"` in English. Another case is the similar meanings of the words `"tahu"` and `"tau"` in my writing. It's just that I never write about "tahu"==tofu, only "tau"=="tahu". So, I'll start by seeing if there are any of these "tidak" and "tahu" modifications.

In [58]:
suspect_list = ['enggak', 'nggak', 'tak', 'gak', 'tidak', 'tahu', 'tau']

for suspect in suspect_list:
  try:
    print(f"\"{suspect}\" : {word_counts[suspect]} times")
  except:
    print(f"\"{suspect}\" : 0")

"enggak" : 0
"nggak" : 0
"tak" : 55 times
"gak" : 15 times
"tidak" : 267 times
"tahu" : 53 times
"tau" : 20 times


Okay, now let's define a function `find_most_used_word` that will allow me to find the word that has been used often from the beginning of my blog posts to the most recent post by iterating over the list of words in each row of the dataframe.

In [59]:
def find_most_used_word(df, column_name):
  """
  This function can be used to determine the word that appears 
  the most frequently in a dataframe by counting all of its word occurrences.

  Input:
  - df: a dataframe that we work on
  - column_name:  a column name where a list of words exists in each row

  Output:
  - word_count_record: a dictionary that contains every unique word 
            and its occurence overall dataframe column
  - mostUsedWord: the most frequently used word
  - maxCount: how many times the mostUsedWord appears
  """

  # initialize the dictionary of word count record
  word_count_record = dict()

  # iterate over rows in a dataframe
  for index, row in df.iterrows():
    # iterate over words in a list of words that exist in the row.
    for word in row[column_name]:
      # to do when the word exists within the lists
      if word in final_suffix_mu:
        word_count_record[word[:-2]] = word_count_record.get(word[:-2], 0) + 1
        word_count_record["kamu"] = word_count_record.get("kamu", 0) + 1
      elif word in final_suffix_ku:
        word_count_record[word[:-2]] = word_count_record.get(word[:-2], 0) + 1
        word_count_record["aku"] = word_count_record.get("aku", 0) + 1
      elif word in final_prefix_ku:
        word_count_record[word[2:]] = word_count_record.get(word[2:], 0) + 1
        word_count_record["aku"] = word_count_record.get("aku", 0) + 1
      # to do when 'tidak' or 'tahu' modifications are found
      elif word == 'gak' or word == 'tak':
        word_count_record["tidak"] = word_count_record.get("tidak", 0) + 1
      elif word == 'tau':
        word_count_record["tahu"] = word_count_record.get("tahu", 0) + 1
      # to do with the remaining words
      else:
        word_count_record[word] = word_count_record.get(word, 0) + 1
  
  # Finding the most frequently used word
  mostUsedWord = None
  maxCount = None
  for word, count in word_count_record.items():
    if maxCount==None or count > maxCount:
      mostUsedWord = word
      maxCount = count

  return word_count_record, mostUsedWord, maxCount

Now is the time to find out what the word is that I use so frequently.

In [60]:
word_count_dict, word, count = find_most_used_word(df, 'blog_post')
print(f"The most frequently used word is '{word}' with {count} occurrences.")

The most frequently used word is 'aku' with 711 occurrences.


Wow. This discovery actually kind of surprised me. Since I always have a thought to write about because of "you", I assumed it would be `"kamu"` rather than `"aku."`

Anyway, in English, `"aku"` means "I" or "me," while `"kamu"` means "you."

I suppose the `untitledoorseries` category might employ the word `"kamu"` more frequently. Let's investigate to find out.

In [61]:
catList = list(df['category'].unique())
for cat in catList:
  dfCat = df.loc[df['category']==cat]
  _, mostUsedWord, maxCount = find_most_used_word(dfCat, 'blog_post')
  print(f"The most frequently used word in {cat} category is '{mostUsedWord}' with {maxCount} occurrences.")

The most frequently used word in untitledoorseries category is 'aku' with 499 occurrences.
The most frequently used word in untuk: aku category is 'yang' with 152 occurrences.
The most frequently used word in favorit category is 'aku' with 95 occurrences.


Nope! Even in this particular category, I still mention `"aku"` a.k.a. myself, a lot. In addition, the word `"yang"`, which is merely an auxiliary, is even mentioned the most in `"untuk:aku"` category.

Lastly, I will look at the top 10 most frequently used words in my blog posts because I'm still curious about how the word `"you"` ranks across the board.

In [62]:
df_word_count = pd.DataFrame.from_dict(word_count_dict, orient='index', columns=['Frequency'])
df_word_count_sorted = df_word_count.sort_values('Frequency', ascending=False)
df_word_count_sorted.head(10)

Unnamed: 0,Frequency
aku,711
kamu,627
yang,484
tidak,337
dan,184
itu,159
di,154
kita,139
untuk,128
dia,119


Finally, there you are. The second most frequently used word turns out to be `"kamu"`. The words `"kita"` ("we" in English) and `"dia"` ("he/she" in English") appear in the top 10, which I find more intriguing. Amazing! 

I can now use this interesting insight to publish a new blog article on my website, probably another `untitledoorseries`. After all, that was the original goal of this project.