My Chrome extension, which retrives webpage information ethically, obtains and accumulates all news headlines every day. I import the latest json each day and run this pynb.

In [1]:
import json

import pandas as pd

In [2]:
data = json.load(open("jan2020_to_28-12-21.json", encoding="utf8"))

In [3]:
data

[{'title': 'Court throws out Rosmah’s bid to disqualify Sri Ram, nullify corruption trial',
  'url': 'https://www.freemalaysiatoday.com/category/nation/2021/12/06/court-throws-out-rosmahs-bid-to-disqualify-sri-ram-nullify-corruption-trial/'},
 {'title': 'DAP urges govt to clarify liquor licensing rules',
  'url': 'https://www.freemalaysiatoday.com/category/nation/2021/12/06/dap-urges-govt-to-clarify-liquor-licensing-rules/'},
 {'title': 'Bersatu warns of disciplinary action against ‘independent’ Ali Biju',
  'url': 'https://www.freemalaysiatoday.com/category/nation/2021/12/06/bersatu-warns-of-disciplinary-action-against-independent-ali-biju/'},
 {'title': 'Liquor licences for coffee shops in line with global standards, says govt',
  'url': 'https://www.freemalaysiatoday.com/category/nation/2021/12/06/liquor-licences-for-coffee-shops-in-line-with-global-standards-says-govt/'},
 {'title': 'Cops throw cold water on wild pool partygoers',
  'url': 'https://www.freemalaysiatoday.com/categor

In [4]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,title,url
0,Court throws out Rosmah’s bid to disqualify Sr...,https://www.freemalaysiatoday.com/category/nat...
1,DAP urges govt to clarify liquor licensing rules,https://www.freemalaysiatoday.com/category/nat...
2,Bersatu warns of disciplinary action against ‘...,https://www.freemalaysiatoday.com/category/nat...
3,Liquor licences for coffee shops in line with ...,https://www.freemalaysiatoday.com/category/nat...
4,Cops throw cold water on wild pool partygoers,https://www.freemalaysiatoday.com/category/nat...


In [5]:
df.shape

(32142, 2)

In [6]:
import re
from datetime import datetime

def split_date(url):
    x = re.findall(r"([0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9])", url)
    try:
        return datetime.strptime(x[0], '%Y/%m/%d')
    except:
        return None

df['date'] = df['url'].apply(split_date)
df.head()

Unnamed: 0,title,url,date
0,Court throws out Rosmah’s bid to disqualify Sr...,https://www.freemalaysiatoday.com/category/nat...,2021-12-06
1,DAP urges govt to clarify liquor licensing rules,https://www.freemalaysiatoday.com/category/nat...,2021-12-06
2,Bersatu warns of disciplinary action against ‘...,https://www.freemalaysiatoday.com/category/nat...,2021-12-06
3,Liquor licences for coffee shops in line with ...,https://www.freemalaysiatoday.com/category/nat...,2021-12-06
4,Cops throw cold water on wild pool partygoers,https://www.freemalaysiatoday.com/category/nat...,2021-12-06


Some urls were not captured by my Chrome extension

In [7]:
pd.set_option('display.max_colwidth', -1)
df[df['date'].isnull()]

  pd.set_option('display.max_colwidth', -1)


Unnamed: 0,title,url,date
2777,"Special RM50 note fetches whopping RM708,000 at auction",,NaT
2778,Death for mother of nine: Where is the noise?,,NaT
2779,China’s smash-hit war movie raises hackles in Malaysia,,NaT
2896,"Drinking Timah is like drinking a Malay woman, says PH MP",,NaT
2897,"Siam, not Kedah, lost Penang to the British, says expert",,NaT
2898,Timah to change its name – and image,,NaT
4166,"Now, Sanusi wants RM100mil a year for ‘lease’ of Penang",,NaT
4167,Unmasked woman slammed for arguing over being barred by KLCC outlet,,NaT
4168,"They call me ‘Padayappa’ in Baling, says Azeez",,NaT
4204,"Ahmad Maslan pays RM1.1mil compound, cleared of charges",,NaT


Considering that I have more than 20k news headlines, it's not that many missing urls. I can impute a few each day.

In [8]:
df.loc[df.title == 'Man who fell from condo key witness in Guan Eng’s corruption case','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/05/man-who-fell-from-condo-key-witness-in-guan-eng-corruption-case/'
df.loc[df.title == 'I was misunderstood, speech was only for Muslims, says preacher','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/05/i-was-misunderstood-speech-was-only-for-muslims-says-preacher/'
df.loc[df.title == 'Melaka govt has fallen, says former CM after exodus','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/04/melaka-govt-has-fallen-says-former-cm-after-exodus/'
df.loc[df.title == 'Something is rotten in the state of Putrajaya','url'] = 'https://www.freemalaysiatoday.com/category/opinion/2021/10/03/something-is-rotten-in-the-state-of-putrajaya/'
df.loc[df.title == 'Minister condemns racist comments against national shuttler','url'] = 'https://www.freemalaysiatoday.com/category/highlight/2021/10/03/minister-condemns-racist-comments-against-national-shuttler/'
df.loc[df.title == 'Bersatu man quits after racial slur against Kisona','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/03/bersatu-man-quits-after-racial-slur-against-kisona/'
df.loc[df.title == 'PH blew it on citizenship matters, says Latheefa','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/02/ph-blew-it-on-citizenship-matters-says-latheefa/'
df.loc[df.title == 'Rusuhan di tokong Cheras, pengurusan mohon maaf','url'] = 'https://www.freemalaysiatoday.com/category/bahasa/tempatan/2021/10/02/rusuhan-di-tokong-cheras-pengurusan-mohon-maaf/'
df.loc[df.title == 'SOPs for cinemas don’t make sense, says former deputy minister','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/02/sops-for-cinemas-dont-make-sense-says-former-deputy-minister/'
df.loc[df.title == 'MIC in the soup over criticism of Umno president','url'] = 'https://www.freemalaysiatoday.com/category/opinion/2021/10/01/mic-in-the-soup-over-criticism-of-umno-president/'
df.loc[df.title == 'Ousted – top Communist Party man linked to monitoring of reporters probing 1MDB','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/01/ousted-top-communist-party-man-linked-to-monitoring-of-reporters-probing-1mdb/'
df.loc[df.title == 'Clinic assistant for 37 years gets RM165,000 for unfair dismissal','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/01/clinic-assistant-for-37-years-gets-rm165000-for-unfair-dismissal/'
df.loc[df.title == 'If furniture belonged to Najib, he had the right to take them, says Muhyiddin','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/09/30/if-furniture-belonged-to-najib-he-had-the-right-to-take-them-says-muhyiddin/'
df.loc[df.title == 'Dining-in allowed even for unvaccinated under Phase 3 of recovery plan','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/01/dining-in-allowed-even-for-unvaccinated-under-phase-3-of-recovery-plan/'
df.loc[df.title == 'Ahmad Maslan pays RM1.1mil compound, cleared of charges','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/09/29/ahmad-maslan-pays-rm1-1mil-compound-cleared-of-charges/'
df.loc[df.title == '3 reasons why DAP is alive and kicking despite the odds','url'] = 'https://www.freemalaysiatoday.com/category/opinion/2021/09/29/3-reasons-why-dap-is-alive-and-kicking-despite-the-odds/'
df.loc[df.title == 'Klang Valley, Melaka to move into recovery Phase 3','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/09/29/klang-valley-melaka-to-move-into-recovery-phase-3/'
df.loc[df.title == 'Anwar questions RM2bil for foreign consultants to draw up 12MP','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/09/28/anwar-questions-rm2bil-for-foreign-consultants-to-draw-up-12mp/'
df.loc[df.title == 'Deputy minister says sorry over dangerous overtaking','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/09/28/deputy-minister-says-sorry-over-dangerous-overtaking/'
df.loc[df.title == 'Zahid extended firm’s contract without consulting govt officers, court told','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/09/28/zahid-extended-firms-contract-without-consulting-govt-officers-court-told/'
df.loc[df.title == 'Umno ‘minta nyawa’ daripada DAP selamatkan kerajaan Melaka, dedah Rauf','url'] = 'https://www.freemalaysiatoday.com/category/bahasa/tempatan/2021/10/06/umno-minta-nyawa-daripada-dap-selamatkan-kerajaan-melaka-dedah-rauf/'
df.loc[df.title == 'Another PKR in the offing? New party may be called Parti Kuasa Rakyat','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/06/another-pkr-in-the-offing-new-party-may-be-called-parti-kuasa-rakyat/'
df.loc[df.title == 'Chaos in Dewan as deputy minister rushes to finish speech','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/06/chaos-in-dewan-as-deputy-minister-rushes-to-finish-speech/'
df.loc[df.title == 'Govt relaxes rules for existing MM2H holders','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/05/govt-relaxes-rules-for-existing-mm2h-holders/'
df.loc[df.title == 'Grand send-off for Penang developer who fell to his death','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/07/grand-send-off-for-penang-developer-who-fell-to-his-death/'
df.loc[df.title == 'I might never be PM, and that’s okay, says Anwar','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/08/i-might-never-be-pm-and-thats-okay-says-anwar/'
df.loc[df.title == 'Court upholds striking out of RM676mil suit against Najib, Rosmah','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/08/court-upholds-striking-out-of-rm676mil-suit-against-najib-rosmah/'
df.loc[df.title == 'Interstate borders open, Malaysians free to go overseas, Rosmah','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/10/interstate-borders-to-reopen-from-tomorrow/'
df.loc[df.title == 'Court upholds striking out of RM676mil suit against Najib, Rosmah','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/08/court-upholds-striking-out-of-rm676mil-suit-against-najib-rosmah/'
df.loc[df.title == 'Maryam, you’re a genius – now be careful','url'] = 'https://www.freemalaysiatoday.com/category/opinion/2021/10/10/maryam-youre-a-genius-now-be-careful/'
df.loc[df.title == '10 reasons why Bumiputeraism is bad policy','url'] = 'https://www.freemalaysiatoday.com/category/opinion/2021/10/10/10-reasons-for-why-bumiputerism-is-bad-policy/'
df.loc[df.title == 'What miracles have you done, Penang DAP reps asked','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/08/what-miracles-have-you-done-penang-dap-reps-asked/'
df.loc[df.title == 'Tengku Zafrul kongsi kisah makan & rentas negeri, warganet pula tanya i-Citra RM10K mana?','url'] = 'https://www.freemalaysiatoday.com/category/bahasa/fmt-ohsem/ohsem-apa-apa-aje/2021/10/11/tengku-zafrul-kongsi-kisah-makan-rentas-negeri-warganet-pula-tanya-i-citra-rm10k-mana/'
df.loc[df.title == 'Opposition leader Anwar gets ‘promotion letter’ to minister level','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/11/opposition-leader-anwar-gets-promotion-letter-to-minister-level/'
df.loc[df.title == 'Rare praise for Guan Eng from PAS MP over funds for tahfiz schools','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/11/rare-praise-for-guan-eng-from-pas-mp-over-funds-for-tahfiz-schools/'
df.loc[df.title == 'Interstate borders open, Malaysians free to go overseas','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/10/interstate-borders-to-reopen-from-tomorrow/'
df.loc[df.title == 'Tropicana wants back international school over rental arrears','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/08/tropicana-wants-back-international-school-over-rental-arrears/'
df.loc[df.title == 'Australia’s Asean farm worker visa hits snag – from Malaysia','url'] = 'https://www.freemalaysiatoday.com/category/opinion/2021/10/13/australias-asean-farm-worker-visa-hits-snag-from-malaysia/'
df.loc[df.title == 'LHDN chief executive resigning?','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/13/lhdn-chief-executive-resigning/'
df.loc[df.title == 'RM100,000 to buy rank of police superintendent?','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/12/rm100000-to-buy-rank-of-police-superintendent/'
df.loc[df.title == 'Batu Puteh: Adakah kedaulatan Johor tak penting, kalau Langkawi hilang apa rasa orang Kedah, soal Su','url'] = 'https://www.freemalaysiatoday.com/category/bahasa/tempatan/2021/10/14/batu-puteh-adakah-kedaulatan-johor-tak-penting-bagi-pentadbiran-dr-m-kalau-langkawi-hilang-kepada-thailand-apa-rasa-orang-kedah-soal-sultan-ibrahim/'
df.loc[df.title == 'Iconic Hameediyah nasi kandar shop slapped with RM10,000 fine','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/15/iconic-hameediyah-nasi-kandar-shop-slapped-with-rm10000-fine/'
df.loc[df.title == 'Knives come out as Penang DCM mulls move to Parliament','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/15/knives-come-out-as-penang-dcm-mulls-move-to-parliament/'
df.loc[df.title == 'Sex video shock for Form 2 students during virtual exam','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/15/sex-video-shock-for-form-2-students-during-virtual-exam/'
df.loc[df.title == '200 youths carried out pre-dawn raid at temple, says witness','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/14/200-youths-carried-out-pre-dawn-raid-at-temple-says-witness/'
df.loc[df.title == 'Not fair to deprive us of ripe pickings, says Malaysian Down Under','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/14/not-fair-to-deprive-us-of-ripe-pickings-says-malaysian-down-under/'
df.loc[df.title == 'Whatsapp terakhir, anggota polis hantar pesanan sebelum ditemui mati','url'] = 'https://www.freemalaysiatoday.com/category/bahasa/tempatan/2021/10/17/whatsapp-terakhir-anggota-polis-hantar-pesanan-sebelum-ditemui-mati/'
df.loc[df.title == 'Dilemmas facing the Indian Muslims','url'] = 'https://www.freemalaysiatoday.com/category/opinion/2021/10/16/dilemmas-facing-the-indian-muslims/'
df.loc[df.title == 'Now, Sanusi wants RM100mil a year for ‘lease’ of Penang','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/27/now-sanusi-wants-rm100mil-a-year-for-lease-of-penang/'
df.loc[df.title == 'Unmasked woman slammed for arguing over being barred by KLCC outlet','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/27/unmasked-woman-slammed-for-arguing-over-being-barred-by-klcc-outlet/'
df.loc[df.title == 'They call me ‘Padayappa’ in Baling, says Azeez','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/27/they-call-me-padayappa-in-baling-says-azeez/'
df.loc[df.title == 'Union rejects Amanah call for restrictions on Muslims in alcohol sale','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/25/union-rejects-amanah-call-for-restrictions-on-muslims-in-alcohol-sale/'
df.loc[df.title == 'Girl born to Chinese national gets to be Malaysian','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/25/girl-born-to-chinese-national-gets-to-be-malaysian/'
df.loc[df.title == '870,000 to get one-off RM500 payment tomorrow','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/25/870000-to-get-one-off-rm500-payment-tomorrow/'
df.loc[df.title == 'Singapore ‘entices’ Malaysian doctors to work in Covid-19 care centres','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/18/singapore-entices-malaysian-doctors-to-work-in-covid-19-care-centres/'
df.loc[df.title == 'Outspoken Kitingan leaves GRS ties at risk, says analyst','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/17/outspoken-kitingan-leaves-grs-ties-at-risk-says-analyst/'
df.loc[df.title == 'Sex ring found at car wash','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/16/sex-ring-found-at-car-wash/'
df.loc[df.title == 'Drinking Timah is like drinking a Malay woman, says PH MP','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/28/drinking-timah-is-like-drinking-a-malay-woman-says-ph-mp/'
df.loc[df.title == 'Siam, not Kedah, lost Penang to the British, says expert','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/29/siam-not-kedah-lost-penang-to-the-british-says-expert/'
df.loc[df.title == 'Timah to change its name – and image','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/28/timah-to-change-its-name-and-image/'
df.loc[df.title == 'Special RM50 note fetches whopping RM708,000 at auction','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/31/special-rm50-note-fetches-whopping-rm708000-at-auction/'
df.loc[df.title == 'Death for mother of nine: Where is the noise?','url'] = 'https://www.freemalaysiatoday.com/category/opinion/2021/10/31/death-for-mother-of-nine-where-is-the-noise/'
df.loc[df.title == 'China’s smash-hit war movie raises hackles in Malaysia','url'] = 'https://www.freemalaysiatoday.com/category/nation/2021/10/30/chinas-smash-hit-war-movie-raises-hackles-in-malaysia/'

df['date'] = df['url'].apply(split_date)

In [9]:
df[df.isna().any(axis=1)]

Unnamed: 0,title,url,date


Imputation done, saving to csv.

In [10]:
df.to_csv('imputed_fmt_28-12-21.csv', index=False)