# Get Metadata from Blogger blog
This notebook shows how to get metadata for all posts of a blog on Blogger. Please refer to [official Blogger documenation](https://developers.google.com/blogger/docs/3.0/reference?hl=en) for more details.

To store your credentials without hardcoding into the code, please create a text file named `.env` with two lines (not included in this repo):
> BLOG_ID = [YOUR_BLOG_ID]  
> KEY = [YOUR_KEY]

Blog id can be found when you login to your own blogger account and choose your blog. The blog id will be shown in the URL. And refer to [this page](https://developers.google.com/blogger/docs/3.0/using) to obtain an API key to get public data.

In [1]:
import requests
import json
import pandas as pd
import os
from dotenv import load_dotenv

load_dotenv()
blog_id = os.getenv('blog_id')
key = os.getenv('key')

I use my own blog https://ccmusichk.blogspot.com/ as an example. It now has close to 800 articles. If I want to retrieve a list of all article titles and published dates, doing it manually is severely time-consuming so API can help.

In [2]:
request = requests.get(f'https://www.googleapis.com/blogger/v3/blogs/{blog_id}?key={key}')
blog_meta = request.json()
print(blog_meta)

{'kind': 'blogger#blog', 'id': '9563690', 'name': '當下音樂', 'description': '給聽得見歷代音樂的這個當下', 'published': '2006-11-18T21:31:20+08:00', 'updated': '2023-12-24T19:55:02+08:00', 'url': 'http://ccmusichk.blogspot.com/', 'selfLink': 'https://www.googleapis.com/blogger/v3/blogs/9563690', 'posts': {'totalItems': 774, 'selfLink': 'https://www.googleapis.com/blogger/v3/blogs/9563690/posts'}, 'pages': {'totalItems': 3, 'selfLink': 'https://www.googleapis.com/blogger/v3/blogs/9563690/pages'}, 'locale': {'language': 'zh', 'country': 'HK', 'variant': ''}}


In [3]:
post_count = blog_meta['posts']['totalItems']
last_date = blog_meta['updated']
print(post_count, last_date)

774 2023-12-24T19:55:02+08:00


The following code block creates a dictionary with information about every post, including id, title, published date and url. Due to limits in fetching I fetch 500 results at a time and loop through all posts.

In [4]:
remaining_post = post_count
last_date = last_date.replace(':','%3A').replace('+','%2B')
data = {'id':[], 'title':[], 'published':[], 'url':[]}
while remaining_post > 0:
    request = requests.get(f'https://www.googleapis.com/blogger/v3/blogs/{blog_id}/posts?maxResults=500&\
    fetchBodies=false&endDate={last_date}&key={key}')
    post_meta = request.json()
    for post in post_meta['items']:
        data['id'].append(post['id'])
        data['title'].append(post['title'])
        data['published'].append(post['published'])
        data['url'].append(post['url'])
    last_date = data['published'][-1]
    last_date = last_date.replace(':','%3A').replace('+','%2B')
    remaining_post -= 500

Then, create a pandas dataframe from the dictionary.

In [5]:
df = pd.DataFrame.from_dict(data)
df.head(10)

Unnamed: 0,id,title,published,url
0,3906862311869223564,當下音樂2023第四季新歌總評,2023-12-24T19:42:00+08:00,http://ccmusichk.blogspot.com/2023/12/2023.html
1,277184505363206060,楊乃文《Flow》：寶島聽歌雜談,2023-11-01T06:38:00+08:00,http://ccmusichk.blogspot.com/2023/11/flow.html
2,1528486961508801536,當下音樂2023第三季新歌總評,2023-10-02T05:03:00+08:00,http://ccmusichk.blogspot.com/2023/10/2023.html
3,4629897493370226472,英倫聽歌雜談(七)：夏日音樂節之BBC Proms,2023-08-19T05:21:00+08:00,http://ccmusichk.blogspot.com/2023/08/bbc-prom...
4,4916221107640258246,英倫聽歌雜談(六)：夏日音樂節之Glastonbury,2023-08-05T16:15:00+08:00,http://ccmusichk.blogspot.com/2023/08/glastonb...
5,7986615748428267563,當下音樂2023第二季新歌總評,2023-07-02T00:11:00+08:00,http://ccmusichk.blogspot.com/2023/07/2023.html
6,4920993640690599734,由聽AI張國榮到訓練AI汪明荃得出的十點啟示,2023-06-21T06:22:00+08:00,http://ccmusichk.blogspot.com/2023/06/aiai.html
7,603399927818319101,當下音樂2023第一季新歌總評,2023-04-01T19:59:00+08:00,http://ccmusichk.blogspot.com/2023/04/2023.html
8,1696388303122897021,告五人《帶你飛》：打臉來得太快但不失為一件好事,2023-03-24T05:52:00+08:00,http://ccmusichk.blogspot.com/2023/03/blog-pos...
9,6431417474549410972,英倫聽歌雜談(五)：從一本一直沒有看完的書談起,2023-02-26T00:54:00+08:00,http://ccmusichk.blogspot.com/2023/02/blog-pos...


[Optional] You can save the dataframe into a csv file for further analysis.

In [6]:
df.to_csv('ccmusichk_posts.csv', index=False)