<a href="https://colab.research.google.com/github/javieraespinosa/lifranum/blob/main/Blogger_Collector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Collecting Data from Blogger via the API

**Blogger** (previously Blogspot) is a Google service for hosting personal blogs. Blogs data can be accessed directly using the [Blogger API](https://developers.google.com/blogger/docs/3.0/getting_started) (e.g., _information about the blog, authors, posts, etc._). This notebook illustrates the usage of the API for collecting and storing **public data** from a blog for further analysis. 

## Requirements

* MongoDB python dependecy (**restart runtime afterwards**)

In [1]:
# Restart runtime
!pip install dnspython

Collecting dnspython
[?25l  Downloading https://files.pythonhosted.org/packages/90/49/cb426577c28ca3e35332815b795a99e467523843fc83cc85ca0d6be2515a/dnspython-2.0.0-py3-none-any.whl (208kB)
[K     |████████████████████████████████| 215kB 3.4MB/s 
[?25hInstalling collected packages: dnspython
Successfully installed dnspython-2.0.0


* Blog URL

In [1]:
BLOG_URL  = 'https://poesiecls.blogspot.com/'

* Google [application key](https://developers.google.com/blogger/docs/3.0/using#APIKey)

In [2]:
# Replace with your own key
MY_APPLICATION_KEY = "AIzaSyC4MJYe3uOFJhfeGuNG7-1nEDLYlb54UfM"

* [MongoDB Atlas](https://www.mongodb.com/cloud/atlashttps://) connection string

In [3]:
# Replace with your own cluster credentials
DB_USER = 'victorhugo'
DB_PWD  = 'HqPqY1oMmzLXVaWI'
DB_NAME = 'blogger'
DB_CONNECTION_STRING = "mongodb+srv://{}:{}@lifranum-cluster.ag22g.mongodb.net/{}".format(DB_USER, DB_PWD, DB_NAME)
DB_CONNECTION_STRING

'mongodb+srv://victorhugo:HqPqY1oMmzLXVaWI@lifranum-cluster.ag22g.mongodb.net/blogger'

* Authorize access to Google Drive


In [4]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


# Helper Functions

Functions required by the notebook.

In [3]:
import requests 
import json
import time

# Gets blog's general information based on its URL
def get_blog_info(blog_url): 
    enpoint="https://www.googleapis.com/blogger/v3/blogs/byurl"
    params = {
        'url': blog_url,
        'key': MY_APPLICATION_KEY
    } 
    r = requests.get(url=enpoint, params=params)
    data = r.json() 
    return data

# Gets blog's posts given a blog's id. You can control the max number of pages to
# retrieve and the number of posts per page. By default, the function will collect
# all blog's posts.
def get_blog_posts(blog_id, max_pages=0, posts_per_page=50):
    posts = []
    data = None
    p = 1
    try:
        endpoint = "https://www.googleapis.com/blogger/v3/blogs/{}/posts".format(blog_id)
        params = {
            'key': MY_APPLICATION_KEY,
            'maxResults': posts_per_page
        }
        while True:
            r = requests.get(url=endpoint, params=params)
            data = r.json()
            posts.extend(data['items'])

            print('last post:', data['items'][-1]['id'], data['items'][-1]['url'])

            if max_pages > 0 and p >= max_pages:
                break
            
            # Retrieve until there are no more pages left
            if 'nextPageToken' not in data:
                break

            params['pageToken'] = data['nextPageToken']

            # sleep every 2 calls to avoid google rate limits
            if p % 2 == 0:
                time.sleep(2)  
            p+=1

    except e:
        print('error:', e)
        print('data:', data)
  
    return posts


# Examples

## Ex1: Collect blog info

In [4]:
blog_info = get_blog_info(BLOG_URL)
blog_info

{'description': 'Jean Coulombe, Alain Larose et Denis Samson ; trois poètes librement associés pour partager leur poésie sous toutes ses formes...                           \n                                                                                                                   \n\nFONDÉ EN JUIN 2009!',
 'id': '574023896617111007',
 'kind': 'blogger#blog',
 'locale': {'country': '', 'language': 'fr', 'variant': ''},
 'name': 'CLS Poésie',
 'pages': {'selfLink': 'https://www.googleapis.com/blogger/v3/blogs/574023896617111007/pages',
  'totalItems': 0},
 'posts': {'selfLink': 'https://www.googleapis.com/blogger/v3/blogs/574023896617111007/posts',
  'totalItems': 904},
 'published': '2009-06-13T21:06:58-04:00',
 'selfLink': 'https://www.googleapis.com/blogger/v3/blogs/574023896617111007',
 'updated': '2020-11-05T03:49:41-05:00',
 'url': 'http://poesiecls.blogspot.com/'}

In [50]:
with open('blog_info.json', 'w') as file:
    json.dump(blog_info, file)

## Ex2: Collect all blog posts

In [38]:
blog_id = get_blog_info(BLOG_URL)['id']
posts   = get_blog_posts(blog_id)

last post: 984829205883248563 http://poesiecls.blogspot.com/2019/10/les-ages.html
last post: 747068710474653706 http://poesiecls.blogspot.com/2018/10/migration.html
last post: 7877748794540349364 http://poesiecls.blogspot.com/2017/11/eclaircies.html
last post: 9160598431079570118 http://poesiecls.blogspot.com/2016/11/pour-ca.html
last post: 7447388152092805161 http://poesiecls.blogspot.com/2016/02/corps-etrangers.html
last post: 5162741375807714660 http://poesiecls.blogspot.com/2015/05/chatoiement.html
last post: 8600333091241084249 http://poesiecls.blogspot.com/2014/10/demain.html
last post: 858928423808051893 http://poesiecls.blogspot.com/2014/02/conducteur-designe.html
last post: 8420449498733953109 http://poesiecls.blogspot.com/2013/06/ai-je-perdu-mon-ombre.html
last post: 1730619025202012008 http://poesiecls.blogspot.com/2012/08/le-sens-du-voyage.html
last post: 6806871601676746304 http://poesiecls.blogspot.com/2012/02/jamais-finie.html
last post: 7549738792386417206 http://poesie

In [39]:
len(posts)

904

In [40]:
posts[0]

{'author': {'displayName': 'Coulombe-Larose-Samson',
  'id': '13874257091559880912',
  'image': {'url': '//2.bp.blogspot.com/-ffR0DFxVSE8/XiaIfnWr2tI/AAAAAAAADoY/dXG-exUmt18g689e7z273ztjkZT4DLoQwCK4BGAYYCw/s35/CLS.png'},
  'url': 'https://www.blogger.com/profile/13874257091559880912'},
 'blog': {'id': '574023896617111007'},
 'content': '<p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-UMgq45zu0C0/X6BNVVT593I/AAAAAAAADyI/KMYtRs6MDBQ17hwoaLQgooLx-Vo-qvtGwCLcBGAsYHQ/s2048/Des%2Bfore%25CC%2582ts%2Ba%25CC%2580%2Bvif.JPG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-UMgq45zu0C0/X6BNVVT593I/AAAAAAAADyI/KMYtRs6MDBQ17hwoaLQgooLx-Vo-qvtGwCLcBGAsYHQ/w400-h300/Des%2Bfore%25CC%2582ts%2Ba%25CC%2580%2Bvif.JPG" width="400" /></a></div><br /><p><br /></p><p><br /></p><p><br /></p><p

In [51]:
with open('blog_posts.json', 'w') as file:
    json.dump(posts, file)


## Ex3: Extract text from posts (offline)

In [42]:
with open('blog_posts.json') as file:
    posts = json.load(file)
    

In [43]:
from bs4 import BeautifulSoup

for post in posts:
    post['content-text'] = BeautifulSoup(post['content'], 'html.parser').findAll(text=True)


In [44]:
posts[0]['content-text']

["L'horizon au fond des yeux",
 'le feu du crépuscule se répand',
 'des forêts à vif',
 "jusqu'au coeur du poème.",
 'Sous la cendre des couleurs',
 "l'image crée un vide",
 'sous les mots.',
 'Denis Samson',
 ' © 2020']

In [45]:
with open('blog_posts.json', 'w') as file:
    json.dump(posts, file)

# Persisting Results

## Google Drive

In [52]:
%%sh
cp *.json gdrive/My\ Drive

## MongoDB (ATLAS)

In [47]:
from pymongo import MongoClient
import json

client = MongoClient(DB_CONNECTION_STRING)
db = client[DB_NAME]

In [48]:
with open('blog_info.json') as file:
    blog_info = json.load(file)
    blog_info['_id'] = blog_info['id']

db['blog'].insert_one (blog_info)

<pymongo.results.InsertOneResult at 0x7fb53e045708>

In [49]:
with open('blog_posts.json') as file:
    blog_posts = json.load(file)
    for p in blog_posts:
        p['_id'] = p['id']

db['posts'].insert_many(blog_posts)

<pymongo.results.InsertManyResult at 0x7fb544ab1388>