# Import Instapaper Highlights from Google Sheets
### Bonus: Process and Export into Local Markdown Files

This notebook will walkthrough getting you integrated with Google's Sheets API so you can pull data from Google Sheets into a data frame and use it accordingly. 

We are using a Google Sheet document created by IFTTT for Instapaper Highlights instapaper highlights to CSV and, in this particular case, saved locally into markdown files for each article. 

-----

### Installation and Setup

- Go to https://developers.google.com/sheets/api/quickstart/python.
- Click "ENABLE THE GOOGLE SHEETS API" and go through additional steps. 
- Configure below with sheet id and name. 
- Run Notebook

NOTE: This should prompt you to open a URL to confirm access with Google. The end result should be a new file created called "token.pickle"

-----

### Libraries

In [1]:
from __future__ import print_function
import pickle
import os.path
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

from datetime import date, datetime as dt, timedelta as td
import pandas as pd

In [2]:
# If modifying these scopes, delete the file token.pickle.
SCOPES = ['https://www.googleapis.com/auth/spreadsheets.readonly']

### The Sheet ID and target range (i.e. sheet name) of a sample spreadsheet.

* Copy Sheet ID from URL of Targetted Sheet
* Add ID to Configuration below as well as sheet name and/or range. 

In [3]:
# Configure to Your Specific Sheet 

# SAMPLE_SPREADSHEET_ID = '1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms'
# SAMPLE_RANGE_NAME = 'Class Data!A2:E'

SPREADSHEET_ID = "ADD_SHEET_ID_HERE"
RANGE_NAME = "Sheet1"

In [4]:
# Authentication
creds = None
# The file token.pickle stores the user's access and refresh tokens, and is
# created automatically when the authorization flow completes for the first
# time.
if os.path.exists('token.pickle'):
    with open('token.pickle', 'rb') as token:
        creds = pickle.load(token)
# If there are no (valid) credentials available, let the user log in.
if not creds or not creds.valid:
    if creds and creds.expired and creds.refresh_token:
        creds.refresh(Request())
    else:
        flow = InstalledAppFlow.from_client_secrets_file(
            'credentials.json', SCOPES)
        creds = flow.run_local_server()
    # Save the credentials for the next run
    with open('token.pickle', 'wb') as token:
        pickle.dump(creds, token)

# build service
service = build('sheets', 'v4', credentials=creds)

In [5]:
# Get Data from Google Sheet and Return a List
def get_gsheet_data(SPREADSHEET_ID, RANGE_NAME):
    sheet = service.spreadsheets()
    result = sheet.values().get(spreadsheetId=SPREADSHEET_ID,
                            range=RANGE_NAME).execute()
    values = result.get('values', [])
    return values

In [6]:
# get data
highlights_list = get_gsheet_data(SPREADSHEET_ID, RANGE_NAME)

In [7]:
# highlights_list[0:1]

In [8]:
# convert to df
insta_highlights = pd.DataFrame(highlights_list[1:], columns = ['TimestampOrig', 'Article Title', 'Highlight', 'URL', 'Timestamp'])


In [9]:
# export to csv
insta_highlights.to_csv("data/instapaper_highlights.csv", index=False)

------

## BONUS: Export Highlights to Markdown Files

#### Configure Local Directory

In [10]:
# export_directory = '/Users/my-user-name/Documents/clippings_and_quotes/'
export_directory = 'data/export_test/'

----

#### Data Import and Data Preparations

In [11]:
highlights = pd.read_csv("data/instapaper_highlights.csv")

In [12]:
highlights.head()

Unnamed: 0,TimestampOrig,Article Title,Highlight,URL,Timestamp
0,"August 17, 2018 at 11:07AM",The Making of a Corporate Athlete,A successful approach to sustained high perfor...,https://hbr.org/2001/01/the-making-of-a-corpor...,8/17/2018 11:07:00
1,"August 17, 2018 at 11:08AM",The Making of a Corporate Athlete,Our efforts aim instead to help executives bui...,https://hbr.org/2001/01/the-making-of-a-corpor...,8/17/2018 11:08:00
2,"August 17, 2018 at 09:32PM",We’re in a new age of obesity. How did it happ...,we ate more in 1976. According to government f...,http://www.theguardian.com/commentisfree/2018/...,8/17/2018 9:32:00
3,"August 17, 2018 at 09:33PM",We’re in a new age of obesity. How did it happ...,According to a long-term study at Plymouth Uni...,http://www.theguardian.com/commentisfree/2018/...,8/17/2018 9:33:00
4,"August 17, 2018 at 09:34PM",We’re in a new age of obesity. How did it happ...,"we ate more in 1976, but differently. Today, w...",http://www.theguardian.com/commentisfree/2018/...,8/17/2018 9:34:00


In [13]:
# date additions
highlights['Timestamp'] = pd.to_datetime(highlights['Timestamp'])
highlights['date'] = highlights['Timestamp'].apply(lambda x: x.strftime('%Y-%m-%d')) # note: not very efficient
highlights['year'] = highlights['Timestamp'].dt.year
highlights['month'] = highlights['Timestamp'].dt.month
highlights['mnth_yr'] = highlights['Timestamp'].apply(lambda x: x.strftime('%Y-%m')) # note: not very efficient
highlights['day'] = highlights['Timestamp'].dt.day
highlights['dow'] = highlights['Timestamp'].dt.weekday
highlights['hour'] = highlights['Timestamp'].dt.hour


#### Highlights Markdown Exporter

In [14]:
highlights_titles = highlights['Article Title'].unique()
print('{:,} total articles with highlights.'.format(len(highlights_titles)))

280 total articles with highlights.


In [15]:
def generate_article_highlights_file(article_title, directory=export_directory):
    article_highlights = highlights[highlights['Article Title'] == article_title]
    title = (article_highlights.iloc[0]['Article Title']).rstrip()
    URL = (article_highlights.iloc[0]['URL'])
    DateRead = (article_highlights.iloc[0]['date'])
    title_stripped = (title.rstrip()
                      .replace(" ", "_")
                      .replace(":", "")
                      .replace(",", "")
                      .replace("/", "")
                      .replace("(", "")
                      .replace(")", "")
                      .replace("?", "")
                      .lower())
    filename=(article_highlights.iloc[0]['Timestamp'].strftime('%Y%m%d%H%M') + "_" + title_stripped+".md")
    filepath= directory+filename
    
    print("Printing... " + filename)
    file = open(filepath,"w") 
    file.write("# " + title + "\n")
    file.write("Source: [" + URL + "]("+ URL + ") \n")
    file.write("Date Read: " + DateRead + "\n")
    file.write("tags: #ArticleHighlights #ArticleRead \n")
    file.write(" \n") 
    file.write("### Highlights \n")
    file.write(" \n") 
    for index, row in article_highlights.iterrows():
        file.write(str(row['Highlight']) + " \n")
        # file.write("p " + str(row['num_pages']) + " | " + row['location'] + " | " + str(row['timestamp']) + " \n")
        file.write(" \n")
    file.close() 

In [16]:
# Get a Test Book Title
title_test = highlights_titles[-1]

In [17]:
# TEST Individual Book Export
generate_article_highlights_file(title_test)

Printing... 201904071150_gender_differences_in_suicide_-_wikipedia.md


In [18]:
# UNCOMMENT TO RUN
# Loop through all books and generate individual markdown file with highlights
#for i in highlights_titles:
#    generate_article_highlights_file(i)