<a href="https://colab.research.google.com/github/invisilico/ActivityExtractor/blob/master/HTML_extractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to the Android Activity TimeStamps Extractor!
###Data extraction, now with timezones!

###Built for use with the **HTML** file from takeout.google.com


Inspired by the De La Iglesias Lab project on E-mail Timestamps, this notebook helps you easily extract, clean and visualise your Android phone's activity data based on timestamps on when apps were opened. 

A pandas dataframe is created with the following structure and is saved as CSV at the end of the notebook.


---


```
# dataframe structure

   App      Year  Month  Date  Time  of24h      Timezone
0  Appname  2020  7      10    1534  15.566667  IST
1  Appname  2020  7      10    1434  14.566667  IST
2  clock    2020  7      11    1334  13.566667  IST
2  Appname  2020  7      11    1234  12.566667  IST
```


---

For this notebook, there are certain privacy measures you can take:

1. Make a copy of this notebook in your drive using the "Copy to Drive" button, and delete once the CSV is downloaded.

2. The files are removed explicitly at the data upload step, but some data may be left. To ensure they are removed, check the files tab on the left once prompted, and terminate session when leaving.

Instructions for using colab: the [ ] on the left of each "block"/"cell" runs them. Run them in order and wait for them to finish. If you run into errors, feel free to contact me at nishantjana5@gmail.com 


---


Made by Nishant Jana during SRBR ChronoSchool 2020.

Twitter: @in_visilico, Github: @invisilico


---

In [7]:
#@title Simple Set-Up
#@markdown Click the ["play"] button on the left and wait for colab to allocate system resources. Once done, click play on each subsequent block.
from google.colab import files
import os
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from calendar import monthrange
from calendar import isleap

In [None]:
#@title Load Data from File
#@markdown This block will keep running till you select the file. Click the "Choose Files" tile as it appears to select file to upload from your system. The file will be deleted once data is loaded.
#@markdown
#@markdown Ensure the file is called 'My Activity.html'
file = files.upload()

os.rename(r'My Activity.html',r'actdata.txt')
with open('actdata.txt','r') as file:
  data = file.readlines()
print("Data loaded from file.")
!rm *.* #removes file, bash shell command
print("File deleted from Colab, verify in files on left panel.")

In [None]:
#@title Build DataFrame
#@markdown This block will extract the data from the HTML file and create a pandas array. The CSV downloaded at the end can be read into python as a pandas dataframe and R as Rdata. Once done it should print the topmost and bottom-most entries and size of the dataframe.
#@markdown 
#@markdown Scroll to the end of the notebook to download CSV or continue to plot activity data.
#@markdown 
#@markdown Appnames are there for three reasons:
#@markdown 
#@markdown 1. Activity of apps like clock and calendar can inform you of the days on which you were woken up by alarms.
#@markdown 2. You can track activity of certain apps that are work related or others that are leisure related.
#@markdown 3. They help clean the data. Some system apps register sporadically even when the phone was not being used. Their activity changes phone to phone and needs manual sorting.
#@markdown 
#@markdown Note: Activity on the app is not read from the HTML file for privacy purposes.
actdat = data[32]
preapp = [app.end(0) for app in re.finditer('<p class="mdl-typography--title">', actdat)] 
postapp = [app.start(0) for app in re.finditer('<br></p></div><div class', actdat)] 
posttime = [app.start(0) for app in re.finditer('</div><div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1 mdl-typography--text-right">', actdat)]

appname = []
datetime = []

for i in range(len(preapp)):
    appname += [actdat[preapp[i]:postapp[i]]]
    datetime += [actdat[posttime[i]-25:posttime[i]]]
    
appname.reverse()
datetime.reverse()

months = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}

yy = []
mm = []
dd = []
time = []
of24 = []
timzn = []

for t in datetime:
  if t[0] == '>':
    yy += [int(t[7:11])]
    mm += [months[(t[3:6])]]
    dd += [int(t[1])]
    time += [int(t[13:15])*100 + int(t[16:18])]
    of24 += [int(t[13:15])+(int(t[16:18])/60)]
    timzn += [t[-3:]]
  else:
    yy += [int(t[7:11])]
    mm += [months[(t[3:6])]]
    dd += [int(t[0:2])]
    time += [int(t[13:15])*100 + int(t[16:18])]
    of24 += [int(t[13:15])+(int(t[16:18])/60)]
    timzn += [t[-3:]]

dataframe = pd.DataFrame(list(zip(appname,yy,mm,dd,time,of24,timzn)),columns = ['App','Year','Month','Date','Time','of24h','TimeZone'])
print(dataframe)

In [None]:
#@title Privacy Filter
#@markdown Removes app names for privacy when sharing data.

dataframe.loc[dataframe['App'].str.contains('clock', case=False), 'App'] = 'clock'
dataframe.loc[dataframe['App'] != "clock", "App"] = "app"

print("appnames have been removed, replaced with "+str(dataframe.App.unique()))

In [None]:
#@title Cleaning Dataset
#@markdown Do not use unless required.
#@markdown 
#@markdown Run Once to see unique App names. Then type in one App name at a time to remove all instances of it from the dataframe.

Appname = "" #@param {type:"string"}

if Appname != "":
  dataframe = dataframe.query(("App != "+'"'+str(Appname)+'"'))

Unique = dataframe.App.unique()
print(Unique)

In [None]:
#@title Raster Plot Activity

#@markdown Specify data, month or year or leave empty for all time.

Year  =   2020#@param {type:"integer"}
Month =  5#@param {type:"integer"}
Day   =  0#@param {type:"integer"}
#@markdown Inputs of Day and Month require Inputs of the higher category, therefore for Day, month and Year must be provided and for Month, Year must be provided.
#@markdown  
#@markdown Set 0 for null entry (all data )
with_XKCD = True #@param {type:"boolean"}

error = "No data for given timeframe."
monthnames  = {1:"January", 2:"February", 3:"March", 4:"April", 5:"May", 6:"June", 7:"July", 8:"August", 9:"September" ,10:"October", 11:"November", 12:"December"}
month = monthnames.keys()
def prepdata(dataframe, y, m, d):
#prepares data for plot

  if y > 0 and m > 0 and d > 0:
    # For a specific date
    frame = dataframe.query(('Year == '+str(y)+' and Month == '+str(m)+' and Date == '+str(d)))
    if frame.size == 0:
      print(error)
      return error
    
    plotdata = frame.of24h.to_numpy()
    #Done

  elif y > 0 and m > 0:
    # For a Month
    frame = dataframe.query(('Year == '+str(y)+' and Month == '+str(m)))
    if frame.size == 0:
      print(error)
      return error
    
    days = monthrange(y,m)[1]
    plotdata = np.empty(days, dtype = object)
    for d in range(days):
      daydata = frame.query(('Date == '+str(d+1)))
      plotdata[d] = daydata.of24h.to_numpy()
      #Done

  elif y > 0:
    # For a Year
    frame = dataframe.query(('Year == '+str(y)))
    if frame.size == 0:
      print(error)
      return error
    m = 1
    plotdata = np.empty(0, dtype = object)
    while m < 12:
      days = monthrange(y,m)[1]
      monthdata = np.empty(days, dtype = object)
      for d in range(days):
          daydata = frame.query(('Date == '+str(d+1)))
          monthdata[d] = daydata.of24h.to_numpy()
      plotdata = np.concatenate((plotdata,monthdata))
      m += 1
    #Done
  else:
    # All time
    Years = dataframe.Year.unique()
    plotdata = np.empty(0, dtype = object)
    for y in Years:
      frame = dataframe.query(('Year == '+str(y)))
      m = 1
      yeardata = np.empty(0, dtype = object)
      while m < 12:
        days = monthrange(y,m)[1]
        monthdata = np.empty(days, dtype = object)
        for d in range(days):
          daydata = frame.query(('Date == '+str(d+1)))
          monthdata[d] = daydata.of24h.to_numpy()
        yeardata = np.concatenate((yeardata,monthdata))
        m += 1
      plotdata = np.concatenate((plotdata,yeardata))
        #Done
  return plotdata

def Raster(plotdata, y, m, d, xkcd):
  if plotdata[0] != error:
    if xkcd == 1:
      if y > 0 and m > 0 and d > 0:
        with plt.xkcd():
          plt.eventplot(plotdata, color = "0.2")
          plt.xlabel("Time of Day")
          plt.title((str(d)+" "+str(monthnames[m])+" "+str(y)))
          plt.xlim(0,24)
          plt.yticks([])
      elif y > 0 and m > 0:
        with plt.xkcd():
          plt.eventplot(plotdata[:], color = "0.2")
          plt.xlabel("Time of Day")
          plt.title((str(monthnames[m])+" "+str(y)))
          plt.xlim(0,24)
          plt.yticks([])
      elif y > 0:
        with plt.xkcd():
          plt.eventplot(plotdata, color = "0.2")
          plt.xlabel("Time of Day")
          plt.title(("Year of "+str(y)))
          plt.xlim(0,24)
          plt.yticks([])
      else:
        with plt.xkcd():
          plt.eventplot(plotdata, color = "0.2")
          plt.xlabel("Time of Day")
          plt.title("All Time")
          plt.xlim(0,24)
          plt.yticks([])
    else:
      if y > 0 and m > 0 and d > 0:
          plt.eventplot(plotdata, color = "0.2")
          plt.xlabel("Time of Day")
          plt.title((str(d)+" "+str(monthnames[m])+" "+str(y)))
          plt.xlim(0,24)
          plt.yticks([])
      elif y > 0 and m > 0:
          plt.eventplot(plotdata, color = "0.2")
          plt.xlabel("Time of Day")
          plt.title((str(monthnames[m])+" "+str(y)))
          plt.xlim(0,24)
          plt.yticks([])
      elif y > 0:
          plt.eventplot(plotdata, color = "0.2")
          plt.xlabel("Time of Day")
          plt.title(("Year of "+str(y)))
          plt.xlim(0,24)
          plt.yticks([])
      else:
          plt.eventplot(plotdata, color = "0.2")
          plt.xlabel("Time of Day")
          plt.title("All time")
          plt.xlim(0,24)
          plt.yticks([])  

plotdata = prepdata(dataframe, Year, Month, Day)
#print(plotdata.shape)
Raster(plotdata,Year,Month,Day,with_XKCD)



---

###For using the webapp:

---



In [None]:
#@title Download Parsed Data as CSV
#@markdown Select Files to be downloaded:
#@markdown 
#@markdown AllData.csv            - Data for all your (recorded) history
AllData = True #@param {type:"boolean"}
#@markdown PrePandemic.csv - Data from Apr 20 to May 20, 2019
PrePandemic = True #@param {type:"boolean"}
#@markdown Pandemic.csv       - Data from Apr 20 to May 20, 2020
Pandemic = True #@param {type:"boolean"}
if dataframe['TimeZone'].any == 1:
  dataframe = dataframe.drop('TimeZone', 1)

pre1 = dataframe.query(('Year == 2019 and Month == 4 and Date == 20'))
pre2 = dataframe.query(('Year == 2019 and Month == 5 and Date == 20'))
pan1 = dataframe.query(('Year == 2020 and Month == 4 and Date == 20'))
pan2 = dataframe.query(('Year == 2020 and Month == 5 and Date == 20'))

pre_start = int(pre1.index[0])
pre_end = int(pre2.index[-1])
pan_start = int(pan1.index[0])
pan_end = int(pan2.index[-1])

prepandemic = dataframe.loc[pre_start:pre_end,:]
pandemic = dataframe.loc[pan_start:pan_end,:]

dataframe.to_csv("AllData.csv")
prepandemic.to_csv("PrePandemic.csv")
pandemic.to_csv("Pandemic.csv")
files.download("AllData.csv")
files.download("PrePandemic.csv")
files.download("Pandemic.csv")

In [None]:
#@title Download CSV for specfic timeframe
Title = "fromhtml" #@param {type:"string"}
name = Title + ".csv"
Start_Date = "2020-04-20" #@param {type:"date"}
y1 = int(Start_Date[0:4])
m1 = int(Start_Date[5:7].strip("0"))
d1 = int(Start_Date[8:])
End_Date = "2020-05-20" #@param {type:"date"}
y2 = int(End_Date[0:4])
m2 = int(End_Date[5:7].strip("0"))
d2 = int(End_Date[8:])
 
name = Title + ".csv"
if dataframe['TimeZone'].any == 1:
  dataframe = dataframe.drop('TimeZone', 1) 
frame1 = dataframe.query(('Year == '+str(y1)+' and Month == '+str(m1)+' and Date == '+str(d1)))
frame2 = dataframe.query(('Year == '+str(y2)+' and Month == '+str(m2)+' and Date == '+str(d2)))
 
start_index = int(frame1.index[0])
end_index = int(frame2.index[-1])
 
timeframe = dataframe.loc[start_index:end_index,:]
 
timeframe.to_csv(name)
files.download(name)

In [None]:
#@title Clean Slate
#@markdown Deletes all files from colab when run. 
#@markdown
#@markdown (run AFTER downloading CSVs)
!rm *.*
print("Done and Dusted!")

You're all done! Head over the link below to visualize your data in all the cool ways of chronobiology!

https://circada.shinyapps.io/VisualizationDemo/



---

###For CSV's with a timezone column:

Rebuild the Dataframe if the previous two cells were run.



---



In [None]:
#@title Download Parsed Data as CSV
#@markdown Select Files to be downloaded:
#@markdown 
#@markdown AllData.csv            - Data for all your (recorded) history
AllData = True #@param {type:"boolean"}
#@markdown PrePandemic.csv - Data from Apr 20 to May 20, 2019
PrePandemic = True #@param {type:"boolean"}
#@markdown Pandemic.csv       - Data from Apr 20 to May 20, 2020
Pandemic = True #@param {type:"boolean"}
pre1 = dataframe.query(('Year == 2019 and Month == 4 and Date == 20'))
pre2 = dataframe.query(('Year == 2019 and Month == 5 and Date == 20'))
pan1 = dataframe.query(('Year == 2020 and Month == 4 and Date == 20'))
pan2 = dataframe.query(('Year == 2020 and Month == 5 and Date == 20'))

pre_start = int(pre1.index[0])
pre_end = int(pre2.index[-1])
pan_start = int(pan1.index[0])
pan_end = int(pan2.index[-1])

prepandemic = dataframe.loc[pre_start:pre_end,:]
pandemic = dataframe.loc[pan_start:pan_end,:]

dataframe.to_csv("AllData.csv")
prepandemic.to_csv("PrePandemic.csv")
pandemic.to_csv("Pandemic.csv")
files.download("AllData.csv")
files.download("PrePandemic.csv")
files.download("Pandemic.csv")

In [None]:
#@title Download CSV for specfic timeframe
Title = "fromhtml" #@param {type:"string"}
name = Title + ".csv"
Start_Date = "2020-04-20" #@param {type:"date"}
y1 = int(Start_Date[0:4])
m1 = int(Start_Date[5:7].strip("0"))
d1 = int(Start_Date[8:])
End_Date = "2020-05-20" #@param {type:"date"}
y2 = int(End_Date[0:4])
m2 = int(End_Date[5:7].strip("0"))
d2 = int(End_Date[8:])
 
name = Title + ".csv"
 
frame1 = dataframe.query(('Year == '+str(y1)+' and Month == '+str(m1)+' and Date == '+str(d1)))
frame2 = dataframe.query(('Year == '+str(y2)+' and Month == '+str(m2)+' and Date == '+str(d2)))
 
start_index = int(frame1.index[0])
end_index = int(frame2.index[-1])
 
timeframe = dataframe.loc[start_index:end_index,:]
 
timeframe.to_csv(name)
files.download(name)

In [None]:
#@title Clean Slate
#@markdown Deletes all files from colab when run. 
#@markdown
#@markdown (run AFTER downloading CSVs)
!rm *.*
print("Done and Dusted!")



---

