In [1]:
import re
import json
import langid
import requests
import pandas as pd


from bs4 import BeautifulSoup
from time import sleep
from selenium import webdriver
from morfeusz2 import Morfeusz
from advertools import url_to_df 
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

requests.packages.urllib3.disable_warnings()

BROWSING_HISTORY_JSON = 'data/BrowserHistory.json'
WORDBAGS_JSON = 'data/BrowsingHistoryWordbags.json'
CHROMEDRIVER_EXEC = r'C:\Users\Szymon\Documents\Kod\LocalPathVariables\Chromedriver_bin\chromedriver.exe'

# Data Preparation

The goal is to analise the timeseries arising from my web browser history in order to predict my usual daily routine. Along the way we'll make use of a clustering algorithm for the purpose of automatically labeling the visited sites as laisure-oriented (for example: youtube, facebook or coub), math- or programming- oriented (stack overflows, various documentations, wikipedia) etc. 

This posts a demand for two types of data: the site's contents as a document and a timeseries of visited site types with accordance to timestamps. Proper gathering and preparation of both of them will be challanging.

First off with the sites. As we'll feed it to a clustering alghoritm via some document vectorisation method, we need consider the following:

1. The less sparse the similarity matrix (made up of documents' feature vectors) the better the clustering algorithm will perform, and as each feature vector entrance is associated with one word (globally, since every document's feature vector has to "make a spot" for its own value for that word), we should choose only the most important of them. Our sense measure of importance will be dictated by:
    - The need to represent site's visible content, not the structure. We need to get rid of any code contained in downloaded .html file.
    - Our decision not to take context into account (for simplicity, as we feel confident that a general document clustering will do), so we don't need to include interpunction or special characters in general.
    - The need to reduce noise inside the document. We'll filter for stopwords.
    - Generality of a word. We should count different forms of one word as several occurences of one word. We will stem each word.

2. Downloading should be performed in a way optimising the end-goal timeseries, so:
    - We need to get rid of noise in the browsing history. We should omit sites that don't meet a visit-frequency threshold.
    - We should get rid of repetitions. We can do it in two ways:
        + We can create "domain buckets" that represent a domain with its various paths as one entrance (to be explained in details later) along with a general content profile of each. It serves not only to reduce the download, but to make it more stable in clustering process.
        + By skipping the lenghty lurking process on one domain bucket and leave just one representative, later to fill the gap between timestamps with that copies of it.
    - We need to drop records of sites that we cannot access.
    
After such processing, we should achieve a nice, dense similarity matrix.

Additionally we'll make a comparison with a clustering based on pages' titles. Their preparation will be same as for the sites the scope of the second point above, but word filtering and generalisation won't be so restrictive.

## Importing the browsing history

We'll import the data spanning from 19.10.2021 untill (but not exactly) 19.03.2022, which marks the beginning of this project.

After reading the data into a dataframe we're limiting it to chosen timespan and accessable URL's (beginning with "http" / "https"). We're also leaving the columns of interest, it is the title, url and time_usec.

We then reset the indexing to match the current record count. It's worth noting that we begin at 0.

In [2]:
data = json.load(open(BROWSING_HISTORY_JSON, encoding="utf-8"))
df = pd.json_normalize(data, record_path=['Browser History'])
df.drop(df[df.time_usec < 1634601600000000].index, inplace=True)
df.drop(df[~df.url.str.contains('http://') & ~df.url.str.contains('https://')].index, inplace=True)
df = df[['title', 'url', 'time_usec']]
df.index = [i for i in range(0, len(df))]

## Partitioning of URL's

As we want to separate the domain form the whole address we use advertools.url_to_df() function, which creates a dataframe object where each row gets an entry for a specific url seciton. We'll only use the netloc field, as it's the domain we look after.

In [3]:
df_add = url_to_df(df.url)[['netloc']]
df = df.join(df_add)

## Volume reduction

In this section we reduce the overall volume of our dataset for the purpose of noise cutting. We're implementing the solutions derived at the beginning.

### Limiting repetitions

As stated before, we're cutting short the lurking peroids, thus getting closer to representing the actual "entrances". We behave in line with an assumption that can't switch between multiple browser cards and we ignore situations where I could potentially change the sitetype by switching to a card opened some time ago. It will come into play later, while modeling the timeseries.

I decided to treat all google search queries as noise. I share the view that it may be a bold move to think that a great deal of potential information isn't lost this way, but I justify it by saing that even if we traet each query as a separate entrance (instead of bucketing under google.com domain) each included page may greately broaden the wordspace we'll have to choose features for.

Later we drop marked rows along with the temporary "del" marking column.

It's also worth to create a table consisting of browsing history share per domain before removing repetitions for the sake of later analysis.


In [4]:
df_amounts = df['netloc'].value_counts() #Zapisujemy ogólne ilości odwiedzin stron przed ich wywaleniem
df['del'] = [False if (df.netloc[i] == df.netloc[i+1] or df.netloc[i] == 'www.google.com') else True
             for i in range(0, len(df.values)-1)] + [True] #Tylko pojedyncze wpisy z bloku powtórek względem całej domeny i won z google, może nawet z translatem
df = df[df['del']][['time_usec', 'title', 'url', 'netloc']] #Wyrzucamy kolumnę del

### Reduction of outliers

I decided to drop all entries about sites that I've visited not more than 23 times during the last 6 months. I'd be hard to speak about a routine if I didn't visit a site at least 5 times per month, co it'd be reasonable to cut it off here.

In [5]:
df = df[df['netloc'].map(df['netloc'].value_counts()) > 23] #Redukcja stron gdzie witałem (i spędzałem posiedzenie) mniej niż 5 razy na miesiąc ~ 25

## Downloading

Buckets dataframe is created to keep just two types of information: the domain name and a wordbag field, to which we will append the trimmed and prepared contents of its subdirectiories that I've visited. In order to cut down on the amount of iterations we create a set called history which will store full URL's of sites whose content has been already downloaded.

To keep it even simpler we only allow ourselves to include sites whose response code is equal to 200 (meaning connection without exceptions). It later tourns out that few sites, ylilauta.org to give an example, is protected against webcrawling bots, so we'll have to label them manually.

The process of stripping the text has three steps. Firstly we strip it out of html tags, secondly we analise each character to keep only normal letters.
Secondly each word is getting checked for appearance of uppercase letters inside of it.
There's a large probability that those would be leftovers of variable names from code. Additionally we specify to keep them in the length range between 4 and 20 characters, lowercasing them all afterwards.
It's all finished by stemming with the use of two lemmatisation libraries. Morfeusz() to be used for polish and a PorterStemmer for english, usage of which is regulated by classify() function from langid library applied to first 20 words from the site.

In [6]:
df_buckets = pd.DataFrame({'netloc' : df['netloc'].unique(),  #Tworzymy tablicę wiader
                           'wordbag': ''})
history = set()

op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(executable_path=CHROMEDRIVER_EXEC, options=op)

for _, row in df.iterrows():
    
    if row['url'] not in history:
        
        sc = 0
        history.add(row['url'])

        try:
            page = requests.get(row['url'], verify=False)
            sc = page.status_code
        except: 
            pass
        if sc == 200:
            print(f'Conducting {row.url} ...')
            
            driver.get(row['url'])
            for _ in range(10):
                sleep(0.35)
                driver.execute_script('return scrollBy(0, 400);')

            soup =  BeautifulSoup(driver.page_source, 'html.parser')

            text = ''.join(i for i in soup.stripped_strings)

            words_lst = ''.join(e if (e.isalnum() or e == ' ') and not e.isdigit() else ' ' #Analiza znaków
                                for e in text).split() 
            words_lst = [e.lower() for e in words_lst #Analiza słów
                            if not any([bool(re.match(r'.\w*[A-Z]\w*', e)), len(e)>20, len(e)<4])] 

            if langid.classify(' '.join(words_lst[:20]))[0] == 'pl':
                stemmer = Morfeusz()
                text = ' '.join(next(iter(stemmer.analyse(e)))[2][1].split(':')[0] for e in words_lst if not e in stopwords.words('polish')) 
            else:
                stemmer = PorterStemmer()
                text = ' '.join(stemmer.stem(e) for e in words_lst if not e in stopwords.words('english')) #Stemming

            if row['netloc'] in df_buckets['netloc'].values:
                df_buckets.loc[df_buckets['netloc'] == row['netloc'], 'wordbag'] += f' {text} '
driver.quit()


  driver = webdriver.Chrome(executable_path=CHROMEDRIVER_EXEC, options=op)


Conducting https://www.youtube.com/watch?v=Aff2g5mVt1Q ...


2022-03-27 16:37:48,807 | INFO | langid.py:162 | load_model | initializing identifier


Conducting https://www.youtube.com/ ...
Conducting https://www.linkedin.com/notifications/ ...
Conducting https://www.linkedin.com/ ...
Conducting https://www.instagram.com/stories/venturebit/2796180259922359277/ ...
Conducting https://www.instagram.com/ ...
Conducting https://login.uj.edu.pl/login?service=https%3A%2F%2Fwww.usosweb.uj.edu.pl%2Fkontroler.php%3F_action%3Dlogowaniecas%2Findex&locale=pl ...
Conducting https://en.wikipedia.org/wiki/Data_mining ...
Conducting https://en.wikipedia.org/wiki/Statistical_learning_theory ...
Conducting https://nofluffjobs.com/pl/praca-it ...
Conducting https://www.kaggle.com/learn ...
Conducting https://en.wikipedia.org/wiki/Continuous_uniform_distribution ...
Conducting https://en.wikipedia.org/wiki/Iterated_integral ...
Conducting https://en.wikipedia.org/wiki/Central_limit_theorem ...
Conducting https://pl.wikipedia.org/wiki/Wariancja ...
Conducting https://en.wikipedia.org/wiki/Monte_Carlo_method ...
Conducting https://en.wikipedia.org/wiki/V

In [7]:
df_buckets

Unnamed: 0,netloc,wordbag
0,www.youtube.com,wojna trwać dzień Rosjanin walczyć mariupol c...
1,ylilauta.org,
2,www.linkedin.com,zarejestrować strona trzeci klient partner do...
3,www.instagram.com,relacj wyświetlić telefonu nazwa użytkownika ...
4,www.usosweb.uj.edu.pl,piast akademik uniwersytet jagielloński menu ...
5,login.uj.edu.pl,punkt logować uniwersytet jagielloński inform...
6,en.wikipedia.org,data mine wikipedia free extract discov patte...
7,nofluffjobs.com,praca praca programista nowy oferta praca flu...
8,www.kaggle.com,learn python data panda tutori notebookt data...
9,www.overleaf.com,overleaf overleaf onlin email password incorr...


### Saving

We save it in json format for use in separate script for analysis.

In [8]:
df_buckets.to_json(WORDBAGS_JSON)