# Project: “Identification Of Internet Users”

### The problem: for a sequence of web sites visited in a row by the same person, identify that person.

The data is taken from [the article](http://ceur-ws.org/Vol-1703/paper12.pdf) "A Tool for Classification of Sequential Data".

For each user, a csv file is created with the name user\*\*\*\*. csv (where instead of asterisks – 4 digits corresponding to the user's ID), and site visits are recorded in the following format:

<center>*timestamp, visited website*</center>

In the project, not all operations can be performed in a reasonable amount of time (for example, we can cross-validate 100 combinations of random forest parameters based on this data), so we will use 2 samples in parallel: 10 users and 150. We will write and debug code for 10 users, and we will have a working version for 150.

The data is arranged as follows:

 - The 10users directory contains 10 csv files with the name "user[USER_ID].csv", where [USER_ID] is the user ID;
 - Similarly, for the 150users directory, there are 150 files;

# Part 1. Preparing data for analysis and building models

The first part of the project is devoted to preparing data for further descriptive analysis and building predictive models.

In [1]:
from scipy.sparse import csr_matrix
from glob import glob
import numpy as np
import pandas as pd
import re
import os
import pickle
import warnings
warnings.filterwarnings('ignore')

**Example of a file with data about web pages visited by the user:**

In [2]:
PATH_TO_DATA = 'capstone_user_identification'

In [3]:
user31_data = pd.read_csv(os.path.join(PATH_TO_DATA, '10users/user0031.csv'))

In [4]:
user31_data.head()

Unnamed: 0,timestamp,site
0,2013-11-15 08:12:07,fpdownload2.macromedia.com
1,2013-11-15 08:12:17,laposte.net
2,2013-11-15 08:12:17,www.laposte.net
3,2013-11-15 08:12:17,www.google.com
4,2013-11-15 08:12:18,www.laposte.net


**Let's formulate a classification problem: identify a user by a session of 10 consecutive visited sites. The object in this task will be a session of 10 sites visited sequentially by the same user, and the attributes will be the indexes of these 10 sites (a little later, a "bag" of sites, the Bag of Words approach, will appear here). The target class will be the user id.**

## 1. Preparing a training sample

In [5]:
pattern = r'\d{4}'

def prepare_train_set(path_to_csv_files, session_length=10):
    '''
    The function returns a table of sessions and a frequency dictionary of site displays in id.
    
    Parametrs:
    - path_to_csv_files - the path to the directory with the files that are relevant to users.
    - session_length - session length.
    
    Return:
    - a table of user sessions and a frequency dictionary of site displays in the id 
    view: {'site_string': [site_id, site_freq]}.
    '''
    
    out_dict = dict()    
    out_df = pd.DataFrame()
    
    for usr in glob(path_to_csv_files + '\\user*.csv'):        
        sites_list = pd.read_csv(usr).site.to_list() 
        
        # Creating a dictionary
        for site in sites_list:
            if site in out_dict:
                out_dict[site][1] += 1
            else:
                out_dict[site] = [len(out_dict) + 1, 1]        
        
        # Converting 1D to 2D, adding zeros to the list
        while len(sites_list) % session_length:
            sites_list.append(0)              
        
        sites_list = np.reshape(sites_list, (len(sites_list) // session_length, session_length)).tolist()        
        
        # Add information about the user
        for i in range(len(sites_list)):            
            sites_list[i].append(int(*re.findall(pattern, usr)))         
        
        # Adding data about user sessions to the shared DataFrame
        out_df = pd.concat([out_df, pd.DataFrame(sites_list)], ignore_index=True)
        
    
    # Setting the smallest indexes for the most frequent sites    
    for i, site in enumerate(sorted(out_dict, key=lambda x: out_dict[x][1], reverse=True)):
        out_dict[site][0] = i + 1   
    
    # Replacing sites with indexes
    out_df.columns = ['site' + str(i + 1) for i in range(session_length)] + ['user_id'] 
    out_df = out_df.applymap(lambda x: out_dict[x][0] if x in out_dict else int(x))
    
    return out_df, out_dict     

Features of the function implementation:
* When assigning indexes to sites, the smallest description principle is used (smaller indexes are assigned to frequently encountered sites)
* Entity recognition is not taken into account, i.e. http://www.google.com and www.google.com they are considered different sites.

In [6]:
train_data_10users, site_freq_10users = prepare_train_set(os.path.join(PATH_TO_DATA, '10users'), session_length=10)

In [7]:
train_data_10users.head()

Unnamed: 0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,user_id
0,192,574,133,3,133,133,3,133,203,133,31
1,415,193,674,254,133,31,393,3305,217,55,31
2,55,3,55,55,5,293,415,333,897,55,31
3,473,3306,473,55,55,55,55,937,199,123,31
4,342,55,5,3307,258,211,3308,2086,675,2086,31


In [8]:
site_freq_10users

{'fpdownload2.macromedia.com': [192, 88],
 'laposte.net': [574, 23],
 'www.laposte.net': [133, 119],
 'www.google.com': [3, 5441],
 'match.rtbidder.net': [203, 84],
 'x2.vindicosuite.com': [415, 37],
 'rp.gwallet.com': [193, 88],
 'pool-eu-ie.creative-serving.com': [674, 18],
 'dl.javafx.com': [254, 65],
 'ajax.googleapis.com': [31, 711],
 'api.dailymotion.com': [393, 40],
 'i1-js-14-3-01-11074-266576264-i.init.cedexis-radar.net': [3305, 1],
 'limelight.cedexis.com': [217, 78],
 'webmail.laposte.net': [55, 399],
 'www.facebook.com': [5, 4141],
 'rubicon-match.dotomi.com': [293, 55],
 'pr.ybp.yahoo.com': [333, 47],
 'dtm.ccs.com': [897, 12],
 'b12.myspace.com': [473, 31],
 'i1-js-14-3-01-11074-845302217-i.init.cedexis-radar.net': [3306, 1],
 ...}

In [9]:
print('Number of unique sessions from 10 sites in a sample with 10 users:', len(train_data_10users))
print('Number of unique sites in a web sample of 10 users:', len(site_freq_10users))

Number of unique sessions from 10 sites in a sample with 10 users: 14061
Number of unique sites in a web sample of 10 users: 4913


In [10]:
%%time
train_data_150users, site_freq_150users = prepare_train_set(os.path.join(PATH_TO_DATA, '150users'), session_length=10)

Wall time: 6.96 s


In [11]:
print('Number of unique sessions from 10 sites in a sample with 150 users:', len(train_data_150users))
print('Number of unique sites in a web sample of 150 users', len(site_freq_150users))

Number of unique sessions from 10 sites in a sample with 150 users: 137019
Number of unique sites in a web sample of 150 users 27797


In [12]:
# Top 10 most popular sites visited by 150 users
sorted(site_freq_150users, key=lambda x: site_freq_150users[x][1], reverse=True)[:10]

['www.google.fr',
 'www.google.com',
 'www.facebook.com',
 'apis.google.com',
 's.youtube.com',
 'clients1.google.com',
 'mail.google.com',
 'plus.google.com',
 'safebrowsing-cache.google.com',
 'www.youtube.com']

For further analysis, we will save the received DataFrame objects to csv files:

In [13]:
train_data_10users.to_csv(os.path.join(PATH_TO_DATA, 'train_data_10users.csv'), index_label='session_id', float_format='%d')
train_data_150users.to_csv(os.path.join(PATH_TO_DATA, 'train_data_150users.csv'), index_label='session_id', float_format='%d')

## 2. Working with sparse data format

The resulting features *site1*,..., *site10* do not make sense as features in the classification problem. But if you use the idea of a bag of words from text analysis – this is another matter. We will create new matrices in which rows will correspond to sessions from 10 sites, and columns will correspond to site indexes. At the intersection of the $i$ row and the $j$ column, the number $n_{ij}$ will be set – the number of times the $j$ site met in the $i$session.

In [14]:
def sparse_matr_gen(X):
    indptr = [0]
    indices = []
    data = []

    for session in X:
        for site in session:
            if site == 0: continue # site with id = 0 is not taken into account
            indices.append(site-1) # for numbering with id = 0 (not id = 1), i.e. deleting column 0
            data.append(1)
        indptr.append(len(indices))
        
    return csr_matrix((data, indices, indptr))

In [15]:
X_10users, y_10users = train_data_10users.iloc[:, :-1].values, train_data_10users.iloc[:, -1].values
X_150users, y_150users = train_data_150users.iloc[:, :-1].values, train_data_150users.iloc[:, -1].values

In [16]:
X_sparse_10users = sparse_matr_gen(X_10users)
X_sparse_150users = sparse_matr_gen(X_150users)

**We save these sparse matrices and vectors *y_10users, y_150users* - target values (user id) for samples of 10 and 150 users using [pickle](https://docs.python.org/2/library/pickle.html). We will also save the frequency dictionaries of sites for 10 and 150 users.**

In [17]:
with open(os.path.join(PATH_TO_DATA, 'X_sparse_10users.pkl'), 'wb') as X10_pkl:
    pickle.dump(X_sparse_10users, X10_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'y_10users.pkl'), 'wb') as y10_pkl:
    pickle.dump(y_10users, y10_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'X_sparse_150users.pkl'), 'wb') as X150_pkl:
    pickle.dump(X_sparse_150users, X150_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'y_150users.pkl'), 'wb') as y150_pkl:
    pickle.dump(y_150users, y150_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'site_freq_10users.pkl'), 'wb') as site_freq_10users_pkl:
    pickle.dump(site_freq_10users, site_freq_10users_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'site_freq_150users.pkl'), 'wb') as site_freq_150users_pkl:
    pickle.dump(site_freq_150users, site_freq_150users_pkl, protocol=2)

## Result
* Created session tables for samples of 10 and 150 users.
* Converted the original attributes to counter attributes that make sense for the classification task.

In the next part of the project (part2_analysis_hypotheses.ipynb), we will continue to prepare data, as well as test some hypotheses related to our observations.