# Mouse dynamics 
## User identification based on mouse activity fingerprint
### Author: Nejm Jaafar - Data scientist 

<h3> Introduction:</h3>


In this notebook, we will analyze mouse event data with the aim to identify the user. <br>Mouse Dynamics is a behavioral biometrics technology used to validate a user’s identity by analyzing unique patterns - such as tiny hand motions - detected in the user’s interaction with their mouse or pointer. Mouse dynamics algorithms interpret the data gathered from a mouse or pointer to build a unique user profile. Mouse dynamics in authentication is essentially being able to validate someone’s identity by the way they use their mouse.<br>
<h4> Datasets:</h4>

• Train_Mouse.csv: In this dataset, the mouse events data from 20 users are collected.<br>
The mouse events are collected during user interaction with a demo banking application. Each user is asked to do 6 sessions and for each session, the data of mouse events are collected. Each row of the dataset represents a single mouse event and consists of uid (unique identifier for each mouse event), session_id (session identifier), user_id, timestamp, event_type (mouse movement type), screen_x and screen_y (coordinates of the mouse event).<br>
Every row of the mouse event data has an event_type column that shows what kind of mouse event the row represents. However, every row by itself does not represent the entire event, the event if oftentimes represented by multiple events in a row. <br>For example, a drag event is a event_type 4 (drag) followed by an indeterminate amount of event_type’s 2 (move) and ends with a event_type 1 (release).<br><br>
The different event_type’s recorded are: 
- 1 = release
- 2 = move 
- 3 = wheel 
- 4 = drag
- 5 = click

And the different events that can occur are:
- click – one click event_type (5) followed by one release event_type (1)
- move – indeterminate amount of move event_type’s (2)
- drag – one drag event_type (4) followed by indeterminate amount of move event_type’s (2) ending with one release event_type (1) 
- wheel – indeterminate amount of wheel event_type’s (3)

• Test_Mouse.csv: This dataset has the same structure as the Train dataset except that the user_id is not included.
<h4> Objective:</h4>

Determine UserID for each session_id in “Test_Mouse.csv” dataset based on collected mouse events data.

<h4> Content:</h4>

We will go through 4 major parts:
- EDA that covers real simulation of the mouse, showing the exact path it took on the screen (space), plus temporal analysis for the user's session time. 
- Feature engineering part during which I derived insights from the X-Y-T and converted them into indicative features like slope, tightness, centers, frequencies, etc.
- Model training, which shows the transformations of the dataset and running the multilayer perceptron classifier (Neural network).
- Evaluation of the model using F1 score and confusion matrix, and introducing cross validation to reduce the over-fitting.
- Generating the users IDs for the Test_mouse.csv file.
- Conclusion and perspectives.

We will use <b>Spark</b> for this implementation.



##### import libraries

In [None]:
!pip install pyspark

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import seaborn as sn
import hashlib
import binascii
import math
from sklearn.decomposition import PCA as skPCA
from pyspark.ml.feature import PCA as spPCA
from pyspark.ml.linalg import Vectors
from pyspark.ml.tuning import CrossValidator, TrainValidationSplit, ParamGridBuilder, TrainValidationSplitModel
from pyspark.ml.feature import VectorAssembler, MinMaxScaler
from pyspark.ml.classification import MultilayerPerceptronClassifier, GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.types import StructType, StringType, FloatType, LongType, IntegerType, StructField
from pyspark.sql import HiveContext, SparkSession
from pyspark.sql.functions import countDistinct, array_distinct, col, isnan, when, count, lit, array

plt.rcParams["figure.autolayout"] = True

ModuleNotFoundError: No module named 'pyspark'

### EDA

In [None]:
# Initialize spark context
spark = SparkSession.builder.getOrCreate()

In [None]:
schemaMousePos = StructType([
    StructField('uid', StringType(), False,),
    StructField('session_id', StringType(), False),
    StructField('user_id', StringType(), True),
    StructField('timestamp', LongType(), False),
    StructField('event_type', IntegerType(), False),
    StructField('screen_x', FloatType(), False),
    StructField('screen_y', FloatType(), False)
])
trainDs = spark.read.csv('/kaggle/input/mouse-dynamics-for-user-authentication/Train_Mouse.csv',header=True, schema=schemaMousePos)
trainDs.printSchema()

In [None]:
# Make sure each session has only 1 user
trainDs.groupBy('session_id').agg(countDistinct('user_id').alias('distinct_uids_per_session')).agg({'distinct_uids_per_session':'max'}).show()


In [None]:
# let's check if the data is imbalanced
trainDs.groupBy('user_id').agg(countDistinct('session_id')).show()


In [None]:
# let's check for nones
trainDs.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in trainDs.columns]).show()

##### Ok perfect. The dataset looks balanced, free of missing values, and there is one user per session. Now let's draw some plots to help us understand the content of the dataset.

##### Here I will draw a color legend for the event transitions, which we'll use for the upcoming plots next.

In [None]:

plt.rcParams["figure.figsize"] = [5, 5]
eventMap = {1:'release', 2:'move', 3 : 'wheel', 4:'drag', 5 : 'click'}
# set colormap
colorsRevDict = {'#'+hashlib.md5((('{}-{}'.format(i, j))*16).encode()).hexdigest()[:6] : '{} -> {}'.format(eventMap[i],eventMap[j])  for i in range(1,6) for j in range(1,6)} 
soa = np.array([[0,i,1,j-i] for i in range(1,6) for j in range(1,6)])
X, Y, U, V = zip(*soa)
plt.figure()
ax = plt.gca()
# generate unique color for each transition, I'll simply use hash because I'm too lazy to create a linear distribution for colors
colors = ['#'+hashlib.md5((('{}-{}'.format(i, j))*16).encode()).hexdigest()[:6] for i in range(1,6) for j in range(1,6)]
ax.quiver(X, Y, U, V, angles='xy', scale_units='xy', scale=1, color=colors, linewidth=0.3)
ax.set_xlim([-1,2])
ax.set_ylim([0,6])
plt.draw()
plt.show()
print(eventMap)
print(colorsRevDict)

##### Now let's see the users' screen movements in deep

In [None]:

plt.rcParams["figure.figsize"] = [30, 15]
df = trainDs.toPandas().sort_values('timestamp')
# usersEncoder will simplify user_id strings into a small range values
usersEncoder = {k:i for i,k in enumerate(trainDs.select('user_id').rdd.flatMap(lambda x: x).distinct().collect())}
screenDims = ((df['screen_x'].min(),df['screen_x'].max()), (df['screen_y'].min(),df['screen_y'].max()))
for userId in usersEncoder.keys(): # df['user_id'].unique():
    portionDf = df[df['user_id']==userId]
    print(usersEncoder[userId], userId)
    for session in portionDf['session_id'].unique():
        portionDfSession = portionDf[portionDf['session_id']==session]
        XYs = np.array([(k[1].screen_x, k[1].screen_y) for k in portionDfSession.iterrows()]).astype(float) # xy
        evs = [k[1].event_type for k in portionDfSession.iterrows()] # events
        tss = [int(k[1].timestamp) for k in portionDfSession.iterrows()] # timestamps
        soa = np.array([[XYs[i][0],XYs[i][1], XYs[i+1][0]-XYs[i][0],XYs[i+1][1]-XYs[i][1]] for i in range(len(XYs)-1)])
        tsd = np.array([tss[i+1]-tss[i] for i in range(len(tss)-1)]).astype(int)
        
        X, Y, U, V = zip(*soa)
        plt.figure()
        ax = plt.gca()
        colors = ['#'+hashlib.md5((('{}-{}'.format(evs[i], evs[i+1]))*16).encode()).hexdigest()[:6] for i in range(len(evs)-1)]
        q = ax.quiver(X, Y, U, V, angles='xy', scale_units='xy', scale=1, color=colors, width=0.001) #, label=colors) 
        ax.set_xlim([screenDims[0][0]-100,screenDims[0][1]+100])
        ax.set_ylim([screenDims[1][0]-100,screenDims[1][1]+100])
        custom_lines = [Line2D([0], [0], color=c, lw=4) for c in set(colors)]
        ax.legend(custom_lines, [colorsRevDict[c] for c in set(colors)])
        
        plt.draw()
        plt.show()

We notice that some users try to follow a linear shape (for example 2|-2416201413375524068), while others are kinda cubic (4). 
<br> The placement on the screen too, for example the user 8 tries to always stay on the right. 
<br>  We also notice that some users have tendancy to use the wheel for scrolling (11) while others forget that it even exists such a key (6) and prefer drag instead. 
<br>  Finally, since clicking usually comes with important screen spots to the user, then we need to use this as a feature too.


##### Now let's analyse the timeseries behavior too..

In [None]:
# <<-- unscaled timeline || log-log timeline -->>
plt.rcParams["figure.figsize"] = [30, 15]
for userId in usersEncoder.keys():
    portionDf = df[df['user_id']==userId]
    print(usersEncoder[userId], userId)
    for session in portionDf['session_id'].unique():
        portionDfSession = portionDf[portionDf['session_id']==session]
        evs = [int(k[1].event_type) for k in portionDfSession.iterrows()] # events
        tss = [int(k[1].timestamp) for k in portionDfSession.iterrows()] 
        # let's plt also the LogLog since, well, some users enjoy taking long breaks..
        tss1 = [math.log(math.log(10000+int(k)-tss[0])) for k in tss]
        f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
        ax1.plot(tss,evs) # base
        ax2.plot(tss1,evs) # loglog
        plt.show()

Some users act (click) uniformly (13), while others act only at the middle (12) or beginning/end of the session.
<br> Some actions are more concentrated than others. (which means some users perform 'stress' actions like series of fast click, while others are just chilling and only click when it's really necessary).


### FEATURE ENGINEERING

In [None]:
# first we define the schema for the training dataset
schemaFeatures = StructType([
    StructField('session_id', StringType(), False),
    StructField('user_id', StringType(), True),
    StructField('user_enc', FloatType(), True),
    StructField('center_x', FloatType(), False),
    StructField('center_y', FloatType(), False),
    StructField('center_click_x', FloatType(), False),
    StructField('center_click_y', FloatType(), False),
    StructField('first_x', FloatType(), False),
    StructField('first_y', FloatType(), False),
    StructField('radius', FloatType(), False),
    StructField('slope', FloatType(), False),
    StructField('narrow', FloatType(), False),
    StructField('ev1', FloatType(), False),
    StructField('ev2', FloatType(), False),
    StructField('ev3', FloatType(), False),
    StructField('ev4', FloatType(), False),
    StructField('ev5', FloatType(), False),
    StructField('stress', IntegerType(), False),
    StructField('chill', IntegerType(), False),
    StructField('nbpoints', IntegerType(), False),
    
])

In [None]:
def featurize(recordsIn): 
    session_id, user_id = recordsIn[0]
    records = recordsIn[1]
    
    center = (lambda axisList: sum(axisList)/len(axisList))
    maxRadius = (lambda xc,yc,xList,yList: max([math.sqrt((xi-xc)**2+(yi-yc)**2) for xi, yi in zip(xList,yList)])) 
    eventRatio = (lambda evKey, allEvents: len([1 for e in allEvents if e==evKey])/len(allEvents))
    
    # to be more precise, min/max duration of inter-events
    minSpeed = (lambda timestamps: min([timestamps[i+1]-timestamps[i] for i in range(len(timestamps)-1)]))
    maxSpeed = (lambda timestamps: max([timestamps[i+1]-timestamps[i] for i in range(len(timestamps)-1)]))
    
    def slope(xList, yList): # the overall curve direction
        x_avg = sum(xList)/len(xList)
        y_avg = sum(xList)/len(yList)
        u=sum([(xi-x_avg)*(yi-y_avg) for xi, yi in zip(xList,yList)])
        d=sum([(xi-x_avg)**2 for xi in xList])
        return u/d
    
    def narrow_spark(xList, yList): # we'll use the next function (sklearn) so this one will not be used here sice this part of code will run in a worker. But it can work in a stand alone mode though 
        spark = SparkSession.builder.getOrCreate()
        data = [(Vectors.dense([xi,yi]),) for xi, yi in zip(xList, yList)]
        df = spark.createDataFrame(data,["features"])
        pca = spPCA(k=1, inputCol="features")
        model = pca.fit(df)
        return model.explainedVariance[0]
    
    def narrow_sklearn(xList, yList): # determine how compact is the curve, like is it line or cube shaped
        X = np.array([[xi,yi] for xi, yi in zip(xList, yList)])
        pca = skPCA(n_components=1)
        pca.fit(X)
        return pca.explained_variance_ratio_[0]
    
    xList = [record['screen_x'] for record in records]
    yList = [record['screen_y'] for record in records]
    # barycenter of all mouse registered positions
    centerX = center(xList)
    centerY = center(yList)
    
    # clicks come with interesting spots. let's use their 'barycenter'
    centerClickX = center((lambda x: x if x else [0])([record['screen_x'] for record in records if record['event_type']==5]))
    centerClickY = center((lambda x: x if x else [0])([record['screen_y'] for record in records if record['event_type']==5]))
    
    # The first move is always precious! it reflects the unconscious mind of the user once holds the mouse
    firstX = xList[0]
    firstY = yList[0]
    
    # how much space the user takes from the screen (as if we'll put all points inside an imaginary circle)  
    tangentCircleRadius = maxRadius(centerX,centerY,xList,yList)
    
    # curve curvature
    slop = slope(xList, yList)
    nar = float(narrow_sklearn(xList, yList))

    allEvents = [record['event_type'] for record in records]
    # frequency of each event
    ev1,ev2,ev3,ev4,ev5 = [eventRatio(i, allEvents) for i in range(1,6)]
    # how relaxed is the user
    stress = minSpeed(sorted([record['timestamp'] for record in records if record['event_type']==2]))
    chill = maxSpeed(sorted([record['timestamp'] for record in records if record['event_type']==2])) # maybe we need to apply log here, since some users take long breaks..
    # some users use the mouse more often than others
    nbpoints = len(xList)
    
    # TODO: maybe we will need to add more temporal features later, like time center of actions, speed, acceleration, etc.
    
    if user_id:
        userEnc = float(usersEncoder[user_id]) 
    else:
        userEnc = None # will not be used since it's to be predicted
    return session_id, user_id, userEnc, centerX, centerY, centerClickX, centerClickY, firstX, firstY, tangentCircleRadius, slop, nar, ev1, ev2, ev3, ev4, ev5, stress, chill, nbpoints


In [None]:
featuresDataframe = spark.createDataFrame(
    trainDs.rdd.groupBy(lambda x: (x['session_id'], x['user_id'])).map(featurize), schema=schemaFeatures
)
featuresDataframe.show()

##### Let's have a look at some features

In [None]:
df_featued = featuresDataframe.toPandas()
for userId in usersEncoder.keys():
    portionDf = df_featued[df_featued['user_id']==userId]
    for session in portionDf['session_id'].unique():
        portionDfSession = portionDf[portionDf['session_id']==session]
        plt.scatter([1,2,3,4,5],portionDfSession[['ev1','ev2','ev3','ev4','ev5']])
    print(usersEncoder[userId], userId)
    plt.show()

##### We can distinguish the user 0 from 17 only by looking at event type ev3!


In [None]:
# TODO: visualize more features

### MODEL TRAINING

In [None]:
# we compose our features vector column
in_col = ['center_x', 'center_y', 'center_click_x', 'center_click_y', 'first_x', 'first_y', 'radius', 'slope', 'narrow', 'ev1', 'ev2', 'ev3', 'ev4', 'ev5', 'stress', 'chill', 'nbpoints']
nbusers = featuresDataframe.select('user_enc').distinct().count()
assemble = VectorAssembler(inputCols=in_col, outputCol='assembled_features', handleInvalid='error')
a_data = assemble.transform(featuresDataframe)
scaler = MinMaxScaler(min=0.0, max=1.0, inputCol='assembled_features', outputCol='features')
fittedScaler = scaler.fit(a_data)
s_data = fittedScaler.transform(a_data)

In [None]:
# train-test split. (note that the state 89 is selected because it allows all 20 users on test with such a small 20% share)
train_df,test_df = s_data.select('user_enc','features').randomSplit([0.80,0.20],89)
print(train_df.select('user_enc').distinct().count())
print(test_df.select('user_enc').distinct().count())
mlpc=MultilayerPerceptronClassifier( featuresCol='features',labelCol='user_enc',layers = [len(in_col),40,nbusers],maxIter=30000,blockSize=8,seed=7,solver='gd')
ann = mlpc.fit(train_df)

In [None]:
# save the trained model for later
ann.save('/kaggle/working/mlp_71_model')

### PERFORMANCE EVALUATION

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol='user_enc',predictionCol='prediction',metricName='f1') 


In [None]:
# confusion matrix
def pltConfusion(pred):
    array = np.zeros((nbusers,nbusers), int)
    for k in pred.collect():
        array[int(k['user_enc']),int(k['prediction'])] = array[int(k['user_enc']),int(k['prediction'])]+1
    df_cm = pd.DataFrame(array, range(nbusers), range(nbusers))
    plt.figure(figsize=(10,7))
    sn.set(font_scale=1.4)
    sn.heatmap(df_cm, annot=True, annot_kws={"size": 16})
    plt.show()

In [None]:
# Visualizing the train performance just for reference
pred = ann.transform(train_df)
ann_f1 = evaluator.evaluate(pred)
print('Train F1 =', ann_f1)
pltConfusion(pred)

In [None]:
# The more important part:
pred = ann.transform(test_df)
ann_f1 = evaluator.evaluate(pred)
print('Test F1 =', ann_f1)
pltConfusion(pred)

We notice that the model mistakes with the kinda similar patterns, for example it preficts 2 as 16 because they both have linear slim shape. 
<br> This might be improved by introducing even more features or feeding more train data.
<br> It also looks like fell for over-fitting. Let's try to fix it..

In [None]:
# We'll apply cross validation, with a grid search to see if it can improve
grid = ParamGridBuilder().addGrid(mlpc.layers, [[len(in_col),50,nbusers],[len(in_col),40,30,nbusers]]).build()
evaluator = MulticlassClassificationEvaluator(labelCol='user_enc',predictionCol='prediction',metricName='f1')
cv = CrossValidator(estimator=mlpc, estimatorParamMaps=grid, evaluator=evaluator, parallelism=2)
cvModel = cv.fit(train_df)

In [None]:
cvModel.getNumFolds()

In [None]:
cvModel.avgMetrics[0]

In [None]:
pred = cvModel.transform(train_df)
ann_f1 = evaluator.evaluate(pred)
print('Train F1 =', ann_f1)
pltConfusion(pred)

In [None]:
pred = cvModel.transform(test_df)
ann_f1 = evaluator.evaluate(pred)
print('Test F1 =', ann_f1)
pltConfusion(pred)

Well, the F1 score of test data didn't improve after applying cross validation.
<br> In this case we can test other models like GBT, add more features, and increase the data (either by getting real data or performing some data augmentation)

In [None]:
cvModel.bestModel.getLayers()

In [None]:
cvModel.save('/kaggle/working/cv_70_model')

In [None]:
pred.show(30)

### PREDICT TEST

In [None]:
schemaMousePosTs = StructType([
    StructField('uid', StringType(), False,),
    StructField('session_id', StringType(), False),
    StructField('timestamp', LongType(), False),
    StructField('event_type', IntegerType(), False),
    StructField('screen_x', FloatType(), False),
    StructField('screen_y', FloatType(), False),
    StructField('user_id', StringType(), True)
])
testDs = spark.read.csv('/kaggle/input/mouse-dynamics-for-user-authentication/Test_Mouse.csv',header=True, schema=schemaMousePosTs)
testDs.printSchema()

In [None]:
testDs.show()

In [None]:
testDs.select('session_id').distinct().count()

ok so we have 40 sessions to predict

In [None]:
featuresDataframeTs = spark.createDataFrame(
    testDs.rdd.groupBy(lambda x: (x['session_id'], x['user_id'])).map(featurize), schema=schemaFeatures 
)

In [None]:
a_data_ts = assemble.transform(featuresDataframeTs)
s_data_ts = fittedScaler.transform(a_data_ts)

In [None]:
pred = ann.transform(s_data_ts)

In [None]:
pred.select(['session_id', 'prediction']).show(50)

In [None]:
pred.select('session_id').distinct().count()

In [None]:
pred.select('prediction').distinct().count()

In [None]:
usersDecoder = {v:k for k,v in usersEncoder.items()}

In [None]:
predPd = pred.toPandas()

In [None]:
predPd['user_id'] = predPd['prediction'].apply(lambda x: usersDecoder[int(x)])

In [None]:
predPd[['session_id','user_id']].set_index('session_id')

In [None]:
testDsC = testDs.toPandas().copy()

In [None]:
# now let's save our predictions
testDsC.reset_index().set_index('session_id').drop(columns='user_id', inplace=False).\
join(predPd[['session_id','user_id']].set_index('session_id')).\
reset_index()[['index', 'uid', 'session_id', 'user_id', 'timestamp', 'event_type', 'screen_x', 'screen_y']].\
set_index('index').sort_index().to_csv('/kaggle/working/Predicted_mouse.csv', index=False)  

### CONCLUSION

##### During this Machine Learning test case we analysed human input data, trained neural network classification model, and tried to guess the identity of the user behind an unseen behaviour. 
Although a large gap (25%) between test and training scores has resulted, this gap can pretty well be reduced by trying some of the following techniques:
- Try different combination of the neural network model layers.
- Try to evaluation other classification models.
- Augment the data either by getting more real data or creating some simulations.
- Try to compose independent features with high level of separation among the categories (se should think about the human reflex in this case, that guides the mouse, or also the type of bank forms, etc).
- It'd be also a good idea to try to approach this problem as a timeseries use case, and train some recurrent models.
