We want to do dimension reduction so let's install UMAP.

In [1]:
!pip install --quiet umap-learn

Now let's load up our data and do a little feature engineering to make all of our data numeric or Boolean.

In [2]:
import pandas as pd

DATA = '/kaggle/input/mobile-device-usage-and-user-behavior-dataset/user_behavior_dataset.csv'
TARGET = 'User Behavior Class'

df = pd.read_csv(filepath_or_buffer=DATA, index_col=['User ID'])
df = pd.get_dummies(data=df, columns=['Device Model', 'Operating System', 'Gender'])
df.head()

Unnamed: 0_level_0,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,User Behavior Class,Device Model_Google Pixel 5,Device Model_OnePlus 9,Device Model_Samsung Galaxy S21,Device Model_Xiaomi Mi 11,Device Model_iPhone 12,Operating System_Android,Operating System_iOS,Gender_Female,Gender_Male
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,393,6.4,1872,67,1122,40,4,True,False,False,False,False,True,False,False,True
2,268,4.7,1331,42,944,47,3,False,True,False,False,False,True,False,True,False
3,154,4.0,761,32,322,42,2,False,False,False,True,False,True,False,False,True
4,239,4.8,1676,56,871,20,3,True,False,False,False,False,True,False,False,True
5,187,4.3,1367,58,988,31,3,False,False,False,False,True,False,True,True,False


Are our target classes balanced?

In [3]:
df['User Behavior Class'].value_counts().to_dict()

{2: 146, 3: 143, 4: 139, 5: 136, 1: 136}

Our target class is essentially balanced.

In [4]:
COLUMNS = ['App Usage Time (min/day)', 'Screen On Time (hours/day)',
       'Battery Drain (mAh/day)', 'Number of Apps Installed',
       'Data Usage (MB/day)', 'Age', 
       'Device Model_Google Pixel 5', 'Device Model_OnePlus 9',
       'Device Model_Samsung Galaxy S21', 'Device Model_Xiaomi Mi 11',
       'Device Model_iPhone 12', 'Operating System_Android',
       'Operating System_iOS', 'Gender_Female', 'Gender_Male']

Let's do some dimension reduction and visualize our data using a scatter plot.

In [5]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=False, n_jobs=1, low_memory=False, n_epochs=201)
df[['x', 'y']] = umap.fit_transform(X=df[COLUMNS])
print('done with UMAP in {}'.format(arrow.now() - time_start))

done with UMAP in 0:00:09.910675


In [6]:
from plotly import express

express.scatter(data_frame=df, x='x', y='y', color=TARGET, )

Our target variable aligns very well with our clusters, so we expect that just about any model can do well. Let's build a model.

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[COLUMNS], df[TARGET], test_size=0.2, random_state=2024, stratify=df[TARGET])

logreg = LogisticRegression(max_iter=100000, tol=1e-3).fit(X_train, y_train)
print('model fit in {} iterations'.format(logreg.n_iter_[0]))
print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=logreg.predict(X=X_test))))
print('f1: {:5.4f}'.format(f1_score(average='weighted', y_true=y_test, y_pred=logreg.predict(X=X_test))))
print(classification_report(zero_division=0.0, y_true=y_test, y_pred=logreg.predict(X=X_test)))

model fit in 13578 iterations
accuracy: 0.9000
f1: 0.8996
              precision    recall  f1-score   support

           1       1.00      0.96      0.98        27
           2       0.97      0.97      0.97        29
           3       0.84      0.93      0.89        29
           4       0.81      0.75      0.78        28
           5       0.89      0.89      0.89        27

    accuracy                           0.90       140
   macro avg       0.90      0.90      0.90       140
weighted avg       0.90      0.90      0.90       140




lbfgs failed to converge (status=1):
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



That's odd. We may do better with KNN. Let's try that.

In [8]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3,).fit(X_train, y_train)

print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=knn.predict(X=X_test))))
print('f1: {:5.4f}'.format(f1_score(average='weighted', y_true=y_test, y_pred=knn.predict(X=X_test))))
print(classification_report(zero_division=0.0, y_true=y_test, y_pred=knn.predict(X=X_test)))

accuracy: 1.0000
f1: 1.0000
              precision    recall  f1-score   support

           1       1.00      1.00      1.00        27
           2       1.00      1.00      1.00        29
           3       1.00      1.00      1.00        29
           4       1.00      1.00      1.00        28
           5       1.00      1.00      1.00        27

    accuracy                           1.00       140
   macro avg       1.00      1.00      1.00       140
weighted avg       1.00      1.00      1.00       140



Let's have a look at the model probabilities for logistic regression.

In [9]:
probability_df = pd.DataFrame(data=logreg.predict_proba(X=X_test).max(axis=1), columns=['probability'])
probability_df['true'] = y_test.tolist()
probability_df['pred'] = logreg.predict(X=X_test)
probability_df['correct'] = probability_df['true'] == probability_df['pred']
probability_df[['x', 'y']] = umap.transform(X=X_test)

probability_df.head()

Unnamed: 0,probability,true,pred,correct,x,y
0,0.999856,5,5,True,-2.291656,-5.380343
1,0.987673,1,1,True,9.696007,-2.103451
2,0.976413,2,2,True,12.582695,9.255136
3,0.996342,5,5,True,-2.172602,-3.936835
4,0.979685,1,1,True,8.440191,-2.951017


In [10]:
express.scatter(data_frame=probability_df, x='x', y='y', facet_col='correct', color='probability')

Surprisingly, our logistic regression model makes mistakes in all five classes, and occasionally produces high model probablities for incorrect predictions.