## Smartphone user behavior:

## Does the number of apps influence the time (screen on Time & ) users spend on their phones?

Predictive Model to predict usage based number of apps.


#### Notebook Layout

1. Introduction
   Formulation of the hypothesis.
   Explanation of the methodology.
2. Data Exploration
   Loading the dataset.
   Inspecting and cleaning the data.
   Data cleaning:
   Remove Nas (Missing, Null)
   Remove duplicates
   Visualize
   Correlation matrix
   Classification
   Descriptive statistics.
3. Feature Engineering
   Calculate the number of apps per user.
   Aggregate screen-on time per user.
   Exploratory Data Analysis (EDA)
   Visualize the relationship between the number of apps and screen-on time.
   Analyze trends and correlations.
   Statistical Analysis
   Correlation analysis.
   Hypothesis testing.
   Modeling
   Linear regression to quantify the relationship.
   Model evaluation metrics.
   Discussion and Conclusion
   Interpret the results.
   Limitations and future work.


#### Paper Structure:

Introduction:
State of the art
Related work
Research question
Methodologies
Concept
Realization/evaluation
Conclusion/outlook
Literature/references:
https://www.sciencedirect.com/science/article/pii/S0747563222002266
https://dl.acm.org/doi/fullHtml/10.1145/3178876.3186169#fn2
https://hal.science/hal-03156195/document

Links related to the smartphone behavior
https://www.kaggle.com/code/harshitpathak18/smartphone-usage-analysis
https://www.kaggle.com/datasets/valakhorasani/mobile-device-usage-and-user-behavior-dataset/code
https://www.kaggle.com/code/pavankumar4757/predicting-user-behavior-battery-drain-100


### 01: Data & Packages


In [None]:
#00: Import all required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr, spearmanr
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import time

sns.set_theme()

#### Dataset

##### Features

The dataset contains the following columns: 1. User ID: Identifier for each user. 2. Device Model: Model of the user's device. 3. Operating System: OS installed on the device. 4. App Usage Time (min/day): Time spent using apps, in minutes per day. 5. Screen On Time (hours/day): Screen-on time, in hours per day. 6. Battery Drain (mAh/day): Battery drain per day in mAh. 7. Number of Apps Installed: Total number of apps installed on the device. 8. Data Usage (MB/day): Data usage per day in megabytes. 9. Age: Age of the user. 10. Gender: Gender of the user.
User Behavior Class: Categorization of user behavior.


In [None]:
#01: Load the Dataset
data_path = 'user_behavior_dataset.csv'
data = pd.read_csv(data_path)

### 02: Data Preparation


In [None]:
print("Dataset Overview:")
data.head()

In [None]:
# Inspect Dataset data types & non-null walues
print("\n Dataset Info:")
print(data.info())

In [None]:
# View if there is null values
print("\n Missing Values:")
print(data.isnull().sum())

In [None]:
# Remove rows with missing values (if seen above)
data = data.dropna()

In [None]:
# Renaming columns for consistency and brevity
data = data.rename(columns={
    'User ID': 'user_id',
    'Device Model': 'device',
    'Operating System': 'os',
    'App Usage Time (min/day)': 'app_usage',
    'Screen On Time (hours/day)': 'screen_time',
    'Battery Drain (mAh/day)': 'battery',
    'Number of Apps Installed': 'num_apps',
    'Data Usage (MB/day)': 'data',
    'User Behavior Class': 'behave_class'
})
data.head()

### 03: Feature Engineering


In [None]:
# Calculate the total screen-on time in minutes per user
# data['screen_time'] = data['screen_time'].astype(float)
data['screen_time'] = data['screen_time'] * 60  # Convert hours to minutes
data['screen_time'].unique()

# plt.hist(data['screen_time'])

# Visualization: Screen Time Distribution
plt.figure(figsize=(8, 5))
# plt.bar(data['user_id'], data['screen_time'], color='red')
sns.barplot(data=data, x='screen_time', y='user_id')
plt.title('Screen Time by User ID')
plt.xlabel('User ID')
plt.ylabel('Screen Time (Hours)')
# plt.xticks(data['user_id'])
# plt.tgiht_layout()
plt.show()

In [None]:
# Calculate the number of apps per user
# apps_per_user = data.groupby('user_id')['app_name'].nunique().reset_index()
# apps_per_user.columns = ['user_id', 'num_apps']
apps_per_user = data[['user_id', 'num_apps']]
# plt.hist(data['num_apps'])
plt.figure(figsize=(8, 5))
plt.bar(data['user_id'], data['screen_time'], color='red')
plt.title('Screen Time by User ID')
plt.xlabel('User ID')
plt.ylabel('Screen Time (Hours)')
# plt.xticks(data['user_id'])
# plt.tgiht_layout()
plt.show()

In [None]:
# Set the figure size
plt.figure(figsize=(10, 6))

# Create a histogram
sns.histplot(data['num_apps'], bins=20, kde=True, color='blue')

# Set the title and labels
plt.title('Distribution of Apps per user', fontsize=16)
plt.xlabel('Apps per user', fontsize=12)
plt.ylabel('Frequency', fontsize=12)

# Add grid lines
plt.grid(True, linestyle='--', alpha=0.7)

# Display the plot
plt.tight_layout()
plt.show()

In [None]:
# Calculate the total screen-on time per user
screen_time_per_user = data.groupby('user_id')['screen_time'].sum().reset_index()
# screen_time_per_user.columns = ['user_id', 'total_screen_time']
# screen_time_per_user.head()

plt.figure(figsize=(10, 6))

# Create a histogram
sns.histplot(data['screen_time'], bins=20, kde=True, color='blue')

# Set the title and labels
plt.title('Distribution of Screen-on time', fontsize=16)
plt.xlabel('Screen-on time per user', fontsize=12)
plt.ylabel('Frequency', fontsize=12)

# Add grid lines
plt.grid(True, linestyle='--', alpha=0.7)

# Display the plot
plt.tight_layout()
plt.show()

In [None]:
# Merge the two datasets
user_data = pd.merge(apps_per_user, screen_time_per_user, on='user_id')
user_data.head()

### 04: Exploratory Data Analysis


In [None]:
# Scatter Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=user_data, x='num_apps', y='total_screen_time')
plt.title('Number of Apps vs Total Screen Time')
plt.xlabel('Number of Apps')
plt.ylabel('Total Screen Time (minutes)')
plt.show()

In [None]:
# Investigate unique values of total_screen_time
print("\nUnique Total Screen Time Values:")
print(user_data['total_screen_time'].unique())

In [None]:
# Plot histogram to visualize the distribution of total_screen_time
plt.figure(figsize=(10, 6))
plt.hist(user_data['total_screen_time'], bins=20, edgecolor='black', alpha=0.7)
plt.title('Distribution of Total Screen Time')
plt.xlabel('Total Screen Time (minutes)')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Add jitter to the data for better visualization
user_data['total_screen_time_jittered'] = user_data['total_screen_time'] + np.random.normal(0, 5, size=user_data.shape[0])

plt.figure(figsize=(10, 6))
sns.scatterplot(data=user_data, x='num_apps', y='total_screen_time_jittered', alpha=0.6)
plt.title('Number of Apps vs Total Screen Time (With Jitter)')
plt.xlabel('Number of Apps')
plt.ylabel('Total Screen Time (minutes, with jitter)')
plt.show()

### Step 5: Correlation Matrix


In [None]:
# Calculate correlation coefficients
pearson_corr, pearson_pval = pearsonr(user_data['num_apps'], user_data['total_screen_time'])
spearman_corr, spearman_pval = spearmanr(user_data['num_apps'], user_data['total_screen_time'])

print(f"Pearson Correlation: {pearson_corr:.2f} (p-value: {pearson_pval:.3f})")
print(f"Spearman Correlation: {spearman_corr:.2f} (p-value: {spearman_pval:.3f})")

### Step 6: Linear Regression


In [None]:
# Linear Regression
# Prepare the data
X = user_data[['num_apps']]
Y = user_data['total_screen_time']

# Fit the model
model = LinearRegression()
model.fit(X, Y)

# Predictions and Evaluation
predictions = model.predict(X)
rmse = np.sqrt(mean_squared_error(Y, predictions))
r2 = r2_score(Y, predictions)

print(f"\nLinear Regression Model:")
print(f"Coefficient: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R^2 Score: {r2:.2f}")

# RMSE
# root mean square error

In [None]:
# Regression Line
plt.figure(figsize=(10, 6))
sns.regplot(data=user_data, x='num_apps', y='total_screen_time', line_kws={'color': 'red'})
plt.title('Regression Line: Number of Apps vs Total Screen Time')
plt.xlabel('Number of Apps')
plt.ylabel('Total Screen Time (minutes)')
plt.show()

### Step 7: Predictive Modeling


In [None]:
# Prepare the data
X = user_data[['num_apps']]
Y = user_data['total_screen_time']

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train, Y_train)

# Predictions on test data
linear_predictions = linear_model.predict(X_test)

# Evaluate the model
linear_rmse = np.sqrt(mean_squared_error(Y_test, linear_predictions))
linear_r2 = r2_score(Y_test, linear_predictions)

print(f"\nLinear Regression Model:")
print(f"Coefficient: {linear_model.coef_[0]:.2f}")
print(f"Intercept: {linear_model.intercept_:.2f}")
print(f"RMSE: {linear_rmse:.2f}")
print(f"R^2 Score: {linear_r2:.2f}")

In [None]:
# Random Forest Regression
rf_model = RandomForestRegressor(random_state=42, n_estimators=100)
rf_model.fit(X_train, Y_train)

# Predictions
rf_predictions = rf_model.predict(X_test)

# Evaluate Random Forest
rf_rmse = np.sqrt(mean_squared_error(Y_test, rf_predictions))
rf_r2 = r2_score(Y_test, rf_predictions)

print(f"\nRandom Forest Regression Model:")
print(f"RMSE: {rf_rmse:.2f}")
print(f"R^2 Score: {rf_r2:.2f}")

# Plot predictions for both models
plt.figure(figsize=(10, 6))
plt.scatter(Y_test, linear_predictions, alpha=0.6, label='Linear Regression')
plt.scatter(Y_test, rf_predictions, alpha=0.6, label='Random Forest', color='orange')
plt.plot([min(Y_test), max(Y_test)], [min(Y_test), max(Y_test)], color='red', linestyle='--', label='Ideal Fit')
plt.title('Model Predictions vs Actual Total Screen Time')
plt.xlabel('Actual Total Screen Time (minutes)')
plt.ylabel('Predicted Total Screen Time (minutes)')
plt.legend()
plt.show()


In [None]:
# Prepare the data
X = user_data[['num_apps']]
Y = user_data['total_screen_time']

# Split into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Linear Regression Benchmark
start_time_linear = time.time()
linear_model = LinearRegression()
linear_model.fit(X_train, Y_train)
linear_end_time = time.time()

# Predictions
linear_predictions = linear_model.predict(X_test)

# Evaluate Linear Regression
linear_rmse = np.sqrt(mean_squared_error(Y_test, linear_predictions))
linear_r2 = r2_score(Y_test, linear_predictions)
linear_time_taken = linear_end_time - start_time_linear

print(f"\nLinear Regression Model:")
print(f"Coefficient: {linear_model.coef_[0]:.2f}")
print(f"Intercept: {linear_model.intercept_:.2f}")
print(f"RMSE: {linear_rmse:.2f}")
print(f"R^2 Score: {linear_r2:.2f}")
print(f"Time Taken: {linear_time_taken:.4f} seconds")

# Random Forest Benchmark
start_time_rf = time.time()
rf_model = RandomForestRegressor(random_state=42, n_estimators=100)
rf_model.fit(X_train, Y_train)
rf_end_time = time.time()

# Predictions
rf_predictions = rf_model.predict(X_test)

# Evaluate Random Forest
rf_rmse = np.sqrt(mean_squared_error(Y_test, rf_predictions))
rf_r2 = r2_score(Y_test, rf_predictions)
rf_time_taken = rf_end_time - start_time_rf

print(f"\nRandom Forest Regression Model:")
print(f"RMSE: {rf_rmse:.2f}")
print(f"R^2 Score: {rf_r2:.2f}")
print(f"Time Taken: {rf_time_taken:.4f} seconds")

# Benchmark Comparison
print("\nBenchmark Comparison:")
print(f"Linear Regression - RMSE: {linear_rmse:.2f}, R^2: {linear_r2:.2f}, Time: {linear_time_taken:.4f}s")
print(f"Random Forest - RMSE: {rf_rmse:.2f}, R^2: {rf_r2:.2f}, Time: {rf_time_taken:.4f}s")

# Plot predictions for both models
plt.figure(figsize=(10, 6))
plt.scatter(Y_test, linear_predictions, alpha=0.6, label='Linear Regression')
plt.scatter(Y_test, rf_predictions, alpha=0.6, label='Random Forest', color='orange')
plt.plot([min(Y_test), max(Y_test)], [min(Y_test), max(Y_test)], color='red', linestyle='--', label='Ideal Fit')
plt.title('Model Predictions vs Actual Total Screen Time')
plt.xlabel('Actual Total Screen Time (minutes)')
plt.ylabel('Predicted Total Screen Time (minutes)')
plt.legend()
plt.show()

### Step 7: Conclusion


In [None]:
print("\nDiscussion:")
print("The analysis compared Linear Regression and Random Forest Regression for predicting screen time.")
print("Linear Regression is interpretable but assumes a linear relationship, while Random Forest captures potential non-linearities.")
print("Based on RMSE and R^2 scores, the model with the best performance can be selected.")