# CSC17104 – Programming for Data Science

## Final Project


## 1. Programming environment

In [1]:
import sys
sys.executable

'C:\\Users\\ASUS\\miniconda3\\envs\\min_ds-env\\python.exe'

## 2. Data collection

   The data is titled **"Mobile Device Usage and User Behavior Dataset"** authored by Seyedvala Khorasani with collaboration from Vala Khorasani, and is available on [Kaggle](https://www.kaggle.com/datasets/valakhorasani/mobile-device-usage-and-user-behavior-dataset/data). The dataset is licensed under the ***Apache 2.0 License***, which permits free use, modification, and distribution, provided proper attribution is given and any derivative works comply with the same license terms.
   
   This dataset provides a comprehensive analysis of mobile device usage patterns and user behavior classification. It contains 700 samples of user data, including metrics such as app usage time, screen-on time, battery drain, and data consumption. Each entry is categorized into one of five user behavior classes, ranging from light to extreme usage, allowing for insightful analysis and modeling.

Key Features:
* User ID: Unique identifier for each user.
* Device Model: Model of the user's smartphone.
* Operating System: The OS of the device (iOS or Android).
* App Usage Time: Daily time spent on mobile applications, measured in minutes.
* Screen On Time: Average hours per day the screen is active.
* Battery Drain: Daily battery consumption in mAh.
* Number of Apps Installed: Total apps available on the device.
* Data Usage: Daily mobile data consumption in megabytes.
* Age: Age of the user.
* Gender: Gender of the user (Male or Female).
* User Behavior Class: Classification of user behavior based on usage patterns (1 to 5).
  
This dataset is ideal for researchers, data scientists, and analysts interested in understanding mobile user behavior and developing predictive models in the realm of mobile technology and applications. This Dataset was primarily designed to implement machine learning algorithms and is not a reliable source for a paper or article.

The dataset description suggests that the data was likely collected through a combination of:

- Mobile application analytics to gather metrics like app usage time, screen-on time, and battery consumption.
- Device monitoring tools to measure hardware-specific metrics like data usage and number of installed apps.
- User surveys or demographic data inputs to collect age, gender, and usage classifications.

## 2. Import necessary libraries

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

## 3. Data exploring & Data preprocessing

### 3.0. Read data

Read the datasest and display basic information about the dataset

In [3]:
file_path = 'user_behavior_dataset.csv'
data = pd.read_csv(file_path)

data.head()

Unnamed: 0,User ID,Device Model,Operating System,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,Gender,User Behavior Class
0,1,Google Pixel 5,Android,393,6.4,1872,67,1122,40,Male,4
1,2,OnePlus 9,Android,268,4.7,1331,42,944,47,Female,3
2,3,Xiaomi Mi 11,Android,154,4.0,761,32,322,42,Male,2
3,4,Google Pixel 5,Android,239,4.8,1676,56,871,20,Male,3
4,5,iPhone 12,iOS,187,4.3,1367,58,988,31,Female,3


### 3.1. The meaning of each column/row


How many rows and how many columns?

In [4]:
rows, columns = data.shape
print(f"The dataset has {rows} rows and {columns} columns.")

The dataset has 700 rows and 11 columns.


### Rows

#### The meaning of each row:

Each row provides some infomation about a single user’s mobile usage behavior, device characteristics, and demographic information, which can be used for analysis and modeling.

### Columns

#### The meaning of each colums:

- User ID: Unique identifier for the user.
- Device Model: The specific model of the user’s smartphone.
- Operating System: The OS running on the user’s device (e.g., Android or iOS).
- App Usage Time (min/day): The total time (in minutes) the user spends on mobile apps daily.
- Screen On Time (hours/day): The total number of hours the screen remains active daily.
- Battery Drain (mAh/day): The daily battery consumption of the user’s device in milliampere-hours (mAh).
- Number of Apps Installed: Total number of applications installed on the user’s device.
- Data Usage (MB/day): The daily amount of mobile data consumed by the user in megabytes (MB).
- Age: The age of the user.
- Gender: The gender of the user (Male or Female).
- User Behavior Class: A classification (from 1 to 5) categorizing the user based on their mobile usage behavior (e.g., light, moderate, or extreme usage).


### 3.2. Duplicated rows

Are there duplicated rows?

In [5]:
duplicated_rows = data.duplicated().sum()
print(f"The dataset has {duplicated_rows} duplicated rows.")

The dataset has 0 duplicated rows.


### 3.3. Data conversion

In [6]:
data.dtypes

User ID                         int64
Device Model                   object
Operating System               object
App Usage Time (min/day)        int64
Screen On Time (hours/day)    float64
Battery Drain (mAh/day)         int64
Number of Apps Installed        int64
Data Usage (MB/day)             int64
Age                             int64
Gender                         object
User Behavior Class             int64
dtype: object

After reviewing the dataset, we noticed that some columns, such as `User Behavior Class`, is using suboptimal data types. Below are the reasons for and suggested changes to the data types:

Convert *User Behavior Class* from `int64` to `category`: The User Behavior Class column contains integer values representing different user behavior classes (e.g., 0, 1, 2). If the number of unique behavior classes is small, converting this column to category can provide the following benefits:
- The category data type reduces memory usage by storing unique class labels more efficiently.
- Pandas can handle categorical data faster, especially when performing operations like grouping or aggregation.

In [7]:
data['Gender'] = data['Gender'].astype('category')
# Check the updated data types
print(data.dtypes)

User ID                          int64
Device Model                    object
Operating System                object
App Usage Time (min/day)         int64
Screen On Time (hours/day)     float64
Battery Drain (mAh/day)          int64
Number of Apps Installed         int64
Data Usage (MB/day)              int64
Age                              int64
Gender                        category
User Behavior Class              int64
dtype: object


### 3.4. With each numerical column, how are values distributed?

In [8]:
numerical_columns = [
    'App Usage Time (min/day)',
    'Screen On Time (hours/day)',
    'Battery Drain (mAh/day)',
    'Number of Apps Installed',
    'Data Usage (MB/day)',
    'Age',
    'User Behavior Class'
]
summary = {}

for column in numerical_columns:
    column_data = data[column]
    summary[column] = {
        'Missing Percentage': column_data.isnull().mean() * 100,
        'Min': column_data.min(),
        'Max': column_data.max(),
        'Mean': column_data.mean(),
        'Std': column_data.std(),
        '25%': column_data.quantile(0.25),
        '50% (Median)': column_data.median(),
        '75%': column_data.quantile(0.75)
    }
# Display the summary results
for column, stats in summary.items():
    print(f"\nColumn: {column}")
    for stat, value in stats.items():
        print(f"  {stat}: {value}")


Column: App Usage Time (min/day)
  Missing Percentage: 0.0
  Min: 30
  Max: 598
  Mean: 271.12857142857143
  Std: 177.19948438266206
  25%: 113.25
  50% (Median): 227.5
  75%: 434.25

Column: Screen On Time (hours/day)
  Missing Percentage: 0.0
  Min: 1.0
  Max: 12.0
  Mean: 5.272714285714286
  Std: 3.0685839102732553
  25%: 2.5
  50% (Median): 4.9
  75%: 7.4

Column: Battery Drain (mAh/day)
  Missing Percentage: 0.0
  Min: 302
  Max: 2993
  Mean: 1525.1585714285713
  Std: 819.1364144757152
  25%: 722.25
  50% (Median): 1502.5
  75%: 2229.5

Column: Number of Apps Installed
  Missing Percentage: 0.0
  Min: 10
  Max: 99
  Mean: 50.68142857142857
  Std: 26.943324147645118
  25%: 26.0
  50% (Median): 49.0
  75%: 74.0

Column: Data Usage (MB/day)
  Missing Percentage: 0.0
  Min: 102
  Max: 2497
  Mean: 929.7428571428571
  Std: 640.4517291185034
  25%: 373.0
  50% (Median): 823.5
  75%: 1341.0

Column: Age
  Missing Percentage: 0.0
  Min: 18
  Max: 59
  Mean: 38.48285714285714
  Std: 12.01

### 3.5. With each categorical column, how are values distributed?

In [9]:
categorical_columns = data.select_dtypes(include=['category', 'object']).columns
categorical_summary = {}
for column in categorical_columns:
    column_data = data[column]
    categorical_summary[column] = {
        'Missing Percentage': column_data.isnull().mean() * 100,
        'Unique Values Count': column_data.nunique(),
        'Sample Values': column_data.unique()[:5]
    }

for column, stats in categorical_summary.items():
    print(f"\nColumn: {column}")
    for stat, value in stats.items():
        print(f"  {stat}: {value}")



Column: Device Model
  Missing Percentage: 0.0
  Unique Values Count: 5
  Sample Values: ['Google Pixel 5' 'OnePlus 9' 'Xiaomi Mi 11' 'iPhone 12'
 'Samsung Galaxy S21']

Column: Operating System
  Missing Percentage: 0.0
  Unique Values Count: 2
  Sample Values: ['Android' 'iOS']

Column: Gender
  Missing Percentage: 0.0
  Unique Values Count: 2
  Sample Values: ['Male', 'Female']
Categories (2, object): ['Female', 'Male']
