Import Modules

In [1]:
import pandas as pd

In [None]:
I - Data Cleaning

In [30]:
df = pd.read_csv('user_behavior_dataset.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   User ID                     700 non-null    int64  
 1   Device Model                700 non-null    object 
 2   Operating System            700 non-null    object 
 3   App Usage Time (min/day)    700 non-null    int64  
 4   Screen On Time (hours/day)  700 non-null    float64
 5   Battery Drain (mAh/day)     700 non-null    int64  
 6   Number of Apps Installed    700 non-null    int64  
 7   Data Usage (MB/day)         700 non-null    int64  
 8   Age                         700 non-null    int64  
 9   Gender                      700 non-null    object 
 10  User Behavior Class         700 non-null    int64  
dtypes: float64(1), int64(7), object(3)
memory usage: 60.3+ KB


Checking if there are missing data in the data file or not

In [31]:
print("Missing values in each column:")
print(df.isnull().sum())

Missing values in each column:
User ID                       0
Device Model                  0
Operating System              0
App Usage Time (min/day)      0
Screen On Time (hours/day)    0
Battery Drain (mAh/day)       0
Number of Apps Installed      0
Data Usage (MB/day)           0
Age                           0
Gender                        0
User Behavior Class           0
dtype: int64


Base on the result of the code, looks like there's no missing value in each column so it's safe to say that the data is already cleaned

For easy data analysis, the two column "App Usage Time" and "Screen On Time" should be changed to have the common unit. At first, in the original data file, "App Usage Time" is in minutes per day (min/day) while "Screen on Time" is in hours per day (hours/day) so the two data should have the same unit so that the analysis can be accurate and easy to compare with other data

In [32]:
df['App Usage Time (min/day)'].describe()

count    700.000000
mean     271.128571
std      177.199484
min       30.000000
25%      113.250000
50%      227.500000
75%      434.250000
max      598.000000
Name: App Usage Time (min/day), dtype: float64

In [33]:
df['Screen On Time (hours/day)'].describe()

count    700.000000
mean       5.272714
std        3.068584
min        1.000000
25%        2.500000
50%        4.900000
75%        7.400000
max       12.000000
Name: Screen On Time (hours/day), dtype: float64

In [34]:
df['Screen On Time (min/day)'] = df['Screen On Time (hours/day)'] * 60

In [35]:
df[['App Usage Time (min/day)', 'Screen On Time (min/day)']]

Unnamed: 0,App Usage Time (min/day),Screen On Time (min/day)
0,393,384.0
1,268,282.0
2,154,240.0
3,239,288.0
4,187,258.0
...,...,...
695,92,234.0
696,316,408.0
697,99,186.0
698,62,102.0


In [36]:
df.drop(columns=['Screen On Time (hours/day)'], inplace=True)
cols = df.columns.tolist()
app_usage_index = cols.index('App Usage Time (min/day)')
cols.insert(app_usage_index + 1, cols.pop(cols.index('Screen On Time (min/day)')))
df = df[cols]
df.head()
df.to_csv('cleaned_data_unified_time.csv', index=False)

For further use, I'll separate the code into 4 other files to help the analysis becomes more accurate and detailed

In [38]:
# Filter data for Male Users
df_male = df[df['Gender'] == 'Male']

# Filter data for Female Users
df_female = df[df['Gender'] == 'Female']

# Filter data for iOS Users
df_ios = df[df['Operating System'] == 'iOS']

# Filter data for Android Users
df_android = df[df['Operating System'] == 'Android']

# Save each filtered dataset to a new CSV file
# Male User data
df_male.to_csv('cleaned_data_male.csv', index=False)

# Female User data
df_female.to_csv('cleaned_data_female.csv', index=False)

# iOS User data
df_ios.to_csv('cleaned_data_ios.csv', index=False)

# Android User data
df_android.to_csv('cleaned_data_android.csv', index=False)