Let's tidy up this "wide" dataset of dummy devices!

In [4]:
import pandas as pd

# Read the CSV file into a pandas DF
df_wide = pd.read_csv('/home/kobi/is362/Project 2/dummy_devices.csv')

# Print the df
print("Dataset pre tidying:")
display(df_wide)

Dataset pre tidying:


Unnamed: 0,Device Name,Device Type,Operating Systems,Owner,Username,MDM Compliance,Registration Time,Device Model
0,Laptop_05,Laptop,Windows 10.0.19044,Ruby Brewer,rb.wr@icloud.org,True,3/3/2022 15:40,Lenovo ThinkPad E16 Gen 2
1,Desktop_66,Desktop,Windows 10.0.19045,Fritz Spence,velit@yahoo.couk,True,6/13/2022 17:07,HP Z2 Mini G9
2,Griffin’s iPhones,Smartphone,IPhone 17.5.1,Griffin Lambert,ac.feugiat@hotmail.com,True,9/23/2022 15:05,iPhone 11 Pro
3,Jade’s iPhone,Smartphone,IPhone 17.6.1,Jade Rowe,eu.metus@protonmail.couk,True,9/23/2022 19:06,iPhone 12 Pro Max
4,Desktop_32,Desktop,Windows 11.0.26100,Salvador Nash,dignissim@yahoo.couk,True,12/8/2022 20:40,HP Z2 Mini G9
5,Desktop_01,Desktop,Windows 10.0.19045,,,,12/11/2022 18:25,Lenovo ThinkCentre M710
6,Desktop_87,Desktop,Windows 11.0.22631,Simon English,sed.turpis@google.ca,True,12/12/2022 12:45,Dell OptiPlex 7000 Micro PC
7,Laptop_19,Laptop,Windows 10.0.19045,Mia Ayala,aenean@hotmail.couk,False,12/13/2022 13:15,Lenovo ThinkPad X1 Yoga Gen 8
8,Laptop_34,Laptop,Windows 10.0.19044,,,,12/18/2022 0:01,
9,Teagan’s iPhone,Smartphone,IPhone 17.5.1,Teagan Serrano,facilisi.sed.neque@yahoo.edu,True,1/31/2023 16:24,iPhone 11


In [8]:
# Clean up rows with missing owner or username
df_tidy = df_wide.dropna(subset=['Owner', 'Username']).copy()

# Make sure operating systems is a string and handle invalid values
df_tidy['Operating Systems'] = df_tidy['Operating Systems'].fillna('').astype(str)

# Split OS type and OS Version
df_tidy[['OS_Type', 'OS_Version']] = df_tidy['Operating Systems'].apply(lambda x: x.split(' ', 1) if x else ['', '']).apply(pd.Series)

# Make sure device model is a string and handle invalid values
df_tidy['Device Model'] = df_tidy['Device Model'].fillna('').astype(str)

# Split device brand and model
df_tidy[['Brand', 'Model']] = df_tidy['Device Model'].apply(lambda x: x.split(' ', 1) if x else ['', '']).apply(pd.Series)

# Drop original columns
df_tidy.drop(columns=['Operating Systems', 'Device Model'], inplace=True)

# Display dataset post tidying
print("Dataset post tidying:")
display(df_tidy)


Dataset post tidying:


Unnamed: 0,Device Name,Device Type,Owner,Username,MDM Compliance,Registration Time,OS_Type,OS_Version,Brand,Model
0,Laptop_05,Laptop,Ruby Brewer,rb.wr@icloud.org,True,3/3/2022 15:40,Windows,10.0.19044,Lenovo,ThinkPad E16 Gen 2
1,Desktop_66,Desktop,Fritz Spence,velit@yahoo.couk,True,6/13/2022 17:07,Windows,10.0.19045,HP,Z2 Mini G9
2,Griffin’s iPhones,Smartphone,Griffin Lambert,ac.feugiat@hotmail.com,True,9/23/2022 15:05,IPhone,17.5.1,iPhone,11 Pro
3,Jade’s iPhone,Smartphone,Jade Rowe,eu.metus@protonmail.couk,True,9/23/2022 19:06,IPhone,17.6.1,iPhone,12 Pro Max
4,Desktop_32,Desktop,Salvador Nash,dignissim@yahoo.couk,True,12/8/2022 20:40,Windows,11.0.26100,HP,Z2 Mini G9
6,Desktop_87,Desktop,Simon English,sed.turpis@google.ca,True,12/12/2022 12:45,Windows,11.0.22631,Dell,OptiPlex 7000 Micro PC
7,Laptop_19,Laptop,Mia Ayala,aenean@hotmail.couk,False,12/13/2022 13:15,Windows,10.0.19045,Lenovo,ThinkPad X1 Yoga Gen 8
9,Teagan’s iPhone,Smartphone,Teagan Serrano,facilisi.sed.neque@yahoo.edu,True,1/31/2023 16:24,IPhone,17.5.1,iPhone,11
10,Kermit’s iPhone,Smartphone,Kermit Knapp,a.nunc.in@google.net,True,2/23/2023 17:59,IPhone,17.6.1,iPhone,13 Pro Max
12,Ella’s iPad,Tablet,Ruby Brewer,rb.wr@icloud.org,False,3/2/2023 18:49,IPad,16.3.1,iPad,Mini 3rd


1) I used a dataset from the discussion board so only took a few mins to find, cleaning it took maybe 40 minutes. It was annoying because I had to account for weirdly formatted variables

2) Cleaning this dataset will allow me to run simple analysis on something like prevalence of specific operating systems in this dataset- see below. It is much easier to do when operating systems is in its own column!

3) It is important to maintain relationships between the variables when performing cleaning, such as when I separated operating systems/device model.

Now let's do a simple analysis to find the percent of users in this dataset that are using a phone AND are also on an Android

In [9]:
# Filter to only show phone users
phone_users = df_tidy[df_tidy['Brand'].str.contains('Phone', case=False, na=False) | 
                      df_tidy['OS_Type'].str.contains('Android', case=False, na=False)]

display (phone_users)

# Count android users
android_users = phone_users[phone_users['OS_Type'].str.contains('Android', case=False, na=False)]

# Caulculate % Android users
percentage_android = (len(android_users) / len(phone_users)) * 100

# Display Results
print(f"Of the users using a phone, here is the % of those users who are using android: {percentage_android:.2f}%")

Unnamed: 0,Device Name,Device Type,Owner,Username,MDM Compliance,Registration Time,OS_Type,OS_Version,Brand,Model
2,Griffin’s iPhones,Smartphone,Griffin Lambert,ac.feugiat@hotmail.com,True,9/23/2022 15:05,IPhone,17.5.1,iPhone,11 Pro
3,Jade’s iPhone,Smartphone,Jade Rowe,eu.metus@protonmail.couk,True,9/23/2022 19:06,IPhone,17.6.1,iPhone,12 Pro Max
9,Teagan’s iPhone,Smartphone,Teagan Serrano,facilisi.sed.neque@yahoo.edu,True,1/31/2023 16:24,IPhone,17.5.1,iPhone,11
10,Kermit’s iPhone,Smartphone,Kermit Knapp,a.nunc.in@google.net,True,2/23/2023 17:59,IPhone,17.6.1,iPhone,13 Pro Max
13,Alex’s iPhone,Smartphone,Ginger Alexander,facilisis@hotmail.edu,True,3/28/2023 19:08,IPhone,17.6.1,iPhone,SE
17,Linsay_workphone,Smartphone,Charlotte Lindsay,mauris@protonmail.org,False,8/14/2023 17:50,AndroidForWork,11,A600DL,
18,Brynn’s iPhone,Smartphone,Brynn Mclaughlin,arcu.ac@icloud.org,True,8/30/2023 18:33,IPhone,17.6.1,iPhone,12 Pro Max
20,Kelly’s iPhone,Smartphone,Kelly Strickland,sem.pellentesque@google.couk,True,9/25/2023 14:18,IPhone,17.5.1,iPhone,13
22,samsung62,Smartphone,Kermit Knapp,a.nunc.in@google.net,True,10/8/2023 2:45,Android,13,Samsung,Galaxy S22 Ultra
24,Wood_AndroidForWork,Smartphone,Lilah Wood,erat.in.consectetuer@yahoo.com,False,10/20/2023 4:56,AndroidForWork,13,Galaxy,S22 Ultra


Of the users using a phone, here is the % of those users who are using android: 47.37%
