# An Overview on Data Cleaning

## Good practices & First steps

Dataset: laptops.csv

Source: [Dataquest](https://dsserver-prod-resources-1.s3.amazonaws.com/293/laptops.csv?versionId=6BQzmMWIQEutXrN29r_0fnVN7_Z0oSVB)

Info: Various details about laptops, such as Manufacturer, Model Name, Category, etc

In [1]:
# import libraries
import pandas as pd

In [2]:
# import csv
# Here we can see files can have different encoding types
# data = pd.read_csv("laptops.csv") # utf-8' codec can't decode

# uses Latin-1 encoding
data = pd.read_csv("laptops.csv", encoding="Latin-1")

In [3]:
data.head()

Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


In [4]:
# can be used to gather prelim info about our dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


## What do we know so far?

1. Each column contains non numeric, non-null values, most likely strings

2. Operating System Version has null values
3. Column labels are complex/varied

## Next Steps
1. Explore  the data
2. Identify patterns and special cases


# Cleaning Columns

Standardize column labels by removing whitespace, and changing letter casing

In [5]:
# returns an index object
data.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

In [6]:
# holds our new labels
fixed_labels = []

# strips the whitespace
for col in data.columns:
    fixed_label = col.strip()
    fixed_labels.append(fixed_label)

# make a copy so the org is untouched

clean_data = data.copy()
clean_data.columns = fixed_labels

# display fixes
for col in data.columns:
    print(col)

Manufacturer
Model Name
Category
Screen Size
Screen
CPU
RAM
 Storage
GPU
Operating System
Operating System Version
Weight
Price (Euros)


In [7]:
# data cleaning through a function
# mixed case -> single case
# removing parentheses
# spaces - > _

new_labels = []


def clean_column(col):
    col = col.replace("(", "")
    col = col.replace(")", "")
    col = col.replace(" ", "_")
    col = col.lower()

    return col


# loop through each column and clean
for c in clean_data.columns:
    cleaned_c = clean_column(c)
    new_labels.append(cleaned_c)

# assign back to our data
clean_data.columns = new_labels

# verify changes
clean_data.head()

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram,storage,gpu,operating_system,operating_system_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


# Cleaning Rows/Entries

In [8]:
# Identify patterns
# all ram is stored as #GB
print(clean_data.loc[0:4, "ram"], "\n")

# vectorized string methods
clean_data["ram"] = clean_data["ram"].str.replace("GB", "")
print(clean_data.loc[0:4, "ram"], "\n")

# changing type to numeric
clean_data["ram"] = clean_data["ram"].astype(int)
print(clean_data.loc[0:4, "ram"])

0     8GB
1     8GB
2     8GB
3    16GB
4     8GB
Name: ram, dtype: object 

0     8
1     8
2     8
3    16
4     8
Name: ram, dtype: object 

0     8
1     8
2     8
3    16
4     8
Name: ram, dtype: int64


In [9]:
# renaming cleaned columns to retain information
clean_data.rename({"ram": "ram_gb"}, axis=1, inplace=True)
clean_data.head(1)

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram_gb,storage,gpu,operating_system,operating_system_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969


## Remapping Values

In [10]:
# remapping values
print(clean_data["operating_system"].head(10))

# new values
os_names = {
    "Android": "Android",
    "Chrome OS": "Chrome OS",
    "Linux": "Linux",
    "Mac OS": "macOS",
    "No OS": "No OS",
    "Windows": "Windows",
    "macOS": "macOS",
}

clean_data["operating_system"] = clean_data["operating_system"].map(os_names)

0      macOS
1      macOS
2      No OS
3      macOS
4      macOS
5    Windows
6     Mac OS
7      macOS
8    Windows
9    Windows
Name: operating_system, dtype: object


In [11]:
# verify
clean_data["operating_system"].value_counts()

Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          21
Android         2
Name: operating_system, dtype: int64

# Extracting and Restructuring Data 

In [12]:
# identify pattern
print(clean_data.loc[0:3, "cpu"], "\n")

# str accessor
# split then grab the first value
# create a new column
clean_data["cpu_manufacturer"] = clean_data["cpu"].str.split().str[0]
clean_data["gpu_manufacturer"] = clean_data["gpu"].str.split().str[0]

# verify
clean_data[{"cpu_manufacturer", "gpu_manufacturer"}]

0          Intel Core i5 2.3GHz
1          Intel Core i5 1.8GHz
2    Intel Core i5 7200U 2.5GHz
3          Intel Core i7 2.7GHz
Name: cpu, dtype: object 



Unnamed: 0,cpu_manufacturer,gpu_manufacturer
0,Intel,Intel
1,Intel,Intel
2,Intel,Intel
3,Intel,AMD
4,Intel,Intel
...,...,...
1298,Intel,Intel
1299,Intel,Intel
1300,Intel,Intel
1301,Intel,AMD


# Finding Missing Data using isnull()

In [13]:
# sums up how many null values are found in each column
print(clean_data.isnull().sum())

manufacturer                  0
model_name                    0
category                      0
screen_size                   0
screen                        0
cpu                           0
ram_gb                        0
storage                       0
gpu                           0
operating_system              0
operating_system_version    170
weight                        0
price_euros                   0
cpu_manufacturer              0
gpu_manufacturer              0
dtype: int64


# Removing null values

We've identified that the "operating_system_version" is the only column that contains null values.

Rather than dropping an entire column, which would skew our data, we can dig deeper and see if we can replace, rather than remove.

In [14]:
# explore the data 1
# where are the null values?
print("operating_system_version")
print(clean_data["operating_system_version"].value_counts(dropna=False))
print("\n")

print("operating_system")
# explore the data 2
# why might there be null values?
null_os = clean_data.loc[
    clean_data["operating_system_version"].isnull(), "operating_system"
]
print(null_os.value_counts())

operating_system_version
10      1072
NaN      170
7         45
X          8
10 S       8
Name: operating_system_version, dtype: int64


operating_system
No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: operating_system, dtype: int64


# What did we find?

1. There are 66 laptops with no os, and thus no version
2. A few macs have no version,but we know macs generally have the most recently released os.
3. Other columns could be filled in with Version Unknown

## Next Steps
1. Taking a look at the weight column

In [15]:
# Replacing null macOS with our best guess
clean_data.loc[clean_data["operating_system"] == "macOS", "os_version"] = "X"

# replacing No OS with version unkown
clean_data.loc[
    clean_data["operating_system"] == "No OS", "os_version"
] = "Version Unknown"

In [16]:
print(clean_data["weight"].head())

0    1.37kg
1    1.34kg
2    1.86kg
3    1.83kg
4    1.37kg
Name: weight, dtype: object


In [17]:
# the format seems to be #.##kg
# we know ahead of time that one is listed as kgs

# remove the kg/kgs
clean_data["weight"] = clean_data["weight"].str.replace("kgs", "")
clean_data["weight"] = clean_data["weight"].str.replace("kg", "")

# change the type to float
clean_data["weight"] = clean_data["weight"].astype(float)

# add kg to the name, in order to retain information
clean_data.rename({"weight": "weight_kg"}, axis=1, inplace=True)

# verify
clean_data["weight_kg"]

0       1.37
1       1.34
2       1.86
3       1.83
4       1.37
        ... 
1298    1.80
1299    1.30
1300    1.50
1301    2.19
1302    2.20
Name: weight_kg, Length: 1303, dtype: float64

In [21]:
clean_data.to_csv("laptops_cv1")