# An Overview on Data Cleaning

## Good practices & First steps

1. We need to first understand **what our data IS** and what we want to learn from it. Without this, we're just searching aimlessly.


2. Identify the **overall topic or theme** of your data set. Are you noticing any **trends**? Lets say our dataset is about cars. Are all the cars from a certain country, a type of build, or maybe within a certain speed range?


3. Look for **common mistakes** such as whitespace, mispellings, numerical errors, language issues and duplicates.


> **"your analysis can only ever be as good as your data"**


Our dataset is a collection of laptops, their various attributes, and associated prices. Some columns include Manufacturer, Model Name, Category.

Based on the provided features we could analyse how a laptops specifications influences its price. We could provide a general overview or break this down on a per feature level.


Dataset: laptops.csv

Source: [Dataquest](https://dsserver-prod-resources-1.s3.amazonaws.com/293/laptops.csv?versionId=6BQzmMWIQEutXrN29r_0fnVN7_Z0oSVB)

#### Columns

* Manufacturer
* Model Name
* Category
* Screen Size
* Screen
* CPU
* RAM
* Storage
* GPU
* Operating System
* Operating System Version
* Weight
* Price (Euros)


In [1]:
# import libraries
import pandas as pd

In [2]:
# import csv
# Here we can see files can have different encoding types
# data = pd.read_csv("laptops.csv") # utf-8' codec can't decode

# uses Latin-1 encoding
data = pd.read_csv("laptops.csv", encoding="Latin-1")

In [3]:
data.head()

Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


In [4]:
data.tail()

Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
1298,Lenovo,Yoga 500-14ISK,2 in 1 Convertible,"14.0""",IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows,10,1.8kg,63800
1299,Lenovo,Yoga 900-13ISK,2 in 1 Convertible,"13.3""",IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows,10,1.3kg,149900
1300,Lenovo,IdeaPad 100S-14IBR,Notebook,"14.0""",1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows,10,1.5kg,22900
1301,HP,15-AC110nv (i7-6500U/6GB/1TB/Radeon,Notebook,"15.6""",1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows,10,2.19kg,76400
1302,Asus,X553SA-XX031T (N3050/4GB/500GB/W10),Notebook,"15.6""",1366x768,Intel Celeron Dual Core N3050 1.6GHz,4GB,500GB HDD,Intel HD Graphics,Windows,10,2.2kg,36900


In [5]:
# can be used to gather prelim info about our dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


## What do we know so far?

1. Each column contains non numeric, non-null values. Seeing as every column is the same type. Who or whatever program made these entires most likely converted everything to strings. This happens.


2. Operating System Version has NaN as its main value. This can point to a data collection error. However on closer inspection, we can see its only for an OS listed as Mac OS.


3. Despite the data types our Column labels are complex and varied in form. Some should be strings, some integers, and some floats.

## Next Steps
1. Identify patterns and special cases
2. Highlight areas for cleaning
3. Clean

# Cleaning Columns

Columns are one of the primary factors in gathering useful information from a dataset.

We need to standardize column labels  in order to make effcient use of our tools.

Fixing whitespace, and formatting letter casing are two easy ways to achieve this.

### Examples
fruit products -> fruit_products

Docking -> docking

In [6]:
# returns an index object containing all our column names
data.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

In [7]:
# holds our new labels
fixed_labels = []

# strips the whitespace
for col in data.columns:
    fixed_label = col.strip()
    fixed_labels.append(fixed_label)

# make a copy so the original is untouched

clean_data = data.copy()
clean_data.columns = fixed_labels

# display our fixes in order to check them
for col in clean_data.columns:
    print(col)

Manufacturer
Model Name
Category
Screen Size
Screen
CPU
RAM
Storage
GPU
Operating System
Operating System Version
Weight
Price (Euros)


In [8]:
# data cleaning through a function
# mixed case -> single case
# removing parentheses
# spaces - > _

new_labels = []


def clean_column(col):
    col = col.replace("(", "")
    col = col.replace(")", "")
    col = col.replace(" ", "_")
    col = col.lower()

    return col


# loop through each column and clean
for c in clean_data.columns:
    cleaned_c = clean_column(c)
    new_labels.append(cleaned_c)

# assign back to our data
clean_data.columns = new_labels

# verify changes
clean_data.head()

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram,storage,gpu,operating_system,operating_system_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


# Cleaning Rows/Entries

Now that we have cleaned up our column names, we can begin to clean up our rows.

One pattern we've noticed is that all **ram** is stored as **"#GB"**. 

Since we know **all our values are in gigabytes**, we can remove this identifier and change the type to **numeric**.

Wouldnt this remove needed visual information? Yes. In order to retain this information we can add the removed GB identifier into our column name.

In [9]:
# all ram is stored as #GB
print(clean_data.loc[0:4, "ram"], "\n")

# vectorized string methods
clean_data["ram"] = clean_data["ram"].str.replace("GB", "")
print(clean_data.loc[0:4, "ram"], "\n")

# changing type to numeric
clean_data["ram"] = clean_data["ram"].astype(int)
print(clean_data.loc[0:4, "ram"])

0     8GB
1     8GB
2     8GB
3    16GB
4     8GB
Name: ram, dtype: object 

0     8
1     8
2     8
3    16
4     8
Name: ram, dtype: object 

0     8
1     8
2     8
3    16
4     8
Name: ram, dtype: int64


In [10]:
# renaming cleaned columns to retain information
clean_data.rename({"ram": "ram_gb"}, axis=1, inplace=True)
clean_data.head(1)

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram_gb,storage,gpu,operating_system,operating_system_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969


## Remapping Values

By remapping values we can standardize our entries for easier categorization.

We can see that our operating systems column entries are unorganized.

We can fix this with a map function.

This way we can get an accurate count for analysis

In [11]:
# remapping values
print(clean_data["operating_system"].head(10))

# new values
os_names = {
    "Android": "Android",
    "Chrome OS": "Chrome OS",
    "Linux": "Linux",
    "Mac OS": "macOS",
    "No OS": "No OS",
    "Windows": "Windows",
    "macOS": "macOS",
}

clean_data["operating_system"] = clean_data["operating_system"].map(os_names)

0      macOS
1      macOS
2      No OS
3      macOS
4      macOS
5    Windows
6     Mac OS
7      macOS
8    Windows
9    Windows
Name: operating_system, dtype: object


In [12]:
# verify
# expected total is 1303
clean_data["operating_system"].value_counts()

Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          21
Android         2
Name: operating_system, dtype: int64

In [13]:
1125 + 66 + 62 + 27 + 21 + 2

1303

# Extracting and Restructuring Data 

Continuing with our pattern detection theme, we now look towards our cpu and gpu entries.

Here we can start to extract useful information.

In [14]:
# identify cpu pattern
print(clean_data.loc[0:3, "cpu"], "\n")

# str accessor
# split then grab the first value
# create a new column
clean_data["cpu_manufacturer"] = clean_data["cpu"].str.split().str[0]
clean_data["gpu_manufacturer"] = clean_data["gpu"].str.split().str[0]

# verify
clean_data[{"cpu_manufacturer", "gpu_manufacturer"}]

0          Intel Core i5 2.3GHz
1          Intel Core i5 1.8GHz
2    Intel Core i5 7200U 2.5GHz
3          Intel Core i7 2.7GHz
Name: cpu, dtype: object 



Unnamed: 0,gpu_manufacturer,cpu_manufacturer
0,Intel,Intel
1,Intel,Intel
2,Intel,Intel
3,AMD,Intel
4,Intel,Intel
...,...,...
1298,Intel,Intel
1299,Intel,Intel
1300,Intel,Intel
1301,AMD,Intel


# Finding Missing Data using isnull()

In [15]:
# sums up how many null values are found in each column
print(clean_data.isnull().sum())

manufacturer                  0
model_name                    0
category                      0
screen_size                   0
screen                        0
cpu                           0
ram_gb                        0
storage                       0
gpu                           0
operating_system              0
operating_system_version    170
weight                        0
price_euros                   0
cpu_manufacturer              0
gpu_manufacturer              0
dtype: int64


# Removing null values

We've identified that the "operating_system_version" is the only column that contains null values.

Rather than dropping an entire column, which would skew our data, we can dig deeper and see if we can replace, rather than remove.

In [16]:
# explore the data 1
# where are the null values?
print("operating_system_version")
print(clean_data["operating_system_version"].value_counts(dropna=False))
print("\n")

print("operating_system")
# explore the data 2
# why might there be null values?
null_os = clean_data.loc[
    clean_data["operating_system_version"].isnull(), "operating_system"
]
print(null_os.value_counts())

operating_system_version
10      1072
NaN      170
7         45
X          8
10 S       8
Name: operating_system_version, dtype: int64


operating_system
No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: operating_system, dtype: int64


# What did we find?

1. There are 66 laptops with no os, and thus no version
2. A few macs have no version,but we know macs generally have the most recently released os.
3. Other columns could be filled in with Version Unknown

## Next Steps
1. Taking a look at the weight column

In [17]:
# Replacing null macOS with our best guess
clean_data.loc[clean_data["operating_system"] == "macOS", "os_version"] = "X"

# replacing No OS with version unkown
clean_data.loc[
    clean_data["operating_system"] == "No OS", "os_version"
] = "Version Unknown"

In [18]:
print(clean_data["weight"].head())

0    1.37kg
1    1.34kg
2    1.86kg
3    1.83kg
4    1.37kg
Name: weight, dtype: object


In [19]:
# the format seems to be #.##kg
# we know ahead of time that one is listed as kgs

# remove the kg/kgs
clean_data["weight"] = clean_data["weight"].str.replace("kgs", "")
clean_data["weight"] = clean_data["weight"].str.replace("kg", "")

# change the type to float
clean_data["weight"] = clean_data["weight"].astype(float)

# add kg to the name, in order to retain information
clean_data.rename({"weight": "weight_kg"}, axis=1, inplace=True)

# verify
clean_data["weight_kg"]

0       1.37
1       1.34
2       1.86
3       1.83
4       1.37
        ... 
1298    1.80
1299    1.30
1300    1.50
1301    2.19
1302    2.20
Name: weight_kg, Length: 1303, dtype: float64

In [20]:
clean_data.to_csv("laptops_cv1")