# *Data Cleaning Basics*

**Here, we are dealing with a dataset about laptops. This dataset has encoding other then the normal 'UTF-8'. And  we will also be performing few Data Cleaning Steps making our raw Data useful for performing analysis.**

### Reading CSV Files with Encodings

1. Import the pandas library
2. Use the pandas.read_csv() function to read the laptops.csv file into a dataframe laptops.
    - Specify the encoding using the string "Latin-1".
    - If that doesn't work, try using the string "Windows-1251".
3. Use the DataFrame.info() method to display information about the laptops dataframe.

In [1]:
import pandas as pd
laptops = pd.read_csv("laptops.csv", encoding="Latin-1")
print(laptops.info())
laptops.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
Manufacturer                1303 non-null object
Model Name                  1303 non-null object
Category                    1303 non-null object
Screen Size                 1303 non-null object
Screen                      1303 non-null object
CPU                         1303 non-null object
RAM                         1303 non-null object
 Storage                    1303 non-null object
GPU                         1303 non-null object
Operating System            1303 non-null object
Operating System Version    1133 non-null object
Weight                      1303 non-null object
Price (Euros)               1303 non-null object
dtypes: object(13)
memory usage: 132.4+ KB
None


Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


### Cleaning Column Names

1. Define a function, which accepts a string argument, and:
    - Removes any whitespace from the start and end of the string.
    - Replaces the substring Operating System with the abbreviation os.
    - Replaces all spaces with underscores.
    - Removes parentheses from the string.
    - Makes the entire string lowercase.
    - Returns the modified string.
2. Use list comprehension to apply the function to each item in the DataFrame.column attribute for the laptops dataframe, assigning the result back to the the DataFrame.columns attribute.

In [2]:
laptops.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

In [3]:
def clean_col(col):
    col = col.strip()
    col = col.replace("Operating System", "os")
    col = col.replace(" ", "_")
    col = col.replace(")", "")
    col = col.replace("(", "")
    col = col.lower()
    return col

laptops.columns = [clean_col(c) for c in laptops.columns]

print(laptops.info())

laptops.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
manufacturer    1303 non-null object
model_name      1303 non-null object
category        1303 non-null object
screen_size     1303 non-null object
screen          1303 non-null object
cpu             1303 non-null object
ram             1303 non-null object
storage         1303 non-null object
gpu             1303 non-null object
os              1303 non-null object
os_version      1133 non-null object
weight          1303 non-null object
price_euros     1303 non-null object
dtypes: object(13)
memory usage: 132.4+ KB
None


Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight',
       'price_euros'],
      dtype='object')

### Converting String Columns to Numeric

1. Use the Series.replace() method to remove the substring GB from the ram column.
2. Use the Series.astype() method to change the ram column to an integer dtype.
3. Because the GB characters contained useful information about the units (gigabytes) of the laptop's ram, use the DataFrame.rename() method to rename the column from ram to ram_gb.
4. Use the DataFrame.dtypes attribute to get a list of the column names and types from the laptop dataframe, and assign the result to dtypes.
5. After you have run your code, use the variable inspector to view the dtypes variable to see the results of your code.

In [4]:
laptops["screen_size"] = laptops["screen_size"].str.replace('"','').astype(float)
laptops.rename({"screen_size": "screen_size_inches"}, axis=1, inplace=True)

laptops["ram"] = laptops["ram"].str.replace("GB", "").astype(int)
laptops.rename({"ram":"ram_gb"}, axis='columns', inplace=True)

dtypes = laptops.dtypes
print(dtypes)

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram_gb                  int64
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
dtype: object


### Practicing Coverting String Columns to Numeric

1. Clean the weight column by:
    - Removing all non-digit characters.
    - Casting the column to the appropriate numeric type
    - Renaming the column to weight_kg.
2. Clean the price_euros column by:
    - Replacing the decimal commas with decimal points.
    - Casting the column to the appropriate numeric type.
3. Use the Series.describe() method to generate some descriptive statistics for each column:
    - Assign the results for the weight_kg column to weight_describe.
    - Assign the results for the price_euros column to price_describe

In [5]:
laptops["weight"] = (laptops["weight"]
                     .str.replace("kgs", ""))
laptops["weight"] = (laptops["weight"]
                     .str.replace("kg","")
                     .astype("float"))
laptops.rename({"weight": "weight_kg"}, axis=1, inplace=True)

laptops['price_euros'] = (laptops['price_euros']
                          .str.replace(',','.')
                          .astype("float"))

weight_describe = laptops['weight_kg'].describe()
price_describe = laptops['price_euros'].describe()

weight_describe

count    1303.000000
mean        2.038734
std         0.665475
min         0.690000
25%         1.500000
50%         2.040000
75%         2.300000
max         4.700000
Name: weight_kg, dtype: float64

### Extracting Values from the Start of Strings 
**In the example code, we have extracted the manufacturer name from the gpu column, and assigned it to a new column gpu_manufacturer.**

1. Extract the manufacturer name from the cpu column, and assign it to a new column cpu_manufacturer.

In [6]:
laptops["gpu_manufacturer"] = (laptops["gpu"].str.split(n=1,expand=True).iloc[:,0])

laptops["cpu_manufacturer"] = (laptops["cpu"].str.split(n=1, expand=True).iloc[:,0])

laptops['gpu_manufacturer']

0        Intel
1        Intel
2        Intel
3          AMD
4        Intel
5          AMD
6        Intel
7        Intel
8       Nvidia
9        Intel
10       Intel
11       Intel
12         AMD
13         AMD
14       Intel
15       Intel
16         AMD
17         AMD
18      Nvidia
19       Intel
20       Intel
21      Nvidia
22         AMD
23       Intel
24       Intel
25       Intel
26       Intel
27         AMD
28       Intel
29      Nvidia
         ...  
1273       AMD
1274     Intel
1275     Intel
1276     Intel
1277     Intel
1278     Intel
1279    Nvidia
1280       AMD
1281       AMD
1282     Intel
1283    Nvidia
1284     Intel
1285     Intel
1286     Intel
1287       AMD
1288     Intel
1289     Intel
1290     Intel
1291     Intel
1292     Intel
1293    Nvidia
1294       AMD
1295       AMD
1296     Intel
1297    Nvidia
1298     Intel
1299     Intel
1300     Intel
1301       AMD
1302     Intel
Name: gpu_manufacturer, Length: 1303, dtype: object

### Extracting Values from the end of the Strings

1. From the cpu column, perform the following steps to extract and convert the processor speed, assigning the results to a new column cpu_speed_ghz:
    - Use Series.str.replace() to remove the substring "GHz" each string.
    - Use Series.str.rsplit() and DataFrame.iloc[] to select the numeric characters from the end of the string.
    - Use Series.astype() to cast the values to the float dtype.

In [7]:
screen_res = laptops["screen"].str.rsplit(n=1, expand=True)
screen_res.columns = ["A", "B"]
screen_res.loc[screen_res["B"].isnull(), "B"] = screen_res["A"]
laptops["screen_resolution"] = (screen_res["B"].str.split(n=1,expand=True).iloc[:,0])

whole_cpu = laptops["cpu"].str.replace("GHz","")

laptops["cpu_speed_ghz"] = whole_cpu.str.rsplit(n=1, expand = True).iloc[:,1].astype(float)
laptops['cpu_speed_ghz']

0       2.30
1       1.80
2       2.50
3       2.70
4       3.10
5       3.00
6       2.20
7       1.80
8       1.80
9       1.60
10      2.50
11      2.00
12      2.80
13      2.00
14      1.20
15      2.30
16      2.70
17      2.90
18      2.40
19      1.60
20      1.44
21      2.50
22      1.50
23      1.60
24      1.80
25      2.00
26      1.60
27      1.60
28      1.90
29      1.60
        ... 
1273    2.50
1274    1.60
1275    0.90
1276    2.40
1277    1.60
1278    1.60
1279    2.60
1280    2.90
1281    2.70
1282    1.60
1283    2.50
1284    2.50
1285    2.50
1286    1.60
1287    2.50
1288    1.60
1289    0.90
1290    2.40
1291    1.60
1292    1.60
1293    2.60
1294    2.90
1295    2.70
1296    1.60
1297    2.50
1298    2.50
1299    2.50
1300    1.60
1301    2.50
1302    1.60
Name: cpu_speed_ghz, Length: 1303, dtype: float64

### Correcting Bad Values
**We have created a dictionary for you to use with mapping. Note that we have included both the correct and incorrect spelling of macOS as keys, otherwise we'll end up with null values.**

1. Use the Series.map() method with the mapping_dict dictionary to correct the values in the os column.

In [8]:
mapping_dict = {
    'Android': 'Android',
    'Chrome OS': 'Chrome OS',
    'Linux': 'Linux',
    'Mac OS': 'macOS',
    'No OS': 'No OS',
    'Windows': 'Windows',
    'macOS': 'macOS'
}

laptops["os"] = laptops["os"].map(mapping_dict)

laptops['os']

0         macOS
1         macOS
2         No OS
3         macOS
4         macOS
5       Windows
6         macOS
7         macOS
8       Windows
9       Windows
10        No OS
11        No OS
12        macOS
13      Windows
14        macOS
15        macOS
16      Windows
17        macOS
18        No OS
19      Windows
20      Windows
21      Windows
22        No OS
23      Windows
24      Windows
25      Windows
26        macOS
27      Windows
28      Windows
29      Windows
         ...   
1273    Windows
1274    Windows
1275    Windows
1276    Windows
1277      Linux
1278    Windows
1279    Windows
1280    Windows
1281      Linux
1282    Windows
1283    Windows
1284    Windows
1285    Windows
1286    Windows
1287    Windows
1288    Windows
1289    Windows
1290    Windows
1291      Linux
1292    Windows
1293    Windows
1294    Windows
1295      Linux
1296    Windows
1297    Windows
1298    Windows
1299    Windows
1300    Windows
1301    Windows
1302    Windows
Name: os, Length: 1303, 

### Dropping Missing Values
1. Use DataFrame.dropna() to remove any rows from the laptops dataframe that have null values, assigning the result to laptops_no_null_rows.
2. Use DataFrame.dropna() to remove any columns from the laptops dataframe that have null values, assigning the result to laptops_no_null_cols.

In [9]:
laptops_no_null_rows = laptops.dropna()
laptops_no_null_cols = laptops.dropna(axis = 1)

laptops_no_null_cols.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 16 columns):
manufacturer          1303 non-null object
model_name            1303 non-null object
category              1303 non-null object
screen_size_inches    1303 non-null float64
screen                1303 non-null object
cpu                   1303 non-null object
ram_gb                1303 non-null int64
storage               1303 non-null object
gpu                   1303 non-null object
os                    1303 non-null object
weight_kg             1303 non-null float64
price_euros           1303 non-null float64
gpu_manufacturer      1303 non-null object
cpu_manufacturer      1303 non-null object
screen_resolution     1303 non-null object
cpu_speed_ghz         1303 non-null float64
dtypes: float64(4), int64(1), object(11)
memory usage: 163.0+ KB


### Filling Missing Values
1. Use a boolean array to identify rows that have the value No OS for the os column, and then use assignment to assign the value Version Unknown to the os_version column for those rows.
2. Use the syntax below to create value_counts_after variable:<br>
    *value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()*

3. After you have run your code, use the variable inspector to look at the difference between value_counts_before and value_counts_after.

In [10]:
value_counts_before = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()
laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"

laptops.loc[laptops["os"] == "No OS", "os_version"] = "Version Unknown"

value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()

print(value_counts_before)
print()
print(value_counts_after)

No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: os, dtype: int64

Linux        62
Chrome OS    27
Android       2
Name: os, dtype: int64


### Challenge: Extracting Storage Information

1. Clean the storage column, creating four new columns:
    - storage_1_capacity_gb, with float dtype.
    - storage_1_type.
    - storage_2_capacity_gb, with float dtype. If there is only one drive, this column should be null.
    - storage_2_type. If there is only one drive, this column should be null.
    - If needed, don't forget to strip the columns of any extra whitespace.
2. Drop the original storage column and any temporary columns you made while completing the exercise.

In [11]:
laptops["storage"] = laptops["storage"].str.replace("TB", "000")

laptops["storage"] = laptops["storage"].str.replace("GB", "")

laptops["storage"] = laptops["storage"].str.strip()

t = laptops.loc[:, "storage"].str.split("+", expand=True)

laptops[["temporary_stor1", "temporary_stor2"]] = t.iloc[:,[0,1]]

laptops[["storage_1_capacity_gb", "storage_1_type"]] = laptops["temporary_stor1"].str.split(n=1, expand=True)

laptops[["storage_2_capacity_gb","storage_2_type"]] = laptops["temporary_stor2"].str.split(n=1, expand=True)

laptops[["storage_1_capacity_gb", "storage_2_capacity_gb"]] = laptops[["storage_1_capacity_gb", "storage_2_capacity_gb"]].astype(float)

laptops["storage_1_type"] = laptops["storage_1_type"].str.strip()

laptops["storage_2_type"] = laptops["storage_2_type"].str.strip()

laptops = laptops.drop(["temporary_stor1", "temporary_stor2", "storage"], axis=1)

laptops.head()

Unnamed: 0,manufacturer,model_name,category,screen_size_inches,screen,cpu,ram_gb,gpu,os,os_version,weight_kg,price_euros,gpu_manufacturer,cpu_manufacturer,screen_resolution,cpu_speed_ghz,storage_1_capacity_gb,storage_1_type,storage_2_capacity_gb,storage_2_type
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,Intel Iris Plus Graphics 640,macOS,X,1.37,1339.69,Intel,Intel,2560x1600,2.3,128.0,SSD,,
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,Intel HD Graphics 6000,macOS,X,1.34,898.94,Intel,Intel,1440x900,1.8,128.0,Flash Storage,,
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,Intel HD Graphics 620,No OS,Version Unknown,1.86,575.0,Intel,Intel,1920x1080,2.5,256.0,SSD,,
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,AMD Radeon Pro 455,macOS,X,1.83,2537.45,AMD,Intel,2880x1800,2.7,512.0,SSD,,
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,Intel Iris Plus Graphics 650,macOS,X,1.37,1803.6,Intel,Intel,2560x1600,3.1,256.0,SSD,,


### Reordering Columns and Exploring Cleaned Data

1. Using the col list we have created for laptops, reorder the columns in the laptops dataframe.
2. Use the DataFrame.to_csv() method to save the laptops dataframe to a CSV file laptops_cleaned.csv without index labels.
3. Read laptops_cleaned.csv into a new dataframe, laptops_cleaned.
4. Use the DataFrame.dtypes attribute to assign the dtypes from laptops_cleaned to laptops_cleaned_dtypes.
5. After you have run your code, use the variable inspector to compare laptops_cleaned_dtypes laptops_dtypes.

In [12]:
laptops_dtypes = laptops.dtypes
cols = ['manufacturer', 'model_name', 'category', 'screen_size_inches',
        'screen', 'cpu', 'cpu_manufacturer',  'cpu_speed_ghz', 'ram_gb',
        'storage_1_type', 'storage_1_capacity_gb', 'storage_2_type',
        'storage_2_capacity_gb', 'gpu', 'gpu_manufacturer', 'os',
        'os_version', 'weight_kg', 'price_euros']

laptops = laptops[cols]
#print(laptops.head())

laptops.to_csv("laptops_cleaned.csv", index = False)
laptops_cleaned = pd.read_csv("laptops_cleaned.csv")
laptops_cleaned_dtypes = laptops_cleaned.dtypes

print(laptops_cleaned_dtypes)
laptops_cleaned.head()

manufacturer              object
model_name                object
category                  object
screen_size_inches       float64
screen                    object
cpu                       object
cpu_manufacturer          object
cpu_speed_ghz            float64
ram_gb                     int64
storage_1_type            object
storage_1_capacity_gb    float64
storage_2_type            object
storage_2_capacity_gb    float64
gpu                       object
gpu_manufacturer          object
os                        object
os_version                object
weight_kg                float64
price_euros              float64
dtype: object


Unnamed: 0,manufacturer,model_name,category,screen_size_inches,screen,cpu,cpu_manufacturer,cpu_speed_ghz,ram_gb,storage_1_type,storage_1_capacity_gb,storage_2_type,storage_2_capacity_gb,gpu,gpu_manufacturer,os,os_version,weight_kg,price_euros
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,Intel,2.3,8,SSD,128.0,,,Intel Iris Plus Graphics 640,Intel,macOS,X,1.37,1339.69
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,Intel,1.8,8,Flash Storage,128.0,,,Intel HD Graphics 6000,Intel,macOS,X,1.34,898.94
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,Intel,2.5,8,SSD,256.0,,,Intel HD Graphics 620,Intel,No OS,Version Unknown,1.86,575.0
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,Intel,2.7,16,SSD,512.0,,,AMD Radeon Pro 455,AMD,macOS,X,1.83,2537.45
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,Intel,3.1,8,SSD,256.0,,,Intel Iris Plus Graphics 650,Intel,macOS,X,1.37,1803.6
