# *Data Cleaning Basics*

**Here, we are dealing with a dataset about laptops. This dataset has encoding other then the normal 'UTF-8'. And  we will also be performing few Data Cleaning Steps making our raw Data useful for performing analysis.**

### Reading CSV Files with Encodings

1. Import the pandas library
2. Use the pandas.read_csv() function to read the laptops.csv file into a dataframe laptops.
    - Specify the encoding using the string "Latin-1".
    - If that doesn't work, try using the string "Windows-1251".
3. Use the DataFrame.info() method to display information about the laptops dataframe.

In [6]:
import pandas as pd
laptops = pd.read_csv("laptops.csv", encoding="Latin-1")
print(laptops.info())
laptops.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
Manufacturer                1303 non-null object
Model Name                  1303 non-null object
Category                    1303 non-null object
Screen Size                 1303 non-null object
Screen                      1303 non-null object
CPU                         1303 non-null object
RAM                         1303 non-null object
 Storage                    1303 non-null object
GPU                         1303 non-null object
Operating System            1303 non-null object
Operating System Version    1133 non-null object
Weight                      1303 non-null object
Price (Euros)               1303 non-null object
dtypes: object(13)
memory usage: 132.4+ KB
None


Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


### Cleaning Column Names

1. Define a function, which accepts a string argument, and:
    - Removes any whitespace from the start and end of the string.
    - Replaces the substring Operating System with the abbreviation os.
    - Replaces all spaces with underscores.
    - Removes parentheses from the string.
    - Makes the entire string lowercase.
    - Returns the modified string.
2. Use list comprehension to apply the function to each item in the DataFrame.column attribute for the laptops dataframe, assigning the result back to the the DataFrame.columns attribute.

In [7]:
laptops.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

In [8]:
def clean_col(col):
    col = col.strip()
    col = col.replace("Operating System", "os")
    col = col.replace(" ", "_")
    col = col.replace(")", "")
    col = col.replace("(", "")
    col = col.lower()
    return col

laptops.columns = [clean_col(c) for c in laptops.columns]

print(laptops.info())

laptops.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
manufacturer    1303 non-null object
model_name      1303 non-null object
category        1303 non-null object
screen_size     1303 non-null object
screen          1303 non-null object
cpu             1303 non-null object
ram             1303 non-null object
storage         1303 non-null object
gpu             1303 non-null object
os              1303 non-null object
os_version      1133 non-null object
weight          1303 non-null object
price_euros     1303 non-null object
dtypes: object(13)
memory usage: 132.4+ KB
None


Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight',
       'price_euros'],
      dtype='object')

### Converting String Columns to Numeric

1. Use the Series.replace() method to remove the substring GB from the ram column.
2. Use the Series.astype() method to change the ram column to an integer dtype.
3. Because the GB characters contained useful information about the units (gigabytes) of the laptop's ram, use the DataFrame.rename() method to rename the column from ram to ram_gb.
4. Use the DataFrame.dtypes attribute to get a list of the column names and types from the laptop dataframe, and assign the result to dtypes.
5. After you have run your code, use the variable inspector to view the dtypes variable to see the results of your code.

In [9]:
laptops.head()

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram,storage,gpu,os,os_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


In [10]:
laptops["screen_size"] = laptops["screen_size"].str.replace('"','').astype(float)
laptops.rename({"screen_size": "screen_size_inches"}, axis=1, inplace=True)

laptops["ram"] = laptops["ram"].str.replace("GB", "").astype(int)
laptops.rename({"ram":"ram_gb"}, axis='columns', inplace=True)

dtypes = laptops.dtypes
print(dtypes)

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram_gb                  int64
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
dtype: object


### Practicing Coverting String Columns to Numeric

1. Clean the weight column by:
    - Removing all non-digit characters.
    - Casting the column to the appropriate numeric type
    - Renaming the column to weight_kg.
2. Clean the price_euros column by:
    - Replacing the decimal commas with decimal points.
    - Casting the column to the appropriate numeric type.
3. Use the Series.describe() method to generate some descriptive statistics for each column:
    - Assign the results for the weight_kg column to weight_describe.
    - Assign the results for the price_euros column to price_describe

In [11]:
laptops["weight"] = (laptops["weight"]
                     .str.replace("kgs", ""))
laptops["weight"] = (laptops["weight"]
                     .str.replace("kg","")
                     .astype("float"))
laptops["weight"].unique()

array([1.37 , 1.34 , 1.86 , 1.83 , 2.1  , 2.04 , 1.3  , 1.6  , 2.2  ,
       0.92 , 1.22 , 0.98 , 2.5  , 1.62 , 1.91 , 2.3  , 1.35 , 1.88 ,
       1.89 , 1.65 , 2.71 , 1.2  , 1.44 , 2.8  , 2.   , 2.65 , 2.77 ,
       3.2  , 0.69 , 1.49 , 2.4  , 2.13 , 2.43 , 1.7  , 1.4  , 1.8  ,
       1.9  , 3.   , 1.252, 2.7  , 2.02 , 1.63 , 1.96 , 1.21 , 2.45 ,
       1.25 , 1.5  , 2.62 , 1.38 , 1.58 , 1.85 , 1.23 , 1.26 , 2.16 ,
       2.36 , 2.05 , 1.32 , 1.75 , 0.97 , 2.9  , 2.56 , 1.48 , 1.74 ,
       1.1  , 1.56 , 2.03 , 1.05 , 4.4  , 1.29 , 1.95 , 2.06 , 1.12 ,
       1.42 , 3.49 , 3.35 , 2.23 , 4.42 , 2.69 , 2.37 , 4.7  , 3.6  ,
       2.08 , 4.3  , 1.68 , 1.41 , 4.14 , 2.18 , 2.24 , 2.67 , 2.14 ,
       1.36 , 2.25 , 2.15 , 2.19 , 2.54 , 3.42 , 1.28 , 2.33 , 1.45 ,
       2.79 , 1.84 , 2.6  , 2.26 , 3.25 , 1.59 , 1.13 , 1.78 , 1.15 ,
       1.27 , 1.43 , 2.31 , 1.16 , 1.64 , 2.17 , 1.47 , 3.78 , 1.79 ,
       0.91 , 1.99 , 4.33 , 1.93 , 1.87 , 2.63 , 3.4  , 3.14 , 1.94 ,
       1.24 , 4.6  ,

### Extracting Values from the Start of Strings 
**In the example code, we have extracted the manufacturer name from the gpu column, and assigned it to a new column gpu_manufacturer.**

1. Extract the manufacturer name from the cpu column, and assign it to a new column cpu_manufacturer.

In [13]:
laptops["gpu_manufacturer"] = (laptops["gpu"].str.split(n=1,expand=True).iloc[:,0])

laptops["cpu_manufacturer"] = (laptops["cpu"].str.split(n=1, expand=True).iloc[:,0])

laptops['gpu_manufacturer']

0        Intel
1        Intel
2        Intel
3          AMD
4        Intel
5          AMD
6        Intel
7        Intel
8       Nvidia
9        Intel
10       Intel
11       Intel
12         AMD
13         AMD
14       Intel
15       Intel
16         AMD
17         AMD
18      Nvidia
19       Intel
20       Intel
21      Nvidia
22         AMD
23       Intel
24       Intel
25       Intel
26       Intel
27         AMD
28       Intel
29      Nvidia
         ...  
1273       AMD
1274     Intel
1275     Intel
1276     Intel
1277     Intel
1278     Intel
1279    Nvidia
1280       AMD
1281       AMD
1282     Intel
1283    Nvidia
1284     Intel
1285     Intel
1286     Intel
1287       AMD
1288     Intel
1289     Intel
1290     Intel
1291     Intel
1292     Intel
1293    Nvidia
1294       AMD
1295       AMD
1296     Intel
1297    Nvidia
1298     Intel
1299     Intel
1300     Intel
1301       AMD
1302     Intel
Name: gpu_manufacturer, Length: 1303, dtype: object

laptops.head()