# Data Cleaning
Since data scientist spends more than half or their times cleaning data, in this projet will learn the data analysis cleaning tasks:
*  Cleaning columns names
* Extract and convert numeric values form strings values
* Extract string data
* Handle missing values

We will work with a dataset laptops.csv a dataset about of 1300 computers.

### 1. Reading csv file with encoding
Nowadayas the predominent encoding is UTF-8 which the default encoding of Python. It exists others encoding Latin-1 (also known as ISO-8895-1) and Windows-1251.
### Instructions
* Import the pandas library
* Use the pandas.read_csv() function to read the laptops.csv file into a dataframe laptops.
  * Specify the encoding using the string "Latin-1".
  * If that doesn't work, try using the string "Windows-1251".
* Use the DataFrame.info() method to display information about the laptops dataframe.

In [2]:
import pandas as pd
laptops = pd.read_csv("laptops.csv", encoding="Latin-1")
laptops.head()

Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


In [6]:
laptops.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
Manufacturer                1303 non-null object
Model Name                  1303 non-null object
Category                    1303 non-null object
Screen Size                 1303 non-null object
Screen                      1303 non-null object
CPU                         1303 non-null object
RAM                         1303 non-null object
 Storage                    1303 non-null object
GPU                         1303 non-null object
Operating System            1303 non-null object
Operating System Version    1133 non-null object
Weight                      1303 non-null object
Price (Euros)               1303 non-null object
dtypes: object(13)
memory usage: 132.4+ KB


In [5]:
laptops.describe()

Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
count,1303,1303,1303,1303,1303,1303,1303,1303,1303,1303,1133,1303,1303
unique,19,618,6,18,40,118,9,38,110,7,4,179,791
top,Lenovo,XPS 13,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,Windows,10,2.2kg,149900
freq,297,30,727,665,507,190,619,412,281,1125,1072,121,14


### 2. Cleaning columns names
The column labels have a variety of upper and lowercase letters, as well as spaces and parentheses. Because we represent column labels as strings, this isn't inherently a problem, but you'll find that uniform column labels will be easier to work with. We'll clean our column labels by:

Removing any whitespace from the start and end of the labels - if you look closely you'll notice that the " Storage" column label has a space in front of it. These quirks with column labels are often hard to spot, so removing them at the start will save you pain in the long run.
Replacing spaces with underscores and remove special characters - this will make things consistent, and also allows for the use of dot accessors for those who prefer that.
Make all labels lowercase - this is good for consistency, and means you'll never have to remember what is capitalized and how.
Shorten any long column names - this helps to keep your code easier to read, especially when you are using method chaining.

### Instructions
* Define a function, which accepts a string argument, and:
  * Removes any whitespace from the start and end of the string.
  * Replaces the substring Operating System with the abbreviation os.
  * Replaces all spaces with underscores.
  * Removes parentheses from the string.
  * Makes the entire string lowercase.
  * Returns the modified string.
* Use a loop to apply the function to each item in the DataFrame.column attribute for the laptops dataframe, assigning the result back to the the DataFrame.columns attribute.

In [10]:
# cleanig the dataset columns labels, remove white spaces, transform labels to lowercase,
# replace space with underscore for dot label accessor
columns = laptops.columns
print(columns)

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')


In [16]:
# clean columns with a function
def clean_column(colname):
    """
    This function take as a paramter a string and clean it removing
    whitespaces, replace them with underscore, replace "("" & ")" with
    nothing
    """
    colname = colname.strip().replace("Operating System", "os")
    colname = colname.replace(' ', '_').replace("(", "").replace(")", "")
    colname = colname.lower()
    return colname

# iteration the DataFrame columns names
cleaned_columns = []
for item in columns:
    cleaned_item = clean_column(item)
    cleaned_columns.append(cleaned_item)

# assign the DataFrame columns names to the cleaned_columns
laptops.columns = cleaned_columns
print(laptops.columns)

Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight',
       'price_euros'],
      dtype='object')


In [17]:
laptops.info(verbose=True) # all columns are of object type indicating that they all string

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
manufacturer    1303 non-null object
model_name      1303 non-null object
category        1303 non-null object
screen_size     1303 non-null object
screen          1303 non-null object
cpu             1303 non-null object
ram             1303 non-null object
storage         1303 non-null object
gpu             1303 non-null object
os              1303 non-null object
os_version      1133 non-null object
weight          1303 non-null object
price_euros     1303 non-null object
dtypes: object(13)
memory usage: 132.4+ KB


### 3. Converting string columns to numeric

In [18]:
laptops.head()

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram,storage,gpu,os,os_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


### Converting text to numeric workflow
Whenever we're converting text to numeric data, we can follow this data cleaning workflow:
<img src="clean_column.png"/>

In [19]:
# Explore the data in the column
# Use Series.unique() to explore unique values of the columns
print(laptops["screen_size"].dtype)
print(laptops["screen_size"].unique())

object
['13.3"' '15.6"' '15.4"' '14.0"' '12.0"' '11.6"' '17.3"' '10.1"' '13.5"'
 '12.5"' '13.0"' '18.4"' '13.9"' '12.3"' '17.0"' '15.0"' '14.1"' '11.3"']


Our next stage is to **identify patterns and special cases**. We can see that all values in this column follow the same pattern - series of digit and period characters, followed by a quote character. There are no special cases - every value matches the same pattern. We can also observe that we will need to convert the column to a float dtype, as the int dtype won't be able to store the decimal values.

The next stage is to **remove the non-digit characters**. The pandas library contains dozens of vectorized string methods, most of which are available using the Series.str accessor. In this case, we can use the Series.str.replace() method, which is a vectorized version of the Python str.replace() method we used in the previous screen to remove all the quote characters:

In [20]:
laptops["screen_size"] = laptops["screen_size"].str.replace('"','') # replace " with nothing
print(laptops["screen_size"].unique())

['13.3' '15.6' '15.4' '14.0' '12.0' '11.6' '17.3' '10.1' '13.5' '12.5'
 '13.0' '18.4' '13.9' '12.3' '17.0' '15.0' '14.1' '11.3']


Now we've removed the non-digit characters, we can convert (or cast) the column to a numeric dtype. To do this, we use the [Series.astype() method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html). We can use either int or float as the parameter for the method to convert the column to the respective type:

In [21]:
laptops["screen_size"] = laptops["screen_size"].astype(float)
print(laptops["screen_size"].dtype)
print(laptops["screen_size"].unique())

float64
[13.3 15.6 15.4 14.  12.  11.6 17.3 10.1 13.5 12.5 13.  18.4 13.9 12.3
 17.  15.  14.1 11.3]


Our **final step is to rename the column**. This is an optional step, and can be useful if the non-digit values contained information that helps us understand the data. In our case, the quote characters actually denoted that the screen size was in inches. We can use the [DataFrame.rename() method]() to rename specific axis labels using a dictionary with the keys as the old label name, and the values as the new label name.

In [22]:
laptops.rename({"screen_size": "screen_size_inches"}, axis=1, inplace=True)
print(laptops.dtypes)

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram                    object
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
dtype: object


### 