# Data Cleaning Basics Course

We can start by reading the data into pandas. Let's look at what happens when we use the pandas.read_csv() function with only the filename argument:


In [2]:
import pandas as pd
import numpy as np

We get an error! (The error message has been shortened.) This error references UTF-8, which is a type of encoding. Computers, at their lowest levels, can only understand binary - 0 and 1- and encodings are systems for representing characters in binary.

Something we can do if our file has an unknown encoding is to try the most common encodings:

* UTF-8
* Latin-1 (also known as ISO-8895-1)
* Windows-1251

Import the pandas library
1. Use the pandas.read_csv() function to read the laptops.csv file into a dataframe laptops.
2. Specify the encoding using the string "Latin-1".
3. Use the DataFrame.info() method to display information about the laptops dataframe.

In [8]:
laptops = pd.read_csv('laptops.csv', encoding='Latin-1')
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
Manufacturer                1303 non-null object
Model Name                  1303 non-null object
Category                    1303 non-null object
Screen Size                 1303 non-null object
Screen                      1303 non-null object
CPU                         1303 non-null object
RAM                         1303 non-null object
 Storage                    1303 non-null object
GPU                         1303 non-null object
Operating System            1303 non-null object
Operating System Version    1133 non-null object
Weight                      1303 non-null object
Price (Euros)               1303 non-null object
dtypes: object(13)
memory usage: 132.5+ KB


We can see that every column is represented as the object type, indicating that they are represented by strings, not numbers. Also, one of the columns, Operating System Version, has null values.

The column labels have a variety of upper and lowercase letters, as well as spaces and parentheses, which will make them harder to work with and read. One noticeable issue is that the " Storage" column name has a space in front of it. These quirks with column labels can sometimes be hard to spot, so removing extra whitespaces from all column names will save us more work in the long run.

We can access the column axis of a dataframe using the DataFrame.columns attribute. This returns an index object — a special type of NumPy ndarray — with the labels of each column:
```python
print(laptops.columns)
Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')
```
Not only can we use the attribute to view the column labels, we can also assign new labels to the attribute:
```python
laptops_test = laptops.copy()
laptops_test.columns = ['A', 'B', 'C', 'D', 'E',
                        'F', 'G', 'H', 'I', 'J',
                        'K', 'L', 'M']
print(laptops_test.columns)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M'], dtype='object')
```
Next, let's use the DataFrame.columns attribute to remove whitespaces from the column names.

1. Remove any whitespace from the start and end of each column name.
    - Create an empty list named new_columns.
    - Use a for loop to iterate through each column name using the DataFrame.columns attribute. Inside the body of the for loop:
        <br> Use the str.strip() method to remove whitespace from the start and end of the string.
        <br> Append the updated column name to the new_columns list.
    - Assign the updated column names to the DataFrame.columns attribute.

In [9]:
new_columns = []
for column in laptops.columns:
    new_columns.append(str.strip(column))
laptops.columns = new_columns


In the last exercise, we removed whitespaces from the column names. Below is the result:
```python
Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', 'Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')
```
However, the column labels still have a variety of upper and lowercase letters, as well as parentheses, which will make them harder to work with and read. Let's finish cleaning our column labels by:

Replacing spaces with underscores.
Removing special characters.
Making all labels lowercase.
Shortening any long column names.
We can create a function that uses Python string methods to clean our column labels, and then again use a loop to apply that function to each label. Let's look at an example:
```python
def clean_col(col):
    col = col.strip()
    col = col.replace("(","")
    col = col.replace(")","")
    col = col.lower()
    return col
​
new_columns = []
for c in laptops.columns:
    clean_c = clean_col(c)
    new_columns.append(clean_c)
Index(['manufacturer', 'model name', 'category', 'screen size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'operating system',
       'operating system version', 'weight', 'price euros'],
      dtype='object')
```
Our code:

Defined a function, which:
Used the str.strip() method to remove whitespace from the start and end of the string.
Used the str.replace() method to remove parentheses from the string.
Used the str.lower() method to make the string lowercase.
Returns the modified string.
Used a loop to apply the function to each item in the index object and assign it back to the DataFrame.columns attribute.
Printed the new values for the DataFrame.columns attribute.
Let's use this technique to clean the column labels in our dataframe, adding a few extra cleaning 'chores' along the way.

###### Column labels still have a variety of upper and lowercase letters, as well as parentheses, which will make them harder to work with and read. Let's finish cleaning our column labels by:

- Replacing spaces with underscores.
- Removing special characters.
- Making all labels lowercase.
- Shortening any long column names

In [10]:
def clean_label(label):
    label = label.strip()
    label = label.replace('(','').replace(')','').replace('Operating System','os').replace(' ','_')
    label = label.lower()
    return label
new_columns = []
for item in laptops.columns:
    clean_c = clean_label(item)
    new_columns.append(clean_c)

laptops.columns = new_columns
print(laptops.columns)

Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight',
       'price_euros'],
      dtype='object')


- Use the Series.unique() method to identify the unique values in the ram column of the laptops dataframe. 
- Assign the result to unique_ram.
- After running your code, use the variable inspector to view the unique values in the ram column and identify any patterns.

In [12]:
unique_ram = laptops['ram'].unique()
unique_ram

array(['8GB', '16GB', '4GB', '2GB', '12GB', '6GB', '32GB', '24GB', '64GB'],
      dtype=object)

The pandas library contains dozens of vectorized string methods we can use to manipulate text data, many of which perform the same operations as Python string methods. Most vectorized string methods are available using the Series.str accessor, which means we can access them by adding str between the series name and the method name:
![title](cleaning_workflow.svg)
vectorized_string_methods
In this case, we can use the Series.str.replace() method, which is a vectorized version of the Python str.replace() method we used in the previous screen, to remove all the quote characters from every string in the screen_size column:
```python
laptops["screen_size"] = laptops["screen_size"].str.replace('"','')
print(laptops["screen_size"].unique())
['13.3', '15.6', '15.4', '14.0', '12.0', '11.6', '17.3',
 '10.1', '13.5', '12.5', '13.0', '18.4', '13.9', '12.3',
 '17.0', '15.0', '14.1', '11.3']
```
Let's remove the non-digit characters from the ram column next.

Instructions

- Use the Series.str.replace() method to remove the substring GB from the ram column.
- Use the Series.unique() method to assign the unique values in the ram column to unique_ram.
- After running your code, use the variable inspector to verify your changes.



In [13]:
laptops.ram = laptops.ram.str.replace('GB', '')
unique_ram = laptops.ram.unique()

In the last screen, we used the Series.str.replace() method to remove the non-digit characters from the screen_size and ram columns. Now, we can convert (or cast) the columns to a numeric dtype.

string to numeric cleaning workflow

To do this, we use the Series.astype() method. To convert the column to a numeric dtype, we can use either int or float as the parameter for the method. Since the int dtype can't store decimal values, we'll convert the screen_size column to the float dtype:
```python
laptops["screen_size"] = laptops["screen_size"].astype(float)
print(laptops["screen_size"].dtype)
print(laptops["screen_size"].unique())
float64
​
[13.3, 15.6, 15.4, 14. , 12. , 11.6, 17.3, 10.1, 13.5, 12.5,
 13. , 18.4, 13.9, 12.3, 17. , 15. , 14.1, 11.3]
```
Our screen_size column is now the float64 dtype. Let's convert the dtype of the ram column to numeric next.

### Instructions

- Use the Series.astype() method to change the ram column to an integer dtype.
- Use the DataFrame.dtypes attribute to get a list of the column names and types from the laptops dataframe. Assign the result to dtypes.
- After running your code, use the variable inspector to view the dtypes variable to see the results of your code.

In [14]:
laptops["ram"] = laptops["ram"].str.replace('GB','')
laptops["ram"] = laptops.ram.astype(int)
dtypes = laptops.dtypes

Now that we've converted our columns to numeric dtypes, the final step is to rename the column. This is an optional step, and can be useful if the non-digit values contain information that helps us understand the data.

string to numeric cleaning workflow

In our case, the quote characters we removed from the screen_size column denoted that the screen size was in inches. As a reminder, here's what the original values looked like:

['13.3"', '15.6"', '15.4"', '14.0"', '12.0"', '11.6"',
 '17.3"', '10.1"', '13.5"', '12.5"', '13.0"', '18.4"',
 '13.9"', '12.3"', '17.0"', '15.0"', '14.1"',
 '11.3"']
To stop us from losing information the helps us understand the data, we can use the DataFrame.rename() method to rename the column from screen_size to screen_size_inches.

Below, we specify the axis=1 parameter so pandas knows that we want to rename labels in the column axis:
```python
laptops.rename({"screen_size": "screen_size_inches"}, axis=1, inplace=True)
print(laptops.dtypes)
manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram                    object
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
dtype: object
```
Note that we can either use inplace=True or assign the result back to the dataframe - both will give us the same results.

Let's rename the ram column next and analyze the results.

### Instructions

- Because the GB characters contained useful information about the units (gigabytes) of the laptop's ram, use the DataFrame.rename() method to rename the column from ram to ram_gb.
- Use the Series.describe() method to return a series of descriptive statistics for the ram_gb column. Assign the result to ram_gb_desc.
- After you have run your code, use the variable inspector to see the results of your code.

In [17]:
laptops.ram.describe()

count    1303.000000
mean        8.382195
std         5.084665
min         2.000000
25%         4.000000
50%         8.000000
75%         8.000000
max        64.000000
Name: ram, dtype: float64

In [19]:

laptops.rename({'ram':'ram_gb'}, axis=1, inplace=True)
ram_gb_desc = laptops.ram_gb.describe()
ram_gb_desc


count    1303.000000
mean        8.382195
std         5.084665
min         2.000000
25%         4.000000
50%         8.000000
75%         8.000000
max        64.000000
Name: ram_gb, dtype: float64

Sometimes, it can be useful to extract non-numeric values from within strings. Let's look at the first five values from the gpu (graphics processing unit) column:
```python
print(laptops["gpu"].head())
0    Intel Iris Plus Graphics 640
1          Intel HD Graphics 6000
2           Intel HD Graphics 620
3              AMD Radeon Pro 455
4    Intel Iris Plus Graphics 650

Name: gpu, dtype: object
```
The information in this column seems to be a manufacturer (Intel, AMD) followed by a model name/number. Let's extract the manufacturer by itself so we can find the most common ones.

Because each manufacturer is followed by a whitespace character, we can use the Series.str.split() method to extract this data:

extracting data from a string, step 2

This method splits each string on the whitespace; the result is a series containing individual Python lists. Also note that we used parentheses to method chain over multiple lines, which makes our code easier to read.

Just like with lists and ndarrays, we can use bracket notation to access the elements in each list in the series. With series, however, we use the str accessor followed by [] (brackets):
```python
print(laptops["gpu"].head().str.split().str[0])

Above, we used 0 to select the first element in each list. Below is the result:

0    Intel
1    Intel
2    Intel
3      AMD
4    Intel
Name: gpu, dtype: object
```
Let's use this technique to extract the manufacturer from the cpu column as well. Here are the first 5 rows of the cpu column:
```python
print(laptops["cpu"].head())

0          Intel Core i5 2.3GHz
1          Intel Core i5 1.8GHz
2    Intel Core i5 7200U 2.5GHz
3          Intel Core i7 2.7GHz
4          Intel Core i5 3.1GHz
Name: cpu, dtype: object
```
### Instructions

In the example code, we have extracted the manufacturer name from the gpu column, and assigned it to a new column gpu_manufacturer.

- Extract the manufacturer name from the cpu column. Assign it to a new column cpu_manufacturer.
- Use the Series.value_counts() method to find the counts of each manufacturer in cpu_manufacturer. Assign the result to cpu_manufacturer_counts.

In [22]:
laptops["gpu_manufacturer"] = laptops.cpu.str.split().str[0]
cpu_manufacturer_counts = laptops["gpu_manufacturer"].value_counts()
cpu_manufacturer_counts

Intel      1240
AMD          62
Samsung       1
Name: gpu_manufacturer, dtype: int64

If your data has been scraped from a webpage or if there was manual data entry involved at some point, you may end up with inconsistent values. Let's look at an example from our os column:
```python
print(laptops["os"].value_counts())
Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          13
Mac OS          8
Android         2
Name: os, dtype: int64
```
We can see that there are two variations of the Apple operating system — macOS — in our data set: Mac OS and macOS. One way we can fix this is with the Series.map() method. The Series.map() method is ideal when we want to change multiple values in a column, but we'll use it now as an opportunity to learn how the method works.

The most common way to use Series.map() is with a dictionary. Let's look at an example using a series of misspelled fruit:
```python
print(s)
0       pair
1     oranje
2    bananna
3     oranje
4     oranje
5     oranje
dtype: object
We'll create a dictionary called corrections and pass that dictionary as an argument to Series.map():

corrections = {
    "pair": "pear",
    "oranje": "orange",
    "bananna": "banana"
}
s = s.map(corrections)
print(s)
0       pear
1     orange
2     banana
3     orange
4     orange
5     orange
dtype: object
```
We can see that each of our corrections were made across our series. One important thing to remember with Series.map() is that if a value from your series doesn't exist as a key in your dictionary, it will convert that value to NaN. Let's see what happens when we run map one more time:
```python
s = s.map(corrections)
print(s)
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
dtype: object
```
Because none of the corrected values in our series existed as keys in our dictionary, all values became NaN! It's a very common occurence, especially when working in Jupyter notebook, where you can easily re-run cells.

Let's use Series.map() to clean the values in the os column.

### Instructions

We have created a dictionary for you to use with mapping. Note that we have included both the correct and incorrect spelling of macOS as keys, otherwise we'll end up with null values.

- Use the Series.map() method with the mapping_dict dictionary to correct the values in the os column.

In [24]:
mapping_dict = {
    'Android': 'Android',
    'Chrome OS': 'Chrome OS',
    'Linux': 'Linux',
    'Mac OS': 'macOS',
    'No OS': 'No OS',
    'Windows': 'Windows',
    'macOS': 'macOS'
}
laptops.os = laptops.os.map(mapping_dict)

### -----------
In previous missions, we've talked briefly about missing values and how both NumPy and pandas represent these as null values. In pandas, null values will be indicated by either NaN or None.

Recall that we can use the DataFrame.isnull() method to identify missing values, which returns a boolean dataframe. We can then use the DataFrame.sum() method to give us a count of the True values for each column:
```python
print(laptops.isnull().sum())
manufacturer            0
model_name              0
category                0
screen_size_inches      0
screen                  0
cpu                     0
ram_gb                  0
storage                 0
gpu                     0
os                      0
os_version            170
weight_kg               0
price_euros             0
cpu_manufacturer        0
screen_resolution       0
cpu_speed               0
dtype: int64
It's now clear that we have only one column with null values, os_version, which has 170 missing values.
```
There are a few options for handling missing values:

- Remove any rows that have missing values.
- Remove any columns that have missing values.
- Fill the missing values with some other value.
- Leave the missing values as is.
The first two options are often used to prepare data for machine learning algorithms, which are unable to be used with data that includes null values. We can use the DataFrame.dropna() method to remove or drop rows and columns with null values.

The DataFrame.dropna() method accepts an axis parameter, which indicates whether we want to drop along the column or index axis. Let's look at an example:

<img src="dropna_1.svg" width=600 height=100>



The default value for the axis parameter is 0, so df.dropna() returns an identical result to df.dropna(axis=0):

<img src="dropna_2.svg" width=600 height=100>

The rows with labels x and z contain null values, so those rows are dropped. Let's look at what happens when we use axis=1 to specify the column axis:

<img src="dropna_3.svg" width=600 height=100>

Only the column with label C contains null values, so, in this case, just one column is removed.

Let's practice using DataFrame.dropna() to remove rows and columns:

### Instructions

- Use DataFrame.dropna() to remove any rows from the laptops dataframe that have null values. Assign the result to laptops_no_null_rows.
- Use DataFrame.dropna() to remove any columns from the laptops dataframe that have null values. Assign the result to laptops_no_null_cols.

In [25]:
laptops_no_null_rows = laptops.dropna()
laptops_no_null_cols = laptops.dropna(axis=1)

### Instructions

- Use a boolean array to identify rows that have the value No OS for the os column. Then, use assignment to assign the value Version Unknown to the os_version column for those rows.
- Use the syntax below to create value_counts_after variable:
    value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()

- After running your code, use the variable inspector to look at the difference between value_counts_before and value_counts_after.

In [28]:
value_counts_before = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()
laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"
laptops.loc[laptops.os == 'No OS', 'os'] = 'Unknown'
value_counts_after = laptops.loc[laptops.os_version.isnull(),'os'].value_counts()
value_counts_after

Unknown      66
Linux        62
Chrome OS    27
Android       2
Name: os, dtype: int64

While it appears that the weight column may just need the kg characters removed from the end of each string, there is one special case - one of the values ends with kgs, so you'll have to remove both kg and kgs characters.

In the last step of this challenge, we'll also ask you to use the DataFrame.to_csv() method to save the cleaned data to a CSV file. It's a good idea to save a CSV when you finish cleaning in case you wish to do analysis later.

We can use the following syntax to save a CSV:
```python
df.to_csv('filename.csv', index=False)
```
By default, pandas will save the index labels as a column in the CSV file. Our data set has integer labels that don't contain any data, so we don't need to save the index.

Don't be discouraged if this challenge takes a few attempts to get correct. Working iteratively is a great way to work, and this challenge is more difficult than exercises you have previously completed. We have included some extra hints, but we encourage you to try without the hints first; only use them if you need them!

### Instructions

- Convert the values in the weight column to numeric values.
- Rename the weight column to weight_kg.
- Use the DataFrame.to_csv() method to save the laptops dataframe to a CSV file laptops_cleaned.csv without index labels.

In [38]:
laptops.weight = laptops.weight.str.replace('kgs','').str.replace('kg','').astype(float)
laptops.rename({'weight':'weight_kg'})
laptops.to_csv('laptops_cleaned.csv', index=False)

AttributeError: Can only use .str accessor with string values!