# Data Cleaning Basics

So far, we've learned how to select, assign, and analyze data with pandas using pre-cleaned data. In reality, data is rarely in the format needed to perform analysis. Data scientists commonly spend over half their time cleaning data, so knowing how to clean "messy" data is an extremely important skill.

In this lesson, we'll learn the basics of data cleaning with pandas as we work with `laptops.csv`, a CSV file containing information about 1,300 laptop computers.

We can start by reading the data into pandas. Let's look at what happens when we use the pandas.read_csv() function with only the filename argument:

laptops = pd.read_csv("laptops.csv")

---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 4: invalid continuation byte



We get an error! (The error message has been shortened.) This error references UTF-8, which is a type of `encoding`. Computers, at their lowest levels, can only understand binary - `0` and `1`- and encodings are systems for representing characters in binary.

Something we can do if our file has an unknown encoding is to try the most common encodings:

 - UTF-8
 - Latin-1 (also known as ISO-8859-1)
 - Windows-1251

The pandas.read_csv() function has an encoding argument we can use to specify an encoding:

`df = pd.read_csv("filename.csv", encoding="some_encoding")`

Since the `pandas.read_csv()` function already tried to read in the file with UTF-8 and failed, we know the file's not encoded with that format. Let's try the next most popular encoding in the exercise.

### Exercise


1. Import the pandas library
2. Use the `pandas.read_csv()` function to read the `laptops.csv` file into a dataframe laptops.
  - Specify the encoding using the string `"Latin-1"`.
3. Use the `DataFrame.info()` method to display information about the `laptops` dataframe.


In [1]:
import pandas as pd
import numpy as np

In [2]:
laptops = pd.read_csv('laptops.csv', encoding='Latin-1')

In [3]:
laptops

Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,Yoga 500-14ISK,2 in 1 Convertible,"14.0""",IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows,10,1.8kg,63800
1299,Lenovo,Yoga 900-13ISK,2 in 1 Convertible,"13.3""",IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows,10,1.3kg,149900
1300,Lenovo,IdeaPad 100S-14IBR,Notebook,"14.0""",1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows,10,1.5kg,22900
1301,HP,15-AC110nv (i7-6500U/6GB/1TB/Radeon,Notebook,"15.6""",1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows,10,2.19kg,76400


In [4]:
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


We can see that every column is represented as the `object` type, indicating that they are represented by strings, not numbers. Also, one of the columns, `Operating System Version`, has null values.

The column labels have a variety of upper and lowercase letters, as well as spaces and parentheses, which will make them harder to work with and read. One noticeable issue is that the `" Storage"` column name has a space in front of it. These quirks with column labels can sometimes be hard to spot, so removing extra whitespaces from all column names will save us more work in the long run.

We can access the column axis of a dataframe using the `DataFrame.columns` [attribute](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html). This returns an index object — a special type of NumPy ndarray — with the labels of each column:

In [5]:
laptops.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

Not only can we use the attribute to view the column labels, we can also assign new labels to the attribute:

In [6]:
laptops_test = laptops.copy()

laptops_test.columns = ['A', 'B', 'C', 'D', 'E',

                        'F', 'G', 'H', 'I', 'J',

                        'K', 'L', 'M']

laptops_test.columns

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M'], dtype='object')

### Exercise:

Next, let's use the DataFrame.columns attribute to remove whitespaces from the column names.
instructions

1. Remove any whitespace from the start and end of each column name.
   - Create an empty list named `new_columns`.
   - Use a for loop to iterate through each column name using the `DataFrame.columns` attribute. 
     Inside the body of the for loop:
     - Use the `str.strip()` [method](https://docs.python.org/3.6/library/stdtypes.html#str.strip) to remove whitespace from the start and end of the string.
     - Append the updated column name to the `new_columns list`.
   - Assign the updated column names to the `DataFrame.columns` attribute.


In [7]:
new_columns = []

for column_name in laptops.columns:
    new_columns.append(column_name.strip())
    
laptops.columns = new_columns

laptops.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', 'Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

In the last exercise, we removed whitespaces from the column names. Below is the result:

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',

       'CPU', 'RAM', 'Storage', 'GPU', 'Operating System',

       'Operating System Version', 'Weight', 'Price (Euros)'],

      dtype='object')

However, the column labels still have a variety of upper and lowercase letters, as well as parentheses, which will make them harder to work with and read. Let's finish cleaning our column labels by:

    Replacing spaces with underscores.
    Removing special characters.
    Making all labels lowercase.
    Shortening any long column names.

We can create a function that uses Python string methods to clean our column labels, and then again use a loop to apply that function to each label. Let's look at an example:

def clean_col(col):

    col = col.strip()

    col = col.replace("(","")

    col = col.replace(")","")

    col = col.lower()

    return col

​

new_columns = []

for c in laptops.columns:

    clean_c = clean_col(c)

    new_columns.append(clean_c)

​

laptops.columns = new_columns

print(laptops.columns)

Index(['manufacturer', 'model name', 'category', 'screen size', 'screen',

       'cpu', 'ram', 'storage', 'gpu', 'operating system',

       'operating system version', 'weight', 'price euros'],

      dtype='object')

Our code:

    Defined a function, which:
        Used the str.strip() method to remove whitespace from the start and end of the string.
        Used the str.replace() method to remove parentheses from the string.
        Used the str.lower() method to make the string lowercase.
        Returns the modified string.
    Used a loop to apply the function to each item in the index object and assign it back to the DataFrame.columns attribute.
    Printed the new values for the DataFrame.columns attribute.

Let's use this technique to clean the column labels in our dataframe, adding a few extra cleaning 'chores' along the way.

### Exercise


1. Define a function, which accepts a string argument, and:
   - Removes any whitespace from the start and end of the string.
   - Replaces the substring Operating System with the abbreviation os.
   - Replaces all spaces with underscores.
   - Removes parentheses from the string.
   - Makes the entire string lowercase.
   - Returns the modified string.
2. Use a loop to apply the function to each item in the DataFrame.columns attribute for the laptops dataframe. 
3. Assign the result back to the DataFrame.columns attribute.


In [8]:
laptops.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', 'Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

In [9]:
def col_clean(column):
    column = column.strip()
    column = column.replace('Operating System', 'os')
    column = column.replace(' ', '_')
    column = column.replace('(', '')
    column = column.replace(')', '')
    column = column.lower()
    return column

In [10]:
cleaned_columns = []

for column in laptops.columns:
    cleaned_columns.append(col_clean(column))
    
laptops.columns = cleaned_columns

laptops.columns

Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight',
       'price_euros'],
      dtype='object')

We observed earlier that all 13 columns have the object dtype, meaning they're stored as strings. Let's look at the first few rows of some of our columns:

In [11]:
laptops.iloc[:5, 2:5]

Unnamed: 0,category,screen_size,screen
0,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600
1,Ultrabook,"13.3""",1440x900
2,Notebook,"15.6""",Full HD 1920x1080
3,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800
4,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600


Of these three columns, we have three different types of text data:

    category: Purely text data - there are no numeric values.
    screen_size: Numeric data stored as text data because of the " character.
    screen: A combination of pure text data with numeric data.

Because the values in the screen_size column are stored as text data, we can't sort them numerically. For instance, if we wanted to select laptops with screens 15" or larger, we'd be unable to do so.

Let's convert the screen_size column to numeric next. Whenever we convert text to numeric data, we can follow this data cleaning workflow:

string to numeric cleaning workflow

The first step is to explore the data. One of the best ways to do this is to use the Series.unique() method to view all of the unique values in the column:

In [12]:
laptops['screen_size'].unique()

array(['13.3"', '15.6"', '15.4"', '14.0"', '12.0"', '11.6"', '17.3"',
       '10.1"', '13.5"', '12.5"', '13.0"', '18.4"', '13.9"', '12.3"',
       '17.0"', '15.0"', '14.1"', '11.3"'], dtype=object)

In [13]:
print(laptops['screen_size'].dtype)

object




Our next step is to identify patterns and special cases. We can observe the following:

   - All values in this column follow the same pattern - a series of digit and period characters, followed by a quote character (").
   - There are no special cases. Every value matches the same pattern.
   - We'll need to convert the column to a float dtype, as the int dtype won't be able to store the decimal values.

Let's identify any patterns and special cases in the ram column next.

### Exercise


1. Use the Series.unique() method to identify the unique values in the ram column of the laptops dataframe. 
   - Assign the result to unique_ram.


In [14]:
unique_ram = laptops['ram'].unique()
unique_ram

array(['8GB', '16GB', '4GB', '2GB', '12GB', '6GB', '32GB', '24GB', '64GB'],
      dtype=object)

In the last exercise, we identified a clear pattern in the ram column - all values are integers and include the character GB at the end of the string:

['8GB' '16GB' '4GB' '2GB' '12GB' '6GB' '32GB' '24GB' '64GB']

To convert both the ram and screen_size columns to numeric dtypes, we'll have to first remove the non-digit characters.

string to numeric cleaning workflow

The pandas library contains dozens of vectorized string methods we can use to manipulate text data, many of which perform the same operations as Python string methods. Most vectorized string methods are available using the Series.str accessor, which means we can access them by adding str between the series name and the method name:

vectorized_string_methods

In this case, we can use the Series.str.replace() method, which is a vectorized version of the Python str.replace() method we used in the previous screen, to remove all the quote characters from every string in the screen_size column:

laptops["screen_size"] = laptops["screen_size"].str.replace('"','')

print(laptops["screen_size"].unique())

['13.3', '15.6', '15.4', '14.0', '12.0', '11.6', '17.3',

 '10.1', '13.5', '12.5', '13.0', '18.4', '13.9', '12.3',

 '17.0', '15.0', '14.1', '11.3']

Let's remove the non-digit characters from the ram column next.

### Exercise


1. Use the Series.str.replace() method to remove the substring GB from the ram column.
2. Use the Series.unique() method to assign the unique values in the ram column to unique_ram.


In [15]:
laptops['ram'].unique()

array(['8GB', '16GB', '4GB', '2GB', '12GB', '6GB', '32GB', '24GB', '64GB'],
      dtype=object)

In [16]:
laptops['ram'] = laptops['ram'].str.replace('GB', '')
unique_ram = laptops['ram'].unique()
unique_ram

array(['8', '16', '4', '2', '12', '6', '32', '24', '64'], dtype=object)

In the last screen, we used the Series.str.replace() method to remove the non-digit characters from the screen_size and ram columns. Now, we can convert (or cast) the columns to a numeric dtype. 

To do this, we use the Series.astype() method. To convert the column to a numeric dtype, we can use either int or float as the parameter for the method. Since the int dtype can't store decimal values, we'll convert the screen_size column to the float dtype:

In [17]:
laptops['screen_size'].unique()

array(['13.3"', '15.6"', '15.4"', '14.0"', '12.0"', '11.6"', '17.3"',
       '10.1"', '13.5"', '12.5"', '13.0"', '18.4"', '13.9"', '12.3"',
       '17.0"', '15.0"', '14.1"', '11.3"'], dtype=object)

In [18]:
laptops['screen_size'] = laptops['screen_size'].str.replace('"', '')
laptops['screen_size'].unique()

array(['13.3', '15.6', '15.4', '14.0', '12.0', '11.6', '17.3', '10.1',
       '13.5', '12.5', '13.0', '18.4', '13.9', '12.3', '17.0', '15.0',
       '14.1', '11.3'], dtype=object)

In [19]:
laptops['screen_size'] = laptops['screen_size'].astype(float)
print(laptops['screen_size'].dtype)

float64


In [20]:
laptops['screen_size'].unique()

array([13.3, 15.6, 15.4, 14. , 12. , 11.6, 17.3, 10.1, 13.5, 12.5, 13. ,
       18.4, 13.9, 12.3, 17. , 15. , 14.1, 11.3])

Our screen_size column is now the float64 dtype. Let's convert the dtype of the ram column to numeric next.

1. Use the Series.astype() method to change the ram column to an integer dtype.
2. Use the DataFrame.dtypes attribute to get a list of the column names and types from the laptops dataframe. Assign the result to dtypes.

In [21]:
laptops['ram'] = laptops['ram'].astype(int)
print(laptops['ram'].dtype)

int64


In [22]:
laptops['ram'].unique()

array([ 8, 16,  4,  2, 12,  6, 32, 24, 64])

In [23]:
laptops.dtypes

manufacturer     object
model_name       object
category         object
screen_size     float64
screen           object
cpu              object
ram               int64
storage          object
gpu              object
os               object
os_version       object
weight           object
price_euros      object
dtype: object

Now that we've converted our columns to numeric dtypes, the final step is to rename the column. This is an optional step, and can be useful if the non-digit values contain information that helps us understand the data. 

![Cleaning Workflow Image](cleaning_workflow.svg)

In our case, the quote characters we removed from the screen_size column denoted that the screen size was in inches. As a reminder, here's what the original values looked like:

To stop us from losing information that helps us understand the data, we can use the [DataFrame.rename()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) method to rename the column from screen_size to screen_size_inches.

Below, we specify the axis=1 parameter so pandas knows that we want to rename labels in the column axis:


In [24]:
laptops.dtypes

manufacturer     object
model_name       object
category         object
screen_size     float64
screen           object
cpu              object
ram               int64
storage          object
gpu              object
os               object
os_version       object
weight           object
price_euros      object
dtype: object

In [25]:
laptops.rename({'screen_size':'screen_size_inches'}, axis=1, inplace=True)

In [26]:
laptops.dtypes

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram                     int64
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
dtype: object

Note that we can either use inplace=True or assign the result back to the dataframe - both will give us the same results.

Let's rename the ram column next and analyze the results.

### Exercise:

1. Because the GB characters contained useful information about the units (gigabytes) of the laptop's ram, use the DataFrame.rename() method to rename the column from ram to ram_gb.
2. Use the Series.describe() method to return a series of descriptive statistics for the ram_gb column. Assign the result to ram_gb_desc.

In [27]:
laptops.rename({'ram':'ram_gb'}, inplace=True, axis=1)

In [28]:
laptops.dtypes

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram_gb                  int64
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
dtype: object

In [29]:
ram_gb_desc = laptops['ram_gb'].describe()
ram_gb_desc

count    1303.000000
mean        8.382195
std         5.084665
min         2.000000
25%         4.000000
50%         8.000000
75%         8.000000
max        64.000000
Name: ram_gb, dtype: float64

In [30]:
laptops.describe()

Unnamed: 0,screen_size_inches,ram_gb
count,1303.0,1303.0
mean,15.017191,8.382195
std,1.426304,5.084665
min,10.1,2.0
25%,14.0,4.0
50%,15.6,8.0
75%,15.6,8.0
max,18.4,64.0


In [31]:
type(ram_gb_desc)

pandas.core.series.Series

Sometimes, it can be useful to extract non-numeric values from within strings. Let's look at the first five values from the gpu (graphics processing unit) column:

In [32]:
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   manufacturer        1303 non-null   object 
 1   model_name          1303 non-null   object 
 2   category            1303 non-null   object 
 3   screen_size_inches  1303 non-null   float64
 4   screen              1303 non-null   object 
 5   cpu                 1303 non-null   object 
 6   ram_gb              1303 non-null   int64  
 7   storage             1303 non-null   object 
 8   gpu                 1303 non-null   object 
 9   os                  1303 non-null   object 
 10  os_version          1133 non-null   object 
 11  weight              1303 non-null   object 
 12  price_euros         1303 non-null   object 
dtypes: float64(1), int64(1), object(11)
memory usage: 132.5+ KB


In [33]:
laptops['gpu'].head(5)

0    Intel Iris Plus Graphics 640
1          Intel HD Graphics 6000
2           Intel HD Graphics 620
3              AMD Radeon Pro 455
4    Intel Iris Plus Graphics 650
Name: gpu, dtype: object

The information in this column seems to be a manufacturer (Intel, AMD) followed by a model name/number. Let's extract the manufacturer by itself so we can find the most common ones.

Because each manufacturer is followed by a whitespace character, we can use the [Series.str.split()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html) method to extract this data:

![String Split Image](str_split_2.svg)

This method splits each string on the whitespace; the result is a series containing individual Python lists. Also note that we used parentheses to method chain over multiple lines, which makes our code easier to read.

Just like with lists and ndarrays, we can use bracket notation to access the elements in each list in the series. With series, however, we use the str accessor followed by [] (brackets):

In [34]:
laptops['gpu'].head().str.split().str[0]

0    Intel
1    Intel
2    Intel
3      AMD
4    Intel
Name: gpu, dtype: object

In [35]:
type(laptops['gpu'].head().str.split())

pandas.core.series.Series

Let's use this technique to extract the manufacturer from the cpu column as well. Here are the first 5 rows of the cpu column:

In [36]:
laptops['cpu'].head()

0          Intel Core i5 2.3GHz
1          Intel Core i5 1.8GHz
2    Intel Core i5 7200U 2.5GHz
3          Intel Core i7 2.7GHz
4          Intel Core i5 3.1GHz
Name: cpu, dtype: object

### Exercise

In the example code, we have extracted the manufacturer name from the gpu column, and assigned it to a new column gpu_manufacturer.

1. Extract the manufacturer name from the cpu column. Assign it to a new column cpu_manufacturer.
2. Use the Series.value_counts() method to find the counts of each manufacturer in cpu_manufacturer. Assign the result to cpu_manufacturer_counts.


In [37]:
laptops['gpu_manufacturer'] = (laptops['gpu']
                                        .str.split()
                                        .str[0])

In [38]:
laptops.dtypes

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram_gb                  int64
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
gpu_manufacturer       object
dtype: object

In [39]:
laptops['gpu_manufacturer'].unique()

array(['Intel', 'AMD', 'Nvidia', 'ARM'], dtype=object)

In [40]:
laptops['gpu_manufacturer'].value_counts()

Intel     722
Nvidia    400
AMD       180
ARM         1
Name: gpu_manufacturer, dtype: int64

### Exercise

In the example code, we have extracted the manufacturer name from the gpu column, and assigned it to a new column gpu_manufacturer.

1. Extract the manufacturer name from the cpu column. Assign it to a new column cpu_manufacturer.
2. Use the Series.value_counts() method to find the counts of each manufacturer in cpu_manufacturer. Assign the result to cpu_manufacturer_counts.


In [41]:
laptops['cpu'].head()

0          Intel Core i5 2.3GHz
1          Intel Core i5 1.8GHz
2    Intel Core i5 7200U 2.5GHz
3          Intel Core i7 2.7GHz
4          Intel Core i5 3.1GHz
Name: cpu, dtype: object

In [42]:
laptops['cpu'].str.split().str[0].head()

0    Intel
1    Intel
2    Intel
3    Intel
4    Intel
Name: cpu, dtype: object

In [43]:
laptops['cpu_manufacturer'] = (laptops['cpu']
                                       .str.split()
                                       .str[0])

In [44]:
cpu_manufacturer_counts = laptops['cpu_manufacturer'].value_counts()
cpu_manufacturer_counts

Intel      1240
AMD          62
Samsung       1
Name: cpu_manufacturer, dtype: int64

In [45]:
pd.DataFrame(cpu_manufacturer_counts)

Unnamed: 0,cpu_manufacturer
Intel,1240
AMD,62
Samsung,1


If your data has been scraped from a webpage or if there was manual data entry involved at some point, you may end up with inconsistent values. Let's look at an example from our os column:

In [46]:
laptops['os'].value_counts()

Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          13
Mac OS          8
Android         2
Name: os, dtype: int64

We can see that there are two variations of the Apple operating system — macOS — in our dataset: Mac OS and macOS. One way we can fix this is with the [Series.map()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) method. The Series.map()() method is ideal when we want to change multiple values in a column, but we'll use it now as an opportunity to learn how the method works.

The most common way to use Series.map() is with a dictionary. Let's look at an example using a series of misspelled fruit:

In [47]:
d = ['pair', 'oranje', 'bananna', 'oranje', 'oranje', 'oranje']
s = pd.Series(data=d)
s

0       pair
1     oranje
2    bananna
3     oranje
4     oranje
5     oranje
dtype: object

We'll create a dictionary called corrections and pass that dictionary as an argument to Series.map():

In [48]:
corrections = {

    "pair": "pear",

    "oranje": "orange",

    "bananna": "banana"

}

s = s.map(corrections)
s

0      pear
1    orange
2    banana
3    orange
4    orange
5    orange
dtype: object

We can see that each of our corrections were made across our series. One important thing to remember with Series.map() is that if a value from your series doesn't exist as a key in your dictionary, it will convert that value to NaN. Let's see what happens when we run map one more time:

In [49]:
s = s.map(corrections)
s

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
dtype: object

Because none of the corrected values in our series existed as keys in our dictionary, all values became NaN! It's a very common occurence, especially when working in Jupyter notebook, where you can easily re-run cells.

Let's use Series.map() to clean the values in the os column.

### Exercise

We have created a dictionary for you to use with mapping. Note that we have included both the correct and incorrect spelling of macOS as keys, otherwise we'll end up with null values.

1. Use the Series.map() method with the mapping_dict dictionary to correct the values in the os column.

In [50]:
mapping_dict = {
    'Android': 'Android',
    'Chrome OS': 'Chrome OS',
    'Linux': 'Linux',
    'Mac OS': 'macOS',
    'No OS': 'No OS',
    'Windows': 'Windows',
    'macOS': 'macOS'
}

In [51]:
laptops['os'] = laptops['os'].map(mapping_dict)

In [52]:
laptops['os'].value_counts()

Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          21
Android         2
Name: os, dtype: int64

### Learn

In previous lessons, we've talked briefly about missing values and how both NumPy and pandas represent these as null values. In pandas, null values will be indicated by either NaN or None.

Recall that we can use the [DataFrame.isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) method to identify missing values, which returns a boolean dataframe. We can then use the DataFrame.sum() method to give us a count of the True values for each column:

In [53]:
laptops.isnull().sum()

manufacturer            0
model_name              0
category                0
screen_size_inches      0
screen                  0
cpu                     0
ram_gb                  0
storage                 0
gpu                     0
os                      0
os_version            170
weight                  0
price_euros             0
gpu_manufacturer        0
cpu_manufacturer        0
dtype: int64

It's now clear that we have only one column with null values, os_version, which has 170 missing values.

There are a few options for handling missing values:

- Remove any rows that have missing values.
- Remove any columns that have missing values.
- Fill the missing values with some other value.
- Leave the missing values as is.

The first two options are often used to prepare data for machine learning algorithms, which are unable to be used with data that includes null values. We can use the DataFrame.dropna() method to remove or drop rows and columns with null values.

The [DataFrame.dropna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method accepts an axis parameter, which indicates whether we want to drop along the column or index axis. Let's look at an example:

![drop na image 1](dropna_1.svg)

The default value for the axis parameter is 0, so df.dropna() returns an identical result to df.dropna(axis=0):

![drop na image 2](dropna_2.svg)

The rows with labels x and z contain null values, so those rows are dropped. Let's look at what happens when we use axis=1 to specify the column axis:

![drop na image 3](dropna_3.svg)

Only the column with label C contains null values, so, in this case, just one column is removed.

Let's practice using DataFrame.dropna() to remove rows and columns:

### Exercise

1. Use DataFrame.dropna() to remove any rows from the laptops dataframe that have null values. Assign the result to laptops_no_null_rows.
2. Use DataFrame.dropna() to remove any columns from the laptops dataframe that have null values. Assign the result to laptops_no_null_cols.


In [54]:
laptops.shape

(1303, 15)

In [55]:
laptops_no_null_rows = laptops.dropna()
laptops_no_null_rows.shape

(1133, 15)

In [56]:
laptops_no_null_cols = laptops.dropna(axis=1)
laptops_no_null_cols.shape

(1303, 14)

### Lesson

In the previous screen, we learned there are various ways to deal with missing values:

- Remove any rows that have missing values.
- Remove any columns that have missing values.
- Fill the missing values with some other value.
- Leave the missing values as is.

While dropping rows or columns is the easiest approach to deal with missing values, it may not always be the best approach. For example, removing a disproportionate amount of one manufacturer's laptops could change our analysis.

Because of this, it's a good idea to explore the missing values in the os_version column before making a decision. We can use [Series.value_counts()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) to explore all of the values in the column, but we'll use a parameter we haven't seen before:

In [57]:
laptops['os_version'].value_counts(dropna=False)

10      1072
NaN      170
7         45
X          8
10 S       8
Name: os_version, dtype: int64

Because we set the dropna parameter to False, the result includes null values. We can see that the majority of values in the column are 10 and missing values are the next most common.

Let's also explore the os column, since it's is closely related to the os_version column. We'll only look at rows in which the os_version is missing:

In [58]:
os_with_null_values = laptops.loc[laptops['os_version'].isnull(), 'os']
os_with_null_values.value_counts()

No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: os, dtype: int64

Immediately, we can observe a few things:

- The most frequent value is "No OS". This is important to note because if there is no os, there shouldn't be a version defined in the os_version column.
- Thirteen of the laptops that come with macOS do not specify the version. We can use our knowledge of MacOS to confirm that os_version should be equal to X.

In both of these cases, we can fill the missing values to make our data more correct. For the rest of the values, it's probably best to leave them as missing so we don't remove important values.

We can use assignment with a boolean comparison to perform this replacement, like below:

`laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"`

For rows with No OS values, let's replace the missing value in the os_version column with the value Version Unknown.

### Exercise


1. Use a boolean array to identify rows that have the value No OS for the os column. Then, use assignment to assign the value Version Unknown to the os_version column for those rows.

2. Use the syntax below to create value_counts_after variable:

    `value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()`

3. After running your code, use the variable inspector to look at the difference between value_counts_before and value_counts_after.


In [59]:
value_counts_before = laptops.loc[laptops['os_version'].isnull(), 'os'].value_counts()
value_counts_before

No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: os, dtype: int64

In [60]:
laptops.loc[laptops['os'] == 'macOS', 'os_version'] = 'X'
laptops.loc[laptops['os'] == 'No OS', 'os_version'] = 'Version Unknown'

value_counts_after = laptops.loc[laptops['os_version'].isnull(), 'os'].value_counts()
value_counts_after

Linux        62
Chrome OS    27
Android       2
Name: os, dtype: int64

### Learn
Now it's time to practice what we've learned so far! In this challenge, we'll clean the weight column. Let's look at a sample of the data in that column:

In [61]:
laptops.dtypes

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram_gb                  int64
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
gpu_manufacturer       object
cpu_manufacturer       object
dtype: object

In [62]:
laptops['weight'].head()

0    1.37kg
1    1.34kg
2    1.86kg
3    1.83kg
4    1.37kg
Name: weight, dtype: object

In [63]:
laptops['weight'].isnull().sum()

0

Your challenge is to convert the values in this column to numeric values. As a reminder, here's the data cleaning workflow you can use:

![CLeaning Workflow](cleaning_workflow.svg)

While it appears that the weight column may just need the kg characters removed from the end of each string, there is one special case - one of the values ends with kgs, so you'll have to remove both kg and kgs characters.

In the last step of this challenge, we'll also ask you to use the [DataFrame.to_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) method to save the cleaned data to a CSV file. It's a good idea to save a CSV when you finish cleaning in case you wish to do analysis later.

We can use the following syntax to save a CSV:

`df.to_csv('filename.csv', index=False)`

By default, pandas will save the index labels as a column in the CSV file. Our dataset has integer labels that don't contain any data, so we don't need to save the index.

Don't be discouraged if this challenge takes a few attempts to get correct. Working iteratively is a great way to work, and this challenge is more difficult than exercises you have previously completed. We have included some extra hints, but we encourage you to try without the hints first; only use them if you need them!

### Exercise


1. Convert the values in the weight column to numeric values.
2. Rename the weight column to weight_kg.
3. Use the DataFrame.to_csv() method to save the laptops dataframe to a CSV file laptops_cleaned.csv without index labels.

In [64]:
laptops['weight'].value_counts()

2.2kg     121
2.1kg      58
2.4kg      44
2.3kg      41
2.5kg      38
         ... 
2.83kg      1
2.03kg      1
1.55kg      1
1.70kg      1
2.99kg      1
Name: weight, Length: 179, dtype: int64

In [65]:
laptops['weight'] = laptops['weight'].str.replace('kgs', '').str.replace('kg', '')
laptops['weight'].unique()

array(['1.37', '1.34', '1.86', '1.83', '2.1', '2.04', '1.3', '1.6', '2.2',
       '0.92', '1.22', '0.98', '2.5', '1.62', '1.91', '2.3', '1.35',
       '1.88', '1.89', '1.65', '2.71', '1.2', '1.44', '2.8', '2', '2.65',
       '2.77', '3.2', '0.69', '1.49', '2.4', '2.13', '2.43', '1.7', '1.4',
       '1.8', '1.9', '3', '1.252', '2.7', '2.02', '1.63', '1.96', '1.21',
       '2.45', '1.25', '1.5', '2.62', '1.38', '1.58', '1.85', '1.23',
       '1.26', '2.16', '2.36', '2.05', '1.32', '1.75', '0.97', '2.9',
       '2.56', '1.48', '1.74', '1.1', '1.56', '2.03', '1.05', '4.4',
       '1.90', '1.29', '2.0', '1.95', '2.06', '1.12', '1.42', '3.49',
       '3.35', '2.23', '4.42', '2.69', '2.37', '4.7', '3.6', '2.08',
       '4.3', '1.68', '1.41', '4.14', '2.18', '2.24', '2.67', '2.14',
       '1.36', '2.25', '2.15', '2.19', '2.54', '3.42', '1.28', '2.33',
       '1.45', '2.79', '1.84', '2.6', '2.26', '3.25', '1.59', '1.13',
       '1.78', '1.10', '1.15', '1.27', '1.43', '2.31', '1.16', '1.64',
   

In [66]:
laptops['weight'].value_counts()

2.2     121
2.1      58
2.4      44
2.3      41
2.5      38
       ... 
2.72      1
2.99      1
2.34      1
1.31      1
1.70      1
Name: weight, Length: 179, dtype: int64

In [67]:
print(laptops['weight'].dtypes)

object


In [68]:
laptops['weight'] = laptops['weight'].astype('float64')
laptops['weight'].dtypes

dtype('float64')

In [69]:
laptops.rename({'weight':'weight_kg'}, axis=1, inplace=True)

In [70]:
laptops.dtypes

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram_gb                  int64
storage                object
gpu                    object
os                     object
os_version             object
weight_kg             float64
price_euros            object
gpu_manufacturer       object
cpu_manufacturer       object
dtype: object

In [71]:
laptops.to_csv('laptops_cleaned.csv', index=False)

Our dataset is ready for some analysis, but there are still some data cleaning tasks left! Here are your next steps:

- Convert the price_euros column to a numeric dtype.
- Extract the screen resolution from the screen column.
- Extract the processor speed from the cpu column.

Here are some questions you might like to answer in your own time by analyzing the cleaned data:

- Are laptops made by Apple more expensive than those made by other manufacturers?
- What is the best value laptop with a screen size of 15" or more?
- Which laptop has the most storage space?


### Convert the price_euros column to a numeric dtype.

In [72]:
laptops.dtypes

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram_gb                  int64
storage                object
gpu                    object
os                     object
os_version             object
weight_kg             float64
price_euros            object
gpu_manufacturer       object
cpu_manufacturer       object
dtype: object

In [73]:
laptops['price_euros']

0       1339,69
1        898,94
2        575,00
3       2537,45
4       1803,60
         ...   
1298     638,00
1299    1499,00
1300     229,00
1301     764,00
1302     369,00
Name: price_euros, Length: 1303, dtype: object

In [74]:
laptops['price_euros'] = laptops['price_euros'].str.replace(',', '.')

In [75]:
laptops['price_euros'].unique()

array(['1339.69', '898.94', '575.00', '2537.45', '1803.60', '400.00',
       '2139.97', '1158.70', '1495.00', '770.00', '393.90', '344.99',
       '2439.97', '498.90', '1262.40', '1518.55', '745.00', '2858.00',
       '499.00', '979.00', '191.90', '999.00', '258.00', '819.00',
       '659.00', '418.64', '1099.00', '800.00', '1298.00', '896.00',
       '244.99', '199.00', '439.00', '1869.00', '998.00', '249.00',
       '367.00', '488.69', '879.00', '389.00', '1499.00', '522.99',
       '682.00', '1419.00', '369.00', '1299.00', '639.00', '466.00',
       '319.00', '841.00', '398.49', '1103.00', '384.00', '767.80',
       '586.19', '2449.00', '415.00', '599.00', '941.00', '690.00',
       '1983.00', '438.69', '229.00', '549.00', '949.00', '1089.00',
       '955.00', '870.00', '1095.00', '519.00', '855.00', '530.00',
       '977.00', '1096.16', '1510.00', '860.00', '399.00', '395.00',
       '1349.00', '699.00', '598.99', '1449.00', '1649.00', '689.00',
       '1197.00', '1195.00', '1049.0

In [76]:
laptops['price_euros'] = laptops['price_euros'].astype('float64')

In [77]:
laptops['price_euros'].dtypes

dtype('float64')

### Extract the screen resolution from the screen column.

In [78]:
laptops['screen'].unique()

array(['IPS Panel Retina Display 2560x1600', '1440x900',
       'Full HD 1920x1080', 'IPS Panel Retina Display 2880x1800',
       '1366x768', 'IPS Panel Full HD 1920x1080',
       'IPS Panel Retina Display 2304x1440',
       'IPS Panel Full HD / Touchscreen 1920x1080',
       'Full HD / Touchscreen 1920x1080',
       'Touchscreen / Quad HD+ 3200x1800',
       'IPS Panel Touchscreen 1920x1200', 'Touchscreen 2256x1504',
       'Quad HD+ / Touchscreen 3200x1800', 'IPS Panel 1366x768',
       'IPS Panel 4K Ultra HD / Touchscreen 3840x2160',
       'IPS Panel Full HD 2160x1440',
       '4K Ultra HD / Touchscreen 3840x2160', 'Touchscreen 2560x1440',
       '1600x900', 'IPS Panel 4K Ultra HD 3840x2160',
       '4K Ultra HD 3840x2160', 'Touchscreen 1366x768',
       'IPS Panel Full HD 1366x768', 'IPS Panel 2560x1440',
       'IPS Panel Full HD 2560x1440',
       'IPS Panel Retina Display 2736x1824', 'Touchscreen 2400x1600',
       '2560x1440', 'IPS Panel Quad HD+ 2560x1440',
       'IPS Panel 

In [79]:
laptops['screen'].value_counts()

Full HD 1920x1080                                507
1366x768                                         281
IPS Panel Full HD 1920x1080                      230
IPS Panel Full HD / Touchscreen 1920x1080         53
Full HD / Touchscreen 1920x1080                   47
1600x900                                          23
Touchscreen 1366x768                              16
Quad HD+ / Touchscreen 3200x1800                  15
IPS Panel 4K Ultra HD 3840x2160                   12
IPS Panel 4K Ultra HD / Touchscreen 3840x2160     11
4K Ultra HD / Touchscreen 3840x2160               10
Touchscreen 2560x1440                              7
IPS Panel 1366x768                                 7
4K Ultra HD 3840x2160                              7
IPS Panel Retina Display 2304x1440                 6
IPS Panel Quad HD+ / Touchscreen 3200x1800         6
IPS Panel Retina Display 2560x1600                 6
Touchscreen 2256x1504                              6
IPS Panel Touchscreen 2560x1440               

In [80]:
laptops['screen_resolution'] = laptops['screen'].str.split().str[-1]
laptops['screen_resolution'].value_counts()

1920x1080    841
1366x768     308
3840x2160     43
3200x1800     27
1600x900      23
2560x1440     23
2304x1440      6
2256x1504      6
2560x1600      6
1920x1200      5
2880x1800      4
2400x1600      4
1440x900       4
2160x1440      2
2736x1824      1
Name: screen_resolution, dtype: int64

### Extract the processor speed from the cpu column

In [81]:
cpu_speed = laptops['cpu'].str.split().str[-1]
laptops['cpu_speed'] = cpu_speed.str.replace('GHz','').astype('float64')

In [82]:
laptops.dtypes

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram_gb                  int64
storage                object
gpu                    object
os                     object
os_version             object
weight_kg             float64
price_euros           float64
gpu_manufacturer       object
cpu_manufacturer       object
screen_resolution      object
cpu_speed             float64
dtype: object

In [83]:
laptops.to_csv('laptops_cleaned.csv', index=False)

## Here are some questions you might like to answer in your own time by analyzing the cleaned data:

- Are laptops made by Apple more expensive than those made by other manufacturers?
- What is the best value laptop with a screen size of 15" or more?
- Which laptop has the most storage space?

### Are laptops made by Apple more expensive than those made by other manufacturers?

In [84]:
laptops.head()

Unnamed: 0,manufacturer,model_name,category,screen_size_inches,screen,cpu,ram_gb,storage,gpu,os,os_version,weight_kg,price_euros,gpu_manufacturer,cpu_manufacturer,screen_resolution,cpu_speed
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,X,1.37,1339.69,Intel,Intel,2560x1600,2.3
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,X,1.34,898.94,Intel,Intel,1440x900,1.8
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,Version Unknown,1.86,575.0,Intel,Intel,1920x1080,2.5
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,X,1.83,2537.45,AMD,Intel,2880x1800,2.7
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,X,1.37,1803.6,Intel,Intel,2560x1600,3.1


### Two ways I found:

1. The long way - use a loop to save the mean values to a dictionary
2. Use the Pandas aggregate function

### The long with a loop:

In [85]:
# Calculate the average price of each laptop by manufacturer
manufacturers = laptops['manufacturer'].unique()
manufacturers

array(['Apple', 'HP', 'Acer', 'Asus', 'Dell', 'Lenovo', 'Chuwi', 'MSI',
       'Microsoft', 'Toshiba', 'Huawei', 'Xiaomi', 'Vero', 'Razer',
       'Mediacom', 'Samsung', 'Google', 'Fujitsu', 'LG'], dtype=object)

In [86]:
avg_price_by_manufacturer = {}

for manufacturer in manufacturers:
    mean = laptops.loc[laptops['manufacturer'] == manufacturer, 'price_euros'].mean()
    avg_price_by_manufacturer[manufacturer] = mean
    
avg_price_by_manufacturer

{'Apple': 1564.1985714285713,
 'HP': 1067.774854014598,
 'Acer': 626.7758252427185,
 'Asus': 1104.1693670886075,
 'Dell': 1186.0689898989892,
 'Lenovo': 1086.3844444444444,
 'Chuwi': 314.2966666666667,
 'MSI': 1728.9081481481485,
 'Microsoft': 1612.3083333333334,
 'Toshiba': 1267.8125,
 'Huawei': 1424.0,
 'Xiaomi': 1133.4625,
 'Vero': 217.425,
 'Razer': 3346.1428571428573,
 'Mediacom': 295.0,
 'Samsung': 1413.4444444444443,
 'Google': 1677.6666666666667,
 'Fujitsu': 729.0,
 'LG': 2099.0}

### Use the Pandas aggregate function

In [87]:
avg_price_by_man = laptops.groupby('manufacturer').price_euros.agg(['mean'])
avg_price_by_man

Unnamed: 0_level_0,mean
manufacturer,Unnamed: 1_level_1
Acer,626.775825
Apple,1564.198571
Asus,1104.169367
Chuwi,314.296667
Dell,1186.06899
Fujitsu,729.0
Google,1677.666667
HP,1067.774854
Huawei,1424.0
LG,2099.0


In [88]:
type(avg_price_by_man)

pandas.core.frame.DataFrame

In [89]:
avg_price_by_man.sort_values(by=['mean'], ascending=False)

Unnamed: 0_level_0,mean
manufacturer,Unnamed: 1_level_1
Razer,3346.142857
LG,2099.0
MSI,1728.908148
Google,1677.666667
Microsoft,1612.308333
Apple,1564.198571
Huawei,1424.0
Samsung,1413.444444
Toshiba,1267.8125
Dell,1186.06899


### What is the best value laptop with a screen size of 15" or more?

I found two ways:
1. Long way: loop through all the prices of the laptops with screens 15' or more, saving the index whenever a lower price is found in the loop

2. Use the [idxmin() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmin.html#pandas.DataFrame.idxmin)

### Long way: loop through all the prices of the laptops with screens 15' or more, saving the index whenever a lower price is found in the loop

In [90]:
# laptops_over_15_5_inches = laptops.loc[laptops['screen_size_inches'] >= 15.0]
length_laptops = laptops.shape[0] 
index_of_minimum_price_over_15_inches = 0

for index in range(length_laptops):
    # test that screen is 15 inches or more and if the price is lower than in the last loop pass through
    #screen_size_inches is in the 4th column
    if (laptops.iloc[index, 3] >= 15.0 and 
        laptops.iloc[index, 12] < laptops.iloc[index_of_minimum_price_over_15_inches, 12]):
        index_of_minimum_price_over_15_inches = index
        
index_of_minimum_price_over_15_inches

290

In [91]:
laptops.iloc[index_of_minimum_price_over_15_inches]

manufacturer                                          Acer
model_name                            Chromebook C910-C2ST
category                                          Notebook
screen_size_inches                                    15.6
screen                                            1366x768
cpu                   Intel Celeron Dual Core 3205U 1.5GHz
ram_gb                                                   2
storage                                           16GB SSD
gpu                                      Intel HD Graphics
os                                               Chrome OS
os_version                                             NaN
weight_kg                                             2.19
price_euros                                          199.0
gpu_manufacturer                                     Intel
cpu_manufacturer                                     Intel
screen_resolution                                 1366x768
cpu_speed                                              1

### Use the [idxmin() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmin.html#pandas.DataFrame.idxmin)

In [92]:
all_prices_for_laptops_over_15_5_in = laptops.loc[laptops['screen_size_inches'] >= 15.0, 'price_euros']
row_index_of_lowest_cost_over_15_5_in = all_prices_for_laptops_over_15_5_in.idxmin()

laptops.iloc[row_index_of_lowest_cost_over_15_5_in]

manufacturer                                          Acer
model_name                            Chromebook C910-C2ST
category                                          Notebook
screen_size_inches                                    15.6
screen                                            1366x768
cpu                   Intel Celeron Dual Core 3205U 1.5GHz
ram_gb                                                   2
storage                                           16GB SSD
gpu                                      Intel HD Graphics
os                                               Chrome OS
os_version                                             NaN
weight_kg                                             2.19
price_euros                                          199.0
gpu_manufacturer                                     Intel
cpu_manufacturer                                     Intel
screen_resolution                                 1366x768
cpu_speed                                              1

In [93]:
row_index_of_lowest_cost_over_15_5_in

290

### Which laptop has the most storage space?

In [94]:
laptops.head()

Unnamed: 0,manufacturer,model_name,category,screen_size_inches,screen,cpu,ram_gb,storage,gpu,os,os_version,weight_kg,price_euros,gpu_manufacturer,cpu_manufacturer,screen_resolution,cpu_speed
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,X,1.37,1339.69,Intel,Intel,2560x1600,2.3
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,X,1.34,898.94,Intel,Intel,1440x900,1.8
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,Version Unknown,1.86,575.0,Intel,Intel,1920x1080,2.5
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,X,1.83,2537.45,AMD,Intel,2880x1800,2.7
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,X,1.37,1803.6,Intel,Intel,2560x1600,3.1


In [95]:
# clean up the characters in the storage column, while replacing the TB with 1000
storage_size_gb = (laptops['storage']
                        .str.replace('SSD', '')
                        .str.replace('HDD', '')
                        .str.replace('Flash Storage', '')
                        .str.replace('Hybrid', '')
                        .str.replace('TB', '000')
                        .str.replace('GB', '')
                        .str.strip()
                        .str.replace(' ', ''))

storage_size_gb

0        128
1        128
2        256
3        512
4        256
        ... 
1298     128
1299     512
1300      64
1301    1000
1302     500
Name: storage, Length: 1303, dtype: object

In [96]:
# Some of the storage values have two values that need to be added to get the total storage
def calculate_total_storage(item):
    if '+' in item:
        items = item.split('+')
        return str(int(items[0]) + int(items[1]))
    else:
        return item

In [97]:
storage_size_gb = (laptops['storage']
         .str.replace('SSD', '')
         .str.replace('HDD', '')
         .str.replace('Flash Storage', '')
         .str.replace('Hybrid', '')
         .str.replace('TB', '000')
         .str.replace('GB', '')
         .str.strip()
         .str.replace(' ', ''))

storage_size_gb = storage_size_gb.apply(calculate_total_storage) # apply the total storage function to all the cells 
storage_size_gb.unique()

array(['128', '256', '512', '500', '1000', '32', '1128', '64', '1256',
       '2256', '2000', '1512', '756', '2128', '1024', '16', '768', '2512',
       '1064', '1', '240', '8', '508'], dtype=object)

In [98]:
storage_size_gb

0        128
1        128
2        256
3        512
4        256
        ... 
1298     128
1299     512
1300      64
1301    1000
1302     500
Name: storage, Length: 1303, dtype: object

In [99]:
laptops['storage_size_gb'] = storage_size_gb # create a new column for storage size in GB

In [100]:
laptops['storage_size_gb'].unique()

array(['128', '256', '512', '500', '1000', '32', '1128', '64', '1256',
       '2256', '2000', '1512', '756', '2128', '1024', '16', '768', '2512',
       '1064', '1', '240', '8', '508'], dtype=object)

In [101]:
laptops['storage_size_gb'] = laptops['storage_size_gb'].astype('int64')

In [102]:
laptops['storage_size_gb'].dtypes

dtype('int64')

In [103]:
laptops.dtypes

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram_gb                  int64
storage                object
gpu                    object
os                     object
os_version             object
weight_kg             float64
price_euros           float64
gpu_manufacturer       object
cpu_manufacturer       object
screen_resolution      object
cpu_speed             float64
storage_size_gb         int64
dtype: object

In [104]:
# calculate the laptop with the largest storage using the idxmax() method

max_storage_index = laptops['storage_size_gb'].idxmax()
max_storage_index

370

In [105]:
laptops.iloc[max_storage_index]

manufacturer                                     Asus
model_name                               ZenBook Flip
category                           2 in 1 Convertible
screen_size_inches                               15.6
screen                Full HD / Touchscreen 1920x1080
cpu                        Intel Core i7 7500U 2.7GHz
ram_gb                                             12
storage                          512GB SSD +  2TB HDD
gpu                           Nvidia GeForce GT 940MX
os                                            Windows
os_version                                         10
weight_kg                                        2.26
price_euros                                    1099.0
gpu_manufacturer                               Nvidia
cpu_manufacturer                                Intel
screen_resolution                           1920x1080
cpu_speed                                         2.7
storage_size_gb                                  2512
Name: 370, dtype: object

In [106]:
laptops.to_csv('laptops_cleaned.csv', index=False)