# Web Scraping flipkart.com
This is the Flipkart website comprising of different laptops. This page contains the details of 24 laptops. So now looking at this, we try to extract the different features of the laptops such as the description of the laptop (model name along with the specification of the laptop), Name of the Processor, Type of Processor, RAM, Operating System, Disk Drive Storage, Display, Warranty, Ratings, and Price. So we extract the data from 7 pages so our dataset consists of information the 168 distinct laptops.<br>
Link to my article: https://towardsdatascience.com/learn-web-scraping-in-15-minutes-27e5ebb1c28e

# Data Collection

In [1]:
from bs4 import BeautifulSoup # Importing the BeautifulSoup package
import requests # Importing requests library
import csv
import pandas as pd 

In [2]:
req = requests.get("https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=1") # Requesting the content of the URL
content = req.content # Getting the content

In [3]:
soup = BeautifulSoup(content,'html.parser') # Here we need to specify the content variable and the parser which is the HTML parser
# So now soup is a variable of the BeautifulSoup object of our parsed HTML

In [4]:
print(soup.prettify()) # Displays the whole code of the website

<!DOCTYPE html>
<html lang="en">
 <head>
  <link href="https://rukminim1.flixcart.com" rel="dns-prefetch"/>
  <link href="https://img1a.flixcart.com" rel="dns-prefetch"/>
  <link href="//img1a.flixcart.com/www/linchpin/fk-cp-zion/css/app.chunk.21be2e.css" rel="stylesheet"/>
  <link as="image" href="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/fk-logo_9fddff.png" rel="preload"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="102988293558" property="fb:page_id"/>
  <meta content="658873552,624500995,100000233612389" property="fb:admins"/>
  <meta content="noodp" name="robots"/>
  <link href="https://img1a.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico" rel="shortcut icon">
   <link href="/osdd.xml?v=2" rel="search" type="application/opensearchdescription+xml"/>
   <meta content="website" property="og:type"/>
   <meta content="Flipkart.com" name="og_site_name" property="og

In [None]:
desc = soup.find_all('div' , class_='_3wU53n') # Extracting the descriptions from the website using the find method - grabbing the div tag which has the classname _3wU53n
# So now this returns all the div tags with the classname of _3wU53n
# As class is a speical keyword in python we have to use the class_ keyword and pass the arguments here
descriptions = []
for i in range(len(desc)):
    descriptions.append(desc[i].text) # We can even access the child tags with dot access.
descriptions
len(descriptions)

24

In [None]:
commonclass = soup.find_all('li',class_='tVe95H') # We observe that the classnames for the different specifications are under one div.So we need to apply some method to extract the different features.
 
# Create empty lists for the features
processors=[]
ram=[]
os=[]
storage=[]
inches=[]
warranty=[]

for i in range(0,len(commonclass)):
    p=commonclass[i].text # Extracting the text from the tags
    if("Processor" in p): 
        processors.append(p)
    elif("RAM" in p): # If RAM is present in the text then append it to the ram list. Similarly do this for the other features as well
        ram.append(p)
    elif("HDD" in p or "SSD" in p):
        storage.append(p)
    elif("Operating" in p):
        os.append(p)
    elif("Display" in p):
        inches.append(p)
    elif("Warranty" in p):
        warranty.append(p)

In [None]:
print(len(processors))
print(len(warranty))
print(len(os))
print(len(ram))
print(len(storage))
print(len(inches)) # We observe that the length of all the features is same which is 24

24
24
24
24
24


In [None]:
price = soup.find_all('div',class_='_1vC4OE _2rQ-NK') # Extracting price of each laptop from the website
prices = []
for i in range(len(price)):
    prices.append(price[i].text)
prices
len(prices)

24

In [None]:
original_price = soup.find_all('div',class_='_3auQ3N _2GcJzG')
len(original_price) 

23

In [None]:
original_prices = []
for i in range(len(original_price)):
    original_prices.append(original_price[i].text)
original_prices
len(original_prices) # Here we get 23 since there is no original price listed for 1 laptops-this means that there is actually no discount on the laptop and the original price is same as the discounted price

23

In [None]:
rating = soup.find_all('div',class_='hGSR34') # Extracting the ratings of all the laptops
ratings = []
for i in range(len(rating)):
    ratings.append(rating[i].text)
ratings
len(ratings) # We observe that the classname for the recommended laptops is also same as the present laptops so that's why it is extracting the ratings of recommned laptops as well
# This is increasing the number of ratings

39

In [None]:
ratings

['4.4',
 '4.3',
 '4.5',
 '4.4',
 '4.5',
 '4.4',
 '4.4',
 '4.8',
 '4.3',
 '4.5',
 '4.4',
 '4.3',
 '4.3',
 '4.2',
 '4.5',
 '4.4',
 '4.3',
 '4.5',
 '4.7',
 '4.3',
 '4.6',
 '4.5',
 '4.4',
 '4.2',
 '4.3',
 '5',
 '4',
 '4.5',
 '4',
 '5',
 '4.3',
 '1',
 '5',
 '4.5',
 '5',
 '5',
 '4.3',
 '5',
 '5']

In [None]:
discount = soup.find_all('span')
discount # There are many span tags, but we need to find only the tags which contain discount 

[<span class="_2Ky4Ru">Plus</span>,
 <span>Cart</span>,
 <span class="_1QZ6fC _3Lgyp8">Electronics<svg class="_3ynUUz" height="8" viewbox="0 0 16 27" width="4.7" xmlns="http://www.w3.org/2000/svg"><path class="_3Der3h" d="M16 23.207L6.11 13.161 16 3.093 12.955 0 0 13.161l12.955 13.161z" fill="#fff"></path></svg></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="nsslWl"></span>,
 <span class="_1QZ6fC _3Lgyp8">TVs &amp; Appliances<svg class="_3ynUUz" height="8" viewbox="0 0 16 27" width="4.7" xmlns="http://www.w3.org/2000/svg"><path class="_3Der3h" d="M16 23.207L6.11 13.161 16 3.093 12.955

In [None]:
discounts = []
for i in range(0,len(discount)):
    p=discount[i].text
    if("% off" in p): # Extracting the tags which contain the discount(all the discounts contain % off so checking with that condition we obtain all the laptops which have discounts)
        discounts.append(p)
discounts
len(discounts) 

28

In [None]:
discounts

['11% off',
 '13% off',
 '28% off',
 '3% off',
 '3% off',
 '17% off',
 '16% off',
 '23% off',
 '20% off',
 '21% off',
 '1% off',
 '4% off',
 '2% off',
 '44% off',
 '12% off',
 '10% off',
 '2% off',
 '4% off',
 '17% off',
 '2% off',
 '1% off',
 '3% off',
 '21% off',
 '20% off',
 '4% off',
 '4% off',
 '28% off',
 '13% off']

In [None]:
d = {'Description':descriptions,'Processor':processors,'RAM':ram,'Operating System':os,'Storage':storage,'Display':inches,'Warranty':warranty,'Price':prices}
dataset = pd.DataFrame(data = d) # Finally merging all the features into a single dataframe

In [None]:
dataset

Unnamed: 0,Description,Processor,RAM,Operating System,Storage,Display,Warranty,Price
0,HP Spectre x360 Core i5 10th Gen - (8 GB/512 G...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,33.78 cm (13.3 inch) Touchscreen Display,1 Year Onsite Warranty,"₹1,02,990"
1,Asus VivoBook 14 Core i5 8th Gen - (8 GB/512 G...,Intel Core i5 Processor (8th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹55,990"
2,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,Intel Core i5 Processor (9th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,39.62 cm (15.6 inch) Display,1 Year International Travelers Warranty (ITW),"₹56,990"
3,Asus VivoBook 14 Ryzen 5 Quad Core 2nd Gen - (...,AMD Ryzen 5 Quad Core Processor (2nd Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹42,990"
4,HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Onsite Warranty,"₹52,990"
5,Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB HDD|256 GB SSD,35.56 cm (14 inch) Display,1 Year Onsite Warranty,"₹59,990"
6,Asus VivoBook Gaming Core i5 9th Gen - (8 GB/1...,Intel Core i5 Processor (9th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB HDD|256 GB SSD,39.62 cm (15.6 inch) Display,1 Year Onsite Warranty,"₹64,990"
7,Asus ROG Strix G15 Core i7 10th Gen - (8 GB/1 ...,Intel Core i7 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB SSD,39.62 cm (15.6 inch) Display,1 Year Onsite Warranty,"₹94,990"
8,Asus VivoBook Gaming Core i7 9th Gen - (16 GB ...,Intel Core i7 Processor (9th Gen),16 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,39.62 cm (15.6 inch) Display,1 Year Limited International Hardware Warranty,"₹75,990"
9,Asus VivoBook S Series Core i5 8th Gen - (8 GB...,Intel Core i5 Processor (8th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB HDD|256 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹53,990"


# Feature Engineering

We go through all the features one by one and keep adding new features. I have made the following changes and created new variables: <br>
RAM - Made columns for Ram Capacity in GB and the DDR version <br>
Processor - Made columns for Name of the Processor, Type of the Processor, Generation <br>
Operating System - Parsed the Operating System from this column and made a new column <br>
Storage - Made new columns for the type of Disk Drive and the capacity of the Disk Drive <br>
Display - Made new columns for the size of the laptop(in inches) and touchscreen <br>
Description - Made new columns for the company and graphic card <br>
Even the description contains the RAM size,Processor,Storage etc. and we already have seperate features for them.

# RAM

### RAM IN GB

In [None]:
dataset['ram'] = dataset['RAM'].copy() # Creating a copy of RAM
dataset['ram'].unique()

array(['8 GB DDR4 RAM', '16 GB DDR4 RAM', '4 GB DDR4 RAM',
       '8 GB DDR3 RAM'], dtype=object)

In [None]:
dataset['ram']

0      8 GB DDR4 RAM
1      8 GB DDR4 RAM
2      8 GB DDR4 RAM
3      8 GB DDR4 RAM
4      8 GB DDR4 RAM
5      8 GB DDR4 RAM
6      8 GB DDR4 RAM
7      8 GB DDR4 RAM
8     16 GB DDR4 RAM
9      8 GB DDR4 RAM
10     4 GB DDR4 RAM
11     8 GB DDR4 RAM
12     8 GB DDR4 RAM
13     8 GB DDR3 RAM
14     8 GB DDR4 RAM
15     8 GB DDR4 RAM
16     8 GB DDR4 RAM
17     8 GB DDR4 RAM
18     8 GB DDR4 RAM
19     8 GB DDR4 RAM
20     8 GB DDR4 RAM
21     8 GB DDR4 RAM
22     8 GB DDR4 RAM
23     8 GB DDR4 RAM
Name: ram, dtype: object

In [None]:
ram_in_gb = []
for i in dataset['ram']:
  if(i[0] == '8'): # 8GB RAM
    ram_in_gb.append(i[0])
  elif(i[0] == '4'): # 4GB RAM
    ram_in_gb.append(i[0])
  elif(i[0] == '1'): # 16GB RAM
    ram_in_gb.append('16')
  elif(i[0] == '3'): # 32GB RAM
    ram_in_gb.append('32')
dataset['ram_in_gb'] = ram_in_gb
print(len(ram_in_gb))
dataset['ram_in_gb']

24


['8',
 '8',
 '8',
 '8',
 '8',
 '8',
 '8',
 '8',
 '16',
 '8',
 '4',
 '8',
 '8',
 '8',
 '8',
 '8',
 '8',
 '8',
 '8',
 '8',
 '8',
 '8',
 '8',
 '8']

### DDR VERSION

In [None]:
ddr_version = []
for i in dataset['ram']:
    if(i[0] == '8' or i[0] == '4'): # If 4GB or 8GB RAM then check the DDR version
      ddr_version.append(i[8])
    elif(i[0] == '1'):  # If 16GB RAM then check the DDR version
      ddr_version.append(i[9])
    elif(i[0]== '3 '):  # If 32GB RAM then check the DDR version
      ddr_version.append(i[9])
dataset['ddr_version'] = ddr_version
print(len(ddr_version))
dataset['ddr_version']

24


['4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '3',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4',
 '4']

# PROCESSOR

In [None]:
dataset['processor'] = dataset['Processor'].copy() # Creating a copy of Processor
gen = dataset['processor'].apply(lambda x:x.split('(')) # Splitting on '(' so the 2 elements form a list
dataset['processor'].unique()

array(['Intel Core i5 Processor (10th Gen)',
       'Intel Core i5 Processor (8th Gen)',
       'Intel Core i5 Processor (9th Gen)',
       'AMD Ryzen 5 Quad Core Processor (2nd Gen)',
       'Intel Core i7 Processor (10th Gen)',
       'Intel Core i7 Processor (9th Gen)',
       'Intel Core i3 Processor (10th Gen)',
       'AMD Ryzen 5 Quad Core Processor',
       'AMD Ryzen 5 Quad Core Processor (3rd Gen)'], dtype=object)

In [None]:
dataset['processor']

0            Intel Core i5 Processor (10th Gen)
1             Intel Core i5 Processor (8th Gen)
2             Intel Core i5 Processor (9th Gen)
3     AMD Ryzen 5 Quad Core Processor (2nd Gen)
4            Intel Core i5 Processor (10th Gen)
5            Intel Core i5 Processor (10th Gen)
6             Intel Core i5 Processor (9th Gen)
7            Intel Core i7 Processor (10th Gen)
8             Intel Core i7 Processor (9th Gen)
9             Intel Core i5 Processor (8th Gen)
10           Intel Core i3 Processor (10th Gen)
11            Intel Core i7 Processor (9th Gen)
12              AMD Ryzen 5 Quad Core Processor
13            Intel Core i5 Processor (8th Gen)
14            Intel Core i5 Processor (8th Gen)
15           Intel Core i5 Processor (10th Gen)
16           Intel Core i5 Processor (10th Gen)
17    AMD Ryzen 5 Quad Core Processor (3rd Gen)
18           Intel Core i5 Processor (10th Gen)
19           Intel Core i5 Processor (10th Gen)
20           Intel Core i5 Processor (10

In [None]:
gen

0            [Intel Core i5 Processor , 10th Gen)]
1             [Intel Core i5 Processor , 8th Gen)]
2             [Intel Core i5 Processor , 9th Gen)]
3     [AMD Ryzen 5 Quad Core Processor , 2nd Gen)]
4            [Intel Core i5 Processor , 10th Gen)]
5            [Intel Core i5 Processor , 10th Gen)]
6             [Intel Core i5 Processor , 9th Gen)]
7            [Intel Core i7 Processor , 10th Gen)]
8             [Intel Core i7 Processor , 9th Gen)]
9             [Intel Core i5 Processor , 8th Gen)]
10           [Intel Core i3 Processor , 10th Gen)]
11            [Intel Core i7 Processor , 9th Gen)]
12               [AMD Ryzen 5 Quad Core Processor]
13            [Intel Core i5 Processor , 8th Gen)]
14            [Intel Core i5 Processor , 8th Gen)]
15           [Intel Core i5 Processor , 10th Gen)]
16           [Intel Core i5 Processor , 10th Gen)]
17    [AMD Ryzen 5 Quad Core Processor , 3rd Gen)]
18           [Intel Core i5 Processor , 10th Gen)]
19           [Intel Core i5 Pro

### PROCESSOR TYPE

In [None]:
processor_type = []
for i in dataset['processor']:
  if(i[11] == 'i' or i[11] == 'm'):
    processor_type.append(i[11:13])
  elif(i[10] == '5'):
    processor_type.append('Ryzen 5')
  elif(i[10]  == '3'):
    processor_type.append('Ryzen 3')
  elif(i[10] == '7'):
    processor_type.append('Ryzen 7')
  elif('SQ1' in i):
    processor_type.append('SQ1')
  elif('Pentium' in i):
    processor_type.append('Pentium')
  elif('APU' in i):
    processor_type.append('APU')
  elif('Athlon' in i):
    processor_type.append('Athlon')
dataset['processor_type'] = processor_type
print(len(processor_type))
dataset['processor_type']

24


0          i5
1          i5
2          i5
3     Ryzen 5
4          i5
5          i5
6          i5
7          i7
8          i7
9          i5
10         i3
11         i7
12    Ryzen 5
13         i5
14         i5
15         i5
16         i5
17    Ryzen 5
18         i5
19         i5
20         i5
21         i5
22         i5
23         i5
Name: processor_type, dtype: object

### PROCESSOR NAME

In [None]:
processor_name = []
for i in dataset['processor']:
  if('Intel' in i): # If it's an Intel processor
    processor_name.append('Intel')
  elif('AMD' in i): # If it's an AMD processor
    processor_name.append('AMD')
  elif('Microsoft' in i): # If it's a Microsoft processor
    processor_name.append('Microsoft')
dataset['processor_name'] = processor_name
print(len(processor_name))
dataset['processor_name']

24


0     Intel
1     Intel
2     Intel
3       AMD
4     Intel
5     Intel
6     Intel
7     Intel
8     Intel
9     Intel
10    Intel
11    Intel
12      AMD
13    Intel
14    Intel
15    Intel
16    Intel
17      AMD
18    Intel
19    Intel
20    Intel
21    Intel
22    Intel
23    Intel
Name: processor_name, dtype: object

### GENERATION

In [None]:
gen_type = []
for i in gen: # After converting into a list we'll be having 2 elements.
  if(len(i) == 2):  # If there is generation mentioned in the processor 
    if(i[1][0] == '1'): # If 10th gen
      gen_type.append('10') 
    else:  # Other than 10th gen
      gen_type.append(i[1][0])
  else: # If the generation is not mentioned
    gen_type.append('Not Mentioned')
dataset['gen_type'] = gen_type
print(len(gen_type))
dataset['gen_type']

24


# OPERATING SYSTEM

In [None]:
dataset['OS'] = dataset['Operating System'].copy() # Creating a copy of Operating System
dataset['OS'].unique()

array(['64 bit Windows 10 Operating System'], dtype=object)

In [None]:
dataset['OS']

0     64 bit Windows 10 Operating System
1     64 bit Windows 10 Operating System
2     64 bit Windows 10 Operating System
3     64 bit Windows 10 Operating System
4     64 bit Windows 10 Operating System
5     64 bit Windows 10 Operating System
6     64 bit Windows 10 Operating System
7     64 bit Windows 10 Operating System
8     64 bit Windows 10 Operating System
9     64 bit Windows 10 Operating System
10    64 bit Windows 10 Operating System
11    64 bit Windows 10 Operating System
12    64 bit Windows 10 Operating System
13    64 bit Windows 10 Operating System
14    64 bit Windows 10 Operating System
15    64 bit Windows 10 Operating System
16    64 bit Windows 10 Operating System
17    64 bit Windows 10 Operating System
18    64 bit Windows 10 Operating System
19    64 bit Windows 10 Operating System
20    64 bit Windows 10 Operating System
21    64 bit Windows 10 Operating System
22    64 bit Windows 10 Operating System
23    64 bit Windows 10 Operating System
Name: OS, dtype:

In [None]:
OS = []
for i in dataset['OS']:
  if('Windows' in i): # Windows Operating System
    OS.append('Windows')
  elif('Mac' in i):   # Mac Operating System
    OS.append('Mac')
  elif('Linux' in i): # Linux Operating System
    OS.append('Linux')
  elif('DOS' in i):   # DOS Operating System
    OS.append('DOS')
dataset['OS'] = OS
print(len(dataset['OS']))
dataset['OS']

24


0     Windows
1     Windows
2     Windows
3     Windows
4     Windows
5     Windows
6     Windows
7     Windows
8     Windows
9     Windows
10    Windows
11    Windows
12    Windows
13    Windows
14    Windows
15    Windows
16    Windows
17    Windows
18    Windows
19    Windows
20    Windows
21    Windows
22    Windows
23    Windows
Name: OS, dtype: object

# STORAGE

### DISK DRIVE

In [None]:
dataset['storage'] = dataset['Storage'].copy() # Creating a copy of Storage
dataset['storage'].unique()

array(['512 GB SSD', '1 TB HDD|256 GB SSD', '1 TB SSD', '256 GB SSD',
       '1 TB HDD|512 GB SSD', '1 TB HDD', '1 TB HDD|128 GB SSD'],
      dtype=object)

In [None]:
dataset['storage']

0              512 GB SSD
1              512 GB SSD
2              512 GB SSD
3              512 GB SSD
4              512 GB SSD
5     1 TB HDD|256 GB SSD
6     1 TB HDD|256 GB SSD
7                1 TB SSD
8              512 GB SSD
9     1 TB HDD|256 GB SSD
10             256 GB SSD
11    1 TB HDD|512 GB SSD
12               1 TB HDD
13             256 GB SSD
14             512 GB SSD
15    1 TB HDD|128 GB SSD
16             512 GB SSD
17    1 TB HDD|256 GB SSD
18    1 TB HDD|256 GB SSD
19             512 GB SSD
20    1 TB HDD|256 GB SSD
21             512 GB SSD
22             512 GB SSD
23             512 GB SSD
Name: storage, dtype: object

In [None]:
disk_drive = []
for i in dataset['storage']:
  if('HDD' in i and 'SSD' not in i):   # If only HDD
    disk_drive.append('HDD')
  elif('SSD' in i and 'HDD' not in i): # If only SSD
    disk_drive.append('SSD')
  elif('HDD' in i and 'SSD' in i):     # If both HDD and SSD
    disk_drive.append('Both')
dataset['disk_drive'] = disk_drive
print(len(disk_drive))
dataset['disk_drive']

24


0      SSD
1      SSD
2      SSD
3      SSD
4      SSD
5     Both
6     Both
7      SSD
8      SSD
9     Both
10     SSD
11    Both
12     HDD
13     SSD
14     SSD
15    Both
16     SSD
17    Both
18    Both
19     SSD
20    Both
21     SSD
22     SSD
23     SSD
Name: disk_drive, dtype: object

### STORAGE IN GB

In [None]:
storage_in_gb = []
for i in dataset['storage']:
  if('512' in i and 'TB' not in i): # Only 512GB SSD or 512GB HDD
    storage_in_gb.append('512')
  elif('256' in i and 'TB' not in i): # Only 256GB SSD or 256GB HDD
    storage_in_gb.append('256')
  elif('128' in i and 'TB' not in i): # Only 128GB SSD or 128GB HDD
    storage_in_gb.append('128')
  elif('512' in i and 'TB' in i):     # If 1TB HDD + 512GB SSD
    storage_in_gb.append('1000+512')
  elif('256' in i and 'TB' in i):     # If 1TB HDD + 256GB SSD
    storage_in_gb.append('1000+256')
  elif('128' in i and 'TB' in i):     # If 1TB HDD + 128GB SSD
    storage_in_gb.append('1000+128')
  elif('1' in i and '256' not in i and '512' not in i): # Only 1TB HDD or 1TB SSD
    storage_in_gb.append('1000')
  elif('2' in i and '256' not in i and '512' not in i): # Only 2TB HDD or 2TB SSD
    storage_in_gb.append('2000')
dataset['storage_in_gb'] = storage_in_gb
print(len(storage_in_gb))
dataset['storage_in_gb']

24


0          512
1          512
2          512
3          512
4          512
5     1000+256
6     1000+256
7         1000
8          512
9     1000+256
10         256
11    1000+512
12        1000
13         256
14         512
15    1000+128
16         512
17    1000+256
18    1000+256
19         512
20    1000+256
21         512
22         512
23         512
Name: storage_in_gb, dtype: object

# DISPLAY

In [None]:
dataset['display'] = dataset['Display'].copy() # Creating a copy of Display
dataset['display'].unique()

array(['33.78 cm (13.3 inch) Touchscreen Display',
       '35.56 cm (14 inch) Display', '39.62 cm (15.6 inch) Display',
       '35.56 cm (14 inch) Touchscreen Display',
       '33.78 cm (13.3 inch) Display'], dtype=object)

In [None]:
dataset['display']

0     33.78 cm (13.3 inch) Touchscreen Display
1                   35.56 cm (14 inch) Display
2                 39.62 cm (15.6 inch) Display
3                   35.56 cm (14 inch) Display
4                   35.56 cm (14 inch) Display
5                   35.56 cm (14 inch) Display
6                 39.62 cm (15.6 inch) Display
7                 39.62 cm (15.6 inch) Display
8                 39.62 cm (15.6 inch) Display
9                   35.56 cm (14 inch) Display
10      35.56 cm (14 inch) Touchscreen Display
11                39.62 cm (15.6 inch) Display
12                39.62 cm (15.6 inch) Display
13                  35.56 cm (14 inch) Display
14                  35.56 cm (14 inch) Display
15                  35.56 cm (14 inch) Display
16                39.62 cm (15.6 inch) Display
17                  35.56 cm (14 inch) Display
18                  35.56 cm (14 inch) Display
19                39.62 cm (15.6 inch) Display
20                  35.56 cm (14 inch) Display
21           

In [None]:
size_in_inches = []
for i in dataset['display']:
  if('15.6' in i):    # If the screen size is 15.6 inches
    size_in_inches.append('15.6')
  elif('14' in i):    # If the screen size is 14 inches
    size_in_inches.append('14')
  elif('13.3' in i):  # If the screen size is 13.3 inches
    size_in_inches.append('13.3')
  elif('13' in i):    # If the screen size is 13 inches
    size_in_inches.append('13')
  elif('13.5' in i):  # If the screen size is 13.5 inches
    size_in_inches.append('13.5')
  elif('17.3' in i):  # If the screen size is 17.3 inches
    size_in_inches.append('17.3')
  elif('16' in i):    # If the screen size is 16 inches
    size_in_inches.append('16')
  elif('10' in i):    # If the screen size is 10 inches
    size_in_inches.append('10')
  elif('15' in i):    # If the screen size is 15 inches
    size_in_inches.append('15')
dataset['size_in_inches'] = size_in_inches
print(len(size_in_inches))
dataset['size_in_inches']

24


0     13.3
1       14
2     15.6
3       14
4       14
5       14
6     15.6
7     15.6
8     15.6
9       14
10      14
11    15.6
12    15.6
13      14
14      14
15      14
16    15.6
17      14
18      14
19    15.6
20      14
21    15.6
22    15.6
23    13.3
Name: size_in_inches, dtype: object

### TOUCHSCREEN

In [None]:
touchscreen = []
for i in dataset['display']:
  if('Touchscreen' in i):     # If touchscreen is present
    touchscreen.append('Yes')
  else:
    touchscreen.append('No')  # If there's no touchscreen option
dataset['touchscreen'] = touchscreen
print(len(touchscreen))
dataset['touchscreen']

24


0     Yes
1      No
2      No
3      No
4      No
5      No
6      No
7      No
8      No
9      No
10    Yes
11     No
12     No
13     No
14     No
15     No
16     No
17     No
18     No
19     No
20     No
21     No
22     No
23     No
Name: touchscreen, dtype: object

# DESCRIPTION

In [None]:
dataset['description'] = dataset['Description'].copy() # Creating a copy of Description
dataset['description'].unique # All the laptops are unique
# dataset['description']

<bound method Series.unique of 0     HP Spectre x360 Core i5 10th Gen - (8 GB/512 G...
1     Asus VivoBook 14 Core i5 8th Gen - (8 GB/512 G...
2     Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...
3     Asus VivoBook 14 Ryzen 5 Quad Core 2nd Gen - (...
4     HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Win...
5     Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB...
6     Asus VivoBook Gaming Core i5 9th Gen - (8 GB/1...
7     Asus ROG Strix G15 Core i7 10th Gen - (8 GB/1 ...
8     Asus VivoBook Gaming Core i7 9th Gen - (16 GB ...
9     Asus VivoBook S Series Core i5 8th Gen - (8 GB...
10    HP Pavilion x360 14 Core i3 10th Gen - (4 GB/2...
11    Dell G3 Core i7 9th Gen - (8 GB/1 TB HDD/512 G...
12    HP Pavilion Gaming Ryzen 5 Quad Core - (8 GB/1...
13    Asus ZenBook Core i5 8th Gen - (8 GB/256 GB SS...
14    Acer Swift 3 Core i5 8th Gen - (8 GB/512 GB SS...
15    Lenovo Thinkpad E14 Core i5 10th Gen - (8 GB/1...
16    Asus Core i5 10th Gen - (8 GB/512 GB SSD/Windo...
17    HP 14s Ryze

### COMPANY

In [None]:
company = []
for i in dataset['description']:
  if('Dell' in i):            # If its a Dell laptop
    company.append('Dell')
  elif('Asus' in i):          # If its an Asus laptop
    company.append('Asus')
  elif('Lenovo' in i):        # If its a Lenovo laptop
    company.append('Lenovo')
  elif('Acer' in i):          # If its an Acer laptop
    company.append('Acer')
  elif('HP' in i):            # If its a HP laptop
    company.append('HP')
  elif('Apple' in i):         # If its an Apple laptop
    company.append('Apple')
  elif('MSI' in i):           # If its a MSI laptop
    company.append('MSI')
  elif('Avita' in i):         # If its an Avita laptop
    company.append('Avita')
dataset['company'] = company
print(len(company))
dataset['company']

24


0         HP
1       Asus
2       Acer
3       Asus
4         HP
5       Asus
6       Asus
7       Asus
8       Asus
9       Asus
10        HP
11      Dell
12        HP
13      Asus
14      Acer
15    Lenovo
16      Asus
17        HP
18      Asus
19      Asus
20        HP
21       MSI
22      Asus
23      Acer
Name: company, dtype: object

### GRAPHIC CARD

In [None]:
graphic_card = []
for i in dataset['description']:
  if('Graphics' in i): # If Graphic card is present
    graphic_card.append('Yes')
  else:                # If Graphic card is not present
    graphic_card.append('No')
dataset['graphic_card'] = graphic_card
print(len(graphic_card))
dataset['graphic_card']

24


0      No
1     Yes
2     Yes
3      No
4      No
5     Yes
6     Yes
7     Yes
8     Yes
9      No
10     No
11    Yes
12    Yes
13     No
14    Yes
15     No
16    Yes
17     No
18    Yes
19    Yes
20     No
21    Yes
22    Yes
23     No
Name: graphic_card, dtype: object

In [None]:
dataset

Unnamed: 0,Description,Processor,RAM,Operating System,Storage,Display,Warranty,Price,ram,ram_in_gb,ddr_version,processor,processor_type,processor_name,OS,gen_type,storage,disk_drive,storage_in_gb,display,size_in_inches,touchscreen,description,company,graphic_card
0,HP Spectre x360 Core i5 10th Gen - (8 GB/512 G...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,33.78 cm (13.3 inch) Touchscreen Display,1 Year Onsite Warranty,"₹1,02,990",8 GB DDR4 RAM,8,4,Intel Core i5 Processor (10th Gen),i5,Intel,64 bit Windows 10 Operating System,10,512 GB SSD,SSD,512,33.78 cm (13.3 inch) Touchscreen Display,13.3,Yes,HP Spectre x360 Core i5 10th Gen - (8 GB/512 G...,HP,No
1,Asus VivoBook 14 Core i5 8th Gen - (8 GB/512 G...,Intel Core i5 Processor (8th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹55,990",8 GB DDR4 RAM,8,4,Intel Core i5 Processor (8th Gen),i5,Intel,64 bit Windows 10 Operating System,8,512 GB SSD,SSD,512,35.56 cm (14 inch) Display,14.0,No,Asus VivoBook 14 Core i5 8th Gen - (8 GB/512 G...,Asus,Yes
2,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,Intel Core i5 Processor (9th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,39.62 cm (15.6 inch) Display,1 Year International Travelers Warranty (ITW),"₹56,990",8 GB DDR4 RAM,8,4,Intel Core i5 Processor (9th Gen),i5,Intel,64 bit Windows 10 Operating System,9,512 GB SSD,SSD,512,39.62 cm (15.6 inch) Display,15.6,No,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,Acer,Yes
3,Asus VivoBook 14 Ryzen 5 Quad Core 2nd Gen - (...,AMD Ryzen 5 Quad Core Processor (2nd Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹42,990",8 GB DDR4 RAM,8,4,AMD Ryzen 5 Quad Core Processor (2nd Gen),Ryzen 5,AMD,64 bit Windows 10 Operating System,2,512 GB SSD,SSD,512,35.56 cm (14 inch) Display,14.0,No,Asus VivoBook 14 Ryzen 5 Quad Core 2nd Gen - (...,Asus,No
4,HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Onsite Warranty,"₹52,990",8 GB DDR4 RAM,8,4,Intel Core i5 Processor (10th Gen),i5,Intel,64 bit Windows 10 Operating System,10,512 GB SSD,SSD,512,35.56 cm (14 inch) Display,14.0,No,HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,HP,No
5,Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB HDD|256 GB SSD,35.56 cm (14 inch) Display,1 Year Onsite Warranty,"₹59,990",8 GB DDR4 RAM,8,4,Intel Core i5 Processor (10th Gen),i5,Intel,64 bit Windows 10 Operating System,10,1 TB HDD|256 GB SSD,Both,1000+256,35.56 cm (14 inch) Display,14.0,No,Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB...,Asus,Yes
6,Asus VivoBook Gaming Core i5 9th Gen - (8 GB/1...,Intel Core i5 Processor (9th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB HDD|256 GB SSD,39.62 cm (15.6 inch) Display,1 Year Onsite Warranty,"₹64,990",8 GB DDR4 RAM,8,4,Intel Core i5 Processor (9th Gen),i5,Intel,64 bit Windows 10 Operating System,9,1 TB HDD|256 GB SSD,Both,1000+256,39.62 cm (15.6 inch) Display,15.6,No,Asus VivoBook Gaming Core i5 9th Gen - (8 GB/1...,Asus,Yes
7,Asus ROG Strix G15 Core i7 10th Gen - (8 GB/1 ...,Intel Core i7 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB SSD,39.62 cm (15.6 inch) Display,1 Year Onsite Warranty,"₹94,990",8 GB DDR4 RAM,8,4,Intel Core i7 Processor (10th Gen),i7,Intel,64 bit Windows 10 Operating System,10,1 TB SSD,SSD,1000,39.62 cm (15.6 inch) Display,15.6,No,Asus ROG Strix G15 Core i7 10th Gen - (8 GB/1 ...,Asus,Yes
8,Asus VivoBook Gaming Core i7 9th Gen - (16 GB ...,Intel Core i7 Processor (9th Gen),16 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,39.62 cm (15.6 inch) Display,1 Year Limited International Hardware Warranty,"₹75,990",16 GB DDR4 RAM,16,4,Intel Core i7 Processor (9th Gen),i7,Intel,64 bit Windows 10 Operating System,9,512 GB SSD,SSD,512,39.62 cm (15.6 inch) Display,15.6,No,Asus VivoBook Gaming Core i7 9th Gen - (16 GB ...,Asus,Yes
9,Asus VivoBook S Series Core i5 8th Gen - (8 GB...,Intel Core i5 Processor (8th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB HDD|256 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹53,990",8 GB DDR4 RAM,8,4,Intel Core i5 Processor (8th Gen),i5,Intel,64 bit Windows 10 Operating System,8,1 TB HDD|256 GB SSD,Both,1000+256,35.56 cm (14 inch) Display,14.0,No,Asus VivoBook S Series Core i5 8th Gen - (8 GB...,Asus,No


As we have created many copies of the columns now we drop all those copies.

In [None]:
dataset.drop(['ram','storage','processor','OS','display','description'],axis=1,inplace=True)

In [None]:
dataset # This is the final dataset

Unnamed: 0,Description,Processor,RAM,Operating System,Storage,Display,Warranty,Price,ram_in_gb,ddr_version,processor_type,processor_name,gen_type,disk_drive,storage_in_gb,size_in_inches,touchscreen,company,graphic_card
0,HP Spectre x360 Core i5 10th Gen - (8 GB/512 G...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,33.78 cm (13.3 inch) Touchscreen Display,1 Year Onsite Warranty,"₹1,02,990",8,4,i5,Intel,10,SSD,512,13.3,Yes,HP,No
1,Asus VivoBook 14 Core i5 8th Gen - (8 GB/512 G...,Intel Core i5 Processor (8th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹55,990",8,4,i5,Intel,8,SSD,512,14.0,No,Asus,Yes
2,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,Intel Core i5 Processor (9th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,39.62 cm (15.6 inch) Display,1 Year International Travelers Warranty (ITW),"₹56,990",8,4,i5,Intel,9,SSD,512,15.6,No,Acer,Yes
3,Asus VivoBook 14 Ryzen 5 Quad Core 2nd Gen - (...,AMD Ryzen 5 Quad Core Processor (2nd Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹42,990",8,4,Ryzen 5,AMD,2,SSD,512,14.0,No,Asus,No
4,HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Onsite Warranty,"₹52,990",8,4,i5,Intel,10,SSD,512,14.0,No,HP,No
5,Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB HDD|256 GB SSD,35.56 cm (14 inch) Display,1 Year Onsite Warranty,"₹59,990",8,4,i5,Intel,10,Both,1000+256,14.0,No,Asus,Yes
6,Asus VivoBook Gaming Core i5 9th Gen - (8 GB/1...,Intel Core i5 Processor (9th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB HDD|256 GB SSD,39.62 cm (15.6 inch) Display,1 Year Onsite Warranty,"₹64,990",8,4,i5,Intel,9,Both,1000+256,15.6,No,Asus,Yes
7,Asus ROG Strix G15 Core i7 10th Gen - (8 GB/1 ...,Intel Core i7 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB SSD,39.62 cm (15.6 inch) Display,1 Year Onsite Warranty,"₹94,990",8,4,i7,Intel,10,SSD,1000,15.6,No,Asus,Yes
8,Asus VivoBook Gaming Core i7 9th Gen - (16 GB ...,Intel Core i7 Processor (9th Gen),16 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,39.62 cm (15.6 inch) Display,1 Year Limited International Hardware Warranty,"₹75,990",16,4,i7,Intel,9,SSD,512,15.6,No,Asus,Yes
9,Asus VivoBook S Series Core i5 8th Gen - (8 GB...,Intel Core i5 Processor (8th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB HDD|256 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹53,990",8,4,i5,Intel,8,Both,1000+256,14.0,No,Asus,No


Export the dataset into a CSV file

In [None]:
dataset.to_csv('dataset.csv')

As this is a dynamic website the content keeps on changing! So next time if we try to run the link may be same but the content will be different.

# Data Cleaning

In [92]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [93]:
df = pd.read_csv('final_merged.csv')

In [94]:
df.shape

(168, 21)

In [95]:
df.drop('Unnamed: 0',axis = 1, inplace = True)

In [96]:
df.shape

(168, 20)

In [97]:
df.head()

Unnamed: 0,Description,Processor,RAM,Operating System,Storage,Display,Warranty,Price,RAM_GB,DDR_Version,Processor Name,Processor Type,Generation,Operating System Type,Storage_GB,Disk Drive,Size(Inches),Company,Graphic Card,Touchscreen
0,HP Spectre x360 Core i5 10th Gen - (8 GB/512 G...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,33.78 cm (13.3 inch) Touchscreen Display,1 Year Onsite Warranty,"₹1,02,990",8,4,Intel,i5,10,Windows,512,SSD,13.3,HP,No,Yes
1,Asus VivoBook 14 Core i5 8th Gen - (8 GB/512 G...,Intel Core i5 Processor (8th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹55,990",8,4,Intel,i5,8,Windows,512,SSD,14.0,Asus,Yes,No
2,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,Intel Core i5 Processor (9th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,39.62 cm (15.6 inch) Display,1 Year International Travelers Warranty (ITW),"₹56,990",8,4,Intel,i5,9,Windows,512,SSD,15.6,Acer,Yes,No
3,Asus VivoBook 14 Ryzen 5 Quad Core 2nd Gen - (...,AMD Ryzen 5 Quad Core Processor (2nd Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹42,990",8,4,AMD,Ryzen 5,2,Windows,512,SSD,14.0,Asus,No,No
4,HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Onsite Warranty,"₹52,990",8,4,Intel,i5,10,Windows,512,SSD,14.0,HP,No,No


In [98]:
df.dtypes # Checking the datatypes

Description               object
Processor                 object
RAM                       object
Operating System          object
Storage                   object
Display                   object
Warranty                  object
Price                     object
RAM_GB                     int64
DDR_Version               object
Processor Name            object
Processor Type            object
Generation                object
Operating System Type     object
Storage_GB                object
Disk Drive                object
Size(Inches)             float64
Company                   object
Graphic Card              object
Touchscreen               object
dtype: object

There are a few columns which are categorical here but they actually contain numerical values.So we need to convert few categorical columns to numerical columns. These are DDR_Version,Generation,Storage_GB,Price.

In [99]:
df['Generation'].unique()

array(['10', '8', '9', '2', 'Not Mentioned', '3', '5', '7'], dtype=object)

In [100]:
df['DDR_Version'].unique()

array(['4', '3', 'Not Mentioned'], dtype=object)

In [101]:
df['Storage_GB'].unique()

array(['512', '1000+256', '1000', '256', '1000+128', '128', '1000+512',
       '2000', '2000+256'], dtype=object)

In [102]:
# Replacing 'Not Mentioned' with 0
df['Generation']=df['Generation'].apply(lambda x:x.replace('Not Mentioned','0'))
df['DDR_Version']=df['DDR_Version'].apply(lambda x:x.replace('Not Mentioned','0'))

In [103]:
df['Generation'].unique()

array(['10', '8', '9', '2', '0', '3', '5', '7'], dtype=object)

In [104]:
df['DDR_Version'].unique()

array(['4', '3', '0'], dtype=object)

In [105]:
df['RAM'].unique()

array(['8 GB DDR4 RAM', '16 GB DDR4 RAM', '4 GB DDR4 RAM',
       '8 GB DDR3 RAM', '16 GB LPDDR4X RAM', '8 GB LPDDR4X RAM',
       '8GB RAM & 256GB SSD for superfast computing', '32 GB DDR4 RAM',
       '4 GB DDR3 RAM', '16 GB DDR3 RAM'], dtype=object)

In [106]:
df['RAM']=df['RAM'].apply(lambda x:x.replace('8GB RAM & 256GB SSD for superfast computing','8 GB RAM'))
# Here they have not mentioned the DDR version so we just replace it with 8 GB RAM

In [107]:
df['RAM'].unique()

array(['8 GB DDR4 RAM', '16 GB DDR4 RAM', '4 GB DDR4 RAM',
       '8 GB DDR3 RAM', '16 GB LPDDR4X RAM', '8 GB LPDDR4X RAM',
       '8 GB RAM', '32 GB DDR4 RAM', '4 GB DDR3 RAM', '16 GB DDR3 RAM'],
      dtype=object)

In [108]:
df['Storage_GB'].unique()

array(['512', '1000+256', '1000', '256', '1000+128', '128', '1000+512',
       '2000', '2000+256'], dtype=object)

In [109]:
df['Storage']

0               512 GB SSD
1               512 GB SSD
2               512 GB SSD
3               512 GB SSD
4               512 GB SSD
              ...         
163             128 GB SSD
164    2 TB HDD|256 GB SSD
165             512 GB SSD
166               1 TB SSD
167             512 GB SSD
Name: Storage, Length: 168, dtype: object

In [110]:
df['Storage'].unique()

array(['512 GB SSD', '1 TB HDD|256 GB SSD', '1 TB SSD', '256 GB SSD',
       '1 TB HDD|512 GB SSD', '1 TB HDD', '1 TB HDD|128 GB SSD',
       '128 GB SSD', '2 TB HDD|256 GB SSD',
       'HP Audio Switch, HP Support Assistant, HP Documentation, HP Jumpstart, HP BIOS Recovery, HP Connection Optimizer, HP 3D DriveGuard (HDD Only), Dropbox',
       '512 GB SSD for Reduced Boot Up Time and in Game Loading',
       '1 TB HDD|1 TB SSD', '2 TB HDD'], dtype=object)

In [111]:
df['Storage']=df['Storage'].apply(lambda x:x.replace('HP Audio Switch, HP Support Assistant, HP Documentation, HP Jumpstart, HP BIOS Recovery, HP Connection Optimizer, HP 3D DriveGuard (HDD Only), Dropbox','1 TB HDD'))
# As they have not mentioned anything about the storage capacity and have mentioned HDD only so we assume the storage as 1 TB HDD

In [112]:
df['Storage']=df['Storage'].apply(lambda x:x.replace('512 GB SSD for Reduced Boot Up Time and in Game Loading','512 GB SSD'))
# Here its clearly mentioned as 512 GB SSD so we remove the extra part which is Reduced Boot UPp Time...

In [113]:
df['Storage'].unique()

array(['512 GB SSD', '1 TB HDD|256 GB SSD', '1 TB SSD', '256 GB SSD',
       '1 TB HDD|512 GB SSD', '1 TB HDD', '1 TB HDD|128 GB SSD',
       '128 GB SSD', '2 TB HDD|256 GB SSD', '1 TB HDD|1 TB SSD',
       '2 TB HDD'], dtype=object)

In [89]:
# Splitting the Storage column into HDD and SSD
SSD = []
HDD = []
def process(inp):
  c = inp.split('|')
  if(len(c) == 1):
    if('SSD' in inp):
      SSD.append(inp)
      HDD.append('0')
    else:
      SSD.append(inp)
      HDD.append('0')
  else:
    SSD.append(c[1])
    HDD.append(c[0])
      

In [90]:
for i in df['Storage']:
  process(i)
df['SSD'] = SSD
df['HDD'] = HDD

In [91]:
df['SSD'].unique() # Have to change this a bit

array(['512 GB SSD', '256 GB SSD', '1 TB SSD', '1 TB HDD', '128 GB SSD',
       '2 TB HDD'], dtype=object)

In [83]:
df['HDD'].unique()

array(['0', '1 TB HDD', '2 TB HDD'], dtype=object)

In [84]:
# Anyways we have 2 columns for HDD and SSD so no need to mention HDD,SSD in the values so we replace the terms HDD,SSD in the values
df['HDD']=df['HDD'].apply(lambda x:x.replace('TB HDD',''))
hdd=[]
for i in df['HDD']:
    if(i=='0'):
        hdd.append(i)
    elif(i=='1 '):
        i=i.replace('1 ','1000')
        hdd.append(i)
    elif(i=='2 '):
        i=i.replace('2 ','2000')
        hdd.append(i)
df['HDD']=hdd
df['HDD'].unique()

array(['0', '1000', '2000'], dtype=object)

In [86]:
df['SSD']=df['SSD'].apply(lambda x:x.replace('GB SSD',''))
df['SSD']=df['SSD'].apply(lambda x:x.replace('TB SSD',''))
ssd=[]
for i in df['SSD']:
    if(i=='0'):
        hdd.append(i)
    elif(i=='1 '):
        i=i.replace('1 ','1000')
        hdd.append(i)
    elif(i=='2 '):
        i=i.replace('2 ','2000')
    else:
        ssd.append(i)


In [88]:
ssd

['512 ',
 '512 ',
 '512 ',
 '512 ',
 '512 ',
 '256 ',
 '256 ',
 '512 ',
 '256 ',
 '256 ',
 '512 ',
 '1 TB HDD',
 '256 ',
 '512 ',
 '128 ',
 '512 ',
 '256 ',
 '256 ',
 '512 ',
 '256 ',
 '512 ',
 '512 ',
 '512 ',
 '512 ',
 '256 ',
 '512 ',
 '256 ',
 '256 ',
 '512 ',
 '128 ',
 '256 ',
 '256 ',
 '256 ',
 '512 ',
 '512 ',
 '256 ',
 '512 ',
 '256 ',
 '256 ',
 '256 ',
 '1 TB HDD',
 '512 ',
 '512 ',
 '1 TB HDD',
 '512 ',
 '1 TB HDD',
 '128 ',
 '512 ',
 '512 ',
 '256 ',
 '512 ',
 '128 ',
 '512 ',
 '512 ',
 '256 ',
 '256 ',
 '512 ',
 '1 TB HDD',
 '256 ',
 '256 ',
 '512 ',
 '256 ',
 '512 ',
 '512 ',
 '256 ',
 '128 ',
 '128 ',
 '1 TB HDD',
 '1 TB HDD',
 '256 ',
 '128 ',
 '256 ',
 '1 TB HDD',
 '1 TB HDD',
 '512 ',
 '1 TB HDD',
 '1 TB HDD',
 '1 TB HDD',
 '1 TB HDD',
 '1 TB HDD',
 '1 TB HDD',
 '256 ',
 '256 ',
 '128 ',
 '512 ',
 '512 ',
 '256 ',
 '512 ',
 '256 ',
 '256 ',
 '1 TB HDD',
 '512 ',
 '512 ',
 '1 TB HDD',
 '512 ',
 '512 ',
 '256 ',
 '1 TB HDD',
 '512 ',
 '512 ',
 '128 ',
 '128 ',
 '256 ',
 

Now we can remove the STORAGE_GB column

In [114]:
df

Unnamed: 0,Description,Processor,RAM,Operating System,Storage,Display,Warranty,Price,RAM_GB,DDR_Version,Processor Name,Processor Type,Generation,Operating System Type,Storage_GB,Disk Drive,Size(Inches),Company,Graphic Card,Touchscreen
0,HP Spectre x360 Core i5 10th Gen - (8 GB/512 G...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,33.78 cm (13.3 inch) Touchscreen Display,1 Year Onsite Warranty,"₹1,02,990",8,4,Intel,i5,10,Windows,512,SSD,13.3,HP,No,Yes
1,Asus VivoBook 14 Core i5 8th Gen - (8 GB/512 G...,Intel Core i5 Processor (8th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹55,990",8,4,Intel,i5,8,Windows,512,SSD,14.0,Asus,Yes,No
2,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,Intel Core i5 Processor (9th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,39.62 cm (15.6 inch) Display,1 Year International Travelers Warranty (ITW),"₹56,990",8,4,Intel,i5,9,Windows,512,SSD,15.6,Acer,Yes,No
3,Asus VivoBook 14 Ryzen 5 Quad Core 2nd Gen - (...,AMD Ryzen 5 Quad Core Processor (2nd Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,"₹42,990",8,4,AMD,Ryzen 5,2,Windows,512,SSD,14.0,Asus,No,No
4,HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Onsite Warranty,"₹52,990",8,4,Intel,i5,10,Windows,512,SSD,14.0,HP,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
163,Lenovo Ideapad Slim APU Dual Core A4 - (4 GB/6...,AMD APU Dual Core A4 Processor,4 GB DDR4 RAM,64 bit Windows 10 Operating System,128 GB SSD,35.56 cm (14 inch) Display,1 Year Onsite Warranty,"₹20,990",4,4,AMD,APU,0,Windows,128,SSD,14.0,Lenovo,No,No
164,Acer Predator Triton 300 Core i7 9th Gen - (8 ...,Intel Core i7 Processor (9th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,2 TB HDD|256 GB SSD,39.62 cm (15.6 inch) Display,1 Year International Travelers Warranty (ITW),"₹97,990",8,4,Intel,i7,9,Windows,2000+256,Both,15.6,Acer,Yes,No
165,Lenovo Yoga 730 Core i5 8th Gen - (8 GB/512 GB...,Intel Core i5 Processor (8th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,33.78 cm (13.3 inch) Touchscreen Display,1 Year Onsite Warranty,"₹94,990",8,4,Intel,i5,8,Windows,512,SSD,13.3,Lenovo,No,Yes
166,Asus ZenBook Core i7 10th Gen - (16 GB/1 TB SS...,Intel Core i7 Processor (10th Gen),16 GB DDR4 RAM,64 bit Windows 10 Operating System,1 TB SSD,33.78 cm (13.3 inch) Display,1 Year Onsite Warranty,"₹99,990",16,4,Intel,i7,10,Windows,1000,SSD,13.3,Asus,No,No


In [115]:
df.columns

Index(['Description', 'Processor', 'RAM', 'Operating System', 'Storage',
       'Display', 'Warranty', 'Price', 'RAM_GB', 'DDR_Version',
       'Processor Name', 'Processor Type', 'Generation',
       'Operating System Type', 'Storage_GB', 'Disk Drive', 'Size(Inches)',
       'Company', 'Graphic Card', 'Touchscreen'],
      dtype='object')

In [116]:
print(df['Processor Name'].unique())
print(df['Processor Type'].unique())
print(df['Operating System Type'].unique())
print(df['Disk Drive'].unique())

['Intel' 'AMD' 'Microsoft']
['i5' 'Ryzen 5' 'i7' 'i3' 'Ryzen 7' 'Ryzen 3' 'SQ1' 'APU' 'Pentium' 'm3'
 'i9' 'Athlon']
['Windows' 'Mac' 'DOS' 'Linux']
['SSD' 'Both' 'HDD']


In [117]:
print(df['Size(Inches)'].unique())
print(df['Company'].unique())

[13.3 14.  15.6 13.  17.3 16.  10.  15.  12.3 15.4]
['HP' 'Asus' 'Acer' 'Dell' 'Lenovo' 'MSI' 'Apple' 'Microsoft' 'MarQ'
 'Avita' 'Alienware' 'Nexstgo']


In [118]:
cat_to_num = ['Generation','DDR_Version'] 
df[cat_to_num] = df[cat_to_num].apply(pd.to_numeric,errors='coerce') # Converting the categorical variables to numerical variables


In [119]:
df.dtypes # We can see that the datatypes of these columns have changed to int64

Description               object
Processor                 object
RAM                       object
Operating System          object
Storage                   object
Display                   object
Warranty                  object
Price                     object
RAM_GB                     int64
DDR_Version                int64
Processor Name            object
Processor Type            object
Generation                 int64
Operating System Type     object
Storage_GB                object
Disk Drive                object
Size(Inches)             float64
Company                   object
Graphic Card              object
Touchscreen               object
dtype: object

So all these are the independent variables and price is the dependent variable.

In [120]:
df['Price'].dtype

dtype('O')

In [121]:
df['Price'].isnull().sum()

0

In [122]:
df['Price']=df['Price'].apply(lambda x: x.replace('₹',''))

In [123]:
# for i in df['Price'].split(','):
#     i = float(i) #using float because you don't only have integers
#     print(i)

In [124]:
price = []
for i in df['Price']:
    i = i.replace(',','')
    i = float(i)
    price.append(i)

In [125]:
df['Price'] = price
df['Price']

0      102990.0
1       55990.0
2       56990.0
3       42990.0
4       52990.0
         ...   
163     20990.0
164     97990.0
165     94990.0
166     99990.0
167    229990.0
Name: Price, Length: 168, dtype: float64

In [126]:
df['Price'].dtype

dtype('float64')

For now Price is numeric but without commas,but try some method to have the prices with commas and as an integer

In [127]:
df.head()

Unnamed: 0,Description,Processor,RAM,Operating System,Storage,Display,Warranty,Price,RAM_GB,DDR_Version,Processor Name,Processor Type,Generation,Operating System Type,Storage_GB,Disk Drive,Size(Inches),Company,Graphic Card,Touchscreen
0,HP Spectre x360 Core i5 10th Gen - (8 GB/512 G...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,33.78 cm (13.3 inch) Touchscreen Display,1 Year Onsite Warranty,102990.0,8,4,Intel,i5,10,Windows,512,SSD,13.3,HP,No,Yes
1,Asus VivoBook 14 Core i5 8th Gen - (8 GB/512 G...,Intel Core i5 Processor (8th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,55990.0,8,4,Intel,i5,8,Windows,512,SSD,14.0,Asus,Yes,No
2,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,Intel Core i5 Processor (9th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,39.62 cm (15.6 inch) Display,1 Year International Travelers Warranty (ITW),56990.0,8,4,Intel,i5,9,Windows,512,SSD,15.6,Acer,Yes,No
3,Asus VivoBook 14 Ryzen 5 Quad Core 2nd Gen - (...,AMD Ryzen 5 Quad Core Processor (2nd Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Limited International Hardware Warranty,42990.0,8,4,AMD,Ryzen 5,2,Windows,512,SSD,14.0,Asus,No,No
4,HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,Intel Core i5 Processor (10th Gen),8 GB DDR4 RAM,64 bit Windows 10 Operating System,512 GB SSD,35.56 cm (14 inch) Display,1 Year Onsite Warranty,52990.0,8,4,Intel,i5,10,Windows,512,SSD,14.0,HP,No,No


In [128]:
df.columns

Index(['Description', 'Processor', 'RAM', 'Operating System', 'Storage',
       'Display', 'Warranty', 'Price', 'RAM_GB', 'DDR_Version',
       'Processor Name', 'Processor Type', 'Generation',
       'Operating System Type', 'Storage_GB', 'Disk Drive', 'Size(Inches)',
       'Company', 'Graphic Card', 'Touchscreen'],
      dtype='object')

In [129]:
df.dtypes

Description               object
Processor                 object
RAM                       object
Operating System          object
Storage                   object
Display                   object
Warranty                  object
Price                    float64
RAM_GB                     int64
DDR_Version                int64
Processor Name            object
Processor Type            object
Generation                 int64
Operating System Type     object
Storage_GB                object
Disk Drive                object
Size(Inches)             float64
Company                   object
Graphic Card              object
Touchscreen               object
dtype: object

In [130]:
df['Touchscreen'].unique()

array(['Yes', 'No'], dtype=object)

In [131]:
df.to_csv('changed.csv')