#Regular Expressions to Extract Features from Product Description

Use regular expressions in Python/Pandas to mine useful features from the laptop product details data.

Data is coming from scraped e-commerce data - [Web Scraping - Flipkart](https://github.com/murali-munna/NLP/tree/master/Web%20Scraping%20-%20Flipkart)

In [1]:
import re
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /drive


In [96]:
df = pd.read_csv('/drive/My Drive/Colab Notebooks/NLP/1. Text Processing/laptop_details.csv')

In [97]:
df.head()

Unnamed: 0.1,Unnamed: 0,Product,Price,Details
0,0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R..."
1,1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ..."
2,2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ..."
3,3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ..."
4,4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ..."


In [98]:
# Let's take a closer look at the details column
print(df.loc[0,'Details'])
print(df.loc[3,'Details'])

Intel Core i5 Processor (9th Gen), 8 GB DDR4 RAM, 64 bit Windows 10 Operating System, 512 GB SSD, 39.62 cm (15.6 inch) Display, Acer Collection, Acer Product Registration , Quick Access, Acer Care Center, 1 Year International Travelers Warranty (ITW)
Intel Core i5 Processor (10th Gen), 8 GB DDR4 RAM, 64 bit Windows 10 Operating System, 1 TB HDD|256 GB SSD, 35.56 cm (14 inch) Display, Microsoft Office Home and Student 2019, 1 Year Limited Hardware Warranty, In Home Service After Remote Diagnosis - Retail


##Extract - Processor, Generation, RAM, OS, Harddisk, Display and Warranty

In [99]:
# Processor - should match i3, i5 etc and should be single word in itself. I deliberately ignored AMD processors for now
df['processor'] = df['Details'].str.extract(r'([^a-zA-Z]i[0-9][^a-zA-Z])|([0-9]+\sQuad\s)', expand = True).apply(lambda x: ''.join(x.dropna()), axis=1)
df.head()

Unnamed: 0.1,Unnamed: 0,Product,Price,Details,processor
0,0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R...",i5
1,1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ...",i3
2,2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5
3,3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5
4,4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3


In [100]:
# Generation - Should match 2nd, 3rd, 8th etc Gen
df['generation'] = df['Details'].str.extract(r'([0-9]{1,2}[a-z]{1,2}\s+Gen)', expand = True)
df.head()

Unnamed: 0.1,Unnamed: 0,Product,Price,Details,processor,generation
0,0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R...",i5,9th Gen
1,1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ...",i3,10th Gen
2,2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen
3,3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen
4,4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen


In [101]:
# RAM - Should match a string having a number and certain words ending with RAM word
df['ram'] = df['Details'].str.extract(r'([0-9]+\s?GB.+RAM)', expand = True)
df.head()

Unnamed: 0.1,Unnamed: 0,Product,Price,Details,processor,generation,ram
0,0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R...",i5,9th Gen,8 GB DDR4 RAM
1,1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ...",i3,10th Gen,8 GB DDR4 RAM
2,2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM
3,3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM
4,4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen,4 GB DDR4 RAM


In [102]:
# OS - Should match Windows/Linux etc Operation System and their common short forms
df['os'] = df['Details'].str.extract(r'([Ww]indows|[Ll]inux|[Uu]buntu)(.+)(OS|[Oo]perating [Ss]ystem)', expand = True).agg(''.join, axis=1)
df.head()

Unnamed: 0.1,Unnamed: 0,Product,Price,Details,processor,generation,ram,os
0,0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R...",i5,9th Gen,8 GB DDR4 RAM,Windows 10 Operating System
1,1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ...",i3,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System
2,2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System
3,3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System
4,4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen,4 GB DDR4 RAM,Windows 10 Operating System


In [103]:
# Storage - Should match 512 GB SSD, 1 TB HDD etc
df['storage_ssd'] = df['Details'].str.extract(r'([0-9]+\s[GT]B\sSSD)', expand = True)
df.head()

Unnamed: 0.1,Unnamed: 0,Product,Price,Details,processor,generation,ram,os,storage_ssd
0,0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R...",i5,9th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD
1,1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ...",i3,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,
2,2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD
3,3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD
4,4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen,4 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD


In [104]:
# Storage - Should match 512 GB HDD, 1 TB HDD etc
df['storage_hdd'] = df['Details'].str.extract(r'([0-9]+\s[GT]B\sHDD)', expand = True)
df.head()

Unnamed: 0.1,Unnamed: 0,Product,Price,Details,processor,generation,ram,os,storage_ssd,storage_hdd
0,0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R...",i5,9th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD,
1,1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ...",i3,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,,1 TB HDD
2,2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD,
3,3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD
4,4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen,4 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD


In [105]:
# Display - Should match 40 cm, 36.5 cm, 38.75 cm etc (numbers upto two digit decimal places)
df['display_cm'] = df['Details'].str.extract(r'([0-9]+)(\.[0-9]{1,2})?(\s?cm)', expand = False).agg(''.join, axis=1)
df.head()

Unnamed: 0.1,Unnamed: 0,Product,Price,Details,processor,generation,ram,os,storage_ssd,storage_hdd,display_cm
0,0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R...",i5,9th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD,,39.62 cm
1,1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ...",i3,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,,1 TB HDD,35.56 cm
2,2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD,,39.62 cm
3,3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD,35.56 cm
4,4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen,4 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD,39.62 cm


In [106]:
# Display - Should match 13 inch, 14.5 inch, 15.75 inch etc (munbers upto two digit decimal places)
df['display_inch'] = df['Details'].str.extract(r'([0-9]+)(\.[0-9]{1,2})?(\s?inch)', expand = False).apply(lambda x: ''.join(x.dropna()), axis=1)
df.head()

Unnamed: 0.1,Unnamed: 0,Product,Price,Details,processor,generation,ram,os,storage_ssd,storage_hdd,display_cm,display_inch
0,0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R...",i5,9th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD,,39.62 cm,15.6 inch
1,1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ...",i3,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,,1 TB HDD,35.56 cm,14 inch
2,2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD,,39.62 cm,15.6 inch
3,3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD,35.56 cm,14 inch
4,4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen,4 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD,39.62 cm,15.6 inch


In [107]:
# Warranty - Should match the entire string from starting with a number, year and till warranty
df['warranty'] = df['Details'].str.extract(r'([0-9]{1,2}\s[Yy]ear.+[Ww]arranty)', expand = True)
df.head()

Unnamed: 0.1,Unnamed: 0,Product,Price,Details,processor,generation,ram,os,storage_ssd,storage_hdd,display_cm,display_inch,warranty
0,0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R...",i5,9th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD,,39.62 cm,15.6 inch,1 Year International Travelers Warranty
1,1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ...",i3,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,,1 TB HDD,35.56 cm,14 inch,1 Year Limited Hardware Warranty
2,2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD,,39.62 cm,15.6 inch,1 Year Onsite Warranty
3,3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD,35.56 cm,14 inch,1 Year Limited Hardware Warranty
4,4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen,4 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD,39.62 cm,15.6 inch,1 Year Onsite Warranty


In [108]:
# Let's look at final df to check for any False positives and False negatives
df
# We can add the processor company to differentiate
# Both SSD and HDD info is not present for all laptops. We can merge them for future analysis

Unnamed: 0.1,Unnamed: 0,Product,Price,Details,processor,generation,ram,os,storage_ssd,storage_hdd,display_cm,display_inch,warranty
0,0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R...",i5,9th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD,,39.62 cm,15.6 inch,1 Year International Travelers Warranty
1,1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ...",i3,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,,1 TB HDD,35.56 cm,14 inch,1 Year Limited Hardware Warranty
2,2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD,,39.62 cm,15.6 inch,1 Year Onsite Warranty
3,3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD,35.56 cm,14 inch,1 Year Limited Hardware Warranty
4,4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen,4 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD,39.62 cm,15.6 inch,1 Year Onsite Warranty
5,5,Asus Core i3 10th Gen - (4 GB/1 TB HDD/Windows...,"₹32,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen,4 GB DDR4 RAM,Windows 10 Operating System,,1 TB HDD,39.62 cm,15.6 inch,1 Year Onsite Warranty
6,6,Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD,35.56 cm,14 inch,1 Year Onsite Warranty
7,7,Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,8 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,1 TB HDD,35.56 cm,14 inch,1 Year Onsite Warranty
8,8,Asus VivoBook 14 Core i3 10th Gen - (4 GB/256 ...,"₹38,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen,4 GB DDR4 RAM,Windows 10 Operating System,256 GB SSD,,35.56 cm,14 inch,1 Year Onsite Warranty
9,9,Asus VivoBook 14 Ryzen 5 Quad Core 2nd Gen - (...,"₹42,990","AMD Ryzen 5 Quad Core Processor (2nd Gen), 8 G...",5 Quad,2nd Gen,8 GB DDR4 RAM,Windows 10 Operating System,512 GB SSD,,35.56 cm,14 inch,1 Year Limited International Hardware Warranty


##Closing Comments


*   The text data is structured and made it relatively easier for pattern recognition. However, this should give a fair understanding of common reg-ex patterns.
*   str.extract() in pandas return a series of matches for each capture group. Hence I used agg-join and apply-lambda-join (for concatenating NaN matches) methods to concatenate all capture groups.


