# Information Extraction Using Wrappers

In this notebook we demonstrate how we retrieved data for laptops from both Amazon and Walmart. We built wrappers to crawl through these websites and extract information for at least 3000 different laptops

In [1]:
import sys
sys.path.append('/u/p/m/pmartinkus/Documents/CS_838/Stage 2')

import HW2 as hw
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## Retrieving the Walmart Data

First, we will show how we extracted the laptop data from Walmart. Unfortunatly, the Walmart website will not load more than 1000 results for a single search so instead of looking for all laptops at once, we split upt the search by brand. Basically, we create a seperate csv file for each brand (each containing roughly 500 laptops) and then combine each brand's results into a single table.

In [2]:
# The brands and number of pages to get per brand (Some branders have fewer than 15 pages available)
brands = ['HP', 'Dell', 'Lenovo', 'ASUS', 'Acer', 'Apple']
pages = [15, 15, 15, 7, 15, 15]

for i, brand in enumerate(brands):
    hw.create_walmart_csv(brand, pages[i])

In [3]:
# Load the csv files as dataframes
data = []
path = '/u/p/m/pmartinkus/Documents/CS_838/Stage 2/Data/Walmart'
names = ['Name', 'Price', 'Brand', 'Screen Size', 'RAM', 'Hard Drive Capacity', 'Processor Type', 'Processor Speed', 
         'Operating System', 'Battery Life']
for brand in brands:
    brand_path = path + '_Brand_Data/Walmart_' + brand + '.csv'
    data.append(pd.read_csv(brand_path, names=names, quotechar="'"))
    
# Take a look at a dataframe
data[0].head()

Unnamed: 0,Name,Price,Brand,Screen Size,RAM,Hard Drive Capacity,Processor Type,Processor Speed,Operating System,Battery Life
0,"HP Flyer Red 15.6"" 15-f272wm Laptop PC with In...",299.0,HP,15.6 in,4 GB,500 GB,Intel Pentium,"2.16 GHz, with a Max Turbo Speed of 2.66 GHz",Windows 10,4.5 hours
1,"HP Stream 14-ax030wm 14” Smoke Gray, Windows 1...",219.0,HP,14 in,4 KB,32 KB,Intel Celeron Processor N3060,1.6 Hz,Windows 10,10.25 h
2,"HP Stream 11.6"" Laptop, Windows 10 Home, Offic...",199.0,HP,11.6 in,4 KB,32 KB,Intel Celeron,1.6 Hz,Windows 10,10 h
3,"HP Stream 14"" Jet Black Laptop, Windows 10 Hom...",218.98,HP,14 in,,,Intel Celeron,,Microsoft Windows,
4,"HP Black Licorice 15.6"" 15-F387WM Laptop PC wi...",329.0,HP,15.6 in,4 GB,500 GB,AMD A-Series,"2.20 GHz, with a Max Turbo Speed of 2.50 GHz",Windows 10,


In [4]:
# Combine all the data frames
walmart = pd.concat(data)
len(walmart)

3038

In [5]:
# Save full dataset to csv file
walmart.to_csv(path + '.csv', quotechar="'", na_rep='NaN', index=False)
walmart.head()

Unnamed: 0,Name,Price,Brand,Screen Size,RAM,Hard Drive Capacity,Processor Type,Processor Speed,Operating System,Battery Life
0,"HP Flyer Red 15.6"" 15-f272wm Laptop PC with In...",299.0,HP,15.6 in,4 GB,500 GB,Intel Pentium,"2.16 GHz, with a Max Turbo Speed of 2.66 GHz",Windows 10,4.5 hours
1,"HP Stream 14-ax030wm 14” Smoke Gray, Windows 1...",219.0,HP,14 in,4 KB,32 KB,Intel Celeron Processor N3060,1.6 Hz,Windows 10,10.25 h
2,"HP Stream 11.6"" Laptop, Windows 10 Home, Offic...",199.0,HP,11.6 in,4 KB,32 KB,Intel Celeron,1.6 Hz,Windows 10,10 h
3,"HP Stream 14"" Jet Black Laptop, Windows 10 Hom...",218.98,HP,14 in,,,Intel Celeron,,Microsoft Windows,
4,"HP Black Licorice 15.6"" 15-F387WM Laptop PC wi...",329.0,HP,15.6 in,4 GB,500 GB,AMD A-Series,"2.20 GHz, with a Max Turbo Speed of 2.50 GHz",Windows 10,


## Retrieving the Amazon Data

Here, things were considerably simpler because we were able to gather the information for all the laptops at once. We just run our script and it saves all of the amazon data into a csv file. We can then take a look at the results by loading it into a pandas dataframe.

In [6]:
# Create the csv file for the amazon data
hw.create_amazon_csv()

In [7]:
amazon_path = '/u/p/m/pmartinkus/Documents/CS_838/Stage 2/Data/Amazon.csv'
amazon = pd.read_csv(amazon_path, quotechar="'")
len(amazon)

3102

In [8]:
amazon.head()

Unnamed: 0,Name,Price,Brand,Screen Size,RAM,Hard Drive Capacity,Processor Type,Processor Speed,Operating System,Battery Life
0,"2018 Newest HP Premium 15.6"" Laptop, AMD A6-92...",287.99,HP,15.6 in,4 GB,500 GB,,2.5 GHz,Windows 10,
1,"2018 Newest Premium Dell Inspiron 15.6"" HD LED...",309.0,Dell,15.6 in,4 GB,500 GB,,1.6 GHz,Windows 10,
2,Acer Aspire E 15 E5-575-33BM 15.6-Inch FHD Not...,349.99,Acer,15.6 in,4 GB,"1,000 GB",Intel,2.4 GHz,Windows 10,
3,2018 HP Business 15.6-inch HD Touchscreen Lapt...,479.0,HP,15.6 in,8 GB,1 TB,,2.7 GHz,Windows 10,
4,"2018 HP Stream 14 Inch Laptop Computer, Intel ...",194.99,HP,14 in,4 GB,32 GB,Intel,1.6 GHz,Windows 10,


In the end we have two csv files containing over 3000 tuples each and have the same schema.