# Brand Analysis and Classification

In this notebook we will performing some exploration on laptop brands and then attempt to classify laptops based on the brand.

In [3]:
import pandas as pd
import numpy as np
from Features import add_numeric_features

### Some Additional Feature Generation

Before performing any analysis, we firt want to add a couple of features. The goal here is to change some of the string features that contain quantities into numeric features. For example, the RAM feature is mostly just quantities with units attached, so we plan to change this feature to a numeric feature.

In [14]:
# Load Merged Table
data = pd.read_csv('../Data/Merged_Table.csv')
data.head(3)

Unnamed: 0,ID,Name,Price,Brand,Screen Size,RAM,Hard Drive Capacity,Processor Type,Processor Speed,Operating System,Battery Life
0,0,"HP Flyer Red 15.6"" 15-f272wm Laptop PC with In...",299.0,HP,15.6 in,4 GB,500 GB,Intel Pentium,"2.16 GHz, with a Max Turbo Speed of 2.66 GHz",Windows 10,4.5 hours
1,2,"HP Stream 11.6"" Laptop, Windows 10 Home, Offic...",199.0,HP,11.6 in,4 GB,32 GB,Intel Celeron,1.6 Hz,Windows 10,10 h
2,4,"HP Black Licorice 15.6"" 15-F387WM Laptop PC wi...",329.0,HP,15.6 in,4 GB,500 GB,AMD A-Series,"2.20 GHz, with a Max Turbo Speed of 2.50 GHz",Windows 10,


In [34]:
import re


# Keep track of update data
result = []

# For each tuple, add the new features
cols = list(data.columns)
for row in data.itertuples():
    row = list(row)

    # Add an additioanl column with each of the following features changed to a numeric value
    features = ['Screen Size', 'RAM', 'Hard Drive Capacity', 'Battery Life']
    for feat in features:
       
        # Check for null values for this feature
        if pd.isnull(row[cols.index(feat) + 1]) or pd.isnull(row[cols.index(feat) + 1]):
            val = float('nan')
        
        # Otherwise change to a numeric value
        else:
            val = float(re.sub('[^0-9.]', '', row[cols.index(feat) + 1]))
            
        # Change TB to GB for hard drive capacity
        if feat = 'Hard Drive Capacity' and val == 1:
            val = 1000
            
        # Add new value to this row
        row.append(val)
    
    # Add the new row to the results
    result.append(row)

# Create a DataFrame
cols = ['ID', 'Name', 'Price', 'Brand', 'Screen Size', 'RAM', 'Hard Drive Capacity', 'Processor Type', 'Processor Speed',
        'Operating System', 'Battery Life', 'Screen Size (Numeric)', 'RAM (Numeric)', 'Hard Drive Capacity (Numeric)',
        'Battery Life (Numeric)']
df = pd.DataFrame(result)
df.head()  


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0,0,"HP Flyer Red 15.6"" 15-f272wm Laptop PC with In...",299.0,HP,15.6 in,4 GB,500 GB,Intel Pentium,"2.16 GHz, with a Max Turbo Speed of 2.66 GHz",Windows 10,4.5 hours,15.6,4.0,500.0,4.5
1,1,2,"HP Stream 11.6"" Laptop, Windows 10 Home, Offic...",199.0,HP,11.6 in,4 GB,32 GB,Intel Celeron,1.6 Hz,Windows 10,10 h,11.6,4.0,32.0,10.0
2,2,4,"HP Black Licorice 15.6"" 15-F387WM Laptop PC wi...",329.0,HP,15.6 in,4 GB,500 GB,AMD A-Series,"2.20 GHz, with a Max Turbo Speed of 2.50 GHz",Windows 10,,15.6,4.0,500.0,
3,3,5,"HP Silver Iridium Ci5 15-cc050wm 15.6"" Laptop,...",569.0,HP,15.6 in,12 GB,1 GB,7th Generation Intel Core i5-7200U Processor,2500 Hz,Windows 10,9.00 h,15.6,12.0,1.0,9.0
4,4,8,"HP 15-bw032WM 15.6"" Laptop Bundle, Windows 10,...",391.76,HP,15.6 in,8 GB,1 KB,,,"Microsoft Windows, @generated",,15.6,8.0,1.0,


### OLAP Exploration

First we will do a little bit of OLAP style exploration. The goal here is to learn a little bit more about each of the brands. Which brands are more expensive? Which brands tend to have more powerful processors?

In [13]:
# Roll up on Brand
data.groupby('Brand').agg({'Price': ['mean'], 'ID': ['count']})

Unnamed: 0_level_0,Price,ID
Unnamed: 0_level_1,mean,count
Brand,Unnamed: 1_level_2,Unnamed: 2_level_2
ASUS,882.672594,240
Acer,554.176981,699
Apple,659.563004,556
Asus,509.0,141
Dell,568.354168,559
HP,437.314723,1118
Lenovo,562.449867,979
