# Brand Analysis and Classification

In this notebook we will performing some exploration on laptop brands and then attempt to classify laptops based on the brand.

In [1]:
import pandas as pd
import numpy as np
from Features import add_numeric_features, impute_features
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree, svm
from sklearn.model_selection import cross_val_score
from sklearn import metrics

### Some Additional Feature Generation

Before performing any analysis, we firt want to add a couple of features. The goal here is to change some of the string features that contain quantities into numeric features. For example, the RAM feature is mostly just quantities with units attached, so we plan to change this feature to a numeric feature.

In [2]:
# Load Merged Table
data = pd.read_csv('../Data/Merged_Table.csv')
data.head(3)

Unnamed: 0,ID,Name,Price,Brand,Screen Size,RAM,Hard Drive Capacity,Processor Type,Processor Speed,Operating System,Battery Life
0,0,"HP Flyer Red 15.6"" 15-f272wm Laptop PC with In...",299.0,HP,15.6 in,4 GB,500 GB,Intel Pentium,2.16 GHz,Windows 10,4.5 hours
1,2,"HP Stream 11.6"" Laptop, Windows 10 Home, Offic...",199.0,HP,11.6 in,4 GB,32 GB,Intel Celeron,1.6 Hz,Windows 10,10 h
2,4,"HP Black Licorice 15.6"" 15-F387WM Laptop PC wi...",329.0,HP,15.6 in,4 GB,500 GB,AMD A-Series,2.20 GHz,Windows 10,


In [3]:
data = add_numeric_features(data)
data.head(3)

Unnamed: 0,ID,Name,Price,Brand,Screen Size,RAM,Hard Drive Capacity,Processor Type,Processor Speed,Operating System,Battery Life,Screen Size (Numeric),RAM (Numeric),Hard Drive Capacity (Numeric),Processor Speed (Numeric),Battery Life (Numeric)
0,0,"HP Flyer Red 15.6"" 15-f272wm Laptop PC with In...",299.0,HP,15.6 in,4 GB,500 GB,Intel Pentium,2.16 GHz,Windows 10,4.5 hours,15.6,4.0,500.0,2.16,4.5
1,2,"HP Stream 11.6"" Laptop, Windows 10 Home, Offic...",199.0,HP,11.6 in,4 GB,32 GB,Intel Celeron,1.6 Hz,Windows 10,10 h,11.6,4.0,32.0,1.6,10.0
2,4,"HP Black Licorice 15.6"" 15-F387WM Laptop PC wi...",329.0,HP,15.6 in,4 GB,500 GB,AMD A-Series,2.20 GHz,Windows 10,,15.6,4.0,500.0,2.2,


### OLAP Exploration

First we will do a little bit of OLAP style exploration. The goal here is to learn a little bit more about each of the brands. Which brands are more expensive? Which brands tend to have more powerful processors?

In [4]:
# Roll up on Brand
data.groupby('Brand').agg({'ID': ['count'], 'Price': ['mean'], 'Screen Size (Numeric)': ['mean'],
                           'RAM (Numeric)': ['mean'], 'Hard Drive Capacity (Numeric)': ['mean'],
                           'Processor Speed (Numeric)': ['mean'], 'Battery Life (Numeric)': ['mean']})

Unnamed: 0_level_0,ID,Price,Screen Size (Numeric),RAM (Numeric),Hard Drive Capacity (Numeric),Processor Speed (Numeric),Battery Life (Numeric)
Unnamed: 0_level_1,count,mean,mean,mean,mean,mean,mean
Brand,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
ASUS,381,744.020395,15.090237,9.768293,729.685301,2.012891,7.163934
Acer,560,608.746,13.845962,7.56917,417.422164,2.129517,7.20339
Apple,559,657.938945,13.459546,6.081481,345.389791,2.092437,9.921656
Dell,559,568.354168,14.431223,8.565996,419.961945,2.430898,8.287794
HP,1120,436.927536,14.999099,6.748988,522.975207,2.192193,7.223881
Lenovo,983,563.939959,14.803406,9.247337,591.687124,2.179495,5.675904


**Quantity:** In the above table, we can see that this table contains the most HP and Lenovo laptops at 1118 and 979 laptops respectivly. In the middle, Acer, Apple, and Dell all have between five and six hundred laptops in the table and we have the fewest data points for ASUS at just under 400.

**Price:** The most expensive brand, on average, is ASUS at nearly \$750 per laptop with Apple in second at about \$660. The brand with the cheapest laptops is HP at under \$450 per laptop.

Next we will look at some of the features of each laptop to see if the more expensive brands justify thier price with better products. If the more expensive brands have higher values for Screen Size, RAM, Hard Drive Capicity, Processor Speed, and/or Battery Life, then it would make sense that their laptops are more expensive.

**Screen Size:** ASUS, HP, and Lenovo have the largest screens with a mean of 15, or nearly 15, inches. Apple, onthe other hand has the smallest average screen size at under 13.5 inches.

**RAM:** ASUS again tops the list along with Lenovo at over 9 GB of RAM for thier average laptop. Apple and HP are at the bottom with an average RAM capacity of 6 and 6.7 respectively.

**Hard Drive Capacity:** ASUS has a substantial lead in average Hard Drive Capacity at over 700 GB with the closest brand, Lenovo, at under 600 GB. Dell and Apple generally have the lowest Hard Drive Capicties at about 420 GB per laptop.

**Processor Speed:** The fastest brand is Dell at over 2.4 GHz with the others all hovering between 2 and 2.2 GHz.

**Battery Life:** Lastly, Apple is the brand with the longest battery life at almost 10 hours with Lenovo by far the worst at under 6 hours of battery life.

### Brand Classification

Next we will see if we can classify laptops into thier different brands using the other features (excluding Name becuase it often contains the brand). We will first impute the missing values in the data, then split the table into a training set and evaluation set, then we will use Cross Validation to select a classifier, and finally, we will evaluate the classifier on the evaluation set.

In [5]:
# First we need to impute all of the missing values in the table
data = impute_features(data)
data.head(3)

Unnamed: 0,ID,Name,Price,Brand,Screen Size,RAM,Hard Drive Capacity,Processor Type,Processor Speed,Operating System,Battery Life,Screen Size (Numeric),RAM (Numeric),Hard Drive Capacity (Numeric),Processor Speed (Numeric),Battery Life (Numeric)
0,0,"HP Flyer Red 15.6"" 15-f272wm Laptop PC with In...",299.0,HP,15.6 in,4 GB,500 GB,Intel Pentium,2.16 GHz,Windows 10,4.5 hours,15.6,4.0,500.0,2.16,4.5
1,2,"HP Stream 11.6"" Laptop, Windows 10 Home, Offic...",199.0,HP,11.6 in,4 GB,32 GB,Intel Celeron,1.6 Hz,Windows 10,10 h,11.6,4.0,32.0,1.6,10.0
2,4,"HP Black Licorice 15.6"" 15-F387WM Laptop PC wi...",329.0,HP,15.6 in,4 GB,500 GB,AMD A-Series,2.20 GHz,Windows 10,,15.6,4.0,500.0,2.2,7.562405


In [6]:
# Now lets split the data into the inputs and targets
inputs = data.drop(['ID', 'Name', 'Brand', 'Screen Size', 'RAM', 'Hard Drive Capacity', 'Processor Type', 'Processor Speed', 'Operating System', 'Battery Life'], axis=1)
targets = data['Brand']
for row in data.itertuples():
    if type(row[4]) != str:
        print(row[1])

In [7]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=0.25, random_state=0)

In [8]:
# Initialize classifiers
rf = RandomForestClassifier(random_state=0)
dt = tree.DecisionTreeClassifier()
sv = svm.SVC()
matchers = [rf, dt, sv]
matcher_names=['Random Forest', 'Decision Tree', 'SVM']

In [9]:
# Cross validation
scores = ['precision_weighted', 'recall_weighted', 'f1_weighted']
results = []
for i, matcher in enumerate(matchers):
    row = [matcher_names[i]]
    for score in scores:
        cv_scores = cross_val_score(matcher, X_train, y_train, scoring=score, cv=10)
        row.append(sum(cv_scores) / float(len(cv_scores)))
    results.append(row)
cols = ['Matcher', 'Weighted Precision', 'Weighted Recall', 'Weighted F1']
cross_val_results = pd.DataFrame(results, columns=cols)
cross_val_results.head()

Unnamed: 0,Matcher,Weighted Precision,Weighted Recall,Weighted F1
0,Random Forest,0.688366,0.68379,0.684263
1,Decision Tree,0.650491,0.650456,0.64569
2,SVM,0.620852,0.53256,0.52743


In the end, we get the best results with the random forest matcher, so we will be using it on the evaluation set.

In [10]:
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
X_test['Brand'] = y_test
X_test['Predictions'] = predictions
X_test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,Price,Screen Size (Numeric),RAM (Numeric),Hard Drive Capacity (Numeric),Processor Speed (Numeric),Battery Life (Numeric),Brand,Predictions
381,242.98,14.1,6.0,500.0,2.4,7.562405,HP,Dell
2165,379.04,12.5,4.0,510.967361,0.8,7.562405,Acer,Acer
2877,1073.83,13.3,7.902245,256.0,2.6,7.562405,Apple,Apple
1366,480.99,15.6,8.0,1000.0,2.2,7.562405,Lenovo,Lenovo
3937,279.0,14.0,4.0,510.967361,2.177161,7.562405,HP,HP


In [11]:
print('Precision Score: ' + str(metrics.precision_score(y_test, predictions, average='weighted')))
print('Recall Score: ' + str(metrics.recall_score(y_test, predictions, average='weighted')))
print('F1 Score: ' + str(metrics.f1_score(y_test, predictions, average='weighted')))

Precision Score: 0.677709807418778
Recall Score: 0.6791546589817483
F1 Score: 0.6777767074410274
