# MODELING PART 2: Supervised

Part 2 takes the dataframe with the addition of cluster variance column from part 1.
Now we will run a classification model to determine new stores which would be good candidates for carrying and selling Mezcal products. 

### Steps:
1. libraries / visualize where we left off
2. Create the train test split for classification
3. Standard scale the new 'var' column
4. Random Forest Regression, to see which ranking column, Ordinal or total ranking is better understood
5. Create a classification target column
6. Classification models

### 1. Import libraries

In [3]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn import metrics

pd.set_option("display.max_rows", 999)

In [4]:
# load data
iowa = pd.read_csv('iowa_model_part_1.csv')

In [5]:
iowa.head(2)

Unnamed: 0,Store Name,Address,City,Zip Code,County,Category Name,Bottle Price Category,Category Name_BOURBON WHISKY,Category Name_BRANDY,Category Name_CANADIAN WHISKY,...,Sale (Dollars),Volume Sold (Liters),Profit / Item,Profit / Invoice,Profit / ml,Retail Price / ml,total_ranking,OE_BPC,OE_ranking,var
0,1st stop beverage shop,2839 East University Ave.,DES MOINES,50317,POLK,BOURBON WHISKY,Expensive,0.0,0.0,0.0,...,-0.162803,-0.161489,2.252836,-0.061991,0.44692,0.447108,1681.0,2.0,35.0,291.719966
1,1st stop beverage shop,2839 East University Ave.,DES MOINES,50317,POLK,BOURBON WHISKY,Expensive,0.0,0.0,0.0,...,-0.15689,-0.16095,2.4655,-0.046327,0.50736,0.507549,1681.0,2.0,35.0,291.719966


### 2. Train Test Split

In [6]:
# keep just the modeling columns
iowa_df = iowa.iloc[:,7:]

In [7]:
# create X and y for 

# dependent variable
total_y = iowa_df['total_ranking']
OE_y = iowa_df['OE_ranking']

# independent variable
iowa_X = iowa_df.drop(columns=['total_ranking', 'OE_ranking'])

#### Total ranking as the target

In [11]:
# Total ranking 
# split into a training and testing data sets.

total_X_train, total_X_test, total_y_train, total_y_test = train_test_split(iowa_X, total_y, test_size = 0.2, random_state=99)

#### Ordinal ranking as the target

In [12]:
# Ordinal Ranking
# split into a training and testing data sets.

OE_X_train, OE_X_test, OE_y_train, OE_y_test = train_test_split(iowa_X, OE_y, test_size = 0.2, random_state=99)

#### Create a reference X_test.

This is so we have the correct store names for learning from our models output

In [13]:
# create an original X_test to match results
# keep all setting the same, except use the original df with store names and rankings

_, reference_test, _, _ = train_test_split(iowa, OE_y, test_size = 0.2, random_state=99)

### 3. Standard Scaler

Most of the columns were scaled during the Part 1 for clustering. We now need to scale our newest feature 'var'. The variance in store cluster labels.

In [14]:
# we need to scale the 'var' column

# assign the standard scaler to a variable
scaler = StandardScaler()

#fit the scaler ON ONLY THE TRAINING SET also the var column is the same for both OE and total
scaler.fit(total_X_train[['var']])

# process both the test and train data through the standard scaler
# assigning them to new variables

total_X_train['var'] = scaler.transform(total_X_train[['var']])

OE_X_train['var'] = scaler.transform(OE_X_train[['var']])

total_X_test['var'] = scaler.transform(total_X_test[['var']])

OE_X_test['var'] = scaler.transform(OE_X_test[['var']])

### 4. Random Forest Regressor

In [15]:
# Random Forest Regressor for total ranking

clf = RandomForestRegressor(random_state=55)
clf.fit(total_X_train, total_y_train)

RandomForestRegressor(random_state=55)

In [16]:
y_pred = clf.predict(total_X_test)

In [17]:
# quick r2 metric, how good is the fit
clf.score(total_X_test, total_y_test)

0.9996813194241374

In [18]:
# store the results

# get the indexes to match
reference_test = reference_test.reset_index()

# create column to store the corresponding results
reference_test['total_pred'] = pd.DataFrame(y_pred)

In [19]:
reference_test[['Store Name', 'total_ranking', 'total_pred']]

Unnamed: 0,Store Name,total_ranking,total_pred
0,Costco wholesale #788 / wdm,66049.0,66049.0
1,Hy-vee wine and spirits / hubbell,66049.0,66049.0
2,Central city 2,6.5,6.5
3,Super quick mart / windsor heights,368.0,368.0
4,Family pantry,6228.0,6228.0
...,...,...,...
45006,Urban liquor,66049.0,66049.0
45007,Shop n save #2 / e 14th,66049.0,66049.0
45008,Fareway stores #167/johnston,66049.0,66049.0
45009,Kum & go #572 / urbandale,66049.0,66049.0


 - Like we expect the r2 was almost 1 so the predictions were correct

### Random Forest reg with OE

In [20]:
# run the model with ordinal rankings
oe_clf = RandomForestRegressor(random_state=55)
oe_clf.fit(OE_X_train, OE_y_train)

RandomForestRegressor(random_state=55)

In [21]:
oe_y_pred = oe_clf.predict(OE_X_test)

In [22]:
# quick r2 metric, how good is the fit
oe_clf.score(OE_X_test, OE_y_test)

0.9994556783414396

- slightly worse r2

In [23]:
OE_test = pd.DataFrame(oe_y_pred)

In [24]:
reference_test['oe_pred'] = OE_test

In [25]:
reference_test

Unnamed: 0,index,Store Name,Address,City,Zip Code,County,Category Name,Bottle Price Category,Category Name_BOURBON WHISKY,Category Name_BRANDY,...,Profit / Item,Profit / Invoice,Profit / ml,Retail Price / ml,total_ranking,OE_BPC,OE_ranking,var,total_pred,oe_pred
0,31645,Costco wholesale #788 / wdm,7205 MILLS CIVIC PKWY,WEST DES MOINES,50266,DALLAS,SPECIAL PACKAGING,Normal,0.0,0.0,...,0.228513,10.085250,-0.178966,-0.178824,66049.0,1.0,67.0,541.867216,66049.0,67.0
1,105022,Hy-vee wine and spirits / hubbell,2310 HUBBELL AVE,DES MOINES,50317,POLK,SPECIAL PACKAGING,Inexpensive,0.0,1.0,...,-0.168181,-0.080368,-0.241143,-0.240990,66049.0,0.0,67.0,230.408432,66049.0,67.0
2,28056,Central city 2,1501 MICHIGAN AVE,DES MOINES,50314,POLK,SPECIAL PACKAGING,Normal,0.0,0.0,...,0.064743,0.138905,-0.171151,-0.170999,6.5,1.0,2.0,428.946708,6.5,2.0
3,188660,Super quick mart / windsor heights,7690 Hickman Rd,WINDSOR HEIGHTS,50324,POLK,VODKA,Normal,0.0,0.0,...,-0.078405,-0.175474,0.149682,0.149676,368.0,1.0,18.0,229.900051,368.0,18.0
4,43685,Family pantry,4538 LOWER BEAVER RD,DES MOINES,50310,POLK,GIN,Inexpensive,0.0,0.0,...,-0.336545,-0.194867,-0.261143,-0.261062,6228.0,0.0,66.0,181.283771,6228.0,66.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45006,199165,Urban liquor,6401 DOUGLAS AVE STE 1,URBANDALE,50322,POLK,TEQUILA,Normal,0.0,0.0,...,0.254777,-0.209156,-0.120937,-0.120781,66049.0,1.0,67.0,286.823802,66049.0,67.0
45007,176568,Shop n save #2 / e 14th,1372 E 14TH ST,DES MOINES,50316,POLK,BRANDY,Normal,0.0,0.0,...,0.178750,0.160676,-0.115625,-0.115520,66049.0,1.0,67.0,247.223484,66049.0,67.0
45008,50506,Fareway stores #167/johnston,6005 Merle Hay Road,JOHNSTON,50131,POLK,FLAVORED RUM,Normal,0.0,0.0,...,-0.093256,0.268970,0.016567,0.016525,66049.0,1.0,67.0,322.190172,66049.0,67.0
45009,137475,Kum & go #572 / urbandale,4860 NW URBANDALE DR,URBANDALE,50322,POLK,WHISKY,Inexpensive,0.0,0.0,...,-0.287463,-0.167740,-0.242007,-0.241814,66049.0,0.0,67.0,187.716864,66049.0,67.0


- there does not seem to be a difference between total_ranking and the ordinal ranked version

### 5. Create a target classification column 

We will use each store's mezcal ranking to create classification labels

In [26]:
# determine bins
iowa['total_ranking'].value_counts().sort_index()

2.00          1773
6.00          1531
6.50          1653
21.00          819
25.00         1593
28.00         1343
44.00         1638
94.50         1676
153.00        1423
156.00         651
160.00        1776
161.50        1670
192.00        1546
220.50        1779
253.50         124
275.00         590
343.00        1612
367.50         103
368.00        2058
409.50        1632
418.50        1274
451.00        1750
500.50        1316
507.00         109
539.00        1498
653.25         179
656.00         343
697.00        1575
710.50        1947
820.00        1448
1107.00       2331
1250.50       1678
1373.50       1609
1558.00        775
1575.00       1282
1681.00       1429
1830.00         87
1860.50        486
1967.25        307
1968.00        702
2016.00       1150
2089.25         92
2214.00        475
2388.75       2457
2412.00       1073
2592.00       1575
2664.00       1048
2808.50        385
2835.00         62
2952.00       1414
3157.00        278
3281.25         53
3456.00     

In [27]:
# create a classification for store types based on the rankings
iowa['mezcal_sales'] = iowa.total_ranking.apply(lambda x: 0 if x == 66049.0 else ( 1 if x > 1500 and x <66049 else 2))

In [28]:
# how does the split look?
iowa['mezcal_sales'].value_counts()

0    144843
2     44047
1     36163
Name: mezcal_sales, dtype: int64

### 6. Random Forest Classification

Now that we have classification labels in the column 'mezcal_sales' we need to create the same splits from earlier but using the classification labels as the target variable

In [29]:
# create a y split with the new column

_, _, y_train, y_test = train_test_split(iowa_X, iowa['mezcal_sales'], test_size = 0.2, random_state=99)

In [30]:
# was the split even
y_train.value_counts()

0    115856
2     35281
1     28905
Name: mezcal_sales, dtype: int64

In [31]:
y_test.value_counts()

0    28987
2     8766
1     7258
Name: mezcal_sales, dtype: int64

- looks fairly even

In [38]:
# create a model, the target variable is not balanced. Having about 3:1:1 as the ratio
rfc = RandomForestClassifier(random_state=1, class_weight='balanced')

# check the model performance
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
scores = cross_val_score(rfc, total_X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

score_mean = np.mean(scores)
score_std = np.std(scores)
# cross val 
print(f'Average accuracy: {score_mean} Average STD: {score_std}')

Average accuracy: 0.9792252173323898 Average STD: 0.001042200508734727


In [39]:
# Now that we are pleased with our underlying model
rfc.fit(total_X_train, y_train)
mezcal_pred = rfc.predict(total_X_test)

In [40]:
rfc.score(total_X_test, y_test)

0.9820932660905112

In [41]:
print("=== Confusion Matrix ===")
print(metrics.confusion_matrix(y_test, mezcal_pred))

=== Confusion Matrix ===
[[28865    86    36]
 [  508  6674    76]
 [   91     9  8666]]


#### the top row contains the real 0 (no mezcal values) rows predicted as:
- 28865 were predicted correctly, 86 were thought to contain some mezcal, 36 were thought to contain lots of mezcal
- those 122 rows for stores should be approached to sell mezcal

#### the middle row contains the real 1 (some mezcal values) rows predicted as:
- 508 were thought to sell 0 mezcal, 6674 correctly identified, and 76 were thought to sell more or a greater variety
- the 76 store rows should be offered new products to sell

#### the bottom row contain real 2 (lots of mezcal/ variety) rows predicted as:
- 91 were thought to sell 0 mezcal, 9 thought to sell less, and 8666 were correct

In [42]:
print("=== Classification Report ===")
print(metrics.classification_report(y_test, mezcal_pred))

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     28987
           1       0.99      0.92      0.95      7258
           2       0.99      0.99      0.99      8766

    accuracy                           0.98     45011
   macro avg       0.98      0.97      0.98     45011
weighted avg       0.98      0.98      0.98     45011



- Class 0 has a recall of 1.0 and a precision of 0.98, this tells me that class 0 is being over selected.

In [79]:
# a table for logging progress
columns = ['model', 'random state', 'class weight', 'r2', 'overall acc', '0: precision', '0: recall', '0: f1',
          '1: precision', '1: recall', '1: f1', '2: precision', '2: recall', '2: f1', 'wrong 0', 'wrong 1', 'wrong 2']

data = [['Random Forest Classifier', 1, 'balanced', 0.9821, 0.98,
        0.98, 1.00, 0.99,
        0.99, 0.92, 0.95,
        0.99, 0.99, 0.99,
        ('X', 86, 36), (508, 'X', 76),(91, 9, 'X')]]


### Task specific classification

Now that we have a model worth trusting for it's ability to understand the dataset, let's tune it to our needs. We are attempting to find new customers for liquor products, specifically mezcal. We now would like our model to over select class 1 and 2, the mezcal selling categories.  We are going to create false positives.  Type 1 error represents stores that the model think should be selling mezcal, but currently are not. 

These stores are good candidates for future sales, and market expansion. We want to make our model worse, in order to increase the list of stores to be approached with mezcal products.

We will attempt to make the model worse, however not much worse. We need to still be able to trust the results.

In [45]:
# now to make it worse!
# if we adjust the model to favor class 1,2 selections in order to capture more possible sales locations
# adjust class weighting

rfc_sales = RandomForestClassifier(random_state=1, class_weight={0:4,1:2,2:1})
rfc_sales.fit(total_X_train, y_train)
mezcal_sales_pred = rfc_sales.predict(total_X_test)
rfc_sales.score(total_X_test, y_test)

0.9815822798871388

In [46]:
print("=== Confusion Matrix ===")
print(metrics.confusion_matrix(y_test, mezcal_sales_pred))

=== Confusion Matrix ===
[[28827    94    66]
 [  437  6699   122]
 [   97    13  8656]]


In [47]:
print("=== Classification Report ===")
print(metrics.classification_report(y_test, mezcal_sales_pred))

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     28987
           1       0.98      0.92      0.95      7258
           2       0.98      0.99      0.98      8766

    accuracy                           0.98     45011
   macro avg       0.98      0.97      0.97     45011
weighted avg       0.98      0.98      0.98     45011



In [80]:
# log data
data.append(['Random Forest Classifier', 1, '0:4,1:2,2:1', 0.9815, 0.98, 0.98, 0.99, 0.99,
                   0.98, 0.92, 0.95,
                   0.98, 0.99, 0.98,
                    ('X', 94, 66), (437, 'X', 122), (97, 13, "X")])

- '0:4,1:2,2:1' weighting gave us more stores to sell to, however increased the incorrect class 1 and 2 selections as well

In [104]:
# second attempt
rfc_sales = RandomForestClassifier(random_state=1, class_weight={0:20,1:1,2:1})
rfc_sales.fit(total_X_train, y_train)
mezcal_sales_pred = rfc_sales.predict(total_X_test)
rfc_sales.score(total_X_test, y_test)

0.9828042034169425

In [105]:
print("=== Classification Report ===")
print(metrics.classification_report(y_test, mezcal_sales_pred))

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     28987
           1       0.98      0.93      0.95      7258
           2       0.98      0.99      0.98      8766

    accuracy                           0.98     45011
   macro avg       0.98      0.97      0.98     45011
weighted avg       0.98      0.98      0.98     45011



In [106]:
print("=== Confusion Matrix ===")
print(metrics.confusion_matrix(y_test, mezcal_sales_pred))

=== Confusion Matrix ===
[[28808   120    59]
 [  386  6748   124]
 [   70    15  8681]]


In [81]:
# log data
data.append(['Random Forest Classifier', 1, '0:20,1:1,2:1', 0.9828, 0.98,
        0.98, 0.99, 0.99,
        0.98, 0.93, 0.95,
        0.98, 0.99, 0.98,
        ('X', 120, 59), (386, 'X', 124), (70, 15, 'X')])

- This model is interesting, r2 increased  (good) , incorrect 0 increased (good), incorrect 2 decreased (good), f1 deceased (bad)

In [75]:
# third attempt
rfc_sales = RandomForestClassifier(random_state=1, class_weight={0:40, 1:1, 2:1})
rfc_sales.fit(total_X_train, y_train)
mezcal_sales_pred = rfc_sales.predict(total_X_test)
rfc_sales.score(total_X_test, y_test)

0.9791384328275311

In [76]:
print("=== Classification Report ===")
print(metrics.classification_report(y_test, mezcal_sales_pred))

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     28987
           1       0.98      0.92      0.95      7258
           2       0.97      0.98      0.98      8766

    accuracy                           0.98     45011
   macro avg       0.98      0.97      0.97     45011
weighted avg       0.98      0.98      0.98     45011



In [77]:
print("=== Confusion Matrix ===")
print(metrics.confusion_matrix(y_test, mezcal_sales_pred))

=== Confusion Matrix ===
[[28763   139    85]
 [  417  6681   160]
 [  121    17  8628]]


In [82]:
# log data
data.append(['Random Forest Classifier', 1, '(0:40, 1:1, 2:1)', 0.9791, 0.98,
        0.98, 0.99, 0.99,
        0.98, 0.92, 0.95,
        0.97, 0.98, 0.98,
        ('X', 139, 85), (417, 'X', 160), (121, 17, 'X')])

In [84]:
# fourth attempt
rfc_sales = RandomForestClassifier(random_state=1, class_weight={0:100, 1:2, 2:1})
rfc_sales.fit(total_X_train, y_train)
mezcal_sales_pred = rfc_sales.predict(total_X_test)
rfc_sales.score(total_X_test, y_test)

0.9810490768923152

In [85]:
print("=== Classification Report ===")
print(metrics.classification_report(y_test, mezcal_sales_pred))

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     28987
           1       0.98      0.92      0.95      7258
           2       0.97      0.99      0.98      8766

    accuracy                           0.98     45011
   macro avg       0.98      0.97      0.97     45011
weighted avg       0.98      0.98      0.98     45011



In [86]:
print("=== Confusion Matrix ===")
print(metrics.confusion_matrix(y_test, mezcal_sales_pred))

=== Confusion Matrix ===
[[28794   116    77]
 [  396  6703   159]
 [   90    15  8661]]


In [87]:
# log data
data.append(['Random Forest Classifier', 1, '(0:100, 1:2, 2:1)', 0.9810, 0.98,
        0.98, 0.99, 0.99,
        0.98, 0.92, 0.95,
        0.97, 0.99, 0.98,
        ('X', 116, 77), (396, 'X', 159), (90, 15, 'X')])

In [88]:
# fourth attempt
rfc_sales = RandomForestClassifier(random_state=1, class_weight={0:200, 1:1, 2:1})
rfc_sales.fit(total_X_train, y_train)
mezcal_sales_pred = rfc_sales.predict(total_X_test)
rfc_sales.score(total_X_test, y_test)

0.97820532758659

In [89]:
print("=== Classification Report ===")
print(metrics.classification_report(y_test, mezcal_sales_pred))

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     28987
           1       0.98      0.91      0.94      7258
           2       0.97      0.99      0.98      8766

    accuracy                           0.98     45011
   macro avg       0.98      0.96      0.97     45011
weighted avg       0.98      0.98      0.98     45011



In [90]:
print("=== Confusion Matrix ===")
print(metrics.confusion_matrix(y_test, mezcal_sales_pred))

=== Confusion Matrix ===
[[28768   123    96]
 [  477  6618   163]
 [  102    20  8644]]


In [91]:
# log data
data.append(['Random Forest Classifier', 1, '(0:200, 1:1, 2:1)', 0.9782, 0.98,
        0.98, 0.99, 0.99,
        0.98, 0.91, 0.94,
        0.97, 0.99, 0.98,
        ('X', 123, 96), (477, 'X', 163), (102, 20, 'X')])

In [123]:
# fourth attempt
rfc_sales = RandomForestClassifier(random_state=1, class_weight={0:400, 1:1, 2:1})
rfc_sales.fit(total_X_train, y_train)
mezcal_sales_pred = rfc_sales.predict(total_X_test)
rfc_sales.score(total_X_test, y_test)

0.9767612361422763

In [124]:
print("=== Classification Report ===")
print(metrics.classification_report(y_test, mezcal_sales_pred))

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     28987
           1       0.98      0.91      0.94      7258
           2       0.97      0.98      0.98      8766

    accuracy                           0.98     45011
   macro avg       0.98      0.96      0.97     45011
weighted avg       0.98      0.98      0.98     45011



In [125]:
print("=== Confusion Matrix ===")
print(metrics.confusion_matrix(y_test, mezcal_sales_pred))

=== Confusion Matrix ===
[[28753   132   102]
 [  508  6595   155]
 [  129    20  8617]]


In [96]:
# log data
data.append(['Random Forest Classifier', 1, '(0:400, 1:1, 2:1)', 0.9767, 0.98,
        0.98, 0.99, 0.99,
        0.98, 0.91, 0.94,
        0.97, 0.98, 0.98,
        ('X', 132, 102), (508, 'X', 155), (129, 20, 'X')])

- Officially too broken. it is selecting 0 for too many 1 and 2

In [102]:
rfc_data = pd.DataFrame(data, columns=columns)

In [103]:
# the best so far is(0:20, 1:1, 2:1)
rfc_data

Unnamed: 0,model,random state,class weight,r2,overall acc,0: precision,0: recall,0: f1,1: precision,1: recall,1: f1,2: precision,2: recall,2: f1,wrong 0,wrong 1,wrong 2
0,Random Forest Classifier,1,balanced,0.9821,0.98,0.98,1.0,0.99,0.99,0.92,0.95,0.99,0.99,0.99,"(X, 86, 36)","(508, X, 76)","(91, 9, X)"
1,Random Forest Classifier,1,"0:4,1:2,2:1",0.9815,0.98,0.98,0.99,0.99,0.98,0.92,0.95,0.98,0.99,0.98,"(X, 94, 66)","(437, X, 122)","(97, 13, X)"
2,Random Forest Classifier,1,"0:20,1:1,2:1",0.9828,0.98,0.98,0.99,0.99,0.98,0.93,0.95,0.98,0.99,0.98,"(X, 120, 59)","(386, X, 124)","(70, 15, X)"
3,Random Forest Classifier,1,"(0:40, 1:1, 2:1)",0.9791,0.98,0.98,0.99,0.99,0.98,0.92,0.95,0.97,0.98,0.98,"(X, 139, 85)","(417, X, 160)","(121, 17, X)"
4,Random Forest Classifier,1,"(0:100, 1:2, 2:1)",0.981,0.98,0.98,0.99,0.99,0.98,0.92,0.95,0.97,0.99,0.98,"(X, 116, 77)","(396, X, 159)","(90, 15, X)"
5,Random Forest Classifier,1,"(0:100, 1:1, 2:1)",0.9782,0.98,0.98,0.99,0.99,0.98,0.91,0.94,0.97,0.99,0.98,"(X, 123, 96)","(477, X, 163)","(102, 20, X)"
6,Random Forest Classifier,1,"(0:400, 1:1, 2:1)",0.9767,0.98,0.98,0.99,0.99,0.98,0.91,0.94,0.97,0.98,0.98,"(X, 132, 102)","(508, X, 155)","(129, 20, X)"


In [107]:
# the best results from rfc are 20,1,1 weighting, accuracy for class 1,2 are still good, while incorrect 0 increased
reference_test['sales_pred'] = pd.DataFrame(mezcal_sales_pred)

In [108]:
# create a new df that is easier to read, and reflects prediction counts
# reference_test was the X_test with store name/ info

selling_stores = reference_test.groupby(['Store Name', 'City', 'total_ranking', 'sales_pred']).agg({'sales_pred':'size'})

In [109]:
# we groupby and agg the same column so we need to rename one in order to reset_index
selling_stores = selling_stores.rename(columns={'sales_pred':'count'}).reset_index()

In [110]:
selling_stores['total_ranking'] = selling_stores['total_ranking'].astype(int)

In [114]:
# filter for mezcal sales prediction possibilities
possibilities = selling_stores[(selling_stores['sales_pred'] != 0) & (selling_stores['total_ranking'] == 66049)]

- these stores do not sell mezcal but were deemed to have potential

In [119]:
possibilities2 = selling_stores[(selling_stores['sales_pred'] == 2) & (selling_stores['total_ranking'] > 1500 ) & (selling_stores['total_ranking'] < 66049 )]

- these stores sell mezcal but are considered good candidates to sell more

In [122]:
possibilities['Store Name'].nunique()

53

- 53 realistic clients

In [120]:
possibilities2['Store Name'].nunique()

16

- 16 stores to approach about increasing their variety of offerings 

In [126]:
#### If we wanted to maximize possible stores, with a slightly worse model (400,1,1 weighting)

reference_test['sales_pred'] = pd.DataFrame(mezcal_sales_pred)

selling_stores = reference_test.groupby(['Store Name', 'City', 'total_ranking', 'sales_pred']).agg({'sales_pred':'size'})

selling_stores = selling_stores.rename(columns={'sales_pred':'count'}).reset_index()

selling_stores['total_ranking'] = selling_stores['total_ranking'].astype(int)

# filter for mezcal sales prediction possibilities
max_stores = selling_stores[(selling_stores['sales_pred'] != 0) & (selling_stores['total_ranking'] == 66049)]

max_stores2 = selling_stores[(selling_stores['sales_pred'] == 2) & (selling_stores['total_ranking'] > 1500 ) & (selling_stores['total_ranking'] < 66049 )]

In [131]:
zero_stores = pd.concat([possibilities['Store Name'], max_stores['Store Name']], ignore_index=True).reset_index()

In [134]:
zero_stores['Store Name'].nunique() - possibilities['Store Name'].nunique()

32

- There are 32 additional '0' stores to approach from the worse model 

In [135]:
one_stores = pd.concat([possibilities2['Store Name'], max_stores2['Store Name']], ignore_index=True).reset_index()

In [136]:
one_stores['Store Name'].nunique() - possibilities2['Store Name'].nunique()

12

- And there are 12 additional stores to approach about expanding their mezcal variety