## Part II: Machine Learning and Business Use Case

Now your manager has a basic understanding of why customers returned orders. Next, he wants you to use machine learning to predict which orders are most likely to be returned. In this part, you will generate several features based on our previous findings and your manager's requirements.

### Problem 4: Feature Engineering
#### Step 1: Create the dependent variable
- First of all, we need to generate a categorical variable which indicates whether an order has been returned or not.
- ***Hint:*** the returned orders’ IDs are contained in the dataset “returns”


In [1]:
import pandas as pd

In [2]:
orders1 = pd.read_csv('../data/Orders.csv')
returns1 = pd.read_csv('../data/Returns.csv')
returns1.columns = ['Returns', 'Order.ID', 'Region']

In [3]:
new_orders = pd.merge(orders1, returns1, left_on='Order.ID', right_on='Order.ID', how='outer')
new_orders['Returns']=new_orders['Returns'].fillna('No')


#### Step 2:
- Your manager believes that **how long it took the order to ship** would affect whether the customer would return it or not. 
- He wants you to generate a feature which can measure how long it takes the company to process each order.
- ***Hint:*** Process.Time = Ship.Date - Order.Date


In [4]:
new_orders['Ship.Date']= pd.to_datetime(new_orders['Ship.Date'])
new_orders['Order.Date']= pd.to_datetime(new_orders['Order.Date'])
new_orders['Process.Time'] = new_orders['Ship.Date'].sub(new_orders['Order.Date'], axis=0)

#### Step 3:

- If a product has been returned before, it may be returned again. 
- Let us generate a feature that indicates how many times the product has been returned before.
- If it never got returned, we just impute using 0.
- ***Hint:*** Group by different Product.ID


In [5]:
# change settings to allow all the columns to be displayed
pd.set_option('display.max_columns', None)

In [6]:
# Temp table -> holds number counts of products returned
temp = pd.DataFrame(new_orders[new_orders['Returns']=='Yes'].groupby('Product.ID').size())

In [7]:
# Merge with temp table
new_orders = new_orders.merge(temp, on="Product.ID", how="outer" )

In [8]:
# Replace NaN with zero
new_orders[0].fillna(0, inplace=True)

In [9]:
# Change type of new column to Integer
new_orders[0] = new_orders[0].astype('int64')

In [10]:
# rename column "0" to "Times.Returned"
new_orders.rename(columns={0: "Times.Returned"}, inplace=True)

<br><br>
### Problem 5: Fitting Models

- You can use any binary classification method you have learned so far.
- Use 80/20 training and test splits to build your model. 
- Double check the column types before you fit the model.
- Only include useful features. i.e all the `ID`s should be excluded from your training set.
- Note that there are only less than 5% of the orders have been returned, so you should consider using the [createDataPartition](https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/createDataPartition) function from `caret` package and [StratifiedKfold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn-model-selection-stratifiedkfold) from sklearn when running cross-validation.
- Do forget to `set.seed()` before the spilt to make your result reproducible.
- **Note:** We are not looking for the best tuned model in the lab so don't spend too much time on grid search. Focus on model evaluation and the business use case of each model.


In [11]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [12]:
# Target Variable: replace Yes -> 1 and No -> 0
new_orders.Returns.replace({'Yes':1, 'No':0}, inplace=True)

In [13]:
# Create training set and test set
np.random.seed(0)
testIdxes = np.random.choice(range(new_orders.shape[0]), size= int(new_orders.shape[0] * .2), replace=False)
trainIdxes = list(set(range(new_orders.shape[0]))-set(testIdxes))
train_set = new_orders.iloc[trainIdxes]
test_set = new_orders.iloc[testIdxes]

In [14]:
train_set_encoded = pd.get_dummies(train_set[["Segment",'Sub.Category', 'Returns','Times.Returned']])
test_set_encoded = pd.get_dummies(test_set[["Segment",'Sub.Category', 'Returns', 'Times.Returned']])

In [15]:
y = train_set_encoded.Returns
X = train_set_encoded[train_set_encoded.columns[1:]]
y_test = test_set_encoded.Returns
X_test = test_set_encoded[test_set_encoded.columns[1:]]

In [16]:
# Logistic Regression
logistic = LogisticRegression(C=1e4, solver='lbfgs', multi_class='auto', class_weight="balanced")
logistic.fit(X,y)
print(logistic.score(X, y))
print(logistic.score(X_test, y_test))


0.711420354844999
0.7140768180931956


In [27]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
cclf = RandomForestClassifier(max_depth=21, random_state=0, class_weight="balanced")
cclf.fit(X, y)
print(cclf.score(X,y))
print(cclf.score(X_test,y_test))

0.5793283291089881
0.5742834860596607


### Problem 6: Evaluating Models
- What is the best metric to evaluate your model. Is accuracy good for this case?
- Now you have multiple models, which one would you pick? 
- Can you get any clue from the confusion matrix? What is the meaning of precision and recall in this case? Which one do you care the most? How will your model help the manager make decisions?
- **Note:** The last question is open-ended. Your answer could be completely different depending on your understanding of this business problem.

In [30]:
# AIC or BIC (best for descriptive models)
# ...
# confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y, logistic.predict(X)))  # Logistic Regression
print( (28093+1098) / (28093+11189+652+1098) )
print(confusion_matrix(y, cclf.predict(X)))      # Random Forest
print( (22216+1555) / (22216+17066+195+1555) )
# initially the score was very high but that is because the training data was not balanced.
# Added the option 'balanced' to the LogisticRegression and that brought down the score to a more reasonable number.

[[28093 11189]
 [  652  1098]]
0.711420354844999
[[22216 17066]
 [  195  1555]]
0.5793283291089881






### Problem 7: Feature Engineering Revisit
- Is there anything wrong with the new feature we generated? How should we fix it?
- ***Hint***: For the real test set, we do not know it will get returned or not.

In [None]:
# ...