# Problem 1
Problem Statement:
A cloth manufacturing company is interested to know about the segment or attributes causes high sale. 

Approach - A Random Forest can be built with target variable Sales (we will first convert it in categorical variable) & all other variable will be independent in the analysis.  

## 1. Load necessary libraries

In [4]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix,  classification_report

from sklearn.feature_selection import SelectFromModel

In [5]:
%matplotlib notebook

## 2. Load data

In [6]:
company_df = pd.read_csv('Company_Data.csv')

In [7]:
company_df.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No


## 3. EDA

### 3.1 Data understanding

In [8]:
company_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Sales        400 non-null    float64
 1   CompPrice    400 non-null    int64  
 2   Income       400 non-null    int64  
 3   Advertising  400 non-null    int64  
 4   Population   400 non-null    int64  
 5   Price        400 non-null    int64  
 6   ShelveLoc    400 non-null    object 
 7   Age          400 non-null    int64  
 8   Education    400 non-null    int64  
 9   Urban        400 non-null    object 
 10  US           400 non-null    object 
dtypes: float64(1), int64(7), object(3)
memory usage: 34.5+ KB


In [9]:
company_df['ShelveLoc'].unique() # Ordinal encode, custom cat

array(['Bad', 'Good', 'Medium'], dtype=object)

In [10]:
company_df['Education'].unique() 

array([17, 10, 12, 14, 13, 16, 15, 18, 11], dtype=int64)

In [11]:
company_df['Urban'].unique()  # Ordinal encode, custom cat

array(['Yes', 'No'], dtype=object)

In [12]:
company_df['US'].unique()  # Ordinal encode, custom cat

array(['Yes', 'No'], dtype=object)

### 3.2 Separating data into features and target

In [13]:
# Extracting column names and sorting them to appropriate categories.
def column_segregator(df, y_name=None):
    """ Returns  three lists of column headers for feature columns, 
    numeric columns and categorical columns
    Input
    ------
    df: Dataframe
    y_name: default None. Name(str) of target column if available
    
    Output
    ------
    features, numeric_cols, cat_cols"""   
    
    cols = df.columns # List of all columns in the input dataframe.
    numeric_cols = [col for col in cols if (df[col].dtypes != 'object') and col != y_name]
    cat_cols = [col for col in cols if (df[col].dtypes == 'object') and col != y_name]
    features = [col for col in cols if col != y_name]
    
    return features, numeric_cols, cat_cols 

In [14]:
def Xy_split(df, y_name=None, y_col=True):
    """Splits the input dataframe into features and target
    input
    -----
    df: Input dataframe
    y_name: default None. Name(str) of target column if available
    y_col: 'True' if y column is present in input dataframe, else 'False'.
    
    output
    ------
    X (features) , y (target) if y colum is present else only X"""
    
    target = y_name
    feature_col,_,_ = column_segregator(df, target)
    if y_col == True:
        # separating features and target.
        X = df.loc[:, feature_col]
        y = df.loc[:, target]
        return X,y
    else:
        X = df.loc[:, feature_col]
        return X

In [15]:
# Column segregation.
features, numeric_cols, cat_cols = column_segregator(company_df, y_name='Sales')

In [16]:
# Splitting features and target.
X, y = Xy_split(company_df, y_name='Sales', y_col=True)

### 3.3 Summary statistics:
a) Numeric features

In [17]:
X[numeric_cols].describe()

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,Age,Education
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,124.975,68.6575,6.635,264.84,115.795,53.3225,13.9
std,15.334512,27.986037,6.650364,147.376436,23.676664,16.200297,2.620528
min,77.0,21.0,0.0,10.0,24.0,25.0,10.0
25%,115.0,42.75,0.0,139.0,100.0,39.75,12.0
50%,125.0,69.0,5.0,272.0,117.0,54.5,14.0
75%,135.0,91.0,12.0,398.5,131.0,66.0,16.0
max,175.0,120.0,29.0,509.0,191.0,80.0,18.0


b) Categorical features

In [18]:
X[cat_cols].describe()

Unnamed: 0,ShelveLoc,Urban,US
count,400,400,400
unique,3,2,2
top,Medium,Yes,Yes
freq,219,282,258


c) Target 

In [19]:
y.describe()

count    400.000000
mean       7.496325
std        2.824115
min        0.000000
25%        5.390000
50%        7.490000
75%        9.320000
max       16.270000
Name: Sales, dtype: float64

## 3.4 Visualizations:
### 3.4.1 Feature distribution - numeric features:

In [20]:
fig, axes = plt.subplots(2,4, figsize=(10,5))
fig.delaxes(axes[1,3]) 
axes = axes.flatten()

for idx, ax in enumerate(axes):
    if idx <7: #since there are only 7 numeric features
        sns.histplot(data=X, x=X[numeric_cols[idx]], ax=ax) 


fig.suptitle('Feature Distribution - numeric features', ha='center', fontweight='bold')
fig.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

### 3.4.2 Feature distribution - categorical features:

In [21]:
fig, axes = plt.subplots(1,3, figsize=(10,4))
axes = axes.flatten()

for idx, ax in enumerate(axes):
    sns.countplot(data=X, x=X[cat_cols[idx]], ax=ax) 

fig.suptitle('Feature Distribution - categorical features', ha='center', fontweight='bold')
fig.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

### 3.4.3 Target distribution

In [22]:
_, axes = plt.subplots(1,2,sharey=False, figsize=(10,4))
sns.histplot(data=y,ax=axes[0])
sns.boxplot(data=y, x=y, color='#b790ee', saturation=0.9, ax=axes[1])
sns.swarmplot(data=y, x=y, ax=axes[1],color='#1ec51e' )
plt.suptitle('Target distribution', ha='center', fontweight='bold')
plt.show()

# color ref: https://www.colorhexa.com/b790ee

<IPython.core.display.Javascript object>

### 3.4.4 Categorizing sales and the resulting distribution

In [24]:
#sales_class = y.apply(lambda x: 'high' if (x > 5.3) else 'low')
sales_class = pd.Series(y.apply(lambda x: 1 if (x > 7.49) else 0),index=X.index, name='sale_cat')

In [25]:
sales_class.unique()

array([1, 0], dtype=int64)

In [26]:
fig, ax = plt.subplots(figsize=(4,4))
sns.countplot(x=sales_class, ax=ax) 
ax.set_xlabel('sales category \n 0-low sales \n 1-high sales')
ax.set_title('Sales distribution bsed on high and low sales')
plt.show()

<IPython.core.display.Javascript object>

## 3.5 Observations:
- Product made by cloth manufacturing company: **car seats**
- 400 records, 10 features and 1 target ('Sales')
- No null values.
- All features recorded with correct datatypes, but object datatypes need to be converted to numeric for the analysis.
- Among the numeric features, price and CompPrice follow normal distribution, 'Advertizing' follows a skewed distribution, and the rest follow a somewhat uniform distribution (Mostly platykurtic, if we represent them using a kde plot).
- There is an imbalance in distributions of all the categorical features.
- From the combined plots for the target distribution(histogram and boxplot) and after looking at the quantiles, we see that  more than 40 times, a sale in the range of ~5K - 9K units has happened. Also sales below and sales above this range has happened around ~20 - 25 times. 
- The sales distribution is symmertric.
- Thus we can categorize sales as follows:
    - below 7.49K units - > low sales (below 50th percentile)
    - above 7.49K units - > high sales.
    
  This distribution is plotted as a count plot above. 
  The classification is a simple heuristic that anything above the average sales is meant to be high sales.
- Since the sales distribution can be expected to be normal(from previous), suppose we find the factors that contribute to high sales, we can then suggest the company to work towards shifting the distribution to the right (by improving on those areas) thereby increasing the sales.

    
**Notes:**
- shelving location - storage location.
- Since the produt is car seats, we are justified that the min age starts from 25, since mostly adults own a car or make purchases related to cars.


## 4. Model building

### 4.1 Constructing a new dataset based on the sale category defined above.

In [27]:
company_sales = pd.concat([X, sales_class], axis=1)

In [28]:
company_sales.head(5)

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US,sale_cat
0,138,73,11,276,120,Bad,42,17,Yes,Yes,1
1,111,48,16,260,83,Good,65,10,Yes,Yes,1
2,113,35,10,269,80,Medium,59,12,Yes,Yes,1
3,117,100,4,466,97,Medium,55,14,Yes,Yes,0
4,141,64,3,340,128,Bad,38,13,Yes,No,0


In [29]:
features, numeric_cols, cat_cols= column_segregator(company_sales, y_name='sale_cat')

In [30]:
X_1, y_1 = Xy_split(company_sales, y_name='sale_cat', y_col=True)

In [31]:
X_1.head()

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,138,73,11,276,120,Bad,42,17,Yes,Yes
1,111,48,16,260,83,Good,65,10,Yes,Yes
2,113,35,10,269,80,Medium,59,12,Yes,Yes
3,117,100,4,466,97,Medium,55,14,Yes,Yes
4,141,64,3,340,128,Bad,38,13,Yes,No


### 4.2 Baseline model with cross validation

In [32]:
# Processing categorical variables
cat_transformer = Pipeline(steps=[
     ('enc', OrdinalEncoder())
])

In [33]:
preprocessor = ColumnTransformer(transformers=[
     ('cat_trf', cat_transformer, cat_cols)    
 ], remainder='passthrough')

In [34]:
rf_classifier = RandomForestClassifier(random_state=42)

In [35]:
# Random forest classifier.
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', rf_classifier)
])

In [36]:
# K fold cross validation.
crv_scores_acc = cross_val_score(estimator=clf, X=X_1, y=y_1, cv=5, scoring='accuracy')
crv_scores_prec = cross_val_score(estimator=clf, X=X_1, y=y_1, cv=5, scoring='precision')
crv_scores_rec = cross_val_score(estimator=clf, X=X_1, y=y_1, cv=5, scoring='recall')

In [37]:
# Summarizing scores:
print("mean accuracy  :{:.4f} ".format(crv_scores_acc.mean()))
print("mean precision :{:.4f} ".format(crv_scores_prec.mean()))
print("mean recall    :{:.4f} ".format(crv_scores_rec.mean()))

mean accuracy  :0.7975 
mean precision :0.8233 
mean recall    :0.7542 


### 4.3 Hyper parameter optimization using grid search CV

In [38]:
clf.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'classifier', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__cat_trf', 'preprocessor__cat_trf__memory', 'preprocessor__cat_trf__steps', 'preprocessor__cat_trf__verbose', 'preprocessor__cat_trf__enc', 'preprocessor__cat_trf__enc__categories', 'preprocessor__cat_trf__enc__dtype', 'preprocessor__cat_trf__enc__handle_unknown', 'preprocessor__cat_trf__enc__unknown_value', 'classifier__bootstrap', 'classifier__ccp_alpha', 'classifier__class_weight', 'classifier__criterion', 'classifier__max_depth', 'classifier__max_features', 'classifier__max_leaf_nodes', 'classifier__max_samples', 'classifier__min_impurity_decrease', 'classifier__min_impurity_split', 'classifier__min_samples_leaf', 'classifier__min_samples_split', 'classifier__min_weight_fraction_leaf', 'classifier__n_estimators', 'classifier

In [39]:
# parameters = {'classifier__n_estimators':[128, 256],
#              'classifier__min_samples_split':[2, 3, 5],
#              'classifier__min_samples_leaf':[1, 2, 3],
#              'classifier__criterion':['gini', 'entropy'],
#              'classifier__max_features':['sqrt']}

In [40]:
#Grid search cv for finding out optimal parameters.
#grid_search = GridSearchCV(estimator=clf,param_grid =parameters, cv=5)

In [41]:
#grid_search.fit(X_1, y_1)

In [42]:
#grid_search.best_params_

In [43]:
#grid_search.best_score_

### 4.4 Observations:
- from gridsearch cv, we find the following parameters to be optimal for the decision tree classifier built above.
    - 'criterion': 'gini',
    - 'max_features': 'sqrt',
    - 'min_samples_leaf': 3,
    - 'min_samples_split': 2,
    - 'n_estimators': 128
    
  We will use these to build the final model.

### 4.5 Final model

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X_1,y_1, test_size=0.2, random_state=42, stratify=y_1)

In [54]:
# Constructing another decision tree from parameters found above.
rf_clf_updated = RandomForestClassifier(criterion='gini',
                                        max_features='sqrt',
                                        min_samples_leaf=3,
                                        min_samples_split=2,
                                        n_estimators=128,
                                        random_state=42)

In [55]:
# # Constructing another decision tree from parameters found above.
# rf_clf_updated = RandomForestClassifier(criterion='gini',
#                                         max_features='sqrt',
#                                         min_samples_leaf=3,
#                                         min_samples_split=2,
#                                         n_estimators=128,
#                                         random_state=42)#no

In [56]:
# # Constructing another decision tree from parameters found above.
# rf_clf_updated = RandomForestClassifier(criterion='entropy',
#                                         max_features='sqrt',
#                                         min_samples_leaf=1,
#                                         min_samples_split=4,
#                                         n_estimators=512,
#                                         random_state=42)#0k

In [57]:
# Decision tree classifier.
clf_updated = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', rf_clf_updated)
])

In [58]:
clf_updated.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat_trf',
                                                  Pipeline(steps=[('enc',
                                                                   OrdinalEncoder())]),
                                                  ['ShelveLoc', 'Urban',
                                                   'US'])])),
                ('classifier',
                 RandomForestClassifier(max_features='sqrt', min_samples_leaf=3,
                                        n_estimators=128, random_state=42))])

## 5. Model testing

In [59]:
y_pred = clf_updated.predict(X_test)#clf_updated

## 6. Model Evaluation

In [60]:
def display_results(y_test, y_pred):
    """Displays model evaluation/performance report that includes
    accuracy_score, confusion_matrix, precision_score, and 
    recall_score.
    input
    -----
    y_test, y_pred
    
    output
    ------
    Model evaluation/performance report"""
    
    print(classification_report(y_test, y_pred))
    print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))

In [61]:
display_results(y_test, y_pred)

              precision    recall  f1-score   support

           0       0.80      0.90      0.85        40
           1       0.89      0.78      0.83        40

    accuracy                           0.84        80
   macro avg       0.84      0.84      0.84        80
weighted avg       0.84      0.84      0.84        80

Confusion matrix:
 [[36  4]
 [ 9 31]]


### 6.1 Feature importance:
To find out the important features that contribute to the sales.

In [62]:
importance = clf_updated.steps[1][1].feature_importances_

In [63]:
# summarize feature importance
feature_score = dict.fromkeys(cat_cols + numeric_cols) #After preprocessing, the columns will be reordered.
for i,v in enumerate(importance):
    #print('Feature:{} ,Score: {:.5f}'.format(features[i],v))
    feature_score[(cat_cols + numeric_cols)[i]] = round(v,5)

In [64]:
pd.Series(feature_score)

ShelveLoc      0.09292
Urban          0.01150
US             0.01248
CompPrice      0.13149
Income         0.09597
Advertising    0.09165
Population     0.07823
Price          0.31031
Age            0.13259
Education      0.04287
dtype: float64

In [65]:
names = list(feature_score.keys())

In [66]:
values = list(feature_score.values())

In [67]:
fig, ax = plt.subplots()
ax.barh(names, values)
ax.set_xlabel('contribution')
ax.set_ylabel('Features')
ax.set_title('Feature importance')
plt.show()

<IPython.core.display.Javascript object>

Important features (in descending order) from graph that contribute to sales are:

    1. Price
    2. CompPrice
    3. Age
    4. Income, advertizing

## 7. Observations:
- From the above analysis, we see that, although we get a reasonable accuracy of 0.8175 after gridsearch CV and 0.84 when we apply the hyperparameters to a simple train test model and , the precision and recall scores for both 0 and 1 class are off by ~10-11%. Further improvements must be done to get these two scores as close as possible. 
- In this case, it is **OK to have false negatives** (high sales classified as low sales) but **bad if there are more false positives**, since that would **mean, we identified a set of observations coresponding to low sales as high**. This is **bad for the company** which wants to maximize its sales.
- The **model** is **doing a good job predicting 0 class** (only 4 misclassifications in the test set; **low false negatves**). Its recall score is 0.90. This model could be tried out for predictions and checked for similar performance with new data.

## 8. Conclusions:
- A decision tree model was applied to a company's sales dataset and an attempt was made to infer factors contributiong to high sales. **Since the recall score is high; low false negatives or less chance of predicting low sales as high**, we can deploy the model and check for performance and update it accordingly.
- If the company focuses on the four features listed above in section 6.1, it has good chances of improving its sales.