# Alison's approach to regression

## CRoss-Industry Standard Process for Data Mining (CRISP-DM)

Before we dig into the problem, lets refresh our memories on the steps in the CRISP-DM model.

<img src="img/new_crisp-dm.png" width="500">

### The Data

<img src="img/grocery-cart.jpg" width="500">

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities.

The data is located in the csv called `big_mart.csv`

### Step 1: Business Understanding

We previously explored the features of the big_mart dataset but now BigMart wants us to answer the following question:

**The sales team at BigMart wants to know which products have high sales.  They ask you to help them  predict if sales of each product at a particular store will be high or low using the big_mart dataset.**


### Step 2: Data Understanding

We have already done a great deal of exploratory data analysis of this dataset.  Let's refresh out memory on what the original data contained. 

<img src="img/big_mart_data_variables.png" width="500">

### Step 3: Data Preparation

This step has already been done for you.  The following steps were taken and are reflected in the `big_mart_clean.csv` file.

- Imputed missing values for `Outlet_Size` (replaced missing with the mode)
- Imputed missing values for `Item_Weight` (replaced missing with the average)
- Cleaned typos in `Item_Fat_Content`
- Created new variable `Item_Type_Combined` (labels items as food, non-consumable, or drinks)
- Created new variable `Outlet_Years` (the years of operation of a store)


#### Before we begin modeling let's do a quick exploration of out new dataset!

In [34]:
import pandas as pd
import matplotlib as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score


from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import PolynomialFeatures



import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels
import scipy

%matplotlib inline

In [9]:
#explore the dataset
big_mart = pd.read_csv('big_mart_clean.csv')
big_mart_orig = pd.read_csv('big_mart.csv')

In [13]:
type(big_mart.drop(columns=['Item_Sales_Cat']))

pandas.core.frame.DataFrame

In [16]:
big_mart_reg = pd.concat([big_mart.drop(columns=['Item_Sales_Cat']), big_mart_orig['Item_Outlet_Sales']], axis=1)

In [17]:
big_mart_reg.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier', 'Outlet_Size',
       'Outlet_Location_Type', 'Outlet_Type', 'Item_Type_Combined',
       'Outlet_Years', 'Item_Outlet_Sales'],
      dtype='object')

In [18]:
big_mart_reg.shape

(8523, 13)

In [19]:
big_mart_reg.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier', 'Outlet_Size',
       'Outlet_Location_Type', 'Outlet_Type', 'Item_Type_Combined',
       'Outlet_Years', 'Item_Outlet_Sales'],
      dtype='object')

In [20]:
#let's look at the target variable (Item_Sales_Cat) a little more in depth to examine the classes
big_mart_reg.Item_Outlet_Sales


0       3735.1380
1        443.4228
2       2097.2700
3        732.3800
4        994.7052
          ...    
8518    2778.3834
8519     549.2850
8520    1193.1136
8521    1845.5976
8522     765.6700
Name: Item_Outlet_Sales, Length: 8523, dtype: float64

### Step 4: Modeling

Once we have clean data, we can begin modeling! Remember, modeling, as with any of these other steps, is an iterative process. During this stage, we'll try to build and tune models to get the highest performance possible on our task.

Consider the following questions during the modeling step:

- Is this a classification task? A regression task? Something else?
- What models will we try?
- How do we deal with overfitting?
- Do we need to use regularization or not?
- What sort of validation strategy will we be using to check that our model works well on unseen data?
- What loss functions will we use?
- What threshold of performance do we consider as successful?



#### Data preparation

Before we begin modeling let's split our data and then perform encoding/scaling steps.

In [21]:
#set random state for our notebook
import numpy as np
np.random.seed(217)

y = big_mart_reg['Item_Outlet_Sales']
X = big_mart_reg.drop(columns=['Item_Outlet_Sales', 'Item_Identifier'])

#split data into test and train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)


In [22]:
#get shape of the training and test sets
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((6392, 11), (6392,), (2131, 11), (2131,))

In [24]:
### Add code here
numeric_features = ['Item_Weight', 'Item_Visibility',
                     'Item_MRP', 'Outlet_Years']
ss=StandardScaler()

categorical_features = [ 'Item_Fat_Content',
       'Item_Type', 'Outlet_Identifier',
       'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type', 'Item_Type_Combined' ]

categorical_transformer = Pipeline(steps=[
    ("ordinal", OrdinalEncoder()),
    ('onehot', OneHotEncoder(categories='auto',drop='first'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', ss, numeric_features),
        ('cat', categorical_transformer, categorical_features)], remainder='passthrough')

#let's use it now!

X_train_tran = preprocessor.fit_transform(X_train)
X_train_tran

#and we can transform X_test too

X_test_tran = preprocessor.fit_transform(X_test)
X_test_tran

<2131x39 sparse matrix of type '<class 'numpy.float64'>'
	with 20879 stored elements in Compressed Sparse Row format>

In [25]:
#get column names for categories
cat_names = preprocessor.named_transformers_['cat'].named_steps['ordinal'].categories_

cat_names = [val for sublist in cat_names for val in sublist[1:]]
cat_names

#full list of column names
column_names = numeric_features + cat_names

#apply column names to dataframe
X_train_trans = pd.DataFrame.sparse.from_spmatrix(X_train_tran, columns=column_names)
X_train_trans.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Years,Non-Edible,Regular,Breads,Breakfast,Canned,Dairy,...,OUT049,Medium,Small,Tier 2,Tier 3,Supermarket Type1,Supermarket Type2,Supermarket Type3,Food,Non-Consumable
0,-1.036915,-0.9727,0.82479,-1.341284,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,-0.378661,-0.223,-0.76442,-0.148401,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,-0.01116,1.071408,-1.309291,1.521635,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
3,1.933487,-0.723034,0.085657,1.283059,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
4,0.89538,-0.073697,-1.289446,-1.341284,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


#### Dummy Model

First we are going to start with a dummy model to predict if the product has high or low sales. In our dummy model we classify everything as the majority class.
https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html

In [33]:
# DummyClassifier to predict only target 0
dummy = DummyRegressor(strategy='mean')
dummy.fit(X_train_tran, y_train)
y_hat_train = dummy.predict(X_train_tran)
r2_score(y_hat_train, y_train)

-1.390621156440286e+31

#### Two approaches
##### Sklearn vs statsmodels

In [None]:
#your code here


### Step 5: Evalution

During the evaluation step we want to evaluate the results of our models, and decide the next steps in selecting the "best" model.  During this step we should consider the following:

- Does our model solve the business problem?
- What metrics should we be using to evaluate the "success" of our model?
- Can we further improve our models?
- Do we need more data?  Or different data?

In [None]:
# some helpful code for me later on

sns.pairplot(df)b
plt.show()
fig = sm.qqplot(normal_rv, line = 'r')
plt.show()