![under_construction](figures/under_construction.gif)

I dati utilizzati in questo notebook sono stati presi dalla competizione di Analytics Vidhya [Practice Problem: Big Mart Sales III](https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/#data_dictionary).

# Analisi esplorativa e preprocessamento dei dati

## Indice

In [None]:
import inspect
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%load_ext autoreload
%autoreload 2

# 1. Big Mart Sales

## 1.1 Descrizione della competizione

## Problem Statement

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

 

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

## Data

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

 

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

|**Variable**                 | Description                                              |
|-----------------------------|----------------------------------------------------------|
|**Item_Identifier**          | Unique product ID                                        |
|**Item_Weight**              | Weight of product                                        |
|**Item_Fat_Content**         | Whether the product is low fat or not                    |
|**Item_Visibility**          | The % of total display area of all products in a store allocated<br/>to the particular product|
|**Item_Type**                |The category to which the product belongs                 |
|**Item_MRP**                 |Maximum Retail Price (list price) of the product          |
|**Outlet_Identifier**        |Unique store ID                                           |
|**Outlet_Establishment_Year**|The year in which store was established                   |
|**Outlet_Size**              |The size of the store in terms of ground area covered     |
|**Outlet_Location_Type**     |The type of city in which the store is located            |
|**Outlet_Type**              |Whether the outlet is just a grocery store or some sort of<br/>supermarket|
|**Item_Outlet_Sales**        |Sales of the product in the particulat store. This is the outcome variable<br/>to be predicted|

### Evaluation Metric

Your model performance will be evaluated on the basis of your prediction of the sales for the test data (test.csv), which contains similar data-points as train except for the sales to be predicted. Your submission needs to be in the format as shown in "SampleSubmission.csv".

We at our end, have the actual sales for the test dataset, against which your predictions will be evaluated. We will use the Root Mean Square Error value to judge your response.

$
RMSE = \sqrt{\frac{\sum_{i=1}^N(Predicted_i - Actual_i)^2}{N}}
$

Where,
$N$: total number of observations
Predicted: the response entered by user
Actual: actual values of sales

Also, note that the test data is further divided into Public (25%) and Private (75%) data. Your initial responses will be checked and scored on the Public data. But, the final rankings will be based on score on Private data set. Since this is a practice problem, we will keep declare winners after specific time intervals and refresh the competition.

## 1.2 Lettura dei dati e separazione della variabile risposta

### Lettura dei dati

In [None]:
data = pd.read_csv("datasets/big_mart_sales/Train_UWu5bXk.csv")
print("Dimensione del dataset: {} data {}".format(*data.shape))
data.head()

### Divisione tra variabili esplicative e variabile risposta

In [None]:
risposta = "Item_Outlet_Sales"
esplicative = sorted(col for col in data.columns if col != risposta)

X, y = data[esplicative].copy(), data[risposta].copy()

# 2. Analisi esplorativa: studio delle variabili esplicative

## 2.1 Divisione in variabili quantitative e qualitative

### Controllo dei tipi delle colonne

In [None]:
X.dtypes

### Salvataggio dei nomi delle colonne in due liste distinte

In [None]:
quantitative = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
qualitative = X.select_dtypes(include=["object"]).columns.tolist()

## 2.2 Variabili quantitative

In [None]:
X[quantitative].head()

### Conteggio dei valori mancanti

In [None]:
X[quantitative].isnull().sum()

### Descrizione

In [None]:
X.describe() # nota: vengono automaticamente considerate solo le colonne numeriche

## 2.3 Variabili qualitative

In [None]:
X[qualitative].head()

### Conteggio dei valori mancanti

In [None]:
X[qualitative].isnull().sum()

### Numero di osservazioni distinte

In [None]:
X[qualitative].nunique()

### Conteggio dei valori

In [None]:
for col in qualitative:
    display(X[col].value_counts().head(16))

### Esercizio

Elencare quanto scoperto grazie all'analisi esplorativa.

### Esercizio

Esplorare i dati graficamente (istogrammi, boxplot, ...).

> Suggerimento: considerare le librerie [Matplotlib](https://matplotlib.org/), [Seaborn](https://seaborn.pydata.org/) o, per grafici interattivi, [Bokeh](https://bokeh.pydata.org/en/latest/).

# 2. Preprocessamento dei dati

## 2.1 Sostituzione dei valori mancanti

### Studio della relazione tra *Item_Identifier* e *Item_Weight*

In [None]:
weight_grby_id = X[["Item_Identifier", "Item_Weight"]].groupby("Item_Identifier").\
    agg(["count", "min", "max", "mean"])["Item_Weight"]
weight_grby_id.sort_values("count", inplace=True, ascending=False)

print("Item_Identifier senza nemmeno un Item_Weight associato: {}".format((weight_grby_id["count"] == 0).sum()))
weight_grby_id.head()

### Sostituzione dei valori mancanti di *Item_Weight*

In [None]:
from msbd.preprocessamento import RiempireNAItemWeight

print(inspect.getsource(RiempireNAItemWeight))

In [None]:
print("Valori mancanti di Item_Weight prima della sostituzione: {}".format(X["Item_Weight"].isnull().sum()))

riempire_na_item_weight = RiempireNAItemWeight()

X = riempire_na_item_weight.fit_transform(X)

print("Valori mancanti di Item_Weight dopo della sostituzione: {}".format(X["Item_Weight"].isnull().sum()))

### Studio della relazione tra *Outlet_Location_Type* e *Outlet_Size*

In [None]:
size_grby_location = X.groupby("Outlet_Location_Type")["Outlet_Size"].value_counts().unstack().fillna(0)

size_grby_location

### Studio della relazione tra *Outlet_Type* e *Outlet_Size*

In [None]:
size_grby_type = X.groupby("Outlet_Type")["Outlet_Size"].value_counts().unstack().fillna(0)

size_grby_type

In [None]:
from msbd.preprocessamento import RiempireNAOutletSize

print(inspect.getsource(RiempireNAOutletSize))

In [None]:
print("Valori mancanti di Outlet_Size prima della sostituzione: {}".format(X["Outlet_Size"].isnull().sum()))

riempire_na_outlet_size = RiempireNAOutletSize()

X = riempire_na_outlet_size.fit_transform(X)

print("Valori mancanti di Outlet_Size dopo della sostituzione: {}".format(X["Outlet_Size"].isnull().sum()))

## 2.2 Aggregazione dei livelli simili delle variabili qualitative

### Aggregazione dei livelli simili di *Item_Fat_Content*

In [None]:
from msbd.preprocessamento import Sostituire

print(inspect.getsource(Sostituire))

In [None]:
sostituire_item_fat_content = Sostituire({"LF": "Low Fat", "low fat": "Low Fat", "reg": "Regular"})

X = sostituire_item_fat_content.fit_transform(X)

## 2.3 Eliminazione di colonne che non si intende utilizzare

### Eliminazione di *Item_Identifier*

In [None]:
X.drop(columns="Item_Identifier", inplace=True)
esplicative.remove("Item_Identifier")
qualitative.remove("Item_Identifier")

# 4. Divisione dei dati in *training*, *validation* e *test*

<div class="alert alert-danger fade in">
<strong>IMPORTANTE</strong>: prima di procedere con analisi relative (anche) alla variabile risposta, è necessario separare gli insiemi di <em>validation</em> e <em>test</em> da quello di <em>training</em>. Omettere questo passaggio può inficiare in modo più o meno grave le conclusioni che si traggono su di essi.
</div>

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=1000)

print("Dimensione del training set: {}".format(len(y_train)))
print("Dimensione del validation set: {}".format(len(y_val)))
print("Dimensione del test set: {}".format(len(y_test)))

# 5. Analisi esplorativa: studio della relazione tra variabili esplicative e variabile risposta

## 5.1 Variabili qualitative

In [None]:
from msbd.grafici import grafico_barre_qualitative_risposta

print(inspect.getsource(grafico_barre_qualitative_risposta))

In [None]:
plt.figure(figsize=(10, 10))

grafico_barre_qualitative_risposta(X_train, y_train, qualitative, 2)

plt.show()

# 6 Trasformazione delle variabili qualitative in dummy

### Esercizio

Perché abbiamo scelto `drop_first=True`?