# Prediction of sales

### Problem Statement
[The dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing) represents sales data for 1559 products across 10 stores in different cities. Also, attributes of each product and store are available. The aim is to build a predictive model and determine the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### In following weeks, we will explore the problem in following stages:

1. **Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome**
2. **Data Exploration – looking at categorical & continuous feature summaries and making inferences about the data**
3. **Data Cleaning – imputing missing values in the data and checking for outliers**
4. **Feature Engineering – modifying existing variables and/or creating new ones for analysis**
5. **Model Building – making predictive models on the data**
---------

In [140]:
# copy paste code from dataprep exercise

import pandas as pd
import numpy as np

#Read files:
data = pd.read_csv("regression_exercise.csv", delimiter=',')

In [141]:
numeric = data._get_numeric_data().columns

categorical = data.select_dtypes(include=['object']).columns


print(data.shape[1])
numeric.shape[0] + categorical.shape[0]

12


12

In [142]:
data['Item_Weight'].fillna(data['Item_Weight'].mean(), inplace= True)

data['Outlet_Size'].fillna('Medium', inplace = True)

data.isnull().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

## 4. Feature Engineering

1. Resolving the issues in the data to make it ready for the analysis.
2. Create some new variables using the existing ones.





### Create a broad category of Type of Item

`Item_Type` variable has many categories which might prove to be very useful in analysis. Look at the `Item_Identifier`, i.e. the unique ID of each item, it starts with either FD, DR or NC. If you see the categories, these look like being Food, Drinks and Non-Consumables. 

**Task:** Use the Item_Identifier variable to create a new column

In [143]:
data['Item_Type']

0                       Dairy
1                 Soft Drinks
2                        Meat
3       Fruits and Vegetables
4                   Household
                ...          
8518              Snack Foods
8519             Baking Goods
8520       Health and Hygiene
8521              Snack Foods
8522              Soft Drinks
Name: Item_Type, Length: 8523, dtype: object

In [144]:
# create new column that contains the item categories from the item identifier
data['Item_Category'] = data['Item_Identifier'].str[:2]
data.head(3)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_Category
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,FD
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,DR
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,FD


In [145]:
data['Item_Category'].unique()

array(['FD', 'DR', 'NC'], dtype=object)

### Determine the years of operation of a store

**Task:** Make a new column depicting the years of operation of a store (i.e. how long the store exists). 

In [146]:
data['Years_Operation'] = 2022 - data['Outlet_Establishment_Year']

In [147]:
# can make this more detailed using pandas date time
data['Years_Operation']

0       23
1       13
2       23
3       24
4       35
        ..
8518    35
8519    20
8520    18
8521    13
8522    25
Name: Years_Operation, Length: 8523, dtype: int64

### Modify categories of Item_Fat_Content

**Task:** There are difference in representation in categories of Item_Fat_Content variable. This should be corrected.

In [148]:
data['Item_Fat_Content'].unique()


array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

In [149]:
# change the low fat rows to 'Low Fat'
data['Item_Fat_Content'].replace('low fat', 'Low Fat', inplace=True)
data['Item_Fat_Content'].replace('LF', 'Low Fat', inplace = True)

data['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular', 'reg'], dtype=object)

In [150]:
# change the regular rows
data['Item_Fat_Content'].replace('reg', 'Regular', inplace=True)

data['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular'], dtype=object)

**Task:** There are some non-consumables as well and a fat-content should not be specified for them. Create a separate category for such kind of observations.

In [151]:
# use pandas where function
data['Non_Consumable'] = np.where(data['Item_Category'] != 'NC', 0, 1)

In [152]:
data['Non_Consumable'].head(3)

0    0
1    0
2    0
Name: Non_Consumable, dtype: int32

In [153]:
# check to see that there are the correct number
data['Item_Category'].value_counts()


FD    6125
NC    1599
DR     799
Name: Item_Category, dtype: int64

In [154]:
data['Non_Consumable'].value_counts()

0    6924
1    1599
Name: Non_Consumable, dtype: int64

### Numerical and One-Hot Encoding of Categorical variables

Since scikit-learn algorithms accept only numerical variables, we need to **convert all categorical variables into numeric types.** 

- if the variable is Ordinal we can simply map its values into numbers
- if the variable is Nominal (we cannot sort the values) we need to One-Hot Encode them --> create dummy variables

In [155]:
# can use pd create dummies
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
# figure out which columns are ordinal and which are nominal
data.head(3)
# ordinal: item_fat_content, outlet_identifier, outlet_size, outlet_location_type
# nominal: item_type, item_category, outlet_type

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_Category,Years_Operation,Non_Consumable
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,FD,23,0
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,DR,13,0
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,FD,23,0


In [156]:
# change item_fat_content (0: Low Fat, 1: Regular), run only once
data.replace({'Item_Fat_Content': {'Low Fat': 0, 'Regular': 1}}, inplace=True)

# map the outlet_identifier to number sorted by ascending outlet number
outlet_mapper = {}
counter = 0
for outlet in data['Outlet_Identifier'].sort_values(ascending=True).unique():
    outlet_mapper[outlet] = counter
    counter += 1

data.replace({'Outlet_Identifier': outlet_mapper}, inplace = True)

# map the outlet_size
size_mapper = {'Small': 0, 'Medium': 1, 'High': 2}
data.replace({'Outlet_Size': size_mapper}, inplace=True)

# map the outlet location type
type_mapper = {'Tier 1': 0, 'Tier 2': 1, 'Tier 3': 2}
data.replace({'Outlet_Location_Type': type_mapper}, inplace = True)

# get dummies for the all the nominal data types
data = pd.get_dummies(data)


In [157]:
data.head(3)


Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Item_Outlet_Sales,Years_Operation,...,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Item_Category_DR,Item_Category_FD,Item_Category_NC
0,9.3,0,0.016047,249.8092,9,1999,1,0,3735.138,23,...,0,0,0,0,1,0,0,0,1,0
1,5.92,1,0.019278,48.2692,3,2009,1,2,443.4228,13,...,0,1,0,0,0,1,0,1,0,0
2,17.5,0,0.01676,141.618,9,1999,1,0,2097.27,23,...,0,0,0,0,1,0,0,0,1,0


**All variables should be by now numeric.**

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Item_Outlet_Sales,Years_Operation,...,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Item_Category_DR,Item_Category_FD,Item_Category_NC
count,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,...,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.352693,0.066132,140.992782,4.722281,1997.831867,0.829168,1.112871,2181.288914,24.168133,...,0.140795,0.052212,0.017365,0.127068,0.654347,0.108882,0.109703,0.093746,0.718644,0.18761
std,4.226124,0.477836,0.051598,62.275067,2.837201,8.37176,0.600327,0.812757,1706.499616,8.37176,...,0.347831,0.222467,0.130634,0.333069,0.475609,0.311509,0.312538,0.291493,0.449687,0.390423
min,4.555,0.0,0.0,31.29,0.0,1985.0,0.0,0.0,33.29,13.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,9.31,0.0,0.026989,93.8265,2.0,1987.0,0.0,0.0,834.2474,18.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,12.857645,0.0,0.053931,143.0128,5.0,1999.0,1.0,1.0,1794.331,23.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
75%,16.0,1.0,0.094585,185.6437,7.0,2004.0,1.0,2.0,3101.2964,35.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
max,21.35,1.0,0.328391,266.8884,9.0,2009.0,2.0,2.0,13086.9648,37.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


---------
### Exporting Data

**Task:** You can save the processed data to your local machine as a csv file.

In [158]:
data.to_csv('sales_data_cleaned.csv')