# Prediction of sales

### Problem Statement
[The dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing) represents sales data for 1559 products across 10 stores in different cities. Also, attributes of each product and store are available. The aim is to build a predictive model and determine the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### In following weeks, we will explore the problem in following stages:

1. **Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome**
2. **Data Exploration – looking at categorical & continuous feature summaries and making inferences about the data**
3. **Data Cleaning – imputing missing values in the data and checking for outliers**
4. **Feature Engineering – modifying existing variables and/or creating new ones for analysis**
5. **Model Building – making predictive models on the data**
---------

In [1]:
import pandas as pd
data = pd.read_csv("regression_exercise.csv")
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [2]:
data.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


In [3]:
data.isna().sum() # Let's clean Item_Weight and Outlet Size...

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [4]:
data.groupby("Item_Identifier") # Python: "Cool I grouped them, but what to do with the other columns?"

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb148460760>

#### let's clean Item_Weight

In [5]:
# Item weight cleanup
item_mean = data.groupby("Item_Identifier")["Item_Weight"].transform("mean")
data["Item_Weight"] = data["Item_Weight"].fillna(item_mean)
data.Item_Weight.isna().sum() # much better

4

#### let's clean item fat content poor strings...

In [6]:
this = {"reg":"Regular","low fat":"Low Fat","LF":"Low Fat","Regular":"Regular","Low Fat":"Low Fat"}
data["Item_Fat_Content"].replace(this, inplace=True)

#### outlet size cleanup

In [7]:
# Outlet Size cleanup (Make all na medium)
data["Outlet_Size"].fillna("Medium", inplace=True)

In [8]:
data.isna().sum() # better!

Item_Identifier              0
Item_Weight                  4
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

# ^ BASIC CLEANUP-DONE ^

## 4. Feature Engineering

1. Resolving the issues in the data to make it ready for the analysis.
2. Create some new variables using the existing ones.





### Create a broad category of Type of Item

`Item_Type` variable has many categories which might prove to be very useful in analysis. Look at the `Item_Identifier`, i.e. the unique ID of each item, it starts with either FD, DR or NC. If you see the categories, these look like being Food, Drinks and Non-Consumables. 

**Task:** Use the Item_Identifier variable to create a new column

In [9]:
data["Item_Type"] = data["Item_Identifier"].str[:-3] # copy Identifier's first 2 letters

In [10]:
data["Item_Type"].head(5)

0    FD
1    DR
2    FD
3    FD
4    NC
Name: Item_Type, dtype: object

In [11]:
data["Item_Type"] = data["Item_Type"].replace({"FD":0,"DR":1,"NC":2}) # rename to digits

In [12]:
# data.rename(columns={'Type: FD0DR1NC2':'Item_Type'}, inplace=True)

In [13]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,0,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,1,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,0,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,0,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,2,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


### Determine the years of operation of a store

**Task:** Make a new column depicting the years of operation of a store (i.e. how long the store exists). 

In [14]:
from datetime import date

In [15]:
print(date.today())
print(date.today().year) # this.

2021-09-30
2021


In [16]:
data["Store_Age"] = date.today().year - data["Outlet_Establishment_Year"]

In [17]:
data.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Store_Age
0,FDA15,9.3,Low Fat,0.016047,0,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,22
1,DRC01,5.92,Regular,0.019278,1,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,12


### Modify categories of Item_Fat_Content

**Task:** There are difference in representation in categories of Item_Fat_Content variable. This should be corrected.

In [18]:
# Fixed this already above.

**Task:** There are some non-consumables as well and a fat-content should not be specified for them. Create a separate category for such kind of observations.

In [19]:
data.loc[data["Item_Type"] == 2, "Item_Fat_Content"] = "other"
# locate all rows where condition is True, 'column_name' then assign whatever

In [20]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Store_Age
0,FDA15,9.3,Low Fat,0.016047,0,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,22
1,DRC01,5.92,Regular,0.019278,1,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,12
2,FDN15,17.5,Low Fat,0.01676,0,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,22
3,FDX07,19.2,Regular,0.0,0,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38,23
4,NCD19,8.93,other,0.0,2,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,34


### Numerical and One-Hot Encoding of Categorical variables

Since scikit-learn algorithms accept only numerical variables, we need to **convert all categorical variables into numeric types.** 

- if the variable is Ordinal we can simply map its values into numbers
- if the variable is Nominal (we cannot sort the values) we need to One-Hot Encode them --> create dummy variables

In [21]:
data2 = data.copy()

#### Outlet Identifier to numberical - done

In [22]:
data2["Outlet_Identifier"] = data2["Outlet_Identifier"].str[4:] # worked

In [23]:
data["Outlet_Identifier"] = data2["Outlet_Identifier"]

#### Item_Identifier to numerical?

In [24]:
data["Item_Identifier"].value_counts() # How do I change this to numerical

FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: Item_Identifier, Length: 1559, dtype: int64

#### item fat content to numerical - done

In [25]:
data["Item_Fat_Content"].value_counts()

Low Fat    3918
Regular    3006
other      1599
Name: Item_Fat_Content, dtype: int64

In [26]:
data["Item_Fat_Content"] = data["Item_Fat_Content"].replace({"other":0,"Low Fat":1,"Regular":2})

#### outlet size as numerical - done

In [27]:
data["Outlet_Type"].value_counts()

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64

In [28]:
data["Outlet_Type"] = data["Outlet_Type"].replace({"Grocery Store":0,"Supermarket Type1":1,"Supermarket Type2":2,"Supermarket Type3":3})

In [29]:
data["Outlet_Type"].value_counts()

1    5577
0    1083
3     935
2     928
Name: Outlet_Type, dtype: int64

#### Outlet location type to numerical - done

In [30]:
data["Outlet_Location_Type"].value_counts()

Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64

In [31]:
data["Outlet_Location_Type"] = data["Outlet_Location_Type"].str[5:]

In [32]:
data["Outlet_Location_Type"]

0       1
1       3
2       1
3       3
4       3
       ..
8518    3
8519    2
8520    2
8521    3
8522    1
Name: Outlet_Location_Type, Length: 8523, dtype: object

#### outlet size to numeric and more...

In [35]:
data = data.replace({"Outlet_Size" : {"Small":0,"Medium":1,"High":2}})

In [37]:
data.describe()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Type,Item_Outlet_Sales,Store_Age
count,8519.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0
mean,12.87542,1.165083,0.066132,0.468966,140.992782,1997.831867,0.829168,1.20122,2181.288914,23.168133
std,4.646098,0.716317,0.051598,0.790146,62.275067,8.37176,0.600327,0.796459,1706.499616,8.37176
min,4.555,0.0,0.0,0.0,31.29,1985.0,0.0,0.0,33.29,12.0
25%,8.785,1.0,0.026989,0.0,93.8265,1987.0,0.0,1.0,834.2474,17.0
50%,12.65,1.0,0.053931,0.0,143.0128,1999.0,1.0,1.0,1794.331,22.0
75%,16.85,2.0,0.094585,1.0,185.6437,2004.0,1.0,1.0,3101.2964,34.0
max,21.35,2.0,0.328391,2.0,266.8884,2009.0,2.0,3.0,13086.9648,36.0


In [40]:
data.head() # looks better!

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Store_Age
0,FDA15,9.3,1,0.016047,0,249.8092,49,1999,1,1,1,3735.138,22
1,DRC01,5.92,2,0.019278,1,48.2692,18,2009,1,3,2,443.4228,12
2,FDN15,17.5,1,0.01676,0,141.618,49,1999,1,1,1,2097.27,22
3,FDX07,19.2,2,0.0,0,182.095,10,1998,1,3,0,732.38,23
4,NCD19,8.93,0,0.0,2,53.8614,13,1987,2,3,1,994.7052,34


In [38]:
# Will come back to creating dummy variables later. 

**All variables should be by now numeric.**

---------
### Exporting Data

**Task:** You can save the processed data to your local machine as a csv file.