





# Retail Revenue Prediction - Long notebook

##### **Group Information**
**Participants:**
- Mina Chen Glein Feragen - 544552
- Andrew Glover Marty - 557813
- Simen Tvete Aabol - 505174


**Kaggle Team name:** Group 8

## Content:
1. [Starting out](#starting-out)
2. [Exploratory Data Analysis](#exploratory-data-analysis)
    1. [Stores train](#stores_train)
    2. [Plaace hierarchy](#Plaace-hierarchy)
    3. [Grunnkrets Data](#Grunnkrets-Data)
    4. [Extra Stores](#Extra-Stores)
    5. [Household income](#Household-income)
    6. [Buss](#Buss)
    7. [Testing with mall_name](#Testing-with-mall_name)
    7. [Testing with chain_name](#Testing-with-chain_name)
    8. [Stores train](#Stores_train)
    9. [Population](#Population)
3. [Data Preprocessing](#Data-Preprocessing)
4. [Feature Engineering](#Feature-Engineering)
     1. [Population](#Population)
     2. [Busstops](#Busstops)
5. [Models](#models)
    1. [Gradient Boosting Machine](#gradient-boosting-machine)
6. [Results](#Results)
6. [Reflections](#Reflections)

## Starting out

When starting out with this project we first chose to spend some time on getting to know the data set and the task in front of us. We all looked through all the features of each table, and discussed to what degree we thought each feature would affect the revenue of a given store. We also brain stormed about possible algorithms to use for this project, for example random forest, light gradient boosting and extreme gradient boosting. 

After having this initial meeting, we started out with the EDA.  

## Exploratory Data Analysis

For the EDA, we simply wanted to go through all of the given files, look for null values, outliers and correlation between features. We also wanted to check whether there were some errors or misleading information somewhere.

We started out by simply importing all neccessary packages and data files, and then went through all files one by one.

**Files:**
- [stores_train.csv](#stores-train)
- [plaace_hierarchy.csv](#plaace-hierarchy)
- [grunnkrets_data.csv](#grunnkrets-data)
- [stores_extra.csv](#extra-stores)
- [grunnkrets_income_households.csv](#household-income)
- [busstops_norway.csv](#buss)

In [None]:
import pandas as pd
import numpy as np
import math
import copy
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go 
from sklearn.preprocessing import LabelEncoder
from plotly.offline import init_notebook_mode,iplot,plot
init_notebook_mode(connected=True)
import warnings
warnings.filterwarnings('ignore')
from shapely.geometry import Point, Polygon

import hashlib


### Stores Train

Started by getting an insight into the data set.

In [None]:
stores_train = pd.read_csv('data/stores_train.csv')
stores_train.head()

In [None]:
stores_train.tail()

In [None]:
stores_train.describe()

In [None]:
stores_train.shape

In [None]:
stores_train.nunique()

Here we can see that there is only one unique value in the 'year' column. We can therefore conclude that this value does not have a correlation with the column 'revenue'. We can therefore remove this column from our dataset.

In [None]:
stores_train.pop("year")
stores_train.head()

In [None]:
stores_train.isnull().sum()

Here we can see that we have three columns where we are missing data, 'address', 'chain_name', and 'mall_name'.

It is uncertain whether the column 'address' is necessary when we have geographical data in the form of latitude and longitude.
'Address' and 'lat' + 'lon' represent the same data. We think this data will have the same same correlation/influence on 'revenue'. Thus, 'address' is removed because the column lacks a lot of data. There is also an advantage with 'lat' and 'lon' in that these are in a numerical form.

In [None]:
stores_train.isnull().sum()

In 'chain_name' and 'mall_name' we still have a problem with missing data.
These two columns can have an impact on 'revenu'. For example, there may be exclusive restaurants with a c'hain_name' that have very high revenue. Or vice versa, if there are some chains that struggle a lot.

It is often the easiest to delete rows where data is missing, but it is not appropriate in this situation when there is such a large proportion of the data set that is missing data. Ant we think these columns are important to have. 

The function of these two columns is to group the stores together.
You can argue that you can remove the 'mall_name' column because we have the GPS locations of the stores. So you can let the algorithm(s) learn that many shops in one place can affect the 'revenue' column. But it's a lot easier for a model to understand categorical data. than coordinates that are close to each other. 

Looking closer at mall_name

In [None]:
absolute_frequencies = stores_train['mall_name'].value_counts()
print("The count of each frequencies for the mall_name\n", absolute_frequencies.value_counts())

In [None]:
print("Absolute frequencies: \n", absolute_frequencies)


In [None]:
print("The count of each frequencies\n", absolute_frequencies.value_counts(dropna=False))

In [None]:
absolute_frequencies.tail(25)

Here we can see that there are many malls that only have 1 store.

In [None]:
absolute_frequencies.head(25)

In [None]:
#  stores_train.pop("mall_name")

Early stage:
After going through many of the shopping centers on Google which, according to our data, only consist of one store, I see that the data is incomplete. Will therefore conclude that it will be better to only deal with GPS location in order to understand the geographical clustering of the stores. Therefore we drop the colum 'mall_namne'

Later on:
We are therefore thinking of removing the 'mall_name' column. But first, we will look further for a correlation. 

 We have now gone back to 'mall_name' after becoming a little more familiar with cleaning data, and think that we can find a benefit in including the column

In [None]:
# Crates a subset of stores_train where we remove all NaN-values. 
mall_subset = stores_train[stores_train['mall_name'].notna()]
plt = mall_subset.groupby(['mall_name'])['revenue'].mean()

plt.plot(kind='bar', title='Avrage revenue per mall', ylabel='Revenue', xlabel="Mall's", figsize=(40, 5) )

Here we can see that there are several malls that have higher revenue on average. Further tests to remove malls consisting of 2 or fewer stores.

In [None]:
threshold = 3 # Anything that occurs less than this will be removed.
for col in mall_subset.columns:
    value_counts = mall_subset['mall_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    mall_subset[col].replace(to_remove, np.nan, inplace=True)
    
plt = mall_subset.groupby(['mall_name'])['revenue'].mean()
plt.plot(kind='bar', title='Avrage revenue per mall', ylabel='Revenue', xlabel="Mall's", figsize=(40, 5) )

Here we can see that several of the tall pillars have disappeared. Some of the spills/store chains that had an average revenue of over 120 have now been removed. These may be appropriate to remove to avoid overfitting.



In [None]:
threshold = 4 # Anything that occurs less than this will be removed.
for col in mall_subset.columns:
    value_counts = mall_subset['mall_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    mall_subset[col].replace(to_remove, np.nan, inplace=True)
    
    
plt = mall_subset.groupby(['mall_name'])['revenue'].mean()
plt.plot(kind='bar', title='Avrage revenue per mall', ylabel='Revenue', xlabel="Mall's", figsize=(40, 5) )

Even if we set the threshold to 5, meaning that we only include malls that occur at least 5 times, we see a large difference in revenue.

Also tests whether there is a relationship between not belonging to a mall and having something to say.

In [None]:
mall_subset['mall_name'] = mall_subset['mall_name'].replace(np.nan, "A-not a mall")
threshold = 6 # Anything that occurs less than this will be removed.
for col in mall_subset.columns:
    value_counts = mall_subset['mall_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    mall_subset[col].replace(to_remove, "A-not a mall", inplace=True)
    
# print("sha3", subset.shape)
    
plt = mall_subset.groupby(['mall_name'])['revenue'].mean()
plt.plot(kind='bar', title='Avrage revenue per mall', ylabel='Revenue', xlabel="Mall's", figsize=(40, 5) )

In [None]:
stores_train['mall_name'] = stores_train['mall_name'].replace(np.nan, "A-not a mall")

threshold = 4 # Anything that occurs less than this will be removed.
for col in stores_train.columns:
    value_counts = stores_train['mall_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    stores_train[col].replace(to_remove, "A-not a mall", inplace=True)
    
plt = stores_train.groupby(['mall_name'])['revenue'].mean()
plt.plot(kind='bar', title='Avrage revenue per mall', ylabel='Revenue', xlabel="Mall's", figsize=(30, 5) )

In [None]:
stores_train['mall_name'].isna().sum()

Just to check that the malls with one store have been converted to "A-not in a mall", the same as the NaN-values, we check if the mall "Telegrafen" still exist.
Just to check that the malls with one store have been converted to "A-not in a mall", the same as the NaN-values, we check if the mall "Telegrafen" still exist.


In [None]:
temp = stores_train.loc[stores_train['mall_name'] == "Telegrafen"]                       
print(temp) 

Early stage:

'store_id' and 'store_name' represent the same data. We can therefore remove one of them. Here it is worth considering that 'store_id' is numerical, and this is not. 'store_name'.
We can therefore remove 'store_name'. 

The same reasoning applies to 'plaace_hierarchy_id' and 'sales_channel_name'. Can therefore remove sales_channel_name'.

Later on:
Used these columns to test out a little different. Did they remove them not here, but somewhere else.


In [None]:
#  stores_train.pop("store_id")
#  stores_train.pop("store_name")

Looking a little closer at 'grunnkrets_id'.

In [None]:
absolute_frequencies = stores_train['grunnkrets_id'].value_counts()
print("The count of each frequencies for the grunnkrets_id\n", absolute_frequencies.tail(25))

Here we can see that there are several 'grunnkrets_id' values that only have one store. On a small data set, this could be scary in terms of overfitting. Maybe something smart should be done here.

It can be good to look through the values in 'revenu' to find out if there are any obvious outliers.

In [None]:
stores_train.sort_values(by=['revenue'])['revenue'].head(25)

In [None]:
stores_train.sort_values(by=['revenue'])['revenue'].tail(25)


Doesn't seem like any of the values for revenue are outliers, of those with the highest revenue.

In [None]:
# absolute_frequencies = stores_train['revenue'].value_counts()
sorted = stores_train.sort_values(by=['revenue'])
absolute_frequencies = sorted['revenue'].value_counts()
print(absolute_frequencies)

This seems very strange. 217 stores had no income in 2016.Take a closer look at these.

In [None]:
stores_train.sort_values(by=['revenue']).head(16)

There are suspiciously many stores that do not have more than '0.0' in revenue. May seem like shops that you have not had data on havd been given a default value of  '0.0' registered. Chooses to check up the store "REMA 1000 TOLLNES" and "YX KJOS" to confirm or deny the hypothesis.

REMA 1000 TOLLNES was founded in 2005 and is still in operation. Found out that the store was active in 2016 through this article "https://www.ta.no/det-gjor-meg-vondt-i-hjertet-a-kaste-mat/s/5-50-174104". Unfortunately, I can't find anything about the store on "www.proff.no". 

YX KJOS was founded in 2010 and is still in operation. On the website "https://www.regnskapstall.no/regnskapstall-for-kjos-servicesenter-as-103319245S1?view=full" you can clearly see that the company had an income in 2016. So the data in the data set is wrong. The company had a revenue of 7,026,000, and a profit of 623,000.
https://www.regnskapstall.no/regnskapstall-for-kjos-servicesenter-as-103319245S1?view=full

We can therefore probably assume that most of the data that has the value '0.0' for 'revenue' is incorrect. Even a company that goes bankrupt has an income from a sale. It is probably the safest choice to remove these.  

We can also confirm that the columns do not show a profit, as we found that one company both had a positive profit.

In [None]:
stores_train = copy.deepcopy(stores_train[stores_train.revenue != 0])
sorted = stores_train.sort_values(by=['revenue'])
absolute_frequencies = sorted['revenue'].value_counts()
print(absolute_frequencies)

Now we will check whether all the coordinates are in Norway, and look correct with a quick overview by printing all the coordinates on a map.

In [None]:
BBox = ( stores_train.lat.min(), stores_train.lat.max(), stores_train.lon.min(), stores_train.lon.max())
print(BBox)

After a bit of Googling online, we soon find out that these coordinates are within the borders of Norway.

In [None]:
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.express as px
import plotly.graph_objects as go 
import plotly.io as pio

fig = px.density_mapbox(stores_train, lat='lat', lon='lon', radius=5,
                        center=dict(lat=0, lon=180), zoom=0,
                        mapbox_style='open-street-map')
# It is possible to use mapbox_style="stamen-terrain" 
fig.show()

Here we can see that all the coordinates are in Norway.
Now we can take a closer look at Norway if the distribution of stores makes sense.

In [None]:
fig = px.density_mapbox(stores_train, lat='lat', lon='lon', radius=1,
                        center=dict(lat=65.5, lon=10), zoom=3,
                        mapbox_style='open-street-map')
fig.show()

Here you can intact with the map to explore more. To give an easy overview, the map shows that there are many shops in the big cities and fewer in the countryside. The coordinates in the dataset seem to make sense.

Removes the columns that we assume are not needed.

In [None]:
stores_train.pop("sales_channel_name")
stores_train.pop("address")
stores_train.head(5)

Now looking more closely at 'chain_name'.

In [None]:
stores_train['chain_name'].isna().sum()

In [None]:
stores_train['chain_name'].nunique()

In [None]:
stores_train['chain_name'].value_counts()

Here we suspect that some data is missing. We will therefore see if we can generate the data we are missing in 'chain_name' through the data located in 'store_name'.

In [None]:
store_name_rema = stores_train[stores_train['store_name'].str.match('REMA')]
store_name_rema

Here we can see that there are 268 stores with names matching 'REMA'. Maybe someone is missing here. Since there are 269 registered under 'REMA FRANCHISE NORGE' in the 'chain_name' column.

In [None]:
chain_name_rema = stores_train[stores_train['chain_name'].str.match('REMA FRANCHISE NORGE',  na=False)]
chain_name_rema

In [None]:
store_name_rema = chain_name_rema[~chain_name_rema["store_name"].str.match('REMA 1000', na=False)]

store_name_rema

Here we can see that 'BRENDEN HANDEL' should not belong to 'REMA FRANCHISE NORGE'.


Here we can create a code snippet to fix errors like this automatically. For this, to work there must be a clear link between the names in 'store_name' and 'chain_name'.

After some looking at the data set, this does not seem entirely feasible. For example, 'chain_name'='3T' has the names '3 T ' and '3T'. Such things make it difficult to automate this process.

It also gets complicated because several of the chain_names do not have a correlation with store_name. For example, there are many companies under the ALLIANCE OPTIKK chain, such as BRILLEHUSET HAMMERFEST, MIDT-TELEMARK SYNSSENTER, FRYDENLUND OPTIKK and OLLIS OPTIKK.
On the other hand, if we had more time, we could go through it more manually. But since this is a very time-consuming process and because we do not have this time, we, therefore, choose not to implement this.

In [None]:
stores_train.sort_values(by=['chain_name']).head(25)

Choosing to fix what we assume was wrong with Rema 1000, also where we are closer to Kiwi, which we thought would be more widespread. Kiwi does not include in the top 5 stores measured in number. (REMA FRANCHISE NORGE, JOKER, MIX, CIRCLE K DETALJIST, BUNNPRIS) 


In [None]:
store_name_rema

In [None]:
store_name_rema.at[7435, 'chain_name'] = np.nan 
store_name_rema

Have now checked that this worked as desired. Therefore performs the action on stores_train.

In [None]:
stores_train.at[7435, 'chain_name'] = np.nan 

stores_train.query('store_id=="915698204-915720811-778305"')

It worked :) 

In [None]:
# chain_name_rema = stores_train[stores_train['chain_name'].str.contains('REMA FRANCHISE NORGE',  na=False)]
# chain_name_rema
# store_name_rema = chain_name_rema[~chain_name_rema["store_name"].str.contains('REMA 1000', na=False)]
# store_name_rema
# name_rema = stores_train[stores_train['store_name'].str.contains('REMA 1000',  na=False)]

# Tried something with str.contains as first, but found str.match, whick better suitet this problem


name_rema = stores_train[stores_train['store_name'].str.match('REMA')]
# name_rema

rema = name_rema[~name_rema["chain_name"].str.contains('REMA', na=False)]

rema

Here we can see that there are no stores with names containing 'REMA' that are not in the REMA chain.

Now we're going to take a closer look at Kiwi's stores.

In [None]:
chain_name_kiwi_norge = stores_train[stores_train['chain_name'].str.match('KIWI',  na=False)]
chain_name_kiwi_norge


In [None]:
chain_name_kiwi_norge = stores_train[stores_train['chain_name'].str.match('KIWI NORGE',  na=False)]
chain_name_kiwi_norge


Tested looking for 'KIWI' and 'KIWI NORWAY'. Finds exactly the same 64 rows. So then we know that the stores in the Kiwi chain have not been registered under different names.

In [None]:
store_name_kiwi_norge = stores_train[stores_train['store_name'].str.contains('KIWI',  na=False)]
store_name_kiwi_norge


Here we can also see that at least one KIWI store is missing the chain value.

Looking further into this, using the same procedure as with REMA.

In [None]:
store_name_kiwi = store_name_kiwi_norge[~store_name_kiwi_norge["chain_name"].str.contains('KIWI', na=False)]
store_name_kiwi

Here we can see the one KIWI store that is not registered as a KIWI chain.

In [None]:
stores_train.at[8164, 'chain_name'] = "KIWI NORGE"

stores_train.query('store_id=="915526284-915802605-781854"')

Then checks whether there are any KIWI stores that are not registered as a KIWI chain.

In [None]:
chain_name_kiwi_norge = stores_train[stores_train['chain_name'].str.contains('KIWI',  na=False)]
chain_name_kiwi_norge


In [None]:
store_name_kiwi = chain_name_kiwi_norge[~chain_name_kiwi_norge["store_name"].str.contains('KIWI', na=False)]
store_name_kiwi

In [None]:
store_name_kiwi = chain_name_kiwi_norge[~chain_name_kiwi_norge["store_name"].str.contains('KIWI', na=False)]
store_name_kiwi

We now no longer have use for 'stor_name', so remove this column.

In [None]:
stores_train.pop("store_name")

Reset the index after some rows have been removed. This makes it easier to iterate through the data frame.

Also checks if the dataframe is not a copy after this operation.

In [None]:
stores_train = stores_train.reset_index(drop=True)
stores_train._is_copy

After looking through the data in 'chain_name', we can now look for a correlation with 'revenue'. We do this in the same way as we did with 'mall_name'

In [None]:
chain_subset = stores_train
# chain_subset = stores_train[stores_train['chain_name'].notna()]

# mall_subset = stores_train[stores_train['mall_name'].notna()]
# plt = mall_subset.groupby(['mall_name'])['revenue'].mean()



chain_subset['chain_name'] = chain_subset['chain_name'].replace(np.nan, "A-not a chain")
threshold = 0 # Anything that occurs less than this will be removed.
for col in chain_subset.columns:
    value_counts = chain_subset['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    chain_subset[col].replace(to_remove, "A-not a chain", inplace=True)
        
plt = chain_subset.groupby(['chain_name'])['revenue'].mean()
plt.plot(kind='bar', title='Avrage revenue per chain_name', ylabel='Revenue', xlabel="Chain's", figsize=(40, 5) )

Here we can see that many chains do significantly better than others. Do as before to set up the minimum number of stores in a chain. 

We can also see that those who do not belong to a chain have a rather low average revenue of around 5.


In [None]:
# subset = stores_train[stores_train['chain_name'].notna()]

# chain_subset['chain_name'] = subset['chain_name'].replace(np.nan, "AAA")
threshold = 3 # Anything that occurs less than this will be removed.
for col in chain_subset.columns:
    value_counts = chain_subset['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    chain_subset[col].replace(to_remove, "A-not a chain", inplace=True)
        
plt = chain_subset.groupby(['chain_name'])['revenue'].mean()
plt.plot(kind='bar', title='Avrage revenue per chain_name', ylabel='Revenue', xlabel="Chain's", figsize=(40, 5) )

Tried several different values for threshold to see if any of the columns that stand out disappeared. Seems like a threshold of 3 makes sense in this case.

Therefore performing this on the data set we are working with. 

In [None]:
stores_train['chain_name'] = stores_train['chain_name'].replace(np.nan, "A-not a chain")
threshold = 0 # Anything that occurs less than this will be removed.
for col in stores_train.columns:
    value_counts = stores_train['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    stores_train[col].replace(to_remove, "A-not a chain", inplace=True)
        
plt = stores_train.groupby(['chain_name'])['revenue'].mean()
plt.plot(kind='bar', title='Avrage revenue per chain_name', ylabel='Revenue', xlabel="Chain's", figsize=(40, 5) )

Changes the values in 'chain_name' and 'mall_name' to numeric values.


In [None]:
# Early stage:

# for i in range(len(stores_train) ):

#     # if pd.isna( stores_train['chain_name'][i]) == False:
#     sha = hashlib.sha3_256()
#     value = stores_train['chain_name'][i]
#     sha.update(value.encode('utf-8'))
#     hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'big')

#     stores_train.loc[i, 'chain_name'] = int(hashedInt)
    
#     # if pd.isna( stores_train['mall_name'][i]) == False:
#     sha = hashlib.sha3_256()
#     value = stores_train['mall_name'][i]
#     sha.update(value.encode('utf-8'))
#     hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'big')

#     stores_train.loc[i,'mall_name'] = int(hashedInt)


# Later on:

stores_train['chain_name'] = LabelEncoder().fit_transform(stores_train['chain_name'])
stores_train['mall_name'] = LabelEncoder().fit_transform(stores_train['mall_name'])

In [None]:
stores_train.head(5)

Changes to numeric columns.

In [None]:
type(stores_train.loc[5, 'chain_name'] )
type(stores_train.loc[5, 'mall_name'] )

In [None]:
stores_train['mall_name'] = pd.to_numeric(stores_train['mall_name'])
stores_train['chain_name'] = pd.to_numeric(stores_train['chain_name'])

In [None]:
type(stores_train.loc[5, 'chain_name'] )
type(stores_train.loc[5, 'mall_name'] )

In [None]:
stores_train.dtypes

In [None]:
stores_train['chain_name'].value_counts()

Here we can see that the number of unique values in the upper range matches the date we previously extracted.

REMA FRANCHISE NORGE    269(removed one)
JOKER                   164
MIX                     115
CIRCLE K DETALJIST      115
BUNNPRIS                112


### Plaace Hierarchy


First, we will see if there is a correlation between "plaace_hierarchy_id" and "revenue", and if we find this we will look at the possibility of merging the data set "plaace_hierarchy".

In [None]:
import matplotlib.pyplot as plt
# Even if plt is imported earlier, this cell tends to crash unless plt is also imported here.

fig = plt.figure()
fig, (ax1, ax2) = plt.subplots(figsize=(20, 18), ncols=2)
stores_train.groupby(["plaace_hierarchy_id"])['plaace_hierarchy_id'].count().plot(kind="barh", ax=ax1)
ax1.set_title('Number in each category')

stores_train.groupby(["plaace_hierarchy_id"])['revenue'].mean().plot(kind="barh", ax=ax2)
ax2.set_title('Average revenue for each category')
plt.show()

On the left side you can see the number of shops belonging to each category. On the right-hand side, we see the average revenue per store. We can see from this graph that there is a correlation. After a quick overview, you can see that "3.3.3.0" stands out. This category has many stores, and almost no revenue.

In [None]:
# Read plaace_hierarchy data 
plaace_hierarchy = pd.read_csv('data/plaace_hierarchy.csv')

# Merge stores_train with information about the hierarchy
stores_with_hierarchy = stores_train.merge(plaace_hierarchy, how='left', on='plaace_hierarchy_id')


stores_with_hierarchy.head()

In [None]:
stores_with_hierarchy['plaace_hierarchy_id'].equals(stores_with_hierarchy['lv4'])


A quick check to validate that the merge went as desired, and that the data in 'lv4' is equal to the 'plaace_hierarchy_id'.
This is precisely the case when we get True returned

In [None]:
# Early stage:

# for i in range(len(stores_with_hierarchy) ):
    
#     sha = hashlib.sha3_256()
#     value = str(stores_with_hierarchy['lv1_desc'][i] + str(stores_with_hierarchy['lv1'][i]))
#     sha.update(   (value.encode('utf-8'))     )
#     hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'little')  # type: ignore
#     stores_with_hierarchy.loc[i, 'lv1_desc'] = hashedInt
    
           
#     sha = hashlib.sha3_256()
#     value = str(stores_with_hierarchy['lv2_desc'][i] + str(stores_with_hierarchy['lv2'][i]))
#     sha.update(   (value.encode('utf-8'))     )
#     hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'little')  # type: ignore
#     stores_with_hierarchy.loc[i, 'lv2_desc'] = hashedInt
    
#     sha = hashlib.sha3_256()
#     value = str(stores_with_hierarchy['lv3_desc'][i] + str(stores_with_hierarchy['lv3'][i]))
#     sha.update(   (value.encode('utf-8'))     )
#     hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'little')  # type: ignore
#     stores_with_hierarchy.loc[i, 'lv3_desc'] = hashedInt
    
#     sha = hashlib.sha3_256()
#     value = str(stores_with_hierarchy['lv4_desc'][i] + str(stores_with_hierarchy['lv4'][i]))
#     sha.update(   (value.encode('utf-8'))     )
#     hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'little')  # type: ignore
#     stores_with_hierarchy.loc[i, 'lv4_desc'] = hashedInt
    
# stores_with_hierarchy

# Later on:
stores_with_hierarchy['lv1_desc'] = LabelEncoder().fit_transform(stores_with_hierarchy['lv1_desc'])
stores_with_hierarchy['lv2_desc'] = LabelEncoder().fit_transform(stores_with_hierarchy['lv2_desc'])
stores_with_hierarchy['lv3_desc'] = LabelEncoder().fit_transform(stores_with_hierarchy['lv3_desc'])
stores_with_hierarchy['lv4_desc'] = LabelEncoder().fit_transform(stores_with_hierarchy['lv4_desc'])

 

In [None]:
stores_with_hierarchy.head()

Several of the columns here contain the same information, so remove those that are string values and keep those that are numeric. 


In [None]:
stores_with_hierarchy.pop('lv1')
stores_with_hierarchy.pop('lv2')
stores_with_hierarchy.pop('lv3')
stores_with_hierarchy.pop('lv4')
# stores_with_hierarchy.pop('lv1_desc')
# stores_with_hierarchy.pop('lv2_desc')
# stores_with_hierarchy.pop('lv3_desc')
# stores_with_hierarchy.pop('lv4_desc')

### Grunnkrets Data
We will now take a closer look at the data that deals with Norway's basic districts

In [None]:
districts = pd.read_csv('data/grunnkrets_norway_stripped.csv')
districts.head()

In [None]:
districts.shape


In [None]:
districts.describe()

In [None]:
districts.nunique()

In [None]:
districts.isnull().sum()

In [None]:
absolute_frequencies = districts['year'].value_counts()
absolute_frequencies

Here it can look like there is a large amount of duplicates. There are 26536 rows, and 13270 of these are dated from 2015, while 13266 of these are dated 2016. Note that these two add up to 26536, which is the number of rows we have.

We can also observe that we have duplicates by seeing that they are 13270 unique values in 'grunnkrets_id'.

In [None]:
sub_districts = districts.loc[districts['grunnkrets_id'] == 10020901]                       
print(sub_districts) 

In [None]:
# 3	10020901	2015	Tregde	Tregde-Skjernøy	Mandal  MULTIPOLYGON
absolute_frequencies = sub_districts['geometry'].value_counts()
# MULTIPOLYGON
absolute_frequencies

In [None]:
sub_districts = districts.loc[districts['grunnkrets_id'] == 10030210]
# POLYGON
print(sub_districts) 

In [None]:
# 4	10030210	2015	Bryneheia	Vanse/Åpta	Farsund	POLYGON
absolute_frequencies = sub_districts['geometry'].value_counts()
# POLYGON
absolute_frequencies

In [None]:
districts.pop("year")
districts.head(2)

In [None]:
districts.nunique()

In [None]:
districts = districts.drop_duplicates()
districts


Here we can see that we are left with 13,270 rows. Originally it had 26,536 rows. This means that 13,266 rows have been removed out of the original 26,536.

In [None]:
districts = districts.reset_index(drop=True)

In [None]:
districts.loc[districts['grunnkrets_id'] == 10020901]   


In [None]:
districts.shape

We no longer have two entries for this ID.

Furthermore, we check whether there are still any duplicates on the ID. "drop_duplicates()" only removed the rows where entire rows are the same.

In [None]:
districts['grunnkrets_id'].value_counts()

Here we can see that no 'grunnkrets_id' has been registered several times.

In [None]:
districts.head()

'grunnkrets_id' and 'grunnkrets_name' represent the same date. We can therefore remove one of these. Keep of the ID, as this is numerical.

In [None]:
districts.pop('grunnkrets_name')

Early stage:
Will preserve the values in 'district_name' and 'municipality_name'. Therefore hashes these into numerical values. Do not use the built-in hash function in python as this can in some cases return different output for the same input, on different machines.

Before we start hashing these values, we check and analyze them in advance, so that we can see afterwards that everything went as desired.

Later on:
Change to use  LabelEncoder().fit_transform()

In [None]:
districts['district_name'].value_counts()

In [None]:
districts['municipality_name'].value_counts()

In [None]:
districts.head()

In [None]:
districts._is_copy

In [None]:
districts.head()

Early stage:

It is  important that the dataframe is not a copy. That is, if 'districts.districts._is_copy' returns '<weakref .....>' then the code will crash.

In [None]:
# for i in range(len(districts) ):
#     sha = hashlib.sha3_256()
#     value = districts['district_name'][i]
#     sha.update(value.encode('utf-8'))
#     hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'big')

#     districts.loc[i, 'district_name'] = hashedInt



districts['district_municipality_name'] = districts['district_name'] + '-'+ districts['municipality_name']
districts.pop('district_name')
districts['district_municipality_name'] = LabelEncoder().fit_transform(districts['district_municipality_name'])
districts['municipality_name'] = LabelEncoder().fit_transform(districts['municipality_name'])


In [None]:
districts.head()

In [None]:
districts['municipality_name'].value_counts()

Oslo         554
Bærum        431
Trondheim    429
Bergen       361
Stavanger    216

In [None]:
districts['district_municipality_name'].value_counts()

Sentrum           123
Bergen sentrum     47
Konnerud           42
Ås                 40
Sandviken          37

Here we can see the number of values representing 'Sentrum' is no longer the top. This is actually something we wanted to solve. From this we understand that there were several 'districts' in different municipalities that had the same district name. So, for example, 'Sentrum' in Oslo, and 'Sentrum' in Trondheim.

Will now check if there are any 'grunnkrets_id' entities that are not in 'stores_with_hierarchy'

In [None]:
list_from_districts = districts['grunnkrets_id']
list_from_districts
list_from_districts = list_from_districts.reset_index(drop=True)
list_from_districts.nunique()

In [None]:
list_from_stores_with_hierarchy = stores_with_hierarchy['grunnkrets_id']
list_from_stores_with_hierarchy.nunique()

This means that there exist 9481 'grunnkrets_id' that are not in the data set stores_train. 
This is not necessarily a problem. As there are many small places in Norway that do not have shops. It is possible that the number will decrease if we manage to bring in a few more from the data set 'stores_extra'.

The deciding factor here is how large a proportion of the shops in the test-set are within these grunnkrets. 

Assumes 'store_train' does not contain 'grunnkrets_id' which does not exist in 'grundkrets_norway_stripped'. But have to check this out


We later found that this assumption was wrong. It meant a little more work for us, but we managed to solve it.

In [None]:
stores_id = list_from_stores_with_hierarchy
districts_id = list_from_districts



stores_id

In [None]:
missing_id = []
for i in range(len(stores_id) ):
    if stores_id[i] not in districts_id.values:
        missing_id.append(stores_id[i])

print( len(missing_id) )

This does not look  optimal. Here I think we have to generate data, or do something else smart.

There are therefore 30 instances of 'grunnkrets_id' that have a shop, where we do not have data available from the data set 'grunnkrets_norway_stripped'

This is a problem that can also arise when we have to test the data. I therefore think it would be appropriate to generate an automatic function that solves the problem. We can thus use this for the data when training and testing our model.

### Extra Stores 
The extra stores dataset is a collection of stores for which we had no revenue data. Structurally, it is identical to the test set, but you are naturally not expected to submit any predictions for it. You can, however, use the additional data in your analysis, in unsupervised methods you might employ, or to provide a stronger data basis for missing value imputation.

In [None]:
stores_extra = pd.read_csv('data/stores_extra.csv')
stores_extra.head()

In [None]:
stores_extra.shape

In [None]:
stores_extra.pop("year")
stores_extra.pop("sales_channel_name")
stores_extra.isnull().sum()



In [None]:
stores_extra.nunique()

Maybe the data set has many of the same stores as we have in 'stores_train', but but data on 'chain_name' and 'mall_name'.

In [None]:
#  Do not remove the code - takes forever to run

# match_id = []
# stores_train_test = stores_train_test.reset_index(drop=True)

# for i in range(len(stores_train_test) ):
#     if stores_train_test['store_id'][i] in stores_extra.values:
#         match_id.append(stores_extra[i])
# print( 'Len: ', len(match_id) )
# print( "Id's ", (match_id) )
        
# this code returns 0

None of the 'store_id's are repeated in the dataset. So can ignore the idea where we could add/update data.
We will therefore not do any more work on/with this data set.

None of the 'store_id's are repeated in the dataset. So can ignore the idea where we could add/update data.
We will therefore not do any more work on/with this data set.

In [None]:
# At this moment we dont need this code. But there is a possibility that we will need it later on. 
# Do not remove the code


# for i in range(len(stores_with_hierarchy) ):
    
 
#     chain_int = int(stores_with_hierarchy['mall_name'][i])
#     lv1_desc_int = int(stores_with_hierarchy['lv1_desc'][i])
#     lv2_desc_int = int(stores_with_hierarchy['lv2_desc'][i])
#     lv3_desc_int = int(stores_with_hierarchy['lv3_desc'][i])
#     lv4_desc_int = int(stores_with_hierarchy['lv4_desc'][i])
   
#     stores_with_hierarchy.loc[i, 'lv1_desc'] = lv1_desc_int
#     stores_with_hierarchy.loc[i, 'lv2_desc'] = lv2_desc_int
#     stores_with_hierarchy.loc[i, 'lv3_desc'] = lv3_desc_int
#     stores_with_hierarchy.loc[i, 'lv4_desc'] = lv4_desc_int

In [None]:
# At this moment we dont need this code. But there is a possibility that we will need it later on. 
# Do not remove the code


# stores_with_hierarchy['chain_name'] = pd.to_numeric(stores_with_hierarchy['chain_name'])
# stores_with_hierarchy['mall_name'] = pd.to_numeric(stores_with_hierarchy['mall_name'])
# stores_with_hierarchy['lv1_desc'] = pd.to_numeric(stores_with_hierarchy['lv1_desc'])
# stores_with_hierarchy['lv2_desc'] = pd.to_numeric(stores_with_hierarchy['lv2_desc'])
# stores_with_hierarchy['lv3_desc'] = pd.to_numeric(stores_with_hierarchy['lv3_desc'])
# stores_with_hierarchy['lv4_desc'] = pd.to_numeric(stores_with_hierarchy['lv4_desc'])

### Household income 

In [None]:
grunnkrets_household_income = pd.read_csv('data/grunnkrets_income_households.csv')
grunnkrets_household_income.head(25)

Here you can see that we have many zero values. We therefore choose not to work further with these rows here. Therefore remove these.

Here we can take some self-criticism for not spending more time going through the data. But with so many zero values, we assumed that we should rather spend our time on other things.

In [None]:
grunnkrets_household_income.pop("singles")
grunnkrets_household_income.pop("couple_without_children")
grunnkrets_household_income.pop("couple_with_children")
grunnkrets_household_income.pop("other_households")
grunnkrets_household_income.pop("single_parent_with_children")

In [None]:
grunnkrets_household_income.head()


In [None]:
absolute_frequencies = grunnkrets_household_income['year'].value_counts()
absolute_frequencies

In [None]:
print(grunnkrets_household_income.shape)
income_sub = grunnkrets_household_income
# income_sub.pop("year")
income_sub = income_sub.drop_duplicates()
print(income_sub.shape)
income_sub.head()


In [None]:
income_sub = income_sub[income_sub.year != 2015]
income_sub.head(2)

In [None]:
absolute_frequencies = income_sub['grunnkrets_id'].value_counts()
absolute_frequencies

In [None]:
districts.head(2)

Chech some random samples that we know is the same district to analyse the income. 
11100
12427
6466

In [None]:
income_sub.loc[income_sub['grunnkrets_id'] == 6466]  

In [None]:
income_sub.loc[income_sub['grunnkrets_id'] == 12427]  

In [None]:
income_sub.loc[income_sub['grunnkrets_id'] == 11100]  

In [None]:
absolute_frequencies = income_sub['all_households'].value_counts()
absolute_frequencies


In [None]:
absolute_frequencies = stores_with_hierarchy['district_municipality_name'].value_counts()
absolute_frequencies




In [None]:
income_sub.head()

In [None]:
income_districts = pd.merge(income_sub,districts, how = 'right', on = 'grunnkrets_id')
income_districts.head(2)

In [None]:
income_districts['grunnkrets_id'].isna().sum()

In [None]:
income_districts.shape

In [None]:
income_districts.rename(columns = {'all_households':'households_grunnkrets'}, inplace = True)
income_districts.head()

In [None]:
income_districts.loc[income_districts['district_municipality_name'] == 64.0]  

We have so far not used for 'geometry' and 'area_km2', therefore removing these.

Realized we needed 'geometry' at a later stage to generate values in the columns we lacked data-

In [None]:
income_districts.pop("geometry")
income_districts.pop("area_km2")
income_districts.head(3)

In [None]:
stores_with_hierarchy = pd.merge(stores_with_hierarchy,income_districts, how = 'left', on = 'grunnkrets_id')
stores_with_hierarchy.head()

In [None]:
stores_with_hierarchy_and_income = stores_with_hierarchy
stores_with_hierarchy_and_income.head()

In [None]:
absolute_frequencies = stores_with_hierarchy_and_income['district_municipality_name'].value_counts()
absolute_frequencies

In [None]:
# income_districts['all_households_district'] = 
# income_districts_sub = income_districts
income_per_district = stores_with_hierarchy_and_income.groupby(['district_municipality_name'])['households_grunnkrets'].mean()
# income_districts = income_districts.reset_index(drop=True)
# income_districts_sub
# print( type(income_districts_sub) )
# income_districts_sub.DataFrame(data = income_districts_sub) 
income_per_district


In [None]:
#  Chech up this id: 100041825566693872077185226859240897274106223553590114601070335315116289585880 100041825566693872077185226859240897274106223553590114601070335315116289585880
stores_with_hierarchy_and_income.loc[stores_with_hierarchy_and_income['district_municipality_name'] == 1539.0]  
# income_districts.loc[income_districts['grunnkrets_id'] == "10010701"]  
# income_districts.head(5)

It seems very strange that all the constituencies within a district have the same income. Will have to look into this further.
Choose to take some random samples of several of the 'grunnkrets_id' we see here. Implements these further up in the code.

After this check, we had a strong suspicion that the data obtained in this table is actually made for districts and not grunnkrets.

After reviewing the information we had about the date, we found that this was correct. It has thus not been necessary to group and calculate the average income.

In [None]:
# income_districts.rename(columns = {'households':'households_grunnkrets'}, inplace = True)

# Merge inn the district income into the store_train df
# income_districts_all = pd.merge(income_districts,income_districts_sub, how = 'inner', on = 'district_municipality_name')
# income_districts_all.head(25)




# stores_with_hierarchy = pd.merge(stores_with_hierarchy,income_districts, how = 'left', on = 'grunnkrets_id')

In [None]:
income_districts.rename(columns = {'households_grunnkrets':'households_grunnkrets'}, inplace = True)

In [None]:
#Checkpoint to save time

# SAVE
# stores_with_hierarchy.to_csv('data/stores_with_hierarchy_copy.csv')
# districts.to_csv('data/districts_copy.csv')
# household_income.to_csv('data/household_income_copy.csv')
# stores_with_hierarchy_and_income.to_csv('data/stores_with_hierarchy_and_income_copy.csv')


# LOAD
# stores_with_hierarchy = pd.read_csv('data/stores_with_hierarchy_copy.csv')
# stores_with_hierarchy.pop('Unnamed: 0')
# districts = pd.read_csv('data/districts_copy.csv')
# districts.pop('Unnamed: 0')
# household_income = pd.read_csv('data/household_income_copy.csv')
# household_income.pop('Unnamed: 0')
# stores_with_hierarchy_and_income = pd.read_csv('data/stores_with_hierarchy_and_income_copy.csv')
# stores_with_hierarchy_and_income.pop('Unnamed: 0')

### Buss

In [None]:
buss = pd.read_csv('data/busstops_norway.csv')
buss.head(5)

In [None]:
buss.describe()

In [None]:
buss.nunique()

In [None]:
buss.shape

In [None]:
buss.isnull().sum()

At first glance, it doesn't seem like 'side_placement' are of much interest here.

In [None]:
absolute_frequencies = buss['stopplace_type'].value_counts()
absolute_frequencies

In [None]:
absolute_frequencies = buss['importance_level'].value_counts()
absolute_frequencies

'stopplace_type' doesn't look too interesting here either. 'importance_level', on the other hand, may be of interest. But it is therefore unfortunate that 55514 rows here have a 'Missing importance level', which can be considered missing data.

In [None]:
buss.pop('geometry')
buss.pop('stopplace_type')


One could probably use 'stopplace_type', 'importance_level' and 'importance_level' to find a correlation between these. Then we could have tried to fill in the missing data. However, we chose not to set aside time to do this. This probably has something to do with the fact that we also didn't think it was worth the time in terms of what we wanted to achieve.

#### Testing with mall_name

In [None]:
stores = pd.read_csv('data/stores_train.csv')
mall_subset = stores[['store_id', 'mall_name','revenue']]
mall_subset['mall_null'] = ''
mall_subset['mall_en'] = ''
mall_subset['mall_to'] = ''
mall_subset['mall_tre'] = ''
mall_subset['mall_fire'] = ''
mall_subset['mall_fem'] = ''
mall_subset['mall_seks'] = ''
mall_subset['hmall_null'] = ''
mall_subset['hmall_en'] = ''
mall_subset['hmall_to'] = ''
mall_subset['hmall_tre'] = ''
mall_subset['hmall_fire'] = ''
mall_subset['hmall_fem'] = ''
mall_subset['hmall_seks'] = ''
mall_subset.head(1)


In [None]:
# null
mall_subset['mall_name'] = stores['mall_name'].replace(np.nan, "A-not a mall")

threshold = 0 # Anything that occurs less than this will be removed.
for col in mall_subset.columns:
    value_counts = mall_subset['mall_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    mall_subset[col].replace(to_remove, "A-not a mall", inplace=True)
    
mall_subset['mall_null'] = LabelEncoder().fit_transform(mall_subset['mall_name'])

for i in range(len(mall_subset) ):
    sha = hashlib.sha3_256()
    value = mall_subset['mall_name'][i]
    sha.update(value.encode('utf-8'))
    hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'big')


# en
threshold = 1 # Anything that occurs less than this will be removed.
for col in mall_subset.columns:
    value_counts = mall_subset['mall_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    mall_subset[col].replace(to_remove, "A-not a mall", inplace=True)
    
mall_subset['mall_en'] = LabelEncoder().fit_transform(mall_subset['mall_name'])

for i in range(len(mall_subset) ):
    sha = hashlib.sha3_256()
    value = mall_subset['mall_name'][i]
    sha.update(value.encode('utf-8'))
    hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'big')

    mall_subset.loc[i,'hmall_en'] = int(hashedInt)
    

#  to
threshold = 2 # Anything that occurs less than this will be removed.
for col in mall_subset.columns:
    value_counts = mall_subset['mall_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    mall_subset[col].replace(to_remove, "A-not a mall", inplace=True)
 
mall_subset['mall_to'] = LabelEncoder().fit_transform(mall_subset['mall_name'])

   
for i in range(len(mall_subset) ):
    sha = hashlib.sha3_256()
    value = mall_subset['mall_name'][i]
    sha.update(value.encode('utf-8'))
    hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'big')

    mall_subset.loc[i,'hmall_to'] = int(hashedInt)
    
    
    
# tre
threshold = 3 # Anything that occurs less than this will be removed.
for col in mall_subset.columns:
    value_counts = mall_subset['mall_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    mall_subset[col].replace(to_remove, "A-not a mall", inplace=True)
 
mall_subset['mall_tre'] = LabelEncoder().fit_transform(mall_subset['mall_name'])

   
for i in range(len(mall_subset) ):
    sha = hashlib.sha3_256()
    value = mall_subset['mall_name'][i]
    sha.update(value.encode('utf-8'))
    hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'big')

    mall_subset.loc[i,'hmall_tre'] = int(hashedInt)
    
    
#  Fire    
threshold = 4 # Anything that occurs less than this will be removed.
for col in mall_subset.columns:
    value_counts = mall_subset['mall_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    mall_subset[col].replace(to_remove, "A-not a mall", inplace=True)
 
mall_subset['mall_fire'] = LabelEncoder().fit_transform(mall_subset['mall_name'])

   
for i in range(len(mall_subset) ):
    sha = hashlib.sha3_256()
    value = mall_subset['mall_name'][i]
    sha.update(value.encode('utf-8'))
    hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'big')

    mall_subset.loc[i,'hmall_fire'] = int(hashedInt)
    
    
    
# Fem
threshold = 5 # Anything that occurs less than this will be removed.
for col in mall_subset.columns:
    value_counts = mall_subset['mall_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    mall_subset[col].replace(to_remove, "A-not a mall", inplace=True)
 
mall_subset['mall_fem'] = LabelEncoder().fit_transform(mall_subset['mall_name'])

   
for i in range(len(mall_subset) ):
    sha = hashlib.sha3_256()
    value = mall_subset['mall_name'][i]
    sha.update(value.encode('utf-8'))
    hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'big')

    mall_subset.loc[i,'hmall_fem'] = int(hashedInt)
    
    
# seks
threshold = 6 # Anything that occurs less than this will be removed.
for col in mall_subset.columns:
    value_counts = mall_subset['mall_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    mall_subset[col].replace(to_remove, "A-not a mall", inplace=True)
 
mall_subset['mall_seks'] = LabelEncoder().fit_transform(mall_subset['mall_name'])

   
for i in range(len(mall_subset) ):
    sha = hashlib.sha3_256()
    value = mall_subset['mall_name'][i]
    sha.update(value.encode('utf-8'))
    hashedInt = int.from_bytes(hashlib.sha256(value.encode('utf-8')).digest(), 'big')

    mall_subset.loc[i,'hmall_seks'] = int(hashedInt)
    
    
mall_subset['hmall_null'] = pd.to_numeric(mall_subset['hmall_null'])
mall_subset['hmall_en'] = pd.to_numeric(mall_subset['hmall_en'])
mall_subset['hmall_to'] = pd.to_numeric(mall_subset['hmall_to'])
mall_subset['hmall_tre'] = pd.to_numeric(mall_subset['hmall_tre'])
mall_subset['hmall_fire'] = pd.to_numeric(mall_subset['hmall_fire'])
mall_subset['hmall_fem'] = pd.to_numeric(mall_subset['hmall_fem'])
mall_subset['hmall_seks'] = pd.to_numeric(mall_subset['hmall_seks'])

In [None]:
data_only_numeric = mall_subset.drop(columns=["mall_name"],axis=1)

corr = data_only_numeric.corr()
g, ax = plt.subplots(figsize=(15,15))  
sns.heatmap(corr, color="k", annot=True, cmap="YlGnBu", ax=ax)

This shows that there is actually a more linear correlation when you include all malls, even if they only consist of one store. Here we can see that the correlation is 0.025. This is obviously low, but still higher than the values we previously found.

Now that we have found that all malls should be included, we will test whether a bool's value can be better.

In [None]:
mall_subset['mall_bool'] = stores['mall_name']
mall_subset['mall_bool'] = mall_subset['mall_bool'].replace(np.nan, 0)

for i in range(len(mall_subset) ):
    if( mall_subset['mall_bool'][i] != 0):

        mall_subset.loc[i,'mall_bool'] = 1
        
mall_subset['mall_bool'] = pd.to_numeric(mall_subset['mall_bool'])
mall_subset.head(10)

In [None]:
data_only_numeric = mall_subset.drop(columns=["mall_name"],axis=1)

corr = data_only_numeric.corr()
g, ax = plt.subplots(figsize=(15,15))  
sns.heatmap(corr, color="k", annot=True, cmap="YlGnBu", ax=ax)

Here we can see that the Boolean value has a greater linear correlation with revenue.


### Testing with chain_name

#### Stores train

In [None]:
stores = pd.read_csv('data/stores_train.csv')

chain = stores[['store_id', 'chain_name','revenue']]

chain['chain_null'] = ''
chain['chain_en'] = ''
chain['chain_to'] = ''
chain['chain_tre'] = ''
chain['chain_fire'] = ''
chain['chain_fem'] = ''
chain['chain_seks'] = ''
chain['chain_syv'] = ''
chain['chain_åtte'] = ''

# Added after the first iteration
chain['chain_null_bool'] = ''
chain['chain_to_bool'] = ''
chain['chain_fire_bool'] = ''
chain['chain_seks_bool'] = ''




chain.head()


In [None]:
# null
chain['chain_name'] = stores['chain_name'].replace(np.nan, "A-not a chain")
threshold = 0 # Anything that occurs less than this will be removed.
for col in chain.columns:
    value_counts = chain['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    chain[col].replace(to_remove, "A-not a chain", inplace=True)
       
chain['chain_null'] = LabelEncoder().fit_transform(chain['chain_name'])

chain['chain_null_bool'] = chain['chain_name']
chain['chain_null_bool'] = chain['chain_null_bool'].replace("A-not a chain", 0)

for i in range(len(chain) ):
    if( chain['chain_null_bool'][i] != 0):

        chain.loc[i,'chain_null_bool'] = 1
        
chain['chain_null_bool'] = pd.to_numeric(chain['chain_null_bool'])



# en
threshold = 1 # Anything that occurs less than this will be removed.
for col in chain.columns:
    value_counts = chain['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    chain[col].replace(to_remove, "A-not a chain", inplace=True)
    
chain['chain_en'] = LabelEncoder().fit_transform(chain['chain_name'])



# to
threshold = 2 # Anything that occurs less than this will be removed.
for col in chain.columns:
    value_counts = chain['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    chain[col].replace(to_remove, "A-not a chain", inplace=True)
    
chain['chain_to'] = LabelEncoder().fit_transform(chain['chain_name'])

chain['chain_to_bool'] = chain['chain_name']
chain['chain_to_bool'] = chain['chain_to_bool'].replace("A-not a chain", 0)

for i in range(len(chain) ):
    if( chain['chain_to_bool'][i] != 0):

        chain.loc[i,'chain_to_bool'] = 1     
chain['chain_to_bool'] = pd.to_numeric(chain['chain_to_bool'])



# tre
threshold = 3 # Anything that occurs less than this will be removed.
for col in chain.columns:
    value_counts = chain['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    chain[col].replace(to_remove, "A-not a chain", inplace=True)
    
chain['chain_tre'] = LabelEncoder().fit_transform(chain['chain_name'])



# fire
threshold = 4 # Anything that occurs less than this will be removed.
for col in chain.columns:
    value_counts = chain['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    chain[col].replace(to_remove, "A-not a chain", inplace=True)
    
chain['chain_fire'] = LabelEncoder().fit_transform(chain['chain_name'])   
           
chain['chain_fire_bool'] = chain['chain_name']
chain['chain_fire_bool'] = chain['chain_fire_bool'].replace("A-not a chain", 0)

for i in range(len(chain) ):
    if( chain['chain_fire_bool'][i] != 0):

        chain.loc[i,'chain_fire_bool'] = 1
        
chain['chain_fire_bool'] = pd.to_numeric(chain['chain_fire_bool'])


# fem
threshold = 5 # Anything that occurs less than this will be removed.
for col in chain.columns:
    value_counts = chain['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    chain[col].replace(to_remove, "A-not a chain", inplace=True)
    
chain['chain_fem'] = LabelEncoder().fit_transform(chain['chain_name'])




# seks
threshold = 6 # Anything that occurs less than this will be removed.
for col in chain.columns:
    value_counts = chain['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    chain[col].replace(to_remove, "A-not a chain", inplace=True)
    
chain['chain_seks'] = LabelEncoder().fit_transform(chain['chain_name'])



chain['chain_seks_bool'] = chain['chain_name']
chain['chain_seks_bool'] = chain['chain_seks_bool'].replace("A-not a chain", 0)

for i in range(len(chain) ):
    if( chain['chain_seks_bool'][i] != 0):

        chain.loc[i,'chain_seks_bool'] = 1
            
chain['chain_seks_bool'] = pd.to_numeric(chain['chain_seks_bool'])





chain['chain_fire_bool'] = chain['chain_name']
chain['chain_fire_bool'] = chain['chain_fire_bool'].replace("A-not a chain", 0)

for i in range(len(chain) ):
    if( chain['chain_fire_bool'][i] != 0):

        chain.loc[i,'chain_fire_bool'] = 1
        
chain['chain_fire_bool'] = pd.to_numeric(chain['chain_fire_bool'])



# syv
threshold = 7 # Anything that occurs less than this will be removed.
for col in chain.columns:
    value_counts = chain['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    chain[col].replace(to_remove, "A-not a chain", inplace=True)
    
chain['chain_syv'] = LabelEncoder().fit_transform(chain['chain_name'])




# åtte
threshold = 6 # Anything that occurs less than this will be removed.
for col in chain.columns:
    value_counts = chain['chain_name'].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    chain[col].replace(to_remove, "A-not a chain", inplace=True)
    
chain['chain_åtte'] = LabelEncoder().fit_transform(chain['chain_name'])

chain.head(7)

In [None]:
import matplotlib.pyplot as plt

data_only_numeric = chain.drop(columns=["chain_name"],axis=1)

corr = data_only_numeric.corr()
g, ax = plt.subplots(figsize=(15,15))  
sns.heatmap(corr, color="k", annot=True, cmap="YlGnBu", ax=ax)

#### Population


In [None]:
stores_train = pd.read_csv('data/stores_train.csv')
grunnkrets = pd.read_csv('data/grunnkrets_norway_stripped.csv')
grunnkrets.head()

In [None]:
grunnkrets.nunique()

## Feature Engineering


#### Population

In [None]:
stores_train = pd.read_csv('data/stores_train.csv')
grunnkrets = pd.read_csv('data/grunnkrets_norway_stripped.csv')
grunnkrets_ages = pd.read_csv('data/grunnkrets_age_distribution.csv')

In [None]:
# Concactenate population with other generated data
population = get_population(stores_train, grunnkrets, grunnkrets_ages)
population.head

In [None]:
def get_population (stores_data, grunnkrets_data, grunnkrets_ages_data):
    """ Returns population of grunnkrets and district
        Manipulate or remove last line of code according to your taste :)
        May require some more work for imputation of 'grunnkrets_population' using data from 'stores_extra' using 'lat' and 'lon'
        Possibility of using unspervised learning?
        Can be used for training only.Need to see how to make it multipurpose for test data as well.
        Else Separate function is needed.
    """
    #Get data for every grunnkrets_id and and drop duplicates. Prioritize the year '2016'
    grunnkrets = grunnkrets.sort_values('year', ascending=False).drop_duplicates('grunnkrets_id').sort_index()
    
    #Create District+Municipality
    grunnkrets_merged_district_municipality_name = grunnkrets
    grunnkrets_merged_district_municipality_name['district_name_pro'] = grunnkrets['district_name'] + ' '+ grunnkrets['municipality_name']
    
    #Drop columns except 'grunnkrets_id' and 'district_municipality_name'
    grunnkrets_with_district_municipality_names = grunnkrets_merged_district_municipality_name
    grunnkrets_with_district_municipality_names = grunnkrets_with_district_municipality_names.drop(grunnkrets_with_district_municipality_names.iloc[:,1:5].columns,axis =1)
    
    grunnkrets_with_district_municipality_names = grunnkrets_with_district_municipality_names.drop('area_km2', axis =1)
    
    #Get data for every grunnkrets_id and and drop duplicates. Prioritize the year '2016'
    grunnkrets_ages_new = grunnkrets_ages.sort_values('year', ascending=False).drop_duplicates('grunnkrets_id').sort_index()
    
    #Sum all ages in grunnkrets
    grunnkrets_ages_new['population'] = grunnkrets_ages_new.iloc[:,2:].sum(axis =1)
    
    #Clean: Drop all age columns including year column
    grunnkrets_population = grunnkrets_ages_new
    grunnkrets_population = grunnkrets_population.drop(grunnkrets_population.iloc[:,1:93].columns,axis =1)
    
    grunnkrets_population_dist_muni = grunnkrets_population
    grunnkrets_population_dist_muni = pd.merge(grunnkrets_with_district_municipality_names,grunnkrets_population, how = 'left', on = 'grunnkrets_id')
    
    grunnkrets_dist_muni = grunnkrets_population_dist_muni
    grunnkrets_dist_muni = grunnkrets_dist_muni.drop(['population'], axis = 1)
    
    #Merge only grunnkrets population and find missing population before merging
    merge_grunnkrets_populn_stores_train = pd.merge(stores_train,grunnkrets_dist_muni, how = 'left', on = 'grunnkrets_id')
    
    #Add population of district
    st_train_grunn_pp = pd.merge(merge_grunnkrets_populn_stores_train,grunnkrets_population, how = 'left', on = 'grunnkrets_id')
    
    #Get Population of District
    population_dist_muni = st_train_grunn_pp
    population_dist_muni = population_dist_muni.groupby('district_name_pro')['population'].sum()
    
    #Merge
    st_train_grunn_pp_dist_pp = pd.merge(st_train_grunn_pp,population_dist_muni, how = 'left', on = 'district_name_pro')
    
    st_train_grunn_pp_dist_pp.rename(columns = {'population_x':'grunnkrets_population','population_y': 'district_population'}, inplace = True)
    
    st_train_grunn_pp_dist_pp['geometry'] = gpd.GeoSeries.from_wkt(st_train_grunn_pp_dist_pp['geometry'])

    store_gdf = gpd.GeoDataFrame(st_train_grunn_pp_dist_pp, geometry='geometry')
    store_gdf = store_gdf.drop_duplicates()
    
    for index, row in st_train_grunn_pp_dist_pp.iterrows():
        # print("row", row)
    
        if pd.isnull(row['grunnkrets_population']):
            lat = row['lat']
            lon = row['lon']
            
            store_location = Point(lon, lat)
        
            polygon_indices = store_gdf.distance(store_location).sort_values().index[0:150] #lower values returns missing data for grunnkrets_population
            #cannot guarantee accuracy of imputed missing population
            nearest_grunnkretser = store_gdf.loc[polygon_indices]
            
            st_train_grunn_pp_dist_pp['grunnkrets_population'].loc[index] = np.floor(nearest_grunnkretser['grunnkrets_population'].mean())
        
    for index, row in st_train_grunn_pp_dist_pp.iterrows():        
        if pd.isnull(row['district_population']):
            lat = row['lat']
            lon = row['lon']
            
            store_location = Point(lon, lat)
        
            polygon_indices = store_gdf.distance(store_location).sort_values().index[0:4]
            nearest_grunnkretser = store_gdf.loc[polygon_indices]
            
            st_train_grunn_pp_dist_pp['district_population'].loc[index] = np.floor(nearest_grunnkretser['district_population'].mean())
    
    st_train_grunn_pp_dist_pp = st_train_grunn_pp_dist_pp.drop(st_train_grunn_pp_dist_pp.iloc[:,1:14].columns,axis =1)#Take off this if all colmuns are needed
    return st_train_grunn_pp_dist_pp

### Busstops
##### Add column with distance from closest bus stop



For the busstops we wanted to calculate the distance from each store to the closest busstop, as we thought this would be an interesting feature. We also wanted to keep the importance levels of the busstops, as a more important busstop nearby will be more valuable for a store than a not so often used busstop.

We started out by importing the file we were given, and dropped the columns we found less helpful during the EDA, i.e. stopplace_type and side_placement.

After that we replaced the categorical values with numerical ones, and inserted columns in the table for latitude and longditude, so that it would be easier to compute the distance (as opposed to using the "geometry" of the busstops). We then dropped geometry, since we had the relevant information saved in other columns. 

In [None]:
busstops = pd.read_csv('./data/busstops_norway.csv')
busstops = busstops.drop(columns=["stopplace_type", "side_placement"])

importance_levels = ["Mangler viktighetsnivå", "Standard holdeplass", "Lokalt knutepunkt",
                     "Regionalt knutepunkt", "Annen viktig holdeplass", "Nasjonalt knutepunkt"]
numerated_importance_levels = [1, 2, 4, 5, 3, 6]

busstops["importance_level"] = busstops["importance_level"].replace(importance_levels, numerated_importance_levels)

busstops.insert(3, "lat", -math.inf)
busstops.insert(4, "lon", -math.inf)

busstops_array = []
for row_index in range(len(busstops)):
    coordinates = busstops["geometry"][row_index][6:-2].split(' ')
    busstops["lon"][row_index] = float(coordinates[0])
    busstops["lat"][row_index] = float(coordinates[1])
    busstops_array.append([float(coordinates[1]), float(coordinates[0])])
busstops = busstops.drop(columns="geometry")


Next, we find the distance to the closest busstop, and return the given distance and its importance level. This function can so be used to add another column to the store data, in order to know distance to the closest bus stop, and the importance of the bus stop.

In [None]:
def getDistFromBusStop(store_lat, store_lon):
    busstops_big_array = np.asarray(busstops_array)
    distances = np.sum((busstops_big_array-[store_lat, store_lon])**2, axis=1)
    index_busstop = np.argmin(distances)
    shortest_distance = distances[index_busstop]
    return shortest_distance, busstops.at[index_busstop, "importance_level"]

After this, we wanted a feature that combined these to features, which we ended up simply calling "buss". This is the log10 value of the product of the distance and importance level. The reason we wanted to use the log, instead of the actual value, was that we plotted the feature importance before and after applying the log10 to the buss-feature, and the importance improved with the log10. 

In [None]:
def addWeightedBusData(df):
    for index, row in df.iterrows():
        # Adding distance to closest busstop
        distance, importance_level = getDistFromBusStop(row['lat'], row['lon'])
        df.at[index, "distance_from_busstop"] = distance
        df.at[index, "busstop_importance_level"]= importance_level
        
    df.insert(11, 'buss', -math.inf)
    df['buss'] = np.log10(np.multiply(df['distance_from_busstop'], df['busstop_importance_level']))
    
    return df

## Models

### Random forest

We started with random forest quite early on. This was an algorithm we had heard about before. Our thought was that it was important to familiarize yourself with how to work with such alogorithms, so we therefore chose to test this one. This is a good lesson for us. We had intended this to be a simple algorithm, so that's why we thought this was a good start for us.

In [None]:
sns.set_style('darkgrid')
stores_train = pd.read_csv('data/stores_train_preprocessed.csv')
stores_train.head()

In [None]:
stores_train.tail()

In [None]:
stores_train = stores_train[stores_train.revenue > 0.0]

In [None]:
stores_train = stores_train[stores_train.revenue > 0.0]

In [None]:
stores_train = stores_train.drop(['store_id'], axis = 1) 
stores_train

In [None]:
stores_train['grunnkrets_population'] = (stores_train['grunnkrets_population'].fillna(stores_train.grunnkrets_population.mean()))
stores_train['district_population'] = (stores_train['district_population'].fillna(stores_train.district_population.mean()))
stores_train['area_km2'] = (stores_train['area_km2'].fillna(stores_train.area_km2.mean()))
stores_train.isnull().sum()

In [None]:
data = stores_train
Y= data[['revenue']]
Y=np.ravel(Y)
X= data.drop('revenue', axis =1)
X.head()

In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(random_state = 42,
                                  n_jobs =-1,
                                   n_estimators = 3000,
                                 )  

In [None]:
regressor.fit(X,Y)

In [None]:
score = regressor.score(X,Y)
score

In [None]:
pd.DataFrame({'Variable':X.columns,
              'Importance':regressor.feature_importances_}).sort_values('Importance', ascending=False)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test = train_test_split(X,Y, test_size = 0.20, random_state = 42)
print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('y_train',y_train.shape)
print('y_test',y_test.shape)

In [None]:
regressor.fit(X_train,y_train)

In [None]:
score = regressor.score(X_train,y_train)
score    

In [None]:
score = regressor.score(X_train,y_train)
score    

In [None]:
from sklearn.metrics import mean_squared_error
y_pred = regressor.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse    

In [None]:
y_pred_non_log = 10**(y_pred)
y_test_non_log = 10**(y_test)

In [None]:
from sklearn.metrics import mean_squared_log_error
rmsle = mean_squared_log_error(y_test_non_log,y_pred_non_log)**0.5
rmsle  

In [None]:
x_ax = range(len(y_test_non_log))
f = plt.figure()
f.set_figwidth(30)
f.set_figheight(5)
plt.plot(x_ax, y_test_non_log, label="truth")
plt.plot(x_ax, y_pred_non_log, label="predicted")
plt.title("Truth vs predicted Revenue")
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend(loc='best',fancybox=True, shadow=True)
plt.grid(True)
plt.show()

### Catboost


We started working with this algorithm after Random forest. We started with this in parallel with LightGBM. These two are more complex compared to the Random front, and were a natural choice. Here we had become better acquainted with ML, and which algorithms existed. After some searching online to find more information, we chose to test Catboost and LightGBM

In [None]:
stores_train = pd.read_csv('data/stores_train_preprocessed.csv')
# stores_train = pd.read_csv('data/stores_train.csv')
stores_train['grunnkrets_population'] = stores_train['grunnkrets_population'].fillna(stores_train.grunnkrets_population.mean())
stores_train['district_population'] = stores_train['district_population'].fillna(stores_train.district_population.mean())
stores_train['area_km2'] = stores_train['area_km2'].fillna(stores_train.area_km2.mean())
stores_train.isnull().sum()

In [None]:
#sns.boxplot(stores_train['revenue'])

In [None]:
stores_train = stores_train[stores_train.revenue > 0.0]

In [None]:
stores_train.tail()

In [None]:
data = stores_train
data = data.drop('store_id',
                   axis =1)
data.dtypes

In [None]:
data.shape

In [None]:
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error, r2_score
import math

In [None]:
Y= data.pop("revenue")
X= data 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test = train_test_split(X,Y, test_size = 0.3, random_state = 42)
print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('y_train',y_train.shape)
print('y_test',y_test.shape)

In [None]:
cat_model = CatBoostRegressor( iterations= 5000, random_seed = 42)

In [None]:
feature_importance = pd.DataFrame(cat_model.get_feature_importance(prettified=True))

plt.figure(figsize=(12, 6));
feature_plot= sns.barplot(x="Importances", y="Feature Id", data = feature_importance,palette="cool");
plt.title('feature importance');

In [None]:
# Fit model and validate
cat_model.fit( X_train, y_train,
               eval_set=(X_test, y_test),
               plot=True,
              verbose = False
              )

In [None]:
y_predict_train = cat_model.predict(X_train)
y_predict= cat_model.predict(X_test)
#RMSE
Rmse_train = math.sqrt(mean_squared_error(y_train,y_predict_train ))
Rmse_test = math.sqrt(mean_squared_error(y_test,y_predict))

#R
r2_train = cat_model.score(X_train,y_train)
r2_test = r2_score(y_test,y_predict)

# Adjusted R2 
n= X_train.shape[0] 
p= X_train.shape[1] 
adj_r2_test = 1-(1-r2_test)*(n-1)/(n-p-1)

print("Evaluation on test data")
print("RMSE train: {:.2f}".format(Rmse_train))
print("RMSE test: {:.2f}".format(Rmse_test))
print("R2 train: {:.2f}".format(r2_train))
print("R2 test: {:.2f}".format(r2_test))
print("Adjusted R2: {:.2f}".format(adj_r2_test)) 

In [None]:
x_ax = range(len(y_test))
f = plt.figure()
f.set_figwidth(30)
f.set_figheight(5)
plt.plot(x_ax, y_test, label="truth")
plt.plot(x_ax, y_predict, label="predicted")
plt.title("Truth vs predicted Revenue")
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend(loc='best',fancybox=True, shadow=True)
plt.grid(True)
plt.show()

In [None]:
y_pred_non_log = 10**(y_predict) 
y_test_non_log = 10**(y_test)

In [None]:
from sklearn.metrics import mean_squared_log_error
rmsle = mean_squared_log_error(y_test_non_log,y_pred_non_log)**0.5
rmsle  

In [None]:
x_ax = range(len(y_test_non_log))
f = plt.figure()
f.set_figwidth(30)
f.set_figheight(5)
plt.plot(x_ax, y_test_non_log, label="truth")
plt.plot(x_ax, y_pred_non_log, label="predicted")
plt.title("Truth vs predicted Revenu")
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend(loc='best',fancybox=True, shadow=True)
plt.grid(True)
plt.show()

### Gradient Boosting Machine
Another algorithm we tried out was gradient boosting, namely the GradientBoostingRegressor by SciKit-Learn. The reason we chose to use this library, is because it was the one we had heard the most of, and been recommended by peers and mentors. 

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

#### Defining and testing base model 
We first started out by defining and testing a base model, which was a simple tree, also from SciKit-Learn, just to get an idea of approximately what performance we could expect from the model. To make it even more predictable, we used the same evaluation metric as the Kaggle Leaderboard.

In [None]:
X = pd.DataFrame(stores_train.drop(columns="revenue"))
y = pd.DataFrame(stores_train["revenue"])

In [None]:
for depth in range(1, 10):
    tree_regressor = tree.DecisionTreeRegressor(max_depth=depth, random_state=1)
    if tree_regressor.fit(X, y).tree_.max_depth < depth:
        break
    score_all=np.mean(cross_val_score(tree_regressor, X, y,
                                  scoring = 'neg_mean_squared_log_error'))
    score_all=math.sqrt(abs(score_all))
    print(depth, score_all)

The results we got were already very good, and so we continued on with hyperparameter tuning.

#### Hyperparameter Tuning
For the hyperparameter tuning we used used an actual Gradient Boosting Regressor (GBR), and tuned the model via Grid Search. We tried with the following values:
- n_estimators: 5, 10, 20, 50, 100, 200
- learning_rate: 0.001, 0.01, 0.1
- max_depth: 1, 2, 4
- subsample: 0.5, 0.75, 1

Because of running time, we did not run the entire thing as one, but rather with different combinations of the n_estimators. Either way, this was how we did it:

In [None]:
GBR=GradientBoostingRegressor()
search_grid = {'n_estimators':[50, 100, 200], 'learning_rate': [0.001, 0.01, 0.1],
               'max_depth': [1, 2, 4], 'subsample': [0.5, 0.75, 1], 'random_state': [1]}
search=GridSearchCV(estimator=GBR, param_grid=search_grid,
                    scoring='neg_mean_squared_log_error')
search.fit(X, y)
print(search.best_params_)
score = math.sqrt(abs(search.best_score_))
print("Score: ", score)

In addition to trying with these different hyperparameters, we also tried out with different features (i.e. dropping some of the features we found were not so relevant), tried normalizing the revenue, doing nothing with it, and applying the log10. When it comes to the revenue, we found that the latter performed the best each time for this model. To save you from reading unneccessary many versions of the same code, we will keep only one in this notebook, but in short, we found that:
- doing nothing to the revenue performed better than normalizing the revenue. Applying the log10 performed better than doing nothing.
- "Stripping" the DataFrame of certain columns was better than keeping all. We will talk more about this later, but we tried out multiple possibilities, based on correlation to revenue, and based on feature importance
- A max depth of 4 and learning rate of 0.1 seemed to be the best each time
- For "stripped" DataFrames, a subsample of 0.75 seemed to perform better than a subsample of 1, even though a subsample of 1 seemed to perform better on non-"stripped" DataFrames

#### Creating actual models
After doing some trying and failing, we used what the GridSearch told us to use, and fitted the regressor with the data.

In [None]:
GBR2 = GradientBoostingRegressor(n_estimators=search.best_params_['n_estimators'], learning_rate=search.best_params_['learning_rate'],
                                 subsample=search.best_params_['sub_sample'],max_depth=search.best_params_['max_depth'], random_state=1)
score=np.mean(cross_val_score(GBR2, X, y, scoring='neg_mean_squared_log_error', n_jobs=1))
score=math.sqrt(abs(score_all))
print(score)
GBR2.fit(X, y_)

#### Making predictions
We then used the model to make predictions, one of which is our end submission (pred_new_stripped_150outliers_3feat). Here we chose to remove the top and bottom 150 stores based on revenue in order to make the data more representative.

In [None]:
prediction = GBR2.predict(stores_test)
data = {'id': stores_test['store_id'],
        'predicted': prediction}
prediction_submission = pd.DataFrame(data)
prediction_submission.to_csv("./predictions/pred.csv", index=False)

After the first predictions we also used the performance of these to improve on our next predictions. We did this by plotting the feature importance, which later indicated which features to use and not for the "stripped" DataFrames.

In [None]:
# Plot feature importance https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR
feat_importance = GBR2.feature_importances_
feat_importance = 100.0 * (feat_importance / feat_importance.max())
sorted_index = np.argsort(feature_importance)
pos = np.arange(sorted_index.shape[0]) + .5
plt.figure(figsize=(8, 18))
plt.barh(pos, feature_importance[sorted_index], align='center')
plt.yticks(pos, X.keys()[sorted_index])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

Here we can see that the chain_name and store_id is very important, and we therefore never tried to take them out. The lv1 and municipality name however were not so important, and therefore we tried making predictions without those and other features.

## Results

**The file for our prediction with score 0.69378 :**
- [short_nootebook_pred_new_stripped_150outliers_3feat.csv](#short_nootebook_pred_new_stripped_150outliers_3feat.csv)

**The file for our prediction with score____ :**
- [______.csv](#______.csv)

## Reflections

This has been a very educational project. We have learned a lot. None of us has worked with machine learning to any great extent before. We therefore had a lot of theory that we had to familiarize ourselves with.

This process went well but caused things to go rather slowly at first. We consider this to be understandable, as a maturation period is often needed when working with something completely new.

We were a team that did not know each other from before, and who worked remotely. This has probably affected the work process a little, but all in all, we are satisfied with how it has gone.

At the start, much of the time was spent figuring out what to do, as we had no experience with ML. If we had started the project anew with our current knowledge, we would probably have had time to try out a few more things. Such as, among other things:

- We could try to generate chain_name ourselves based on store_name. Alternatively, we could try to fill in the chain_name column with the data we had available.

- The column mall_name probably also had missing data. Here we could try to generate this data ourselves by looking at the coordinates. We could test putting all stores that were within a certain distance from each other in the same mall_name.

- We did not use the dataset store_extra to a particularly large extent, here we could possibly have spent more time on theory on ML and perhaps found an area of ​​use for this data.

- We could obviously spend more time on models, but this in itself is very time-consuming, and it is a separate field to work on perfecting parameters.

But despite what we could have done, we are very satisfied with what we were able to do. We got a good insight into the data through our EDA. We went through the stores_train dataset in particular very accurately. Here, among other things, we got verified data, to the extent that the data we found looked correct.

- Like, among other things, there were many grocery stores that had high incomes. This represents the news picture well.

- We went through the types of stores that existed, and the income for these categories.

- We got confirmation that the coordinates for the store made sense, and represented the whole of Norway. Here we could clearly see that the shops were in the majority of the big cities, which makes sense.

- We also went through and tested how well chain_names were set based on store_name. This was done surprisingly well. Here we spent a lot of time at the grocery stores Rema 1000 and Kiwi in particular.

- We also went in-depth when it came to income. There were 217 rows that had an income =0. Here we discussed whether it was actually a profit that had been obtained. But since no stores had negative values, this was hard to believe. Unfortunately, not all stores go into surplus. We also checked the stores and found out after a bit of Googling that they were active and had both positive income and profit in 2016. We therefore chose to remove these rows. We therefore make a 'feature selection' here.

- Furthermore, we are very satisfied with our 'feature engineering/creations' with the bus stops and the population.

- Another thing we are pleased with is how we generated the data we were missing. This is because we did not have data for all the grunnkrets. We did this by calculating an average from the nearest basic districts, geographically.

- Using 'model interpretation' for the models catboost and AutoML helped us a lot in interpreting our weak results. This helped us find a bug in our preprocessing. We preprocessed stores_train and stores_test separately. This in itself is normal, but we managed to make a big mistake. We created new instances of 'LabelEncoder().fit_transform()' for each dataset. This resulted in our categorized data being represented differently. An example of this could be the chain_name value 'REMA FRANCHISE NORGE' can be converted to '7' in stores_train and a '9' in stores_test. Naturally, these two should have been converted to the same value for a model to benefit from the chain_name column.  
Our model interpretation clearly understood that it was 'chain_name' that was most important, ie had the biggest correlation with 'revenue'. This fits well with our previous findings using sns.heatmap, which calculated that there is a 0.39/0.40 linear correlation with the target attribute 'revenue'.
