# Introduction: Retail Revenue Prediction

In this project, you will predict the revenues of retail stores across all of Norway.

We have been given a large dataset by [Plaace](https://plaace.co/en/), a Norwegian company that matches businesses to retail properties. They have provided us with revenue information for over 20,000 retail stores that will be used to train and evaluate your models. In addition, we have also been provided with a wide selection of supplementary data that can be used to engineer more powerful features, including geo-specific demographic, public transportation information, a hierarchical grouping of the stores, and information about over 28,000 additional stores.

**You may only use the data provided here for your models.** We have made this rule based on experience from previous years to ensure a fair and accurate grading.

This notebook aims to provide you with an introduction to the data. In addition, we give a short tutorial on how to generate and upload a Kaggle submission file. 

## Basics 

We begin by giving a quick overview of the data you've been given, presenting the training and testing set, and introducing the formula we will use to evaluate prediction performance.

### Directory Structure

You are given a collection of comma separated values (csv) files to work with. The two most important files are `stores_train.csv` and `stores_test.csv` which will be covered shortly. The remaining files contain supplementary data that are likely to be useful for crafting a richer feature set. They will be covered later in this notebook.

In [None]:
!ls ./data | sort

### Train Stores

The `stores_train.csv` dataset is the basis for your training data and contains 12859 rows. 
Each store (row) is associated with the following information:
- `store_id`: unique ID for each store 
- `year`: The year the data is recorded for (should all be 2016)
- `store_name`: Human readable name for each store
- `plaace_hierarchy_id`: Group ID for the store type (see [Place Hierarchy](#Plaace-Hierarchy))
- `sales_channel_name`: Human readable name for the store type 
- `grunnkrets_id`: Geographical ID for the store's location (see [Grunnkrets Data](#Grunnkrets-Data))
- `address`: Street address of the store 
- `lat`: Latitude (north-south) coordinate for the store's location 
- `lon`: Longitude (west-east) coordinate for the store's location 
- `chain_name`: Name of the chain the store belongs to (if available)
- `mall_name`: Name of the mall a store is located in (if available)
- `revenue` (**target**): The store's revenue in 2016. This is what you will be predicting.

In [None]:
import pandas as pd
stores_train = pd.read_csv('data/stores_train.csv')
stores_train.head()

### Test Stores

The dataset with test stores consists of 8577 rows. Note that it contains the exact same columns as the train set stores, except that the `revenue` column is missing. Your grade in this project will mainly be based on how well you can predict these missing values.

In [None]:
stores_test = pd.read_csv('data/stores_test.csv')
stores_test.head()

### Objective 

Because we consider the revenue a continuous variable, we call this a _regression_ problem. It is common to evaluate regression problems according to some deviation measure of the error (difference) between the predictions and the ground truth values. Typical choices are Mean Squared Error (MSE) and its square root, the Root Mean Squared Error (RMSE).

However, both of these measures are quite sensitive to extreme values and work best if the typical scale of prediction errors are consistent across the dataset. This is not likely to be the case here because the revenue variable varies a lot. This means that a, say 10%, prediction error would matter a lot more if it is for one of the higher-earning stores than for one of the lower-earning ones. Consequently, we will use a variation that takes a log transform of the target variable before computing prediction errors.

**TL;DR**: submissions for this problem will be evaluated according to the `Root Mean Squared Log Error` (RMSLE):

- $\text{RMSLE}(y, \hat{y}) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\log(1 + \hat{y}_i) - \log(1 + y_i))^2}$

In the equation above, $y_i$ corresponds to the ground truth value for datapoint $i$, $\hat{y}_i$ corresponds to the predicted value for datapoint $i$, and $n$ denotes the total number of datapoints (dimensionality of $y$, $\hat{y}$). See the cell below for an implementation.



In [None]:
import numpy as np 

def rmsle(y_true, y_pred):
    """
    Computes the Root Mean Squared Logarithmic Error 
    
    Args:
        y_true (np.array): n-dimensional vector of ground-truth values 
        y_pred (np.array): n-dimensional vecotr of predicted values 
    
    Returns:
        A scalar float with the rmsle value 
    
    Note: You can alternatively use sklearn and just do: 
        `sklearn.metrics.mean_squared_log_error(y_true, y_pred) ** 0.5`
    """
    assert (y_true >= 0).all(), 'Received negative y_true values'
    assert (y_pred >= 0).all(), 'Received negative y_pred values'
    assert y_true.shape == y_pred.shape, 'y_true and y_pred have different shapes'
    y_true_log1p = np.log1p(y_true)  # log(1 + y_true)
    y_pred_log1p = np.log1p(y_pred)  # log(1 + y_pred)
    return np.sqrt(np.mean(np.square(y_pred_log1p - y_true_log1p)))


# Hei!
# Calculate rmsle for a few example predictions 
y_true = stores_train.revenue.values
n = len(stores_train)
print('A couple of RMSLE scores computed over the train set')
print(f'Perfect prediction: {rmsle(y_true, y_true):.4f}')
print(f'All zeros prediciton: {rmsle(y_true, np.zeros(n)):.4f}')
print(f'All ones prediction: {rmsle(y_true, np.ones(n)):.4f}')

## Supplementary Data 

The following sections cover the remaining data at your disposal.

### Extra Stores 

The extra stores dataset is a collection of stores for which we had no revenue data. Structurally, it is identical to the test set, but you are naturally not expected to submit any predictions for it. You can, however, use the additional data in your analysis, in unsupervised methods you might employ, or to provide a stronger data basis for missing value imputation.


In [None]:
stores_extra = pd.read_csv('data/stores_extra.csv')
stores_extra.head()

### Plaace Hierarchy 

Plaace has provided us with their system for sorting stores into categories in a 4-level hierarchy. The top level of the hierarchy contains the most abstract groupings ("Dining and Experiences", "Retail", and "Services"). As you move further down in the hierarchy, you get more and more specific categories (109 distinct values at level 4).

Each store is associated with a dot-separated `plaace_hierarchy_id`, which gives its group membership at all four levels of the hierarchy. For instance, the id "2.8.11.2" gives groups "Retail" -> "Food and drinks" -> "Alcohol sales" -> "Beer and soda shop". 

In the cell below we use the `pd.DataFrame.merge` method to perform a left join between the train store data and the hierarchy data, using the `plaace_hierarchy_id` as join key. The result is that each train store row gets corresponding place hierarchy information appended to its side (we just visualize the table vertically to make it easier to see all the columns).

In [None]:
# Read plaace_hierarchy data 
plaace_hierarchy = pd.read_csv('data/plaace_hierarchy.csv')

# Augment stores_train with information about the hierarchy
stores_with_hierarchy = stores_train.merge(plaace_hierarchy, how='left', on='plaace_hierarchy_id')

# Show dataframe, but transposed so that we can more easily see all the resulting columns
stores_with_hierarchy.head().T

### Grunnkrets Data 

Next follows information about different regions in Norway and their demography.
A "[Grunnkrets](https://no.wikipedia.org/wiki/Grunnkrets)" is a type of statistical unit used to describe a small geographic area. The corresponding term in english is "[Basic statistical unit](https://en.wikipedia.org/wiki/Basic_statistical_unit_(Norway))".

We have a total of four extra datasets that provides extra information about the different grunnkrets units in Norway: One describing the grunnkrets itself and three describing its demographics.

All the grunnkrets datasets have a column called `grunnkrets_id`. A corresponding column can be found in the `stores_*` datasets, meaning that you can augment each store with information about the grunnkrets it is located in using similar join logic as the `merge` operation used in the previous section.

Note that we have grunnkrets-related data for both 2015 and 2016, but keep in mind that the revenue predictions you will make are all for 2016. Moreover, the demography data is actually for the district, but has already been mapped to each grunnkrets for us.

#### Geography

The first grunnkrets dataset we will look at describes the geography of each grunnkrets. In addition to the official name of the grunnkrets, we also have the district the grunnkrets is located in, as well as the municipality the district is located in. 

Finally, we also have a polygon describing the geographical area covered by the grunnkrets. If you want to process this data further, we recommend you check out the [geopandas](https://geopandas.org) extension. The area of the polygon (in square kilometers) is already computed and available for you to use.

In [None]:
grunnkrets = pd.read_csv('data/grunnkrets_norway_stripped.csv')
grunnkrets.head()

#### Age Distribution

Next, we have information about the age distribution in different grunnkrets units. Each column gives the number of people of a given age that lives in the grunnkrets. For instance, the `12` in the first row of column `age_6` means that there are 12 six-year-olds in the corresponding grunnkrets.

In [None]:
grunnkrets_ages = pd.read_csv('data/grunnkrets_age_distribution.csv')
grunnkrets_ages.head()

#### Household Types 

The second demography dataset gives the household composition in each grunnkrets. The different types of households are partitioned into 8 categories. Each column gives the number of households in a given category.

In [None]:
grunnkrets_household_types = pd.read_csv('data/grunnkrets_households_num_persons.csv')
grunnkrets_household_types.head()

#### Household Income

The last demography dataset gives median incomes in each grunnkrets. The first income column is `all_households`, which denotes the median income aggregated across all households in the grunnkrets. The last three columns further breaks this figure down for three different types of households.

In [None]:
grunnkrets_household_income = pd.read_csv('data/grunnkrets_income_households.csv')
grunnkrets_household_income.head()

### Busstops

Lastly, we also have information about bus stops all over Norway. Unlike the previous datasets, none of the rows here are directly associated with any of the stores. However, the `geometry` column contains information about the location, which can be compared with the location of stores to generate features.

- `busstop_id`: unique ID for busstop. Not tied to anything else in this dataset 
- `stopplace_type`: what kind of stop it is (e.g. just a curbside stop or a proper bus pocket)
- `importance_level`: how important the stop is (e.g. just a regular stop or a regional hub)
- `side_placement`: position in the road 
- `geometry`: latitude and longitude of location (point geometry)

In [None]:
busstops = pd.read_csv('data/busstops_norway.csv')
busstops.head()

## Getting Started

In this final section, we will go through a sped-up and simplified version of the work we expect you to do; all the way from initial EDA, to "training" a very simple model, and finally using the model to make a submission with test set predictions.

### Analyzing the Data

The first thing you should do is start exploring the data. We often refer to this activity as Exploratory Data Analysis (EDA). In the cell below we make two plots. The first one shows the number of missing values in the `stores_train` dataframe. The second one visualizes the distribution of the target variable (revenue).

In your own work, you should go a lot further than this. You may for instance want to:

- Look for outliers and other parts of the data that should be cleaned up
- Investigate to what degree single variables correlate, both with each other and the target variable
- Visualize different variables spatially, perhaps with maps in the background



In [None]:
import matplotlib.pyplot as plt 

fig, (ax1, ax2) = plt.subplots(figsize=(12, 3), ncols=2)
stores_train.isna().mean().plot.bar(ax=ax1)
ax1.set_title('Fraction of rows with NaN values')
stores_train.revenue.plot.hist(bins=300, ax=ax2)
ax2.set_title('Distribution of Revenues');



### Building a Model

Next, we design and train a model. To avoid giving away a solution that will beat the worst virtual teams, the cell below implements an exceptionally bad model. All it does is fitting a uniform distribution based on the minimum and maximum y-values (revenues) in the training set. When making predictions, it simply samples the uniform distribution while completely disregarding any features.

In [None]:
class ReallyBadRandomGuesser:
    """
    Model that fits a uniform distribution to the minimum and 
    maximum observed y-values in the training set.
    
    Args:
        random_seed (int): Seed for the random distribution used to 
            sample predictions.
    """
    
    def __init__(self, random_seed=None):
        self.random = np.random.RandomState(random_seed)
    
    def fit(self, X, y):
        # Store min/max values of train set y values 
        self.y_min = y.min()
        self.y_max = y.max()
    
    def predict(self, X):
        n = len(X)
        return self.random.uniform(self.y_min, self.y_max, size=n)

# Partition into X (not really used here) and y values 
X_train = stores_train.drop(columns=['revenue'])
y_train = stores_train.revenue 

# Create and fit a model 
model = ReallyBadRandomGuesser(random_seed=123)
model.fit(X_train, y_train)

# Generate predictions over the training set 
y_train_pred = model.predict(X_train)

print(f'Train set RMSLE: {rmsle(y_train, y_train_pred) :.4f}')

### Creating a Submission

Finally, we used the "trained" model to make predictions on the test set and turn it into a submission for kaggle. The format for submissions is a simple csv file with two columns; one for the store id and one for the predicted revenue.
An example of what the start of the file should look like can be seen below:


```
id,predicted
914206820-914239427-717245,181.66162783399506
916789157-916823770-824309,206.81469433388355
913341082-977479363-2948,83.49386666841214
...
```

Keep in mind the following when generating predictions:
- Make sure that the csv  <id, predicted> columns.
- Make sure that the ID values correctly correspond to each prediction. 
- Make sure there are no negative, nan, or other non-numeric values in your submission 


In [None]:
# Predict on the test set 
X_test = stores_test  
y_test_pred = model.predict(X_test)

# Generate submission dataframe 
# NOTE: It is important that the ID and predicted values match
submission = pd.DataFrame()
submission['id'] = X_test.store_id 
submission['predicted'] = np.asarray(y_test_pred)

# Save it to disk (`index=False` means don't save the index in the csv)
submission.to_csv('sample_submission.csv', index=False)
submission

### Uploading to Kaggle


Once a submission csv has been created, it can be uploaded to Kaggle. If you haven't already, you can enroll in the Kaggle competition by following [this special link](https://www.kaggle.com/t/3affe88e40c44dde87d1ff836ded9e92) along with the rest of your teammates.

You can upload submissions manually through the competition web page [as explained here](https://www.kaggle.com/docs/competitions#submitting-predictions). 


Alternatively, you can use the Kaggle API ([see here for installation instructions](https://github.com/Kaggle/kaggle-api)) and do it from the terminal with the following command template:

```bash
kaggle competitions submit tdt4173-2022-project2 -f <filepath> -m "<message>"
```

Where `<filepath>` in this case would be `./sample_submission.csv` and `<message>` is your own comment for the submission.

Note that your prediction performance on the test set is broken down into two parts; a `public` and a `private` one. When you upload a submission, you will immediately be able to see your public score, which is computed over a subset of the test set rows. The private score is calculated over the remaining datapoints and will not be visible until the end of the project, but is what ultimately determines the score-based part of your grade.

There is a limit to the maximum number of submissions you can submit on Kaggle each day (5 at the time of writing). Keep this in mind when submitting, but don't hesitate to use all your submissions every day; they will provide you with feedback and renew each new day.

## Next Steps

- Plot, summarize, and get familiar with the dataset. If you have any questions, don't hesitate to ask the course staff. If we don't know the answer, we can try to reach out to the data scientists at Plaace. Don't be surprised if you find the dataset to be a bit noisy. Like most real-world datasets, it is aggregated from many imperfect (and sometimes contradictory) sources.

- Try to make some very simple models. How well does a simple mean estimate do? Can you improve on the mean by including information from a single feature? Do any variables have simple, linear relationships with the revenue? Can you make the relationships (more) linear? In addition to getting you more familiar with the dataset, making simple models establishes baselines that can be used to reason about more complex approaches later.

- Try to apply different machine learning algorithms. The [Scikit-Learn](https://scikit-learn.org/stable/) package is a good starting point that contains stable implementations of many popular algorithms. In addition, we recommend you look into gradient boosting algorithms such as [xgboost](https://xgboost.readthedocs.io/en/stable/), [catboost](https://catboost.ai/), and [lightgbm](https://lightgbm.readthedocs.io/en/v3.3.2/). Keep in mind that as you start applying more complex learning algorithms, you start running the risk of overfitting and may need to use validation data to tune hyperparameters. 

- As you start to get several decent models, start thinking about ensembling them. This will typically allow you to squeeze out a bit more performance. In general, an ensemble works better if the individual member models make uncorrelated errors. Different algorithms, features, and even hyperparameters tend to lead to uncorrelated errors.

- Feel free to draw inspiration from others online, just adapt the solutions you find to the problem at hand and make sure you understand it. The Kaggle platform is a goldmine in this regard and contains tutorial notebooks on everything from [basic data science](https://www.kaggle.com/code/kanncaa1/data-sciencetutorial-for-beginners), [general machine learning](https://www.kaggle.com/code/kanncaa1/machine-learning-tutorial-for-beginners/notebook), and [boosting](https://www.kaggle.com/code/carlmcbrideellis/an-introduction-to-xgboost-regression/notebook). 