# Summary of findings

The [notebook](./location_analysis.ipynb) containing the analysis is quite long, and tries to show my thought process as I tackled this problem. This notebook summarizes my high level findings. 


**Note:**
All section references (i.e. $\S$X.X) are references to the [analysis notebook](./location_analysis.ipynb), not sections in this notebook.

## 1. Data issues

Data was mostly clean, with only a couple of issues:
- direction data was missing for 58 rows
- There was one significant outlier in the response variable, which was dropped

The validation set was scanned to see if there were data points that would be obvious errors (which we would not want to penalize our model for) but the validation set looked clean.

## 2. Depedence of rows -- 25 stores with multiple measurements

The biggest issue that came up when dealing with this dataset was the time series nature of the data. Specifically, by looking at features that we would expect to remain constant for a store in a particular day, such as the region, the number of seats, and the number of parking spaces, the market type, and the shop type, I was able to determine there were likely 25 distinct stores, with repeated measurements on different days.

Given more time, I would have liked to see if I could make predictions about the store using the features that are constant on the store. One commonly used technique is `mixed-effect modeling`, such as the `statsmodels` package (although many different hierarichal models would also work). Hierarical models don't feature within the `sklearn` toolkit. With the small number of stores, we want to be careful that we don't have the extra features simply "encode" which store we are in, so that we can be confident our model will generalize out of sample. 

Mixed effect models, where the fixed effects were the store, and the random effects were the different days, would also make a lot of sense for the application. When a new store is inserted, we probably don't care about the effect of a store on a particular day, but are likely to care about what the effect of the store averaged over many days (as the store will continue to exist once it is constructed). Our prompt on the task was to look specifically at predicting the `deltaRevenue` on each day, so this would not have satisfied the task.

There are variables such as `Parking_slots` and `Seats` that would tell us about the size of a store, which are plausibly very strong predictors. I would like to try making a simple model that uses some of these features on the individual stores to predict the average `deltaRevenue`, but this would need to be done using a statistical criteria (e.g. AIC or WAIC) instead of cross-validation, because of the small number of stores.

As a very specific example, there is only one store with 13 parking slots, as we can see in the table below. There are multiple stores with `0` parking slots, but they are spread across different regions. There _is_ very likely a signal to extract here, but I did not have time to extract it, as many of the standard `sklearn` pipelines are designed with independent rows in mind. To do store level features would mean using a hierarichal model package, _or_ putting a lot of time into doing my own custom cross-validation.

In [19]:
train = pd.read_csv('data/train.csv')
uniqueness = train.groupby(['Parking_slots', 'region', 'Seats']).nunique()

## What are the unique properities (i.e. only have one value for a given (parking, seats, region) combo)?
uniqueness.loc[:, uniqueness.max()==1]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,directionCode,type_dtsf,mtype
Parking_slots,region,Seats,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,162,1,1,1
0,3,158,1,1,1
0,3,210,1,1,1
0,4,86,1,1,1
0,8,73,1,1,1
0,8,105,1,1,1
0,8,151,1,1,1
13,1,100,1,1,1
15,1,53,1,1,1
15,1,106,1,1,1


I did make one attempt to use a store specific feature, `mtype` (market type) in $\S$3.4. I attempted to encode this feature using _target encoding_. Because of the pipeline issues, I did this in a way that did leak data across the different stores. However, even with this leakage, the addition of the encoded feature did not improve the results.

Because we want this data to generalize to _new_ stores, I did ensure that for my modeling work that I used `GroupKFold` validation for setting up my grid search. This ensured that no "store" (i.e. parking, seat, region combo) would appear in both my training and hold out fold. This was important as we want our results to generalize to new stores.


### Summary

* There were 25 independent stores, which lead to some features being very strongly correlated
* The cross validation method was modified to make sure we could generalize, by ensuring that we did not have a store appear in the training and validation folds
* The set of features that were constant on a store were (`Parking_slots`, `Seats`, `region`, `mtype`, `type_dtsf`, `directionCode`)
* Building a hierarchical or mixed-effects model would probably be the right way of approaching this, and would allow us to measure the store specific properties (fixed effects), while averaging over the daily effects. Assuming these are not "pop-up" shops, we are likely more interested in the long-term effects than the value of `deltaRevenue` on a specific day.
* One attempt to build a model using a _crude_ version of mixed effect modeling (making a categorical feature into one correlated with the target) was tried in $\S$3.4 and made model performance worse.
* I would expect there to be strong signal in `Parking_slots` and / or `region` (one relates to the size of the store, the other would presumably be encoded with affluence). I tried working with `region` a little bit, but did not have time to investigate `Parking_slots`

## 3. Feature and data leakage

In $\S$1.5 we look at how some of the features are related to the target, `deltaRevenue`.

In that section, I discuss some of the issues about what features we are able to use without worrying about data leakage. If the goal of this model is to make new predictions about a store that does not yet exist, we will not have access to the `totalRevenue` on a day before we have built the store! 

This problem exists for many of features (e.g. the number of transactions has the same problem, the number of activities might be set by external forces, or might not be available either). We don't know anything about how the features `r_*` are measured either, so we don't know if there is data leakage there.

Here is a copy of the relevant discussion (copied and pasted):

---
---

A few callouts here:

1. Both `transactCount` and `totalRevenue` are positively correlated with `deltaRevenue`. Basically the more your store does in revenue, the more we expect the `deltaRevenue` to be
2. Depending on whether `transactCount` and `totalRevenue` are of the _existing_ store or the _new_ store, it may or may not be fair to include it. In both cases, we don't actually have the measurement of transations of revenue until the day in question, so we wouldn't be able to put measured values into our model!
  - If it is for the transactions of the _existing store_, we probably should not include it. We have measurements of the typical revenue of the store _after_ the new store was put in, but we don't know what it was before. The thing that we are trying to measure is the effect of putting the store in!
    - We could do something a more complicated: use `deltaRevenue` to reconstruct the revenue _before_ the new store moved in, determine the previous revenue `train['counterFactualRevenue'] = train['totalRevenue']/(1 + train['deltaRevenue']/100)`. Then we could use estimates of the existing store performance for the feature `counterFactualRevenue`, and we don't have `totalRevenue` as a feature at all. We would also do this transformation on the validation set (it looks circular but isn't as long as we don't use `totalRevenue` -- it is assuming the way we would use this is projecting the current revenue of the store in the absence of the change).
    - If we did this, we would need to investigate how the `deltaRevenue` was measured, to ensure that we didn't end up with a circular definition!
  - If it for the transactions of the _new store_, we still don't have the data, but we could use the information strategically. For example, if we know that we need to do at least 100 transactions for the new store to be viable, it makes sense to ask "we are going to put a store here, and have at least 100 transactions in it" to assess the impact and check we do not cannibalise the existing store too much.
  - At the moment, I will include these features, as it isn't clear what the meaning of them are, so I will allow myself to include them. I am just including notes of how we would modify these fields
3. We should check for colinearity in general, but in particular between `transactCount` and `totalRevenue` (we would expect Revenue and Transactions to be strongly correclated!)
4. We have masked the effect of time -- a lot of the scatter can come about by looking at the same store on different days (especially for `region`, `Parking_slots` , and `Seats`)
5. `totalActivitiesRefcircle` and `totalCustomerRefcircle` also have roughly the same shape as `transactCount` and `totalRevenue` (at least just looking at it on this scale).

Let's look at possible correlations between `totalActivitiesRefcircle`, `totalCustomerRefcircle`, `transactCount`, and `totalRevenue`:

---
---


I did end up using these features, as I didn't get an instruction that there was concern about leakage, but wanted to call it out

## 4. Feature transformations

Features that were very large, and right skewed (right tailed) were logged. This bought the range of the numbers back down to managable values, eliminated the skew, and made sense for linear models in terms of the application (if we are modeling percentage change in revenue, this is an _additive_ effect on the log of revenue). These were
- `totalRevenue`
- `totalActivitiesRefcircle'
- `transactCount`
- `totalCustomersRefcircle`


The total number of customers was found to be strongly correlated with activities, so it was dropped ($\S$ 1.5)

In a sample of a couple of random stores, we saw that there was not an overall trend in our individual stores sales against date. We saw in $\S$1.5 that some of the market types, like *Rural* and *Suburban* seemed to have a different `deltaRevenue`, but were less well respresented in our data. 

In $\S$1.6 we looked at days of the week, and saw surprising little effect on `deltaRevenue`. Monday has a slightly wider interquartile range, Saturday has a slightly smaller IQR, and visually Tuesday seemed to have a longer tail of outliers (but I did not test this for significance). I did not see a strong day of week or weekend pattern, at least on the target.

There was a strong relationship between holidays, day of week, and `deltaRevenue`, although holiday rows account for approximately 6% of the data.

Other transformations, like whether to standard scale or not, or how to encode categorical variables, were done at the level of the individual models.

## 5. Modeling

I looked at a Lasso model and a variety of different RandomForest models. I was trying to pick models that would do a good job with feature selection, as I was still worried about the relatively small number of stores in our data set. Since the purpose of this model was to generalize to new stores, I wanted to be pretty aggressive about eliminating features -- especially those correlated directly with the store (I was less concerned about eliminating features that changed daily, like revenue or `r_*` features).

For the linear model Lasso, I performed standard scaling on the features, so that regularization was applied on "fair" footing. For tree based models, I skipped this as it isn't necessary (the ordining of points along a particular feature matter, but the specific values do not).

We tried four models, all selected with the goal of being able to do feature reduction. The models were:


1. **LASSO** simple Lasso, with cross-validation to determine regularization
2. **Basic forest**, a random forest, with categorical variables one-hot encoded
3. **Reduced forest**, a random forest on a reduced set of features, to try and counteract the wide variance seen between folds in model 2
4. **Advanced forest**, a random tree, using feature selection (via 2) and trying to use the target to determine the encoding for the market type (done in a way that leaked data).

I found model 2 performed the best. Model 3 was similar, with slightly fewer parameters, but both models 2 and 3 had large variations between the folds. They were statistically indistinguishable; I kept model 2 because I had built out more of the pipeline with it in mind. The high degree of variability between folds did indicate that I was overfitting.

The extra information from encoding the market category in model 4 made the model worse. This suggests that we really were encoding information about the specific stores when training, and we cannot use this feature without doing a lot more modeling.

We move to validation with the **Basic forest**

## 6. Results

* We used the **Basic forest** model
* When doing GroupKFold validation, we were getting $R^2$ values of approximately 0.23 (with large variation between folds)
* It was clear there was still overfitting, as looking score of the best model on the full training data set gave a score of about 0.66 (compared to 0.23 out of sample).
* The validation set was a lot worse, with $R^2\approx 0.12$ 

It would seem that aiming for simpler models would still be better, although attempts to reduce the feature size didn't give better out-of-sample results (e.g. the **reduced forest** performed worse than the original model!). 

The biggest gains in this exercise are probably trying to measure the effect of the stores better. The crude way of doing this is to construct features of the stores based on Parking slots and seats, and this is probably the next thing I would try.

We may do even better (and perhaps it might be a more important question to answer) by doing a mixed-effects model, where we are not trying to answer what the  `deltaRevenue` is on a given day (as a non-popup store is a long term commitment), but instead treat the different days as random effects, and focus on the fixed effects of a given store.