<a href="https://colab.research.google.com/github/nielsenguest/2020_ecd/blob/master/2020_ECD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



---


Getting started
---
 This notebook will help you with selecting variables, creating models and validating their outcomes. In order to get started, there are a few code blocks that you should run to import the required modules and data. You can run a block of code by clicking on the code and either hitting the Run bottom on the left or by typing *Ctrl + Enter*. If you have any questions, don't hesitate to ask us!

In [None]:
# Intall modules that aren't built-in yet
!pip install -q statsmodels

In [None]:
# Import modules
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod import families
from statsmodels.tools import eval_measures
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from google.colab import files

In [None]:
# Run this block to upload the data (which is located on your Desktop folder)
uploaded = files.upload()

In [None]:
# Read the data in
data = pd.read_csv('data.csv')

# View the first 5 rows of the data
data.head(5)



---



About the data
---
In the code block above we imported the data. The goal is to make a model that gives insights on the effectiveness of media on the KPI *kpi_familiarity*. In the data we make a distinction between 4 kinds of different variables denoted by the prefix:


1.   *kpi_* : The KPI (key performance indicator) is the variable that we want to model
2.   *sd_*  : Socio demographic variables, e.g. gender and age
3.   *ctrl_*  : Control variables, e.g. whether a respondent has cable tv?
4.   *media_*   : Media variables, i.e. how often was the respondent contacted by the campaign via a certain media channel?


To make life easier, we already included a recoded version of all the "raw" variables to dummies. For example, we say that the respondent is familiar when he or she answered: "Know a lot about it", "Know everything about it" and "Know a little about it" (run the code block below to see a visual representation). A more detailed overview of all the variables in the data can be found in the Excel file or [here](https://github.com/nielsenguest/strategy_tour_2020/blob/master/variable_overview.csv).


In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(7, 7))
sns.countplot(x='kpi_familiarity_raw', data=data.sort_values('kpi_familiarity_raw'), ax=ax[0])
ax[0].set_xticklabels(rotation=90, labels=ax[0].get_xticklabels())
sns.countplot(x='kpi_familiarity', data=data, ax=ax[1])
ax[0].set_title('Raw KPI')
ax[1].set_title('Recoded KPI')
plt.tight_layout()



---


Let's start modeling
---
In the code below you can create your own model! We give you the option to choose between two models, being OLS and logistic regression, but feel free to code another model as well. In case you would like to refresh your memory about the model types: [OLS regression](https://en.wikipedia.org/wiki/Ordinary_least_squares) and [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression). 

Think carefully about your model choice and what the limitations are of this model. Here are some things to think about: 

*   What are the variables you want to include?
*   How can I test the performance of my model?
*   What is the interpretation of the estimated coefficients?
*   What is a good cut-off value?

**How to get started?**

You'll first have to select the variables you want to include in your model. Tip: run the code block below to quickly see all possible independent variables. Of course you are free to create new (interaction) variables too.

In [None]:
# Show all socio demographic, control & media variables in the data
list(data.filter(regex='^sd_|^ctrl_|^media_'))

After selecting the variables you want to include in your model, you will have to choose the model type. 

```
# Select OLS model
model_type = 'ols'
```

```
# Select logistic model
model_type = 'logistic'
```

That's all you need to know to start modeling. Good luck!


---




Run your model
---


In [None]:
# Select response variable
y_var = 'kpi_familiarity'  # Do not change this

# Specifying the name of the X variables you want to include in your model
# Note: the variables selected below are an example. Feel free to make your own selection  
x_vars = ['sd_age_group_18_26', 'sd_age_group_27_35']

# Create training and test data
X_train, X_test, y_train, y_test = train_test_split(data[x_vars], data[y_var],
                                                    test_size=0.33, random_state=42)


# Select here if you want to select OLS or logistic model
model_type = 'ols'  # logistic or ols

# For OLS models:
if model_type == 'ols':
  model = sm.OLS(endog=y_train, exog=X_train).fit()
  print(model.summary())

  # Get other metrics
  y_pred = model.predict(X_test)  # These are the predicted probabilities
  print(f'MSE: {np.mean((y_test - y_pred)**2)}')
  
# For logistic regression:
if model_type == 'logistic':
  model = GLM(data[y_var], data[x_vars],
  family = families.Binomial()).fit(attach_wls=True, atol=1e-10)
  print(model.summary())

  # Confusion matrix
  cut_off = 0.5  # Set a cut-off value
  y_pred = model.predict(X_test)  # These are the predicted probabilities
  y_pred = np.where(y_pred >= cut_off, 1, 0)
  cf_matrix = confusion_matrix(y_test, y_pred)
  print(f'Out of sample predictions: \n {cf_matrix}')




---


Save your model
---
You can save your model by running the code block below. Please make sure that you give your model a name. For example:
```
# Name your model model_1
model_name = 'model_1'
```


In [None]:
# Specify the model name here
model_name = '' 

# We check whether this directory exists; if not, we'll create it
if not os.path.exists('models/'):
  os.mkdir('models/')

# Check whether this model is already saved
if os.path.exists(f'models/{model_name}_{model_type}.pkl'):
  print('You already saved a model with the same name to this location. Are you sure you want to overwrite this model?')
  print('yes/no ?')

  x = input()
  if x == 'yes':
    model.save(f'models/{model_name}_{model_type}.pkl')
    print('Model is saved')
  else:
    print('Model not saved')
else:    
  model.save(f'models/{model_name}_{model_type}.pkl')
  print('model is saved')




---


Load your model
---
You can load in a model by specifying its name (see below for an example) and running the following blocks of code.

```
# Load the model called model_1
load_model = 'model_1'
```



In [None]:
# Check which models you saved
model_folder = 'models/'
print(f'Your saved models: {os.listdir(model_folder)}')

In [None]:
# Load your model
load_model = ''  # model_1
model = sm.load(f'models/{load_model}.pkl')
print(model.summary())

In case you want to see a summary of all your saved models, run the following block of code.

In [None]:
# Show summary of all saved models
model_folder = 'models/'
for model in os.listdir('models/'):
  try:
    model_name
    model = sm.load(f'models/{model}')
    print(f'Summary for model {model_name}: {model.summary()}')
  except:
    print(f'Something went wrong saving model: {model_name}')



---

Additional analysis
---

When you are satisfied with your model(s), it would be good to think about doing some additional analysis on the data and/or results. Are there are any things in the data that could be interesting information for the client? Or are there any visualizations that would help them gain more insights on the model (e.g. variable importance, correlation analysis, etc.)? Be creative! 

In [None]:
# Code your additional analysis here
