## The challenge:

Your challenge now is to build an series of regression equations that predict advertising effectiveness (clicks). For this project I want you to imagine that you're working as a digital advertising strategist. You're trying to learn what it is about the advertising campaigns that you're looking at that drove clicks. 

You've got the outcome performance data for campaigns, as well as some info about the ads themselves. Now, it's time to build a predictive algorithm that shows us what features most drove clicks.

## Background Info about the data:

Shortly after the 2016 election, congress released over 2,600 Facebook ads and the actual [advertisements themselves as PDFs.](https://democrats-intelligence.house.gov/social-media-content/social-media-advertisements.htm) I wrote a python script to extract the data from the PDF and converted it into a csv. I also published a paper on the data here, which you can [check out here](http://chrisjvargo.com/wp-content/uploads/2020/09/Vargo-C.-Hopp-T.-2020.pdf), but please don't distract yourself on that right now, you have an exam to finish.

Each row is an advertisement that ran. The columns correspond to the attributes of the ad: the targeting parameters, when it ran, the amount spent and so on. 

For this dataset, you're going to try and **predict the "Ad Clicks"**, that is, the amount of clicks an ad got.

# Imports

In [None]:
import numpy as np
from numpy import array
import pandas as pd
import warnings
import sklearn
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoLarsCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import HuberRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import math

## Load the data

Get the rusdata_utf8.csv data file from the [files in Canvas](https://canvas.colorado.edu/files/46911073/download?download_frd=1) and upload it to your Google Drive.

In [None]:
# Import data
DATA_FILE = "drive/MyDrive/rusdata_utf8.csv"
data = pd.read_csv(DATA_FILE)

In [None]:
data.head()

Unnamed: 0,launched,e_day,days_elec_abs,days_elec_pminus,bin_before,days_elec_flip,medium,medium_bin,text,landpage,...,Mexico.Hispanicculture,Interests:.AppleMusic,People Who Match:.LawEnforcementLife,Interests:.Kemetism,Facebook access (mobile):.allmobiledevices,Gender:.Female,Interests:.AlJazeera,Age:.16-40,United States:.BaltimoreMaryland;Ferguson,on pages:.InstagramFeed
0,2/20/17,11/8/16,104,-104,0,104,Facebook,1,God Bless Dixie! The South will rise again!,https://www .facebook.com/South-United-1777037...,...,0,0,0,0,0,0,0,0,0,0
1,10/3/16,11/8/16,36,36,1,-36,Facebook,1,Stop Islamophobia,https://www.facebook.com/MuslimAmerica/,...,0,0,0,0,0,0,0,0,0,0
2,5/12/16,11/8/16,180,180,1,-180,Facebook,1,"Only for Chrome users! Any music via ""FaceMusi...",https:Hmusicfb.info/,...,0,0,0,0,0,0,0,0,0,0
3,5/12/16,11/8/16,180,180,1,-180,Facebook,1,"Only for Chrome users! Any music via ""FaceMusi...",https:Hmusicfb.info/,...,0,0,0,0,0,0,0,0,0,0
4,5/12/16,11/8/16,180,180,1,-180,Facebook,1,Free online player! Just add in ur browser and...,https://musicfb.info/,...,0,0,0,0,0,0,0,0,0,0


# Data Cleaning and prep

Let's take a look at the columns inside of the data

In [None]:
list(data)

There are almost 1,500 columns of data here. If we were to exhaustively look at each column, we'd have a thesis on our hands. For the sake of your sanity, we're going to look at a small subset of columns, specifically these:

In [None]:
good_columns = [
    'days_elec_pminus',
    'bin_before',
    'medium_bin',
    'impress',
    'clicks',
    'spend',
    'toxic',
    'sevtoxic',
    'idattack',
    'insult',
    'profane',
    'threat',
    'sexexp',
    'flirt',
    'a_author',
    'a_commentor',
    'incoh',
    'inflam',
    'obscene'
]

## Filter_by_columns

Create a function that filters the dataframe and returns a dataframe with only the provided columns. This function will then be used to filter our dataframe by `good_columns`.

In [None]:
def filter_by_columns(data, columns):
    """Returns a DataFrame that is DataFrame `data` filtered to include
    only the columns specified by the list `columns`.
    """
    new_df = data[columns]
    return new_df

In [None]:
filtered_data = filter_by_columns(data, good_columns)

In [None]:
#~~ grader-ignore:
filtered_data.describe()
#~~ /grader-ignore

Unnamed: 0,days_elec_pminus,bin_before,medium_bin,impress,clicks,spend,toxic,sevtoxic,idattack,insult,profane,threat,sexexp,flirt,a_author,a_commentor,incoh,inflam,obscene
count,2603.0,2603.0,2603.0,2603.0,2603.0,2603.0,2567.0,2567.0,2567.0,2567.0,2567.0,2567.0,2567.0,2567.0,2567.0,2567.0,2567.0,2567.0,2567.0
mean,84.54514,0.588167,0.963504,15587.05,1434.482904,2259.621978,0.282537,0.178731,0.363708,0.240508,0.169406,0.281331,0.144173,0.348318,0.085155,0.23388,0.621705,0.412045,0.139888
std,213.464054,0.49226,0.187558,52209.33,3856.227669,10412.677227,0.196168,0.155516,0.263367,0.18161,0.139665,0.188673,0.105656,0.157812,0.122024,0.244604,0.2018,0.242694,0.20183
min,-278.0,0.0,0.0,1.0,0.0,0.0,0.005326,0.002737,0.007891,0.005239,0.005024,0.010024,0.007683,0.035857,2e-06,1.3e-05,0.00236,6.7e-05,3.3e-05
25%,-125.0,0.0,1.0,608.5,40.0,127.485,0.106319,0.048998,0.11051,0.08143,0.056207,0.135015,0.071904,0.226852,0.011162,0.041756,0.474158,0.190638,0.027065
50%,47.0,1.0,1.0,3450.0,245.0,300.0,0.241943,0.137857,0.301972,0.219467,0.144725,0.233145,0.13093,0.356524,0.041313,0.145668,0.641103,0.426272,0.057572
75%,211.0,1.0,1.0,12653.5,1412.5,769.9,0.435198,0.26589,0.606927,0.348619,0.251765,0.379598,0.178907,0.44706,0.105646,0.332387,0.779085,0.628208,0.142816
max,518.0,1.0,1.0,1334544.0,73063.0,331675.75,0.968733,0.813341,0.958674,0.853236,0.979797,0.948494,0.989824,0.968972,0.884986,0.968173,0.988748,0.908876,0.993693


The above cell should show statistics for `filtered_data` which should now be a
DataFrame that only includes the columns of our list of `good_columns`.

## Get_missing_row_counts

Check the dataframe to see if there is missing data in any of the columns. Return the column names and the total number of missing rows for each columns. Use this method: https://www.kite.com/python/answers/how-to-count-the-number-of-nan-values-in-a-pandas-dataframe-column-in-python, but apply it to the entire frame in order to get the counts for all columns.

The return result of this function will be a [Pandas Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) with the axis labels being the column names and the values being a count for the respective column.

In [None]:
def get_missing_row_counts(df):
    """Returns a missing value summary of DataFrame df, which is a series of
    `na` counts by column.
    
    The return value should be a series of count values indexed by the column
    name.
    """
    na_counts = df.isna().sum()
    return na_counts

In [None]:
missing_row_counts = get_missing_row_counts(filtered_data)

In [None]:
missing_row_counts

days_elec_pminus     0
bin_before           0
medium_bin           0
impress              0
clicks               0
spend                0
toxic               36
sevtoxic            36
idattack            36
insult              36
profane             36
threat              36
sexexp              36
flirt               36
a_author            36
a_commentor         36
incoh               36
inflam              36
obscene             36
dtype: int64

## Non_zeros

Implement the function non_zeros and use it to create a list of column names that have missing rows.



In [None]:
def non_zeros(series):
    """Returns a list of the index values in series for which
    the value is greater than 0.
    """ 
    col = []
    for i, num in series.iteritems():
        if num != 0:
            col.append(i)
    return col

In [None]:
missing_cols = non_zeros(missing_row_counts)

In [None]:
missing_cols

['toxic',
 'sevtoxic',
 'idattack',
 'insult',
 'profane',
 'threat',
 'sexexp',
 'flirt',
 'a_author',
 'a_commentor',
 'incoh',
 'inflam',
 'obscene']

**Sanity check:** `missing_cols` printed above should be a list of column names. It will be some subset of `good_columns`. Compare this list with the Series `missing_row_counts` above and be sure the results make sense and are what you expect them to be.

## Fill_nas

Use pandas to fill in blank rows in the affected columns. **Replace the missing values with the mean value for each column.**



In [None]:
# Example of how to handle a single column. fill_nas should do this for all affected columns

#~~ grader-ignore:
filtered_data['toxic'].fillna(filtered_data['toxic'].mean(), inplace=True)
#~~ /grader-ignore

In [None]:
def fill_nas(df, cols_with_nas):
    """For the columns specified by cols_with_nas, fill the na entries for that
    column in DataFrame df. The entered fill value should be the mean for the
    existing values in that column.

    Operates on the dataframe in place, but also returns the resulting dataframe.
    """
    for col in cols_with_nas:
       df[col].fillna(df[col].mean(), inplace=True)
    return df

In [None]:
final_data = fill_nas(filtered_data, missing_cols)

**Sanity check:** There should be no more na data in the frame, so the following cell should report all zeros:

In [None]:
#~~ grader-ignore:
final_data.isna().sum()
#~~ /grader-ignore

days_elec_pminus    0
bin_before          0
medium_bin          0
impress             0
clicks              0
spend               0
toxic               0
sevtoxic            0
idattack            0
insult              0
profane             0
threat              0
sexexp              0
flirt               0
a_author            0
a_commentor         0
incoh               0
inflam              0
obscene             0
dtype: int64

We should see 0's across the board above.

## Extract_target

The data is finally clean. You will now do the work to extract the target from the remaining data so that you have a predictor frame and a target frame to work with.

Create a function that separates your final_data into two dataframes. One dataframe should contain all your predictors (X). The other should contain your target variable (y). Your target is 'clicks'.

In [None]:
def extract_target(df, target):
    """Separate DataFrame df into a target Series and a predictor DataFrame.

    Returns the y, X tuple of target and predictor. The returned X
    should no longer contain the column that has now been extracted as y.
    """
    y = df[target]
    X = df.iloc[:, df.columns != target]
    return y, X

In [None]:
#~~ grader-ignore:
y, X = extract_target(final_data, 'clicks')
#~~ /grader-ignore

**Sanity check:** Again, these checks should pass the assertions

In [None]:
#~~ grader-ignore:
assert type(y) == pd.core.series.Series
assert type(X) == pd.DataFrame
assert 'clicks' not in X.columns
#~~ /grader-ignore

# LASSO Model

## Split_data

Use [sklearn's train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) (already imported above) to separate the data into training and test dataframes. Train on 80% of the data, test on 20%. Use a radom state of 123. Your function should accept y and X and return pred_train, pred_test, tar_train, tar_test

In [None]:
def split_data(y, X, n=0.2, random_state=123):
    """Split target and predictor y, X into train and test portions.

    Returns the 4-tuple of (X_train, X_test, y_train, y_test) created by
    a train_test_split of n% as test data.
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n, random_state=random_state)
    return X_train, X_test, y_train, y_test

In [None]:
#~~ grader-ignore:
X_train, X_test, y_train, y_test = split_data(y, X)
#~~ /grader-ignore

## Build_and_fit_lasso_model

Fit a LassoLars CV model on the training data. Fit the model in the below function (return the result of model.fit). Use the following parameters when creating your model:

 * cv = 10
 * precompute = False

In [None]:
def build_and_fit_lasso_model(X, y):
    """Creates and returns a LASSO model that is fitted to the values of the
    given predictor and target X, and y.
    """
    model = LassoLarsCV(cv=10, precompute=False,normalize=True)
    model_fit = model.fit(X, y)
    return model_fit

In [None]:
#~~ grader-ignore:
lasso_model = build_and_fit_lasso_model(X_train, y_train)
lasso_model
#~~ /grader-ignore

## Get_coefficients

Create a function that returns a DataFrame with variable names and coefficients inside of the fit lasso model



In [None]:
def get_coefficients(model, X):
    """Returns a DataFrame containing the columns `label` and `coeff` which are
    the coefficients by column name.
    """
    allpredictors = list(X.columns)
    predictors_model=pd.DataFrame(allpredictors)
    predictors_model.columns = ['label']
    predictors_model['coeff'] = model.coef_
    return predictors_model

In [None]:
#~~ grader-ignore:
coefficients = get_coefficients(lasso_model, X)
coefficients
#~~ /grader-ignore

Unnamed: 0,label,coeff
0,days_elec_pminus,-1.225788
1,bin_before,-269.893518
2,medium_bin,941.651826
3,impress,0.056202
4,spend,0.066874
5,toxic,0.0
6,sevtoxic,0.0
7,idattack,-375.601403
8,insult,0.0
9,profane,0.0


## Filter_coefficients

Recall that LASSO models automatically set coefficients to 0 when they explain no significant variance. Return a filtered version the coefficients df that have no 0 coefficients

In [None]:
def filter_coefficients(coeff_frame):
    """Returns a filtered version of DataFrame coeff_frame which
    contains only non-zero rows.

    coeff_frame is a DataFrame containing a column named "coeff".
    """
    coef_fil = coeff_frame[coeff_frame['coeff'] != 0]
    return coef_fil

In [None]:
#~~ grader-ignore:
coefficients_no_zeros = filter_coefficients(coefficients)
coefficients_no_zeros
#~~ /grader-ignore

Unnamed: 0,label,coeff
0,days_elec_pminus,-1.225788
1,bin_before,-269.893518
2,medium_bin,941.651826
3,impress,0.056202
4,spend,0.066874
7,idattack,-375.601403
11,sexexp,140.941813
15,incoh,-657.511985
17,obscene,322.484203


**Sanity check:** You should now have a dataframe of label, coeff with all coefficient values being non-zero.

The following should be True:

In [None]:
#~~ grader-ignore:
len(coefficients_no_zeros) == 9
#~~ /grader-ignore

True

## Top_pos_coeffs

Sort filter_coefficients and return the top three largest, positive coefficients as a dataframe



In [None]:
def top_pos_coeffs(coeff_frame, n):
    """Returns a DataFrame containing only the top n rows by coefficient.

    coeff_frame is a DataFrame containing the column "coeff".
    """
    top = coeff_frame.nlargest(n, 'coeff')
    return top

In [None]:
#~~ grader-ignore:
top_pos_coeffs(coefficients_no_zeros, 3)
#~~ /grader-ignore

Unnamed: 0,label,coeff
2,medium_bin,941.651826
17,obscene,322.484203
11,sexexp,140.941813


## Top_neg_coeffs

Implement a function that returns a frame of the top n **negative** coefficients, which is to say the **most negative** of the values.

This function will be used to sort coefficients_no_zeros and return the strongest negative coefficient as a dataframe

In [None]:
def top_neg_coeffs(coeff_frame, n):
    """Returns a DataFrame containing only the smallest n rows by coefficient.

    coeff_frame is a DataFrame containing the column "coeff".
    """
    neg = coeff_frame.nsmallest(n, 'coeff')
    return neg

In [None]:
#~~ grader-ignore:
top_neg_coeffs(coefficients_no_zeros, 1)
#~~ /grader-ignore

Unnamed: 0,label,coeff
15,incoh,-657.511985


**Sanity check:** You should see the resulting single row with label and coefficient value of the coefficient that is the **most negatiave** of all the coefficients.

## Predict_target

Calculate predicted values for both the training and test set. Return a numpy array for each.

This function will be used to predict clicks on the data, since that is what
we trained our model to.

In [None]:
def predict_target(X_train, X_test, model):
    """Returns a tuple of the prediction arrays from the provided model
    for the training and test DataFrames provided.

    model should be a previously fit model with a `predict` method. This
    function should return a tuple containing the prediction on the training
    data, and the prediction on the test data.
    """
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    return y_train_pred, y_test_pred

In [None]:
#~~ grader-ignore:
y_pred_clicks_train, y_pred_clicks_test = predict_target(X_train, X_test, lasso_model)
#~~ /grader-ignore

## Get_error_metrics

Return mean squared error and r-squared metrics using `mean_squared_error` and `r2_score` that have already been imported from sklearn metrics.

This function will be used to report error metrics on both the training and the testing data.

In [None]:
def get_error_metrics(target, pred):
    """Return mean-squared and r-squared errors for the given data.

    Returns a tuple of mse, r2 errors between the target and pred datasets.
    """
    mse = mean_squared_error(target, pred)
    r2 = r2_score(target, pred)
    return mse, r2

In [None]:
#~~ grader-ignore:
train_mse, train_r2 = get_error_metrics(y_train, y_pred_clicks_train)
train_mse, train_r2
#~~ /grader-ignore

(2773619.0581228794, 0.8211896632138826)

In [None]:
#~~ grader-ignore:
test_mse, test_r2 = get_error_metrics(y_test, y_pred_clicks_test)
test_mse, test_r2
#~~ /grader-ignore

(1800890.9762832574, 0.8532346917715389)

Let's contextualize the mean squared error by inspecting the variance of the clicks variable.

## Get_std_dev

Calculate the standard deviation for a column in a DataFrame. Create a function that returns the standard deviation, given a pandas dataframe and the column of interest. Use the [Pandas std method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)

This function will be used to calculate the standard deviation of the clicks variable from the final_data dataframe.

In [None]:
def get_std_dev(df, column):
    """Return the standard deviation of the specified column in
    DataFrame df.
    """
    col_series = df.loc[:, column]
    return col_series.std()

In [None]:
#~~ grader-ignore:
clicks_std = get_std_dev(final_data, 'clicks')
clicks_std
#~~ /grader-ignore

3856.227668546531

## Error_over_std

Divide the mean error of the test data over the standard deviation. Take the square root of test_error when making your calcuation (we want mean error, not mean squared error). Use math to calculate the square root.

In [1]:
def error_over_std(mse, std):
    """Returns the ratio of error to standard deviation where
    error is the square-root of mse.
    """
    error = math.sqrt(mse)
    ratio = error/std
    return ratio

In [None]:
#~~ grader-ignore:
error_over_std(test_mse, clicks_std)
#~~ /grader-ignore

0.3480014428665842

In general, if our test prediction error is within one standard deviation, we can assume our model is fairly precise. 

## Error_std_ratio_is_under_threshold

Build a function that includes all of the code that your previous function for, but instead of returning a ratio, have it return True if this condition is satisfied (return True if the ratio is less than 1).

In [None]:
def error_std_ratio_is_under_threshold(mse, std, threshold=1):
    """Calculates the ratio of error to standard deviation where
    error is the square-root of mse.

    Returns True if the resulting calculation is less than the
    value of threshold, False otherwise.
    """
    if error_over_std(mse, std) < threshold:
      return True
    else:
      return False

In [None]:
#~~ grader-ignore:
error_std_ratio_is_under_threshold(test_mse, clicks_std)
#~~ /grader-ignore

True

# Compare Ensemble models

Next, let's compare a host of ensemble regression models and see how they perform. 

## Ensemble_model_fit_and_eval

Create a function that goes through each of the models below and writes the model name, the mean squared error and rsquared for the **test dataset** to pandas dataframe.

In [None]:
models_to_eval = [
    RandomForestRegressor, ExtraTreesRegressor, BaggingRegressor,
    GradientBoostingRegressor, AdaBoostRegressor, HuberRegressor
]

In [None]:
model_results = pd.DataFrame(columns=['model', 'test_mse', 'test_r2'])

Make sure that the index of your pandas dataframe is indexed by enumeration, and write new row to the pandas dataframe using df.loc[]. For instance:


```
for i, model in enumerate(models):
      <INSERT LOGIC HERE TO BUILD MODELS>
      model_results.loc[i] = [model, test_mse, test_rsquared]
```




In [None]:
def ensemble_model_fit_and_eval(X_train, X_test, y_train, y_test, models, results):
    """Write evaluation data from the given models to the results frame.

    X_train, X_text, y_train, y_test are the prediction and target training
    and testing data provided as frames.

    models are the model classes to be instantiated and evaluated.

    results is a DataFrame containing columns: model, test_mse, test_r2. The
    results frame should be filled with the evaluation data and returned by
    this function.
    """ 
    for i, model_class in enumerate(models):
        # instantiate model and fit to training data
        model = model_class()
        model.fit(X_train, y_train)
        
        # predict on test data and calculate mse and r2
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        
        # add results to dataframe
        results.loc[i] = [model_class, mse, r2]
        
    return results

In [None]:
#~~ grader-ignore:
model_results = ensemble_model_fit_and_eval(X_train, X_test, y_train, y_test, models_to_eval, model_results)
model_results
#~~ /grader-ignore

Unnamed: 0,model,test_mse,test_r2
0,<class 'sklearn.ensemble._forest.RandomForestR...,1886033.0,0.846296
1,<class 'sklearn.ensemble._forest.ExtraTreesReg...,1403284.0,0.885638
2,<class 'sklearn.ensemble._bagging.BaggingRegre...,1725861.0,0.859349
3,<class 'sklearn.ensemble._gb.GradientBoostingR...,1653678.0,0.865232
4,<class 'sklearn.ensemble._weight_boosting.AdaB...,4976682.0,0.594421
5,<class 'sklearn.linear_model._huber.HuberRegre...,5025044.0,0.590479


## Top_results

Sort the model results and return a dataframe with the top `n` r2 values.

This function will be used to get a DataFrame with one row that contains the model with the top r2.



In [None]:
def top_results(df, column, n):
    """Return a DataFrame of the top n rows in df by the specified column."""
    top = df.nlargest(n, column)
    return top

In [None]:
#~~ grader-ignore:
top_results(model_results, 'test_r2', 1)
#~~ /grader-ignore

Unnamed: 0,model,test_mse,test_r2
1,ExtraTreesRegressor,1294446.0,0.894508


## Filter_by_column_threshold

Create a function to filter a dataframe by column values meeting a minimum
threshold criterion.

This function will be used to filter model_results so that only models that performed better than the LASSO rsquared are returned.



In [None]:
def filter_by_column_threshold(df, column, threshold):
    """Return a subframe that is the rows of df for which the value of column
    is greater than threshold.
    """
    return df[df[column] > threshold]

In [None]:
#~~ grader-ignore:
filter_by_column_threshold(model_results, 'test_r2', test_r2)
#~~ /grader-ignore

Unnamed: 0,model,test_mse,test_r2
0,RandomForestRegressor,1723246.0,0.859562
1,ExtraTreesRegressor,1294446.0,0.894508
3,GradientBoostingRegressor,1707586.0,0.860839


In [None]:
#~~ grader-ignore:
filter_by_column_threshold(model_results, 'test_mse', 2000000)
#~~ /grader-ignore

Unnamed: 0,model,test_mse,test_r2
4,AdaBoostRegressor,3919397.0,0.680585
5,HuberRegressor,5025044.0,0.590479


In [None]:
#~~ grader-ignore:
filter_by_column_threshold(model_results, 'test_r2', test_r2).model.to_list()
#~~ /grader-ignore

['RandomForestRegressor', 'ExtraTreesRegressor', 'GradientBoostingRegressor']