# Predictive Analytics Challenge
***

Hathaway Zhang <br>
104369396 <br>
Nov.4, 2018
***

For this predictive analytics challenge, I utilized `sklearn` along with the [Brandzooka Advertising Performance Data Set](https://brandzooka.com/) to build a machine learning algorithm that predicts the advertising effectiveness --- `totalClicks` in my case. This predictive algorithm is able to show which feature will drive the clicks most and what I can do to improve my Google’s Online Marketing Challenge. The  R2  will be provided to analysis the performance of the machine learning algorithm. Moreover, significant impact factors will be analyzed according to the coefficient weights.

__<font color=red>! It may take times to run this script !</font>__

### Part One: Data Cleaning & Preparing
*** 
Import necessary package for this project <br>
All the models are selected from [`sklearn`](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning). And these models include:
   - LassoLarsCV
   - LarsCV
   - RidgeCV
   - ElasticNetCV
   - OrthogonalMatchingPursuitCV

In [212]:
import pandas as pd
import pandas
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LassoLarsCV
from sklearn.linear_model import LarsCV
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import ElasticNetCV
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import OrthogonalMatchingPursuitCV

Read in the raw dataset then exclude invalid predictors from the raw data except `totalClicks`. Also goes over the columns to double check if I've selected proper predictors. To improve the performance of prediction. I will also remove duplicates, missing information, and extraneous features to clean the data set at the first stage.
Delete columns with missing information and duplicates values.

In [2]:
Sheet = pd.ExcelFile("campaign data_only matching videos.xlsx")
# print(Sheet.sheet_names)
dfRaw = Sheet.parse('dataforboone')

In [3]:
list(dfRaw)
## Exclude invalid predictors from the raw data except 'totalClicks'
dfC = dfRaw.drop(columns=['bidWinRate', 'campaignCompletedEmailSent', 'campaignStartedEmailSent',  
                           'clickThroughRate', 'completionRate',  'completionRate',  'couponCode',
                           'discount', 'emailSent', 'ourMarkupFee', 'spentBudget', 'status', 'tdCampaignId', 'total',
                           'totalImpressions', 'updatedAt',  'viewability', 'totalClicks'])

# Get an overview about all the columns
dfC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 53 columns):
id                           6 non-null object
adLandingPage                148 non-null object
adLanguage                   0 non-null float64
audienceGender               148 non-null object
audienceInterests            126 non-null object
audienceLocationCountry      22 non-null object
audienceLocationCountries    119 non-null object
audienceLocationRegions      56 non-null object
audienceLocationZipcodes     139 non-null object
audienceMaxAge               148 non-null int64
audienceMinAge               148 non-null int64
audiencePlacements           146 non-null object
budget                       148 non-null int64
createdAt                    148 non-null object
goal                         145 non-null object
hours                        75 non-null object
targetingCost                148 non-null int64
targetingPercent             148 non-null float64
placementCost         

In [4]:
## To improve the performance of prediction. I will remove duplicates, missing information, 
## and extraneous features to clean the data set at the first stage.
## Delete columns with missing information and duplicates values
dfClean = dfC.drop(columns=['id','adLanguage','audienceLocationRegions','audienceLocationZipcodes',
                            'hours'])

For a better performance, I will convert those object columns into dummy table. In this way, categorical variables are converted into integer data and the machine learning algorithm will do a better job in prediction. `create_dummy` function is aimed to converted categorical data. I haven't applied this function to `audienceLocationCountry` as this segement contains empty rows and inconsistent country name. 

In [5]:
dfClean['audienceLocationCountry'] = dfClean['audienceLocationCountry'].astype(str)
dfClean['audienceLocationCountries'] = dfClean['audienceLocationCountries'].astype(str)
dfClean['audienceLocationCountries'] = dfClean['audienceLocationCountries'].apply(lambda x: x.replace('\"',''))
dfClean['audienceLocationCountries'] = dfClean['audienceLocationCountries'].apply(lambda x: x.replace('[',''))
dfClean['audienceLocationCountries'] = dfClean['audienceLocationCountries'].apply(lambda x: x.replace(']',''))

new_col = dfClean[['audienceLocationCountry', 'audienceLocationCountries']].apply(lambda x: ','.join(x), axis=1)
new_col = new_col.apply(lambda x: x.replace('nan,',''))
new_col = new_col.apply(lambda x: x.replace(',nan',''))
new_col = new_col.apply(lambda x: x.replace('United States','USA'))

In [6]:
## Combine audienceLocationCountry and audienceLocationCountries into a new column, then drop
## these columns
dfClean.insert(loc=3, column='audienceLocation', value=new_col)
dfClean.insert(loc=0, column='totalClicks', value=dfRaw['totalClicks'])
dfClean = dfClean.drop(columns=['audienceLocationCountry','audienceLocationCountries'])
dfLC = dfClean['audienceLocation'].str.get_dummies(sep=',')
dfClean = pd.concat([dfClean, dfLC], axis=1)

In [7]:
def create_dummy(pre):
    global dfClean
    dfClean[pre] = dfClean[pre].astype(str)
    dfClean[pre] = dfClean[pre].apply(lambda x: x.replace('\"',''))
    dfClean[pre] = dfClean[pre].apply(lambda x: x.replace('[',''))
    dfClean[pre] = dfClean[pre].apply(lambda x: x.replace(']',''))
    # convert interests into dummy table format
    dfConvert = dfClean[pre].str.get_dummies(sep=',')
    dfClean = pd.concat([dfClean, dfConvert], axis=1)

In [8]:
list_pre = ['audienceGender','audienceInterests','audiencePlacements','goal','category']
for i in list_pre:
    create_dummy(i)

In [9]:
## RangeIndex: 148 entries, 0 to 147
## Columns: 362 entries, totalClicks to viewability
## dtypes: datetime64[ns](2), float64(2), int64(347), object(11)
dfClean = dfClean.dropna(axis=1, how='all')
dfClean = dfClean.drop(columns=['nan','audienceGender','audienceLocation','audienceInterests','audiencePlacements','goal',
                                'createdAt','scheduleDateFrom','scheduleDateTo','title','category','userId','videoId','adLandingPage'])
dfClean.info()
dfClean.to_csv("campaign.csv", sep=',')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Columns: 378 entries, totalClicks to ytv
dtypes: float64(2), int64(376)
memory usage: 437.1 KB


In [None]:
## Double check if null sets are still exist
# dfClean.select_dtypes(include=['object'])

### Part Two: Machine Learning Algorithm 

After cleaning the dataset to make sure it could have proper performance, I started training and predicting the data set through multiple machine learning algorithm. To start learning, I set `totalClicks` as target and put all the predictors I've cleaned into `predictors`. For all these model, I will train 70% of the data and test the rest 30% of data. Since every prediction is created randomly, I simulated the prediction for 10000 or 20000 times to obtain the optimal r-squred result.

In [10]:
# create a list of all the predictors you're going to feed into the LassoLarsCV model
allvariablenames = list(dfClean.columns.values)
listofallpredictors = allvariablenames[1:]
#load predictors into dataframe
predictors = dfClean[listofallpredictors]  
#load target into dataframe
target = dfClean['totalClicks']

#### LassoLarsCV
***
**LassoLars** is a lasso model implemented using the LARS algorithm, and unlike the implementation based on coordinate_descent, this yields the exact solution, which is piecewise linear as a function of the norm of its coefficients.

In [240]:
def max_r_squared(n):
    rsquared_train = []
    rsquared_test = []
    coeff = []
    # find the maximum r-squared value and corresponding random_state value
    for i in range(n):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=i)
        model = LassoLarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
        
        rsquared_train.append(model.score(pred_train,tar_train))
        rsquared_test.append(model.score(pred_test,tar_test))
        mx = rsquared_test.index(max(rsquared_test))
    
    # put the maximum random_state value back into the model and display the coefficients
    pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=mx)
    model = LassoLarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
    predictors_model=pd.DataFrame(listofallpredictors)
    predictors_model.columns = ['label']
    predictors_model['coeff'] = model.coef_
    for index, row in predictors_model.iterrows():
        if row['coeff'] > 0:
            coeff.append(row.values)
    
    return rsquared_train[mx], rsquared_test[mx], coeff, mx, model.intercept_

In [241]:
max_r_squared(10000)

(0.4065024567596378,
 0.9774456730304493,
 [array(['budget', 0.08631671489574987], dtype=object)],
 6763,
 76.65804019375742)

This model provides the best $r^2$ so far. Identified the top 3, 10, and 40 r-squared values to check if the coefficient would only contain budget.

In [136]:
def max_r_squared_top_3(n):
    rsquared_train = []
    rsquared_test = []
    coeff = []
    train = []
    test = []
    # find the top three r-squared values and their corresponding random_state values
    for i in range(n):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=i)
        model = LassoLarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
        
        rsquared_train.append(model.score(pred_train,tar_train))
        rsquared_test.append(model.score(pred_test,tar_test)) 
        mx = sorted(range(len(rsquared_test)), key=lambda i: rsquared_test[i], reverse=True)[:3]
    
    for j in range(3):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=mx[j])
        model = LassoLarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
        train.append(rsquared_train[mx[j]])
        test.append(rsquared_test[mx[j]])
        predictors_model=pd.DataFrame(listofallpredictors)
        predictors_model.columns = ['label']
        predictors_model['coeff'] = model.coef_
        for index, row in predictors_model.iterrows():
            if row['coeff'] > 0:
                coeff.append(row.values)
    
    return train, test, mx, coeff

In [138]:
max_r_squared_top_3(10000)

([0.4065024567596378, 0.4217922728245661, 0.42712435329085485],
 [0.9774456730304493, 0.9712468614497402, 0.9699994676190249],
 [6763, 8456, 3486],
 [array(['budget', 0.08631671489574987], dtype=object),
  array(['budget', 0.08860510821364571], dtype=object),
  array(['budget', 0.08018047648837662], dtype=object),
  array(['targetingCost', 0.01810776856736746], dtype=object)])

In [139]:
def max_r_squared_top_10(n):
    rsquared_train = []
    rsquared_test = []
    coeff = []
    train = []
    test = []
    # find the top ten r-squared values and their corresponding random_state values
    for i in range(n):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=i)
        model = LassoLarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
        
        rsquared_train.append(model.score(pred_train,tar_train))
        rsquared_test.append(model.score(pred_test,tar_test)) 
        mx = sorted(range(len(rsquared_test)), key=lambda i: rsquared_test[i], reverse=True)[:10]
    
    for j in range(10):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=mx[j])
        model = LassoLarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
        train.append(rsquared_train[mx[j]])
        test.append(rsquared_test[mx[j]])
        # load predictors into predictors_model
        predictors_model=pd.DataFrame(listofallpredictors)
        # rename header of predictors_model as 'label'
        predictors_model.columns = ['label']
        # create a column named coeff and add coeff. to dataframe
        predictors_model['coeff'] = model.coef_
        for index, row in predictors_model.iterrows():
        # since any coefficients that are non-zero are significant, find all the coeff, greater than 0
            if row['coeff'] > 0:
                # print out the significant values
                coeff.append(row.values)
    
    return train, test, mx, coeff

In [140]:
max_r_squared_top_10(10000)

([0.4065024567596378,
  0.4217922728245661,
  0.42712435329085485,
  0.40531647396192794,
  0.37631727886847677,
  0.3562131062112781,
  0.39502931234023436,
  0.4066399781506745,
  0.4214674186775904,
  0.4292206466363122],
 [0.9774456730304493,
  0.9712468614497402,
  0.9699994676190249,
  0.9676155045101624,
  0.9662396218010796,
  0.9634843534728885,
  0.963179830115318,
  0.9630004844677185,
  0.9626532756473132,
  0.9618035967390384],
 [6763, 8456, 3486, 8278, 95, 5377, 2661, 103, 6292, 3502],
 [array(['budget', 0.08631671489574987], dtype=object),
  array(['budget', 0.08860510821364571], dtype=object),
  array(['budget', 0.08018047648837662], dtype=object),
  array(['targetingCost', 0.01810776856736746], dtype=object),
  array(['budget', 0.08641743073508373], dtype=object),
  array(['targetingCost', 0.0018100487980428975], dtype=object),
  array(['budget', 0.07672520612523405], dtype=object),
  array(['placementCost', 0.02945518047734132], dtype=object),
  array(['budget', 0.085

In [173]:
def max_r_squared_top_40(n):
    rsquared_train = []
    rsquared_test = []
    coeff = []
    train = []
    test = []
    # find the top 40 r-squared values and their corresponding random_state values
    for i in range(n):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=i)
        model = LassoLarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
        
        rsquared_train.append(model.score(pred_train,tar_train))
        rsquared_test.append(model.score(pred_test,tar_test)) 
        mx = sorted(range(len(rsquared_test)), key=lambda i: rsquared_test[i], reverse=True)[:40]
    
    for j in range(40):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=mx[j])
        model = LassoLarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
        train.append(rsquared_train[mx[j]])
        test.append(rsquared_test[mx[j]])
        predictors_model=pd.DataFrame(listofallpredictors)
        predictors_model.columns = ['label']
        predictors_model['coeff'] = model.coef_
        for index, row in predictors_model.iterrows():
            if row['coeff'] > 0:
                coeff.append(row.values)
    
    return train, test, mx, coeff

In [174]:
max_r_squared_top_40(10000)

([0.4065024567596378,
  0.4217922728245661,
  0.42712435329085485,
  0.40531647396192794,
  0.37631727886847677,
  0.3562131062112781,
  0.39502931234023436,
  0.4066399781506745,
  0.4214674186775904,
  0.4292206466363122,
  0.3444616283190496,
  0.4047537222165852,
  0.42934089362202266,
  0.39476252610091,
  0.40152692283436175,
  0.4178805126782799,
  0.4323132718495563,
  0.4088220924712632,
  0.40839134921222237,
  0.38528303343888604,
  0.4507849399544204,
  0.42259284386473256,
  0.43399881197783136,
  0.4054898097318287,
  0.4513773697059265,
  0.3937651825484262,
  0.4516777378587613,
  0.43710335451783794,
  0.39232217701925776,
  0.44293401185824754,
  0.4348772122742998,
  0.42476356960879513,
  0.4442232973360697,
  0.4053162356824609,
  0.4139531884441301,
  0.43195170517421233,
  0.37416250364034076,
  0.43156448659260066,
  0.43615758420583794,
  0.4410652725715368],
 [0.9774456730304493,
  0.9712468614497402,
  0.9699994676190249,
  0.9676155045101624,
  0.96623962180

Instead of find the maximum r-squared value of testing data, finding the maximum sum of r-squared for testing and training data. 

In [336]:
def max_r_squared_sum(n):
    global dfSum
    rsquared_train = []
    rsquared_test = []
    dfSum = pd.DataFrame(allvariablenames[1:])
    coeff = []
    # find the maximum sum of r-squared values and its corresponding random_state value
    for i in range(n):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=i)
        model = LassoLarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
        dfSum["coefficient{0}".format(i)] = model.coef_
        rsquared_train.append(model.score(pred_train,tar_train))
        rsquared_test.append(model.score(pred_test,tar_test))
        rsquared= [x + y for x, y in zip(rsquared_train, rsquared_test)]
        mx = rsquared.index(max(rsquared))
        
    pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=mx)
    model = LassoLarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
    predictors_model=pd.DataFrame(listofallpredictors)
    predictors_model.columns = ['label']
    predictors_model['coeff'] = model.coef_
    for index, row in predictors_model.iterrows():
        if row['coeff'] != 0:
            coeff.append(row.values)
    
    return rsquared_train[mx], rsquared_test[mx], coeff, mx, model.intercept_

In [337]:
max_r_squared_sum(20000)

(0.9179901194905821,
 0.8301898639433808,
 [array(['budget', 0.07119603642235801], dtype=object),
  array(['placementCost', 0.09444803778204203], dtype=object),
  array(['callToAction', 9.626485225266828], dtype=object),
  array(['contact', -10.334857639464932], dtype=object),
  array(['beautifulFigure', 9.369398016382434], dtype=object),
  array(['celebrity', -22.988016259901347], dtype=object),
  array(['targetedStarring', 12.390903673160278], dtype=object),
  array(['bright_colorful', -7.898632531003881], dtype=object),
  array(['logoInvolved', 2.303066954475899], dtype=object),
  array(['mimicUI', 6.247707849309872], dtype=object),
  array(['bgmQuality', 4.836739779197172], dtype=object),
  array(['conversation', -8.421379160693306], dtype=object),
  array(['narratage', 15.967857234584145], dtype=object),
  array(['humor', -0.12216904336451291], dtype=object),
  array(['accent', -22.077896042300672], dtype=object),
  array(['diversity', -5.2524113016751715], dtype=object),
  array(

For LassoLars model, the best testing $r^2$ is around 0.83. Identifying the top 20 factors.

In [386]:
## Count the occurrence
dfSum['count'] = dfSum.astype(bool).sum(axis=1)
dfSum = dfSum.sort_values(by=['count'],ascending=False)
## Rank the factors
dfSum['rank'] = dfSum['count'].rank(ascending=False)
Result1 = dfSum[[0,'count','rank']]
Result1.head(20)

Unnamed: 0,0,count,rank
2,budget,16508,1.0
5,placementCost,14894,2.0
142,Charity & Philanthropy,7278,3.0
3,targetingCost,3478,4.0
168,DIY,2660,5.0
293,Real Estate,2655,6.0
74,000-$374,2631,7.0
359,doctor,2189,8.0
322,Undergraduate,2150,9.0
345,reach,2088,10.0


In [394]:
Result2 = dfSum[[0,'coefficient1984']].sort_values(by=['coefficient1984'],ascending=False)
Result2

Unnamed: 0,0,coefficient1984
359,doctor,146.447899
74,000-$374,103.536264
111,African American,103.143814
142,Charity & Philanthropy,84.221319
168,DIY,82.604693
371,sport,69.904018
55,United Kingdom,62.449572
293,Real Estate,59.610672
351,awareness,48.279354
275,Online Activity,24.079637


#### LarsCV
***
Cross-validated Least Angle Regression model

In [222]:
def max_r_squared_Lars(n):
    rsquared_train = []
    rsquared_test = []
    coeff = []
    # find the maximum r-squared value and corresponding random_state value
    for i in range(n):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=i)
        model = LarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
        
        rsquared_train.append(model.score(pred_train,tar_train))
        rsquared_test.append(model.score(pred_test,tar_test))
        mx = rsquared_test.index(max(rsquared_test))
        
    pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=mx)
    model = LassoLarsCV(cv=5, precompute=False).fit(pred_train, tar_train)
    # load predictors into predictors_model
    predictors_model=pd.DataFrame(listofallpredictors)
    # rename header of predictors_model as 'label'
    predictors_model.columns = ['label']
    # create a column named coeff and add coeff. to dataframe
    predictors_model['coeff'] = model.coef_
    for index, row in predictors_model.iterrows():
    # since any coefficients that are non-zero are significant, find all the coeff, greater than 0
        if row['coeff'] > 0:
            # print out the significant values
            coeff.append(row.values)
    
    return rsquared_train[mx], rsquared_test[mx], coeff, mx, model.intercept_

In [223]:
max_r_squared_Lars(10)

(0.4018562622516678,
 0.8620038126251974,
 [array(['budget', 0.1222922977225194], dtype=object),
  array(['wordy', 7.129278270811197], dtype=object),
  array(['narratage', 4.214814303269835], dtype=object),
  array(['000-$374', 41.186177671948414], dtype=object),
  array(['Charity & Philanthropy', 79.24999811876575], dtype=object),
  array(['Cosmetic Surgery', 5.255254515126166], dtype=object),
  array(['DIY', 49.71777960444509], dtype=object),
  array(['Online Activity', 12.577870634407152], dtype=object),
  array(['Pets & Animals', 15.482757794955857], dtype=object),
  array(['Real Estate', 52.342493698115376], dtype=object),
  array(['SEO', 0.26633038738390163], dtype=object),
  array(['awareness', 23.56650935878031], dtype=object),
  array(['doctor', 73.77608895204311], dtype=object),
  array(['sport', 3.954019704139708], dtype=object)],
 6)

#### RidgeCV
***
**Ridge regression** addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. The ridge coefficients minimize a penalized residual sum of squares.

In [186]:
def max_r_squared_ridge(n):
    rsquared_train = []
    rsquared_test = []
    coeff = []
    # find the maximum r-squared value and corresponding random_state value
    for i in range(n):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=i)
        model = RidgeCV(cv=5, fit_intercept=True).fit(pred_train, tar_train)
        
        rsquared_train.append(model.score(pred_train,tar_train))
        rsquared_test.append(model.score(pred_test,tar_test))
        mx = rsquared_test.index(max(rsquared_test))
        
    pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=mx)
    model = RidgeCV(cv=5, fit_intercept=True).fit(pred_train, tar_train)
    predictors_model=pd.DataFrame(listofallpredictors)
    predictors_model.columns = ['label']
    predictors_model['coeff'] = model.coef_
    for index, row in predictors_model.iterrows():
        if row['coeff'] != 0:
            coeff.append(row.values)
    
    return rsquared_train[mx], rsquared_test[mx], coeff, mx, model.intercept_

In [188]:
max_r_squared_ridge(20000)

(0.8451636872990215,
 0.8702712392775607,
 [array(['audienceMaxAge', 0.3726009725147037], dtype=object),
  array(['audienceMinAge', -0.31392821398146253], dtype=object),
  array(['budget', 0.2775549322923325], dtype=object),
  array(['targetingCost', -0.28624748505103526], dtype=object),
  array(['targetingPercent', -2.894002529856331], dtype=object),
  array(['placementCost', 0.3348485525148135], dtype=object),
  array(['ourTargetingPercent', -2.894002529856331], dtype=object),
  array(['projectedImpressions', -0.011920827892026864], dtype=object),
  array(['callToAction', 3.0969960702584114], dtype=object),
  array(['quality', 1.7034252429165346], dtype=object),
  array(['keywords', 4.035229445189808], dtype=object),
  array(['contact', -2.668295329389826], dtype=object),
  array(['CEO', -1.5194100603314573], dtype=object),
  array(['beautifulFigure', 5.903120340129352], dtype=object),
  array(['celebrity', -6.77673840265364], dtype=object),
  array(['targetedStarring', 3.94302226127

Change the values of alpha to see if anything improves.

In [329]:
def max_r_squared_ridge_alpha(n):
    global df
    rsquared_train = []
    rsquared_test = []
    df = pd.DataFrame(allvariablenames[1:])
    coeff = []
    # find the maximum r-squared value and corresponding random_state value
    for i in range(n):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=i)
        model = RidgeCV(alphas=[1e-4, 1e-3, 1e-2, 1e-1], cv=5, fit_intercept=True).fit(pred_train, tar_train)
        df["coefficient{0}".format(i)] = model.coef_
        rsquared_train.append(model.score(pred_train,tar_train))
        rsquared_test.append(model.score(pred_test,tar_test))   
        mx = rsquared_test.index(max(rsquared_test))
        
    pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=mx)
    model = RidgeCV(alphas=[1e-4, 1e-3, 1e-2, 1e-1], cv=5, fit_intercept=True).fit(pred_train, tar_train)
    predictors_model=pd.DataFrame(listofallpredictors)
    predictors_model.columns = ['label']
    predictors_model['coeff'] = model.coef_
    for index, row in predictors_model.iterrows():
        if row['coeff'] != 0:
            coeff.append(row.values)
    
    return rsquared_train[mx], rsquared_test[mx], mx, model.intercept_

In [331]:
max_r_squared_ridge_alpha(20000)

(0.9861514268630314, 0.9208814033140389, 18291, 31.726245672450915)

For RidgeCV model, the best testing $r^2$ is around 0.92. Identifying the top 20 factors.

In [395]:
## Count the occurrence and sort the factors
df['count'] = df.astype(bool).sum(axis=1)
df = df.sort_values(by=['count'],ascending=False)
df['rank'] = df['count'].rank(ascending=False)
R1 = dfSum[[0,'count','rank']]
R1.head(20)

Unnamed: 0,0,count,rank
2,budget,16508,1.0
5,placementCost,14894,2.0
142,Charity & Philanthropy,7278,3.0
3,targetingCost,3478,4.0
168,DIY,2660,5.0
293,Real Estate,2655,6.0
74,000-$374,2631,7.0
359,doctor,2189,8.0
322,Undergraduate,2150,9.0
345,reach,2088,10.0


In [399]:
R2 = df[[0,'coefficient18291']].sort_values(by=['coefficient18291'],ascending=False)
R2

Unnamed: 0,0,coefficient18291
368,school,82.796252
322,Undergraduate,81.742635
133,Business,66.602421
136,Business Education,61.873626
187,Fashion & Style,55.973452
188,Finance,39.456436
359,doctor,39.405445
11,contact,36.454750
275,Online Activity,35.473239
371,sport,34.403980


#### ElasticNetCV
***
**ElasticNet** is a linear regression model trained with L1 and L2 prior as regularizer. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge. We control the convex combination of L1 and L2 using the l1_ratio parameter.
<br>
Elastic-net is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.
<br>
A practical advantage of trading-off between Lasso and Ridge is it allows Elastic-Net to inherit some of Ridge’s stability under rotation.

In [238]:
def max_r_squared_elastic(n):
    rsquared_train = []
    rsquared_test = []
    coeff = []
    # find the maximum r-squared value and corresponding random_state value
    for i in range(n):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=i)
        model = ElasticNetCV(cv=5, precompute=False).fit(pred_train, tar_train)
        
        rsquared_train.append(model.score(pred_train,tar_train))
        rsquared_test.append(model.score(pred_test,tar_test))
        mx = rsquared_test.index(max(rsquared_test))
        
    pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=mx)
    model = ElasticNetCV(cv=5, precompute=False).fit(pred_train, tar_train)
    predictors_model=pd.DataFrame(listofallpredictors)
    predictors_model.columns = ['label']
    predictors_model['coeff'] = model.coef_
    for index, row in predictors_model.iterrows():
        if row['coeff'] != 0:
            coeff.append(row.values)
    
    return rsquared_train[mx], rsquared_test[mx], coeff, mx, model.intercept_

In [239]:
max_r_squared_elastic(10000)

(0.42740918548753326,
 0.9187785455516372,
 [array(['budget', 0.04110180451316815], dtype=object),
  array(['projectedImpressions', 0.0045980868175371165], dtype=object)],
 3302,
 56.34847246718043)

#### OrthogonalMatchingPursuitCV
***
**OrthogonalMatchingPursuitCV** Cross-validated Orthogonal Matching Pursuit model (OMP)

In [235]:
def max_r_squared_elastic(n):
    rsquared_train = []
    rsquared_test = []
    coeff = []
    # find the maximum r-squared value and corresponding random_state value
    for i in range(n):
        pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=i)
        model = OrthogonalMatchingPursuitCV(cv=5).fit(pred_train, tar_train)
        
        rsquared_train.append(model.score(pred_train,tar_train))
        rsquared_test.append(model.score(pred_test,tar_test))
        mx = rsquared_test.index(max(rsquared_test))
        
    pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=mx)
    model = OrthogonalMatchingPursuitCV(cv=5).fit(pred_train, tar_train)
    predictors_model=pd.DataFrame(listofallpredictors)
    predictors_model.columns = ['label']
    predictors_model['coeff'] = model.coef_
    for index, row in predictors_model.iterrows():
        if row['coeff'] != 0:
            coeff.append(row.values)
    
    return rsquared_train[mx], rsquared_test[mx], coeff, mx, model.intercept_

In [227]:
max_r_squared_elastic(20000)

(0.660064927407843,
 0.8663663968730866,
 [array(['placementCost', 0.2873464093549272], dtype=object),
  array(['mimicUI', 30.610673469836343], dtype=object),
  array(['000-$374', 129.3121928313342], dtype=object),
  array(['Company Sales', -54.40228439204767], dtype=object),
  array(['Family & Parenting', 96.27686563976097], dtype=object),
  array(['Online Activity', 41.216759348320615], dtype=object),
  array(['Real Estate', 47.727384807464496], dtype=object),
  array(['reach', -54.10834679843333], dtype=object),
  array(['doctor', 132.66656983428442], dtype=object)],
 16230,
 28.443466139931346)

The optimal $r^2$ in this model is good, but the difference betweeen these two $r-squared$ makes it inconsistent.