# Directions

1. Walk through the entire notebook, section by section, to run the entire pipeline for all three kinds of prediction. There are six sections total, including one supplemental module.
    1. Set up
    2. Individual feature
    3. Umbrella features
    4. Combined features (individual)
    5. Combined features (umbrella)
    6. *(Supplemental) Informative features*
2. **Yellow cells**: Run cell *without* changing code 
3. **Red cells**: Complete TODOs and run cell

# 1. Set Up

**1.1 Import relevant packages**

In this section, import all relevant notebooks containing helper functions and packages (may take a few minutes).

In [1]:
# import once, to import functions from other notebooks
import import_ipynb

# general packages
import pandas as pd
import io
import numpy as np
import requests
import os
import pickle
import pyarrow.parquet as pq
import pyarrow as pa
import random
import statistics
import json
import _pickle as pickle
import copy

# import pipeline components
import pipeline_components.general as general
import pipeline_components.model_functions as model 
import pipeline_components.plot as plot 

general.set_background('#fff9e9')

importing Jupyter notebook from /Users/anjimenez/HCL/crt_project/pipeline/pipeline_components/general.ipynb
importing Jupyter notebook from /Users/anjimenez/HCL/crt_project/pipeline/pipeline_components/model_functions.ipynb
importing Jupyter notebook from /Users/anjimenez/HCL/crt_project/pipeline/pipeline_components/plot.ipynb


**1.2 Define global variables and paths**

In this section, we define the global variables and paths that will be used throughout the pipeline.

<font color='green'>**Variable Descriptions**</font>
1. **MY_TARGET** (string): name of target variable (i.e. CRT score)
    1. Use "CRT_ACC" if CRT contains both numeric and conceptual questions
    2. Use "CRT_numeric" if CRT contains just numeric questions
    3. Use "CRT_conceptual" if CRT contains just conceptual questions
2. **DATAFRAME_TYPE** (string): description of how NaN values are dealt with in the dataframe
    1. Use "Complete" if NaN values were dropped
    2. Use "Imputed" if NaN values were imputed
    3. Use "Full" if NaN values remain as NaNs
3. **MY_DATE** (string): month and day of run as MM_DD
4. **MY_CROSS_VAL** (integer): the number of folds for cross-validation during prediction
5. **MY_TEXT_FEATURES** (list of strings): list of text feature names

<font color='blue'>**Path Descriptions**</font>
1. **ROOT_DIR** (string): path to root directory, which is /pipeline
2. **RESULTS_FOLDER_NAME** (string): name of folder that stores pipeline results 


In [3]:
# TODO: set variables
MY_TARGET = 'CRT_ACC' # {"CRT_ACC", "CRT_numerical", "CRT_conceptual"'}
DATAFRAME_TYPE = 'Complete' # {"Complete", "Imputed", "Full"}
MY_DATE = '04_11'
MY_CROSS_VAL = 5

# TODO: set text features
MY_TEXT_FEATURES = ['text', 'domains', 'followees', 'mentions', 'hashtags', 'bio', 'follower_bios', 'followee_bios']

general.set_background('#efe1e1')

In [6]:
# run this cell to create results and plots folders if they do not exist
ROOT_DIR = os.path.dirname(os.path.abspath('__file__')) + "/" 
RESULTS_FOLDER_NAME = 'results/{}/{}/{}/'.format(MY_DATE, DATAFRAME_TYPE.lower(), MY_TARGET) 

PLOT_FOLDER_NAME = 'results/{}/plots/'.format(MY_DATE)

isExist = os.path.exists(RESULTS_FOLDER_NAME)
if not isExist:
    os.makedirs(RESULTS_FOLDER_NAME)

isExist = os.path.exists(PLOT_FOLDER_NAME)
if not isExist:
    os.makedirs(PLOT_FOLDER_NAME)
    
general.set_background('#fff9e9')

**1.3 Read data**

Data should already be pre-processed and stored in the **/data/master** subfolder. See *data_processing.ipynb* for instructions on how dataframe should be properly formatted.

<font color='blue'>**Path Descriptions**</font>
1. **DATA_PATH** (string): path to file containing dataframe 

In [4]:
# TODO: set path to dataframe
DATA_PATH = 'data/master/data_full.parquet' # Full dataset
# DATA_PATH = 'data/master/data_complete.parquet' # Complete dataset

general.set_background('#efe1e1')

In [402]:
# run this cell to read the dataframe and print basic information about the data
data = pq.read_table(DATA_PATH).to_pandas()

target_feats = []
for i in data.columns:
    if i != 'screen_name' and i != 'age' and not i.startswith('CRT'):
        target_feats.append(i)

print("All feature names:\n")
for a,b,c in zip(data.columns.values[::3],data.columns.values[1::3],data.columns.values[2::3]):
    print('{:<30}{:<30}{:<}'.format(a,b,c))
print("\nNumber of participants:", data.shape[0], ", Number of features:", len(target_feats))

general.set_background('#fff9e9')

All feature names:

screen_name                   followees_count               CRT_ACC
political                     has_location                  is_protected
follower_followers_avg        bio                           insight
followee_tweets_avg           hashtags_count                hashtags_avg_length
neg_emotion                   age                           favorites_avg
domains_count_unique          domains                       follower_bios
follower_followees_avg        CRT_numeric                   mentions_count
inhibit                       url_exist                     followees
follower_media_avg            statuses_count                domains_count
pos_emotion                   followee_followers_avg        creation_time_of_day
follower_likes_avg            hashtags                      has_profile_image
screen_name_char              follower_verified_score       followee_bios
followee_verified_score       bio_char                      moral
CRT_conceptual           

**1.4 Define umbrella features**

Individual features can be combined to generate umbrella features: feature groups containing related features. For instance, hashtags, hashtags count, and hashtags average length all contain information related to hashtags and are grouped into one umbrella feature.

In this section, we define umbrella features to test. 

In [295]:
# TODO: set umbrella features
profile_features = [
    'creation_time_of_day',
    'url_exist',
    'screen_name_char', 
    'has_profile_image',
    'days_on_twitter',
    'has_location',
    'bio_char',
    'bio',
    'bot_score',
    'bio_char',
    'is_protected',
    'listed_count',
    'screen_name_digits',
    'favorites_avg',
    'favorites_given_count'
]

tweet_features = [
    'text', 
    'statuses_count'
]

emotional_features = [
    'insight',
    'inhibit', 
    'relig', 
    'pos_emotion',
    'neg_emotion',
    'moral',
    'political'
]

followee_features = []
for i in target_feats:
    if i.startswith('followee'):
        followee_features.append(i)

follower_features = []
for i in target_feats:
    if i.startswith('follower'):
        follower_features.append(i)
        
hashtag_features = []
for i in target_feats:
    if i.startswith('hashtag'):
        hashtag_features.append(i)

mention_features = []
for i in target_feats:
    if i.startswith('mention'):
        mention_features.append(i)

domain_features = ['outlet_score']
for i in target_feats:
    if i.startswith('domain'):
        domain_features.append(i)

# TODO: define combination features (combinations of umbrella features)
all_text = domain_features + mention_features + tweet_features + hashtag_features + emotional_features
all_text_words = ['domains', 'hashtags', 'text']
all_text_addendums = ['domains', 'hashtags', 'mentions']
all_friends = follower_features + followee_features

# TODO: create list of umbrella features and umbrella feature names
all_umbrella_features = [
    domain_features, 
    mention_features, 
    tweet_features, 
    follower_features, 
    followee_features, 
    hashtag_features, 
    profile_features, 
    emotional_features, 
    all_text, 
    all_friends, 
    all_text_words, 
    all_text_addendums
]

all_umbrella_features_names = [
    'domain_features', 
    'mention_features', 
    'text_features', 
    'follower_features', 
    'followee_features', 
    'hashtag_features', 
    'profile_features', 
    'emotional_features', 
    'all_text', 
    'all_friends', 
    'all_text_words', 
    'all_text_addendums'
]

general.set_background('#efe1e1')

In [None]:
# run this cell to generate a dictionary mapping umbrella feature name to features
combined_feats_dict = {}
for i in range(len(all_umbrella_features)):
    combined_feats_dict[all_umbrella_features_names[i]] = all_umbrella_features[i]
    
general.set_background('#fff9e9')

# 2. Individual Features

In this section, we determine which individual features are most predictive for CRT score. This module performs the following, in order: 
1. Transform features
    1. *For text features*: Transform feature using TF-IDF and TruncatedSVD
    2. *For quantitative features*: Log transform if skew is large and standardize
2. Split data into train and test set 
3. Perform feature selection with ElasticNetCV
4. Predict CRT on test set with model of choice and cross-validation
5. Average results over all splits
6. Plot and save results

<font color='green'>**Variable Descriptions**</font>
1. **MY_MODEL_INDV** (string): name of model of choice
    1. Use "ridge" for Ridge model
    2. Use "lasso" for LASSO model
    3. Use "rfr" for Random Forests Regressor model
2. **MY_PARAMS_INDV** (dictionary): dictionary of parameters for model of choice, following the formatting directions below:
    1. Keys are structured "X__Y" where X is either "model" or "poly" and Y is the parameter name, separated by a double underscore
        1. Use "model" as X to set parameters for the model of choice, which is either Ridge, LASSO, or Random Forests
        2. Use "poly" as X to set degree for PolynomialFeatures, which captures cross-feature interactions
        3. The parameter name Y must match the parameter name in the specs
            1. [Ridge specs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)
            2. [LASSO specs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)
            3. [Random Forests specs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
            4. [PolynomialFeatures specs](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
    2. Values are a list of parameter values to try in order to find the optimal value during cross-validation
        1. List lengths must be equal to or greater than 1 
3. **MY_SPLITS_INDV** (list or integer): data splits
    1. List of integers: list of random states for train/test split
    2. Integer: Randomly generates a list of random states the size of integer
4. **MY_MINDF_INDV** (integer): min_df parameter in TF-IDF
5. **MY_MAXDF_INDV** (integer): max_df parameter in TF-IDF
6. **MY_TEST_SIZE_INDV** (float): size of test set out of 1.0
7. **MY_N_COMPONENTS_INDV** (integer): n_components parameter in TruncatedSVD
4. **MY_TITLE_INDV** (string
): title for saving and plotting results

<font color='blue'>**Path Descriptions**</font>
1. **RESULTS_FOLDER_INDIVIDUAL** (string): name of folder that stores individual feature results 

In [7]:
# TODO: set variables and paths
MY_MODEL_INDV = 'ridge'
MY_PARAMS_INDV = {'model__alpha': np.logspace(-5, 5, 100)}
MY_SPLITS_INDV = range(0, 50)
MY_MINDF_INDV = 10
MY_MAXDF_INDV = 0.99
MY_TEST_SIZE_INDV = 0.1
MY_N_COMPONENTS_INDV = 50
MY_TITLE_INDV = 'Individual Features'

RESULTS_FOLDER_INDIVIDUAL = ROOT_DIR + RESULTS_FOLDER_NAME + 'individual_{}/'.format(MY_MODEL_INDV)

general.set_background('#efe1e1')

In [8]:
# run this cell to store results
state_dict_individual = {
    'model': MY_MODEL_INDV, 
    'params': MY_PARAMS_INDV, 
    'splits': MY_SPLITS_INDV,
    'mindf': MY_MINDF_INDV,
    'maxdf': MY_MAXDF_INDV,
    'test_size': MY_TEST_SIZE_INDV,
    'n_components': MY_N_COMPONENTS_INDV,
}

r_dict = {}
p_dict = {}
score_dict = {}
individual_results_df = pd.DataFrame({})

general.set_background('#fff9e9')

In [6]:
# run this cell to find best individual features 

for feat in target_feats:

    print(feat)
    
    # determine splits 
    splits = model.get_splits(MY_SPLITS_INDV)
                          
    # create X and Y from data 
    X_original = data[feat].dropna()
    Y_original = data[MY_TARGET]
    
    # set r, p, score sums to 0
    total_r = []
    total_p = []
    total_score = []
    
    if feat in MY_TEXT_FEATURES:
        X, Y, na_count = model.transform_feature(data, feat, MY_TARGET, my_max_df=MY_MAXDF_INDV, 
                                                 my_min_df=MY_MINDF_INDV, 
                                 my_n_components=MY_N_COMPONENTS_INDV)
    else:
        X, Y, na_count = model.transform_feature(data, feat, MY_TARGET)
            
    # average over all splits 
    for s in splits:
        
        # split data into train test set
        X_train, X_test, Y_train, Y_test = model.split_data(X, Y, MY_TARGET, MY_TEST_SIZE_INDV, s)
        
        # feature selection with ElasticNet on train set 
        X_train, X_test, Y_train, Y_test = model.feature_selection(X_train, X_test, Y_train, Y_test, 'median')
        
        # predict on test set 
        results = model.predict(X_train, X_test, Y_train, Y_test, MY_MODEL_INDV, MY_PARAMS_INDV, MY_CROSS_VAL)
        
        print(results)
        
        # add to r, p totals
        total_r.append(results['r'])
        total_p.append(results['p'])
        total_score.append(results['r2'])
    
    # add individual feature results to results dataframe
    f_results = model.create_results(feat, na_count, total_r, total_p, total_score)
    individual_results_df = individual_results_df.append(f_results, ignore_index = True)
    
    r_dict[feat] = total_r
    p_dict[feat] = total_p
    score_dict[feat] = total_score

# create and show plots
plot, chart = model.create_plotly_individual(individual_results_df, MY_TARGET, 
                                             DATAFRAME_TYPE, MY_MODEL_INDV, MY_TITLE_INDV)
chart.show()

# save results and plots 
model.save_results(RESULTS_FOLDER_INDIVIDUAL, MY_TITLE_INDV.lower(), individual_results_df, 
             plot, chart, state_dict_individual)
plot.individual_plot(individual_results_df, MY_TEXT_FEATURES, PLOT_FOLDER_NAME, MY_MODEL_INDV, DATAFRAME_TYPE)

general.set_background('#fff9e9')

# 3. Umbrella Features

*Variable and path definitions are similar to that of individual features prediction in Section 2.*

In [9]:
# TODO: set variables and paths
MY_MODEL_UMB = 'ridge'
MY_PARAMS_UMB = {'model__alpha': np.logspace(-5, 5, 100)}
MY_SPLITS_UMB = range(0, 50)
MY_MINDF_UMB = 10
MY_MAXDF_UMB = 0.99
MY_TEST_SIZE_UMB = 0.1
MY_N_COMPONENTS_UMB = 50
MY_TITLE_UMB = 'Umbrella Features'

RESULTS_FOLDER_UMBRELLA = ROOT_DIR + RESULTS_FOLDER_NAME + 'umbrella_{}/'.format(MY_MODEL_UMB)

general.set_background('#efe1e1')

In [10]:
# run this cell to store results
state_dict_umbrella = {
    'model': MY_MODEL_UMB, 
    'params': MY_PARAMS_UMB, 
    'splits': MY_SPLITS_UMB,
    'mindf': MY_MINDF_UMB,
    'maxdf': MY_MAXDF_UMB,
    'test_size': MY_TEST_SIZE_UMB,
    'n_components': MY_N_COMPONENTS_UMB,
}

r_dict = {}
p_dict = {}
score_dict = {}
umbrella_results_df = pd.DataFrame({})

general.set_background('#fff9e9')

In [2]:
# run this cell to find best umbrella features 

for j in range(len(all_features)):

    feature = all_features[j]
    feature_name = all_features_names[j]
    
    print(feature)
    
    # determine splits 
    splits = model.get_splits(MY_SPLITS_UMB)
                          
    # create X and Y from data 
    X_original = data[feature]
    Y_original = data[MY_TARGET]
    
    # set r, p, score sums to 0
    total_r = []
    total_p = []
    total_score = []
    
    # transform text and quantitative data 
    X = []
    for feat in feature:
        if feat in MY_TEXT_FEATURES:
            X_new, Y_new, na_count = model.transform_feature(data, feat, MY_TARGET, my_max_df=MY_MAXDF_UMB, 
                                                             my_min_df=MY_MINDF_UMB, 
                                     my_n_components=MY_N_COMPONENTS_UMB)
        else:
            X_new, Y_new, na_count = model.transform_feature(data, feat, MY_TARGET)
        
        # concatenate transformed individual feature to umbrella feature
        if X == []:
            X = X_new
        else:
            X = np.concatenate([X, X_new], axis=1)
    
    # average over all splits 
    for s in splits:
        
        # split data into train test set
        X_train, X_test, Y_train, Y_test = model.split_data(X, Y_original, MY_TARGET, MY_TEST_SIZE_UMB, s)
        
        # feature selection with ElasticNet on train set 
        X_train, X_test, Y_train, Y_test = model.feature_selection(X_train, X_test, Y_train, Y_test, 'median')
        
        # predict on test set 
        results = model.predict(X_train, X_test, Y_train, Y_test, MY_MODEL_UMB, MY_PARAMS_UMB, MY_CROSS_VAL)
        
        print(results)
        
        # add to r, p totals
        total_r.append(results['r'])
        total_p.append(results['p'])
        total_score.append(results['r2'])
    
    # add feature results to results dataframe 
    f_results = model.create_results(feature_name, na_count, total_r, total_p, total_score)
    umbrella_results_df = umbrella_results_df.append(f_results, ignore_index = True)
            
    r_dict[feature_name] = total_r
    p_dict[feature_name] = total_p
    score_dict[feature_name] = total_score

# create and show plots
plot, chart = model.create_plotly_individual(umbrella_results_df, MY_TARGET, 
                                             DATAFRAME_TYPE, MY_MODEL_UMB, MY_TITLE_UMB)
chart.show()

# save results and plots 
model.save_results(RESULTS_FOLDER_UMBRELLA, MY_TITLE_UMB.lower(), umbrella_results_df, 
             plot, chart, state_dict_umbrella)
plot.umbrella_plot(umbrella_results_df, PLOT_FOLDER_NAME, MY_MODEL_UMB)

general.set_background('#fff9e9')

# 4. Combined Features (Individual)


While individual features hold some predictive power over CRT score, combining features can improve prediction accuracy. By combining features, we create the best model for CRT score prediction given the current dataset. The following protocol is used to combine individual features:
1. Predict CRT score using all features.
2. Remove the individual feature with the highest median $p$ value across splits.
3. Repeat Steps 1 and 2, removing one feature at a time, until one feature remains (i.e. feature with the lowest median $p$ value).

Umbrella features are also combined using a similar protocol as above, but we remove the umbrella feature with the highest median $p$ value with each iteration. 

*Variable and path definitions are similar to that of individual features prediction in Section 2.*


In [14]:
# TODO: set variables and paths
MY_MODEL_COMB_INDV = 'lasso'
MY_PARAMS_COMB_INDV = {'model__alpha': np.logspace(-5, 5, 100), 'poly__degree': [1]} # structure for ridge/lasso 
# MY_PARAMS_COMB_INDV = {
#              'model__max_depth': [10], # default is [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None]
#              'model__min_samples_leaf': [40], # default is [20, 40, 60]
#              'model__min_samples_split': [10], # default is [2, 5, 10]
#              'model__n_estimators': [600], # default is [200, 400, 600, 800, 1000]
#              'model__max_features': ['log2'],
#              'poly__degree': [1] # default is [1]
#         } # structure for random forests
MY_SPLITS_COMB_INDV = range(0, 50)
MY_MINDF_COMB_INDV = 10
MY_MAXDF_COMB_INDV = 0.99
MY_TEST_SIZE_COMB_INDV = 0.1
MY_N_COMPONENTS_COMB_INDV = 50
MY_TITLE_COMB_INDV = 'Combined Individual Features'

RESULTS_FOLDER_COMBINED_INDV = ROOT_DIR + RESULTS_FOLDER_NAME + 'combined_individual_{}/'.format(MY_MODEL_COMB_INDV)

general.set_background('#efe1e1')

In [12]:
# run this cell to generate the ordering of individual features to combine

to_combine = individual_results_df.sort_values('p_median', ascending=True)['feature'].values
to_combine = to_combine.tolist()

general.set_background('#fff9e9')

In [15]:
# run this cell to store results
state_dict_combined_individual = {
    'model': MY_MODEL_COMB_INDV, 
    'params': MY_PARAMS_COMB_INDV, 
    'splits': MY_SPLITS_COMB_INDV,
    'mindf': MY_MINDF_COMB_INDV,
    'maxdf': MY_MAXDF_COMB_INDV,
    'test_size': MY_TEST_SIZE_COMB_INDV,
    'n_components': MY_N_COMPONENTS_COMB_INDV,
}

r_dict = {}
p_dict = {}
score_dict = {}
combined_individual_results_df = pd.DataFrame({})

general.set_background('#fff9e9')

In [3]:
# run this cell to combine individual features 

for i in range(len(to_combine)):
    
    # determine splits 
    splits = model.get_splits(MY_SPLITS_COMB_INDV)
        
    # remove umbrella with highest median p value 
    if i != 0:
        to_combine.pop()
        
    # generate combined feature name
    combined_name = " ".join(to_combine)
    print(combined_name)
        
    X = []
    for feat in to_combine:
        if feat in MY_TEXT_FEATURES:
            X_new, Y_new, na_count = model.transform_feature(data, feat, MY_TARGET, my_max_df=MY_MAXDF_COMB_INDV, 
                                                             my_min_df=MY_MINDF_COMB_INDV, my_n_components=MY_N_COMPONENTS_COMB_INDV)
        else:
            X_new, Y_new, na_count = model.transform_feature(data, feat, MY_TARGET)
        
        # concatenate transformed individual feature to all features for prediction
        if X == []:
            X = X_new
        else:
            X = np.concatenate([X, X_new], axis=1)
    
    # set r, p, score sums to 0
    total_r = []
    total_p = []
    total_score = []
        
    for s in splits: 
        # split data into train test set (same split)
        X_train, X_test, Y_train, Y_test = model.split_data(X, data[MY_TARGET], MY_TARGET, MY_TEST_SIZE_COMB_INDV s)
        
        # feature selection with ElasticNet on train set 
        X_train, X_test, Y_train, Y_test = model.feature_selection(X_train, X_test, Y_train, Y_test, 'median')
        
        # predict on test set 
        results = model.predict(X_train, X_test, Y_train, Y_test, MY_MODEL_COMB_INDV, MY_PARAMS_COMB_INDV, MY_CROSS_VAL)
        print(results)
        
        # add to r, p totals
        total_r.append(results['r'])
        total_p.append(results['p'])
        total_score.append(results['r2'])
    
    # add individual feature results to results dataframe
    f_results = model.create_results(combined_name, na_count, total_r, total_p, total_score)
    combined_individual_results_df = combined_individual_results_df.append(f_results, ignore_index = True)
                
    r_dict[combined_name] = total_r
    p_dict[combined_name] = total_p
    score_dict[combined_name] = total_score
    
# create and show plots
plot, chart = model.create_plotly_individual(combined_individual_results_df, MY_TARGET, 
                                             DATAFRAME_TYPE, MY_MODEL_COMB_INDV, MY_TITLE_COMB_INDV)
chart.show()

# save results and plots 
model.save_results(RESULTS_FOLDER_COMBINED_INDV, MY_TITLE_COMB_INDV.lower(), combined_individual_results_df, 
             plot, chart, state_dict_combined_individual)
plot.combined_plot(combined_individual_results_df, PLOT_FOLDER_NAME, 'individual')

general.set_background('#fff9e9')

# 5. Combined Features (Umbrella) 

In this section, we combine umbrella features using a similar protocol to combined individual prediction in Section 4.

*Variable and path definitions are similar to that of individual features prediction in Section 2.*

In [17]:
# TODO: set variables and paths
MY_MODEL_COMB_UMB = 'rfr'
# MY_PARAMS_COMB_UMB = {'model__alpha': np.logspace(-5, 5, 100), 
#                       'poly__degree': [2], 'model__tol': [0.01]} # structure for ridge/lasso
MY_PARAMS_COMB_UMB = {
             'model__max_depth': [10], # default is [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None]
             'model__min_samples_leaf': [20], # default is [20, 40, 60]
             'model__min_samples_split': [10], # default is [2, 5, 10]
             'model__n_estimators': [600], # default is [200, 400, 600, 800, 1000]
             'model__max_features': ['log2'],
             'poly__degree': [1] # default is [1]
        } # structure for random forests

MY_SPLITS_COMB_UMB = range(0, 50)
MY_MINDF_COMB_UMB = 10
MY_MAXDF_COMB_UMB = 0.99
MY_TEST_SIZE_COMB_UMB = 0.1
MY_N_COMPONENTS_COMB_UMB = 50
MY_TITLE_COMB_UMB = 'Combined Umbrella Features'

RESULTS_FOLDER_COMBINED_UMBRELLA = ROOT_DIR + RESULTS_FOLDER_NAME + 'combined_umbrella_{}/'.format(MY_MODEL_COMB_UMB)

general.set_background('#efe1e1')

In [18]:
# run this cell to store results
state_dict_combined_umbrella = {
    'model': MY_MODEL_COMB_UMB, 
    'params': MY_PARAMS_COMB_UMB, 
    'splits': MY_SPLITS_COMB_UMB,
    'mindf': MY_MINDF_COMB_UMB,
    'maxdf': MY_MAXDF_COMB_UMB,
    'test_size': MY_TEST_SIZE_COMB_UMB,
    'n_components': MY_N_COMPONENTS_COMB_UMB,
}

r_dict = {}
p_dict = {}
score_dict = {}
combined_umbrella_results_df = pd.DataFrame({})

general.set_background('#fff9e9')

In [32]:
# run this cell to generate the ordering of umbrella features to combine

to_combine = umbrella_results_df.sort_values('p_median', ascending=True)['feature'].values

# remove combinations of umbrella features
for elem in to_combine:
    if elem.startswith('all'):
        to_combine = np.delete(to_combine, np.where(to_combine == elem))

general.set_background('#fff9e9')

In [4]:
# run this cell to combine umbrella features 

for i in range(len(to_combine)):
        
    # remove umbrella with highest median p value 
    if i != 0:
        to_combine.pop()

    # generate combined feature name
    combined_name = " ".join(to_combine)
    print(combined_name)
    
    # determine splits 
    splits = model.get_splits(MY_SPLITS_COMB_UMB)
    
    # get all individual features 
    all_feats = []
    for umbrella_feat in to_combine:
        for feat in combined_feats_dict[umbrella_feat]:
            all_feats.append(feat)
    
    X = []
    for feat in all_feats:
        if feat in MY_TEXT_FEATURES:
            X_new, Y_new, na_count = model.transform_feature(data, feat, MY_TARGET, my_max_df=MY_MAXDF_COMB_UMB, 
                                                             my_min_df=MY_MINDF_COMB_UMB, 
                                     my_n_components=MY_N_COMPONENTS_COMB_UMB)
        else:
            X_new, Y_new, na_count = model.transform_feature(data, feat, MY_TARGET)
        
        # concatenate transformed individual feature
        if X == []:
            X = X_new
        else:
            X = np.concatenate([X, X_new], axis=1)
    
    # set r, p, score sums to 0
    total_r = []
    total_p = []
    total_score = []
        
    for s in splits: 
        # split data into train test set (same split)
        X_train, X_test, Y_train, Y_test = model.split_data(X, data[MY_TARGET], MY_TARGET, MY_TEST_SIZE_COMB_UMB, s)
        
        # feature selection with ElasticNet on train set 
        X_train, X_test, Y_train, Y_test = model.feature_selection(X_train, X_test, Y_train, Y_test, 'median')
        
        # predict on test set 
        results = model.predict(X_train, X_test, Y_train, Y_test, MY_MODEL_COMB_UMB, MY_PARAMS_COMB_UMB, MY_CROSS_VAL)
        print(results)
        
        # add to r, p totals
        total_r.append(results['r'])
        total_p.append(results['p'])
        total_score.append(results['r2'])
    
    # add individual feature results to results dataframe
    f_results = model.create_results(combined_name, na_count, total_r, total_p, total_score)
    combined_umbrella_results_df = combined_umbrella_results_df.append(f_results, ignore_index = True)
                
    r_dict[combined_name] = total_r
    p_dict[combined_name] = total_p
    score_dict[combined_name] = total_score
    
# create and show plots
plot, chart = model.create_plotly_individual(combined_umbrella_results_df, MY_TARGET, 
                                             DATAFRAME_TYPE, MY_MODEL_COMB_UMB, MY_TITLE_COMB_UMB)
chart.show()

# save results and plots 
model.save_results(RESULTS_FOLDER_COMBINED_UMBRELLA, MY_TITLE_COMB_UMB.lower(), combined_umbrella_results_df, 
             plot, chart, state_dict_combined_umbrella)
plot.combined_plot(combined_umbrella_results_df, PLOT_FOLDER_NAME, 'umbrella')

general.set_background('#fff9e9')

# 6: Informative Features

This is a separate module from the above pipeline. In this module, we generate the most informative words and phrases for domains, mentions, hashtags, followees, text (Tweets and Retweets), bios, follower bios, and followee bios. The most informative features are those that yield the largest and smallest coefficients after TF-IDF feature extraction and Ridge regression with cross-validation. 

<font color='green'>**Variable Descriptions**</font>
1. **FEAT_INFORMATIVE** (string): name of target feature, as it appears in dataframe 
2. **N_INFORMATIVE** (integer): top N_IF and bottom N_IF informative features displayed, default is 20 
3. **MAXDF_INFORMATIVE** (float): max_df parameter for TF-IDF, default is 1.0 ([*TF-IDF specs*](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html))
4. **MINDF_INFORMATIVE** (integer): max_df parameter for TF-IDF, default is 10
5. **NGRAM_INFORMATIVE** ((integer, integer)): n_gram range parameter for TF-IDF, default is (1, 1)
6. **STATE_INFORMATIVE** (integer): random state for train/test split

<font color='blue'>**Path Descriptions**</font>
1. **RESULTS_FOLDER_INFORMATIVE** (string): name of folder to save informative feature results 

In [35]:
# TODO: set variables 
FEAT_INFORMATIVE = 'followees'
N_INFORMATIVE = 15
MAXDF_INFORMATIVE = 0.99
MINDF_INFORMATIVE = 10 
NGRAM_INFORMATIVE = (1, 1) 
STATE_INFORMATIVE = 17 

RESULTS_FOLDER_INFORMATIVE = ROOT_DIR + RESULTS_FOLDER_NAME + 'informative_features/{}/'.format(FEAT_INFORMATIVE)

general.set_background('#efe1e1')

In [5]:
# run this cell to generate most informative features
r, p, score, chart = model.get_informative_features(data, FEAT_INFORMATIVE, MY_TARGET, RESULTS_FOLDER_INFORMATIVE, 
                                                   n=N_INFORMATIVE, maxDF=MAXDF_INFORMATIVE, minDF=MINDF_INFORMATIVE, 
                                                   n_gram=NGRAM_INFORMATIVE, my_state=STATE_INFORMATIVE)

chart.show()

general.set_background('#fff9e9')