### Inferential Statistics

Objective: Identify variables in the data to answer a project question. Identify strong correlations between pairs of independent variables or between an independent and a dependent variable. Practice identifying the most appropriate tests to use to analyze relationships between variables.

Outline:
1. Compute pairwise correlation of features, return correlation statistic. Tool: Pandas correlation function corr().
2. Assess correlation of category features, which indicate whether a url is benign, phishing or malware, with other individual features. Tool: Pandas correlation function corr().

#### Preliminary Steps

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

  import pandas.util.testing as tm


In [3]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [4]:
df = pd.read_pickle('capstone2_final')

#### Pairwise Correlation of Features

The data set consists of 97 predictor variables. We'll first drop a few feautres no longer necessary for the project. 

In [5]:
df_model = df.drop(['n_dots', 'pc_lowercase', 'n_domain_dots', 'n_netloc_dots', 'n_path_slashes', 'pc_path_lowercase'], axis = 1)

In [6]:
#get correlations of each features in dataset
corrmat = df_model.corr()
top_corr_features = corrmat.index
#plt.figure(figsize=(45,45))
#plot heat map
#g=sns.heatmap(df_model[top_corr_features].corr(),annot=True,cmap="RdYlGn")

In [7]:
corrmat

Unnamed: 0,len_url,is_53,is_54_75,is_76,len_tokenized_url,avg_token_len,last_slashes,loc_last_slashes,n_let,n_num,n_spec,pc_num,pc_let,pc_spec,n_ats,n_semicol,num_underscores,num_question,pc_uppercase,entropy,n_masques,char_cont_rate,n_domain_suffix,len_domain,is_ip,n_domain_num,n_domain_let,n_domain_spec,pc_domain_num,pc_domain_let,pc_domain_spec,n_domain_tok,avg_domain_tok_len,n_domain_hyphens,n_domain_ats,n_domain_masques,domain_entropy,is_top500_domain,len_netloc,n_netloc_num,n_netloc_let,n_netloc_spec,pc_netloc_num,pc_netloc_let,pc_netloc_spec,n_netloc_tok,n_subdomains,avg_netloc_tok_len,n_netloc_masques,netloc_entropy,len_all_paths,n_path_pc20,pc_path_num,pc_path_let,pc_path_spec,n_path_masques,path_entropy,shortest_path_len,longest_path_len,n_single_char_path,n_path_items,avg_path_token_len,pc_path_uppercase,len_param,n_param_num,n_param_let,n_param_spec,pc_param_num,pc_param_let,pc_param_spec,n_params_masque,param_entropy,n_queries,len_query,n_query_num,n_query_let,n_query_spec,pc_query_num,pc_query_let,pc_query_spec,n_queries_masques,queries_entropy,len_frag,n_frag_num,n_frag_let,n_fraf_spec,pc_frag_num,pc_frag_let,pc_frag_spec,n_frag_masques,frag_entropy
len_url,1.0,-0.562073,0.002507,0.654146,0.763483,0.510728,0.229812,-0.415116,0.955055,0.716362,0.772241,0.032582,0.096773,-0.466836,0.137329,0.457459,0.42677,0.495376,0.326838,0.780899,0.69391,-0.52548,0.124328,0.102492,-0.172988,-0.167347,0.2028,-0.108806,-0.178611,0.187988,-0.181884,-0.158877,0.162806,0.043534,,-0.006258,-0.014,0.01207,0.165926,-0.130866,0.234839,0.017558,-0.15654,0.16645,-0.163838,0.002906,0.153078,0.18462,0.022862,0.0963,0.631402,0.002473,0.037806,0.134305,-0.108861,0.295528,0.522497,0.099732,0.547332,0.067597,0.419059,0.417215,0.062989,0.104267,0.100626,0.10169,0.101364,0.067992,0.074207,0.077166,0.09388,0.086512,0.531028,0.75417,0.626956,0.725957,0.580753,0.468051,0.440737,0.325664,0.624806,0.6634,0.058801,0.043306,0.056954,0.050268,0.010487,0.051952,0.02507,0.042667,0.0589
is_53,-0.562073,1.0,-0.558201,-0.665918,-0.483754,-0.333156,-0.058538,0.555295,-0.578934,-0.271726,-0.477471,0.062178,-0.19378,0.521478,-0.107714,-0.145438,-0.262544,-0.272688,-0.161416,-0.76194,-0.195031,0.502048,-0.242617,-0.08939,0.284204,0.275577,-0.260578,0.134991,0.288719,-0.281895,0.178498,0.230634,-0.18118,-0.09213,,0.008224,0.050629,-0.081527,-0.177951,0.246993,-0.311607,0.004718,0.279467,-0.280002,0.19261,0.026002,-0.186567,-0.199914,-0.01837,-0.0833,-0.591149,-0.007279,-0.133979,-0.125352,0.173371,-0.141868,-0.705513,-0.082189,-0.459838,-0.076254,-0.547021,-0.346505,-0.011543,-0.031097,-0.02461,-0.033619,-0.032362,-0.036132,-0.040705,-0.039897,-0.024709,-0.039443,-0.213666,-0.199878,-0.172386,-0.186816,-0.172313,-0.203724,-0.268842,-0.214241,-0.139308,-0.269495,-0.027605,-0.017827,-0.025712,-0.044975,-0.008484,-0.051248,-0.025704,-0.010299,-0.048241
is_54_75,0.002507,-0.558201,1.0,-0.247266,0.008098,0.049597,-0.014457,-0.202159,0.019259,-0.044559,0.007753,-0.053669,0.090707,-0.164422,0.009941,-0.051749,0.028735,-0.052513,-0.000929,0.152413,-0.045549,-0.08756,0.213453,0.042732,-0.156694,-0.145907,0.133545,-0.067717,-0.152203,0.146031,-0.080568,-0.071933,0.066446,-0.00245,,0.008921,-0.012084,0.055444,0.002797,-0.147438,0.089499,-0.061945,-0.15346,0.146075,-0.061054,-0.043478,0.016466,0.039282,0.00538,-0.038416,0.081374,0.005389,0.130217,0.039731,-0.101341,-0.004663,0.233603,0.036716,0.03125,0.011662,0.218317,0.026872,0.004073,-0.011391,-0.009032,-0.012321,-0.011759,-0.012072,-0.012015,-0.012066,-0.009175,-0.014452,-0.05994,-0.066846,-0.062116,-0.06103,-0.054316,-0.061667,-0.046043,-0.03438,-0.050415,-0.077641,-0.000113,-0.000143,-0.000237,0.002019,0.005532,0.005581,0.001116,-0.000919,0.004214
is_76,0.654146,-0.665918,-0.247266,1.0,0.557657,0.344471,0.08136,-0.466714,0.658775,0.357392,0.550629,-0.024357,0.144742,-0.461154,0.116852,0.216376,0.280768,0.365667,0.18934,0.752769,0.268716,-0.507573,0.091407,0.065969,-0.191009,-0.190633,0.184232,-0.096759,-0.200319,0.1979,-0.136011,-0.204661,0.151841,0.109794,,-0.017625,-0.048261,0.045357,0.2053,-0.155876,0.283429,0.050187,-0.188384,0.19565,-0.170037,0.008727,0.203071,0.198144,0.016615,0.131821,0.617189,0.003655,0.03938,0.110664,-0.111346,0.169869,0.61387,0.062969,0.50891,0.078565,0.442525,0.377224,0.009818,0.046558,0.036862,0.05034,0.048367,0.053051,0.05834,0.057442,0.037105,0.059057,0.303418,0.293526,0.257167,0.273042,0.250069,0.293361,0.355359,0.281107,0.208017,0.384532,0.032339,0.020947,0.03024,0.050707,0.004934,0.05483,0.029014,0.012854,0.052547
len_tokenized_url,0.763483,-0.483754,0.008098,0.557657,1.0,0.020545,0.272108,-0.323227,0.648367,0.546953,0.985545,0.094623,-0.056529,-0.095972,0.149782,0.419439,0.307474,0.334573,0.346587,0.600198,0.228479,-0.482057,0.065589,0.034878,-0.05503,-0.057172,0.065729,-0.013272,-0.061271,0.059431,-0.03582,-0.035723,0.045066,0.050039,,-0.009529,-0.02064,-0.004418,0.075357,-0.049129,0.095511,0.049099,-0.055407,0.053797,-0.028194,0.046001,0.087914,0.054786,-0.003224,0.050717,0.680043,0.000779,0.079143,0.080956,-0.047013,0.073084,0.479899,0.167647,0.607786,0.057458,0.377972,0.512541,0.154367,0.071101,0.068786,0.069155,0.069515,0.047076,0.054228,0.054491,0.062019,0.059027,0.43637,0.41758,0.279943,0.403755,0.500405,0.239952,0.275202,0.281854,0.220258,0.395592,0.020589,0.012051,0.018075,0.05344,0.008198,0.051085,0.027754,-0.001312,0.046833
avg_token_len,0.510728,-0.333156,0.049597,0.344471,0.020545,1.0,0.002228,-0.288387,0.574245,0.318506,0.061002,-0.211999,0.392391,-0.776696,0.0185,0.105459,0.271182,0.277266,0.199706,0.454065,0.587099,-0.126593,0.180481,0.222483,-0.369132,-0.348228,0.440019,-0.297584,-0.370453,0.40466,-0.454792,-0.387181,0.382943,-0.00376,,0.007927,-0.003812,0.061471,0.215748,-0.306286,0.405248,-0.17817,-0.349286,0.388305,-0.464026,-0.209291,0.117804,0.397069,0.060195,0.061975,0.253513,0.003771,0.001817,0.045361,-0.134866,0.390215,0.263003,0.087046,0.22732,0.017027,0.11775,0.237821,0.003968,0.02841,0.023205,0.03039,0.028668,0.026224,0.028247,0.030646,0.026614,0.033226,0.194176,0.408518,0.415931,0.385615,0.147455,0.330955,0.28367,0.105424,0.434758,0.388536,0.152969,0.119757,0.154693,0.013548,0.013119,0.030042,0.00221,0.164652,0.068416
last_slashes,0.229812,-0.058538,-0.014457,0.08136,0.272108,0.002228,1.0,0.454388,0.243815,0.020287,0.333693,-0.032435,0.026871,0.004933,0.138555,0.41753,0.300534,0.246194,-0.007635,0.10798,0.014858,-0.090547,-0.021971,0.008249,0.019281,0.019291,-0.005885,0.008689,0.019184,-0.017014,0.002838,0.01108,0.003382,0.001083,,0.003286,0.008834,0.000273,0.000188,0.006596,-0.003506,0.001458,0.017885,-0.016949,0.006678,0.011494,0.003123,-0.00126,0.003676,0.004945,-0.003745,-0.000217,-0.022632,0.023014,-0.001883,0.000146,0.006154,-0.010076,-0.006955,-0.007851,0.004896,-0.012092,-0.016144,-0.001391,-0.001228,-0.001434,-0.001358,-0.001339,-0.001485,-0.001454,-0.001107,-0.001612,0.38607,0.31539,0.031008,0.347279,0.616145,0.014376,0.144305,0.201571,0.017048,0.204094,0.00743,0.002843,0.00649,0.023016,0.000637,0.008018,0.004822,0.001241,0.01085
loc_last_slashes,-0.415116,0.555295,-0.202159,-0.466714,-0.323227,-0.288387,0.454388,1.0,-0.409828,-0.280224,-0.287777,-0.022536,-0.108948,0.479684,-0.02336,0.01229,-0.112318,-0.117839,-0.153941,-0.614292,-0.19876,0.393977,-0.189421,-0.117239,0.194716,0.186762,-0.223277,0.083465,0.201948,-0.204784,0.164837,0.143356,-0.165956,-0.054824,,0.000157,-0.019272,-0.029869,-0.163584,0.159114,-0.247055,-0.020286,0.19299,-0.198645,0.16378,-0.005051,-0.140989,-0.178339,-0.019906,-0.108104,-0.497712,-0.00525,-0.158916,-0.180277,0.157345,-0.12119,-0.58094,-0.108822,-0.399342,-0.078116,-0.466529,-0.306553,-0.050664,-0.034237,-0.02779,-0.036591,-0.03535,-0.03507,-0.040822,-0.041056,-0.02773,-0.041602,-0.048505,-0.087598,-0.182594,-0.058412,0.076458,-0.197467,-0.15377,-0.071778,-0.156013,-0.159436,-0.015365,-0.012326,-0.014917,-0.010436,-0.007598,-0.030315,-0.007792,-0.008307,-0.026233
n_let,0.955055,-0.578934,0.019259,0.658775,0.648367,0.574245,0.243815,-0.409828,1.0,0.49813,0.661929,-0.175999,0.305363,-0.568694,0.144393,0.418037,0.418793,0.483414,0.259745,0.795083,0.693045,-0.526406,0.191069,0.108407,-0.286337,-0.283159,0.290597,-0.200354,-0.295617,0.308716,-0.288313,-0.270827,0.2188,0.038648,,-0.011096,-0.039721,0.036188,0.169382,-0.245536,0.309016,-0.054304,-0.272665,0.286157,-0.263434,-0.06619,0.177017,0.224228,0.019267,0.07158,0.599247,0.00316,-0.064524,0.208797,-0.135535,0.307782,0.554047,0.07094,0.50519,0.055971,0.433612,0.369748,-0.010778,0.07672,0.069702,0.077716,0.074973,0.049006,0.060949,0.060234,0.073188,0.066287,0.499191,0.722338,0.519011,0.729669,0.565494,0.381312,0.453758,0.329184,0.617203,0.625761,0.06847,0.046655,0.067583,0.046847,0.008541,0.052321,0.022285,0.05457,0.060889
n_num,0.716362,-0.271726,-0.044559,0.357392,0.546953,0.318506,0.020287,-0.280224,0.49813,1.0,0.53585,0.527646,-0.415999,-0.159077,0.034098,0.322899,0.219964,0.341685,0.317492,0.436525,0.579754,-0.275473,-0.081132,0.064761,0.150693,0.162813,-0.062665,0.135178,0.157415,-0.157354,0.116552,0.152196,-0.011051,0.034371,,0.010254,0.064182,-0.048802,0.118397,0.193338,-0.007522,0.180395,0.17441,-0.173096,0.110607,0.163331,0.049318,0.049436,0.032867,0.137588,0.368909,-0.000188,0.262942,-0.093506,-0.022382,0.224482,0.199979,0.082864,0.345765,0.064646,0.194261,0.271197,0.172661,0.136589,0.145265,0.124483,0.130343,0.092507,0.076249,0.087973,0.112309,0.104673,0.369119,0.608559,0.760968,0.494856,0.332553,0.566734,0.256266,0.157499,0.53389,0.546986,0.023835,0.028556,0.020832,0.032051,0.011477,0.027016,0.017673,0.012167,0.031384


List of correlations, in descending order:

In [8]:
# mask away the lower triangle and diagonal
mask = np.triu(np.ones_like(corrmat),1) == 1

# get the upper triangle (excluding diagonal) by masking and stack:
corr = corrmat.where(mask).stack()

# largest by absolute values
max = corr.abs().nlargest(75)
print(max)

len_frag            n_frag_let            0.993732
pc_netloc_num       pc_netloc_let         0.990982
pc_domain_num       pc_domain_let         0.990537
pc_domain_let       pc_netloc_let         0.986543
pc_domain_num       pc_netloc_num         0.986235
len_tokenized_url   n_spec                0.985545
len_param           n_param_let           0.981684
pc_domain_num       pc_netloc_let         0.980047
len_param           n_param_spec          0.979631
pc_domain_let       pc_netloc_num         0.977825
n_domain_num        pc_domain_num         0.976317
is_ip               pc_netloc_num         0.975041
n_param_let         n_param_spec          0.973842
len_query           n_query_let           0.973074
is_ip               pc_netloc_let         0.969996
                    pc_domain_num         0.969425
pc_num              pc_let                0.968539
n_domain_num        pc_netloc_num         0.968420
is_ip               pc_domain_let         0.966854
n_domain_num        pc_domain_l

Analysis:
Several pairs of predictor variables are highly correlated. All feature variables will remain in the dataset for initial attempts at training a machine learning model. Changes to the feature set will be made during model building, if necessary. 

#### Correlation of Features with URL Categories

Next we calculate correlation between predictor features and the target feature, 'Category'.

In [9]:
# make cateogry values numeric
category = {'benign':1, 'phishing':2, 'malicious':3}
df_model.category = [category[item] for item in df_model.category]

In [12]:
#get correlations of features with categories
#corrmat2 = df_model.corr()['category']
#corrmat2.sort_values(ascending=False)

# or use:
df_model.drop("category", axis=1).apply(lambda x: x.corr(df_model.category)).abs().sort_values(ascending=False)

is_ip                 0.604087
pc_netloc_num         0.603287
n_netloc_num          0.592038
n_domain_num          0.582885
pc_domain_num         0.581708
pc_netloc_let         0.564852
pc_domain_let         0.557707
n_domain_suffix       0.497599
n_domain_tok          0.493811
pc_num                0.445295
pc_let                0.438070
n_domain_spec         0.431180
domain_entropy        0.419934
netloc_entropy        0.384621
n_netloc_spec         0.368265
n_netloc_tok          0.316131
path_entropy          0.314085
pc_domain_spec        0.305760
pc_uppercase          0.289347
len_netloc            0.274093
is_top500_domain      0.264671
len_domain            0.264375
pc_path_uppercase     0.232758
avg_path_token_len    0.231629
longest_path_len      0.221292
pc_path_spec          0.220846
len_all_paths         0.208341
pc_spec               0.190418
n_domain_let          0.189295
pc_netloc_spec        0.185265
pc_path_num           0.173647
avg_netloc_tok_len    0.169445
n_num   

In [None]:
top_features = corrmat2.index
plt.figure(figsize=(45,45))
#plot heat map
g=sns.heatmap(df_model[top_features].corr(),annot=True,cmap="RdYlGn")

Analysis:
Numeric-based domain/netloc features are the most highly correlated with the target feature. The top eight features in this list may be reflective of the fact that 50+% of the malicious urls in our dataset have ip addresses instead of domain names. The domain and netloc features overlap. The number of features will be tailored down before machine learning work.