# Classify Using Relevant Features Found

**PROCESS**
- In notebook 2, the relevant features were found for both the UCI madelon dataset, and the Cook madelon dataset.
- Below, the classification models will be the same classification models used in the Benchmarking notebook (notebook 1). 
- The scores will then be compared to the raw classifiers used in the Benchmarking notebook.


**RELEVANT FEATURES FOUND FROM FEATURE EXTRACTION (NOTEBOOK 2) (difficult to identify true Redundant features from true Informative features)**
- *UCI Madelon* has a total of 20 relevant features: 28, 48, 64, 105, 128, 153, 241, 281, 318, 336, 338, 378, 433, 442, 451, 453, 455, 472, 475, and 493
- *Cook Madelon* has a total of 20 relevant features: 257, 269, 308, 315, 336, 341, 395, 504, 526, 639, 681, 701, 724, 736, 769, 808, 829, 867, 920, and 956

In [1]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sn
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
!pip install psycopg2 --quiet

In [3]:
import psycopg2 as pg2
from psycopg2.extras import RealDictCursor

In [4]:
import scipy.stats as stats

In [5]:
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler, StandardScaler

### Load the Data
- Here, I loaded the data only using the relevant 20 features found in the two respective datasets.
- UCI Madelon: All 2,000 samples are going to be used from the train set and only the 20 relevant features, instead of the original 500 features.
- Josh's Madelon: All 220,000 samples will be used with only the 20 relevant features, instead of the original 1,000 features.

##### UCI Madelon

In [6]:
madelon_all_train = '../assets/madelon_train.data'
madelon_label_train = '../assets/madelon_train.labels'

In [7]:
madelon_all_train_df = pd.read_csv(madelon_all_train, delimiter=' ', header=None)
madelon_all_train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,491,492,493,494,495,496,497,498,499,500
0,485,477,537,479,452,471,491,476,475,473,...,481,477,485,511,485,481,479,475,496,
1,483,458,460,487,587,475,526,479,485,469,...,478,487,338,513,486,483,492,510,517,
2,487,542,499,468,448,471,442,478,480,477,...,481,492,650,506,501,480,489,499,498,
3,480,491,510,485,495,472,417,474,502,476,...,480,474,572,454,469,475,482,494,461,
4,484,502,528,489,466,481,402,478,487,468,...,479,452,435,486,508,481,504,495,511,


In [8]:
UCI_extracted_features = [28, 48, 64, 105, 128, 153, 241, 281, 318, 336, 338, 378, 433, 442, 451, 453, 455, 472, 475, 493]
madelon_train_relfeat_only = madelon_all_train_df.iloc[:, UCI_extracted_features]
madelon_train_relfeat_only.head()

Unnamed: 0,28,48,64,105,128,153,241,281,318,336,338,378,433,442,451,453,455,472,475,493
0,459,440,648,181,452,575,434,517,414,658,628,419,533,568,463,471,630,515,401,485
1,475,499,488,431,473,404,551,435,469,469,528,526,442,463,474,311,582,465,549,338
2,491,460,485,593,487,585,474,535,506,465,431,464,569,503,481,606,424,485,454,650
3,472,529,415,698,493,591,569,526,458,398,377,553,565,447,472,545,456,457,602,572
4,472,429,387,451,475,448,538,456,462,385,509,424,462,536,472,426,465,500,560,435


In [9]:
madelon_train_relfeat_only.shape

(2000, 20)

In [10]:
madelon_label_train_df = pd.read_csv(madelon_label_train, delimiter=' ', header=None)
madelon_label_train_df.rename(columns={0:'target'}, inplace=True)

In [11]:
to_concat = [madelon_train_relfeat_only, madelon_label_train_df]
madelon_rel_feat_train = pd.concat(to_concat, axis=1)
madelon_rel_feat_train.head()

Unnamed: 0,28,48,64,105,128,153,241,281,318,336,...,378,433,442,451,453,455,472,475,493,target
0,459,440,648,181,452,575,434,517,414,658,...,419,533,568,463,471,630,515,401,485,-1
1,475,499,488,431,473,404,551,435,469,469,...,526,442,463,474,311,582,465,549,338,-1
2,491,460,485,593,487,585,474,535,506,465,...,464,569,503,481,606,424,485,454,650,-1
3,472,529,415,698,493,591,569,526,458,398,...,553,565,447,472,545,456,457,602,572,1
4,472,429,387,451,475,448,538,456,462,385,...,424,462,536,472,426,465,500,560,435,1


In [12]:
madelon_rel_feat_train_only = madelon_rel_feat_train.drop('target', axis=1)
madelon_rel_feat_train_target = madelon_rel_feat_train['target']

In [13]:
madelon_valid = '../assets/madelon_valid.data'
madelon_valid_label = '../assets/madelon_valid.labels'

In [14]:
madelon_valid_df = pd.read_csv(madelon_valid, delimiter=' ', header=None)
madelon_valid_labels_df = pd.read_csv(madelon_valid_label, delimiter = ' ', header=None)

In [15]:
madelon_valid_labels_df.rename(columns={0:'target'}, inplace=True)

In [16]:
madelon_valid_df.drop(500, axis=1, inplace=True)

In [17]:
madelon_valid_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
0,483,454,513,495,523,469,453,477,506,479,...,455,480,543,259,413,520,485,498,523,510
1,485,508,493,487,478,472,504,476,479,475,...,486,480,535,534,514,452,484,495,548,477
2,483,521,507,475,493,486,421,475,496,483,...,491,476,498,495,508,528,486,465,508,503
3,474,504,576,480,553,483,524,478,483,483,...,521,475,470,463,509,525,479,467,552,517
4,495,474,523,479,495,488,485,476,497,478,...,510,471,522,343,509,520,475,493,506,491


In [18]:
madelon_valid_labels_df.head()

Unnamed: 0,target
0,-1
1,-1
2,-1
3,1
4,-1


In [19]:
to_concat_2 = [madelon_valid_df, madelon_valid_labels_df]
madelon_valid_all_df = pd.concat(to_concat_2, axis=1)

In [20]:
madelon_valid_all_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,491,492,493,494,495,496,497,498,499,target
0,483,454,513,495,523,469,453,477,506,479,...,480,543,259,413,520,485,498,523,510,-1
1,485,508,493,487,478,472,504,476,479,475,...,480,535,534,514,452,484,495,548,477,-1
2,483,521,507,475,493,486,421,475,496,483,...,476,498,495,508,528,486,465,508,503,-1
3,474,504,576,480,553,483,524,478,483,483,...,475,470,463,509,525,479,467,552,517,1
4,495,474,523,479,495,488,485,476,497,478,...,471,522,343,509,520,475,493,506,491,-1


In [21]:
# choose only the relevant features from the valid dataset

In [22]:
madelon_valid_relevant_df = madelon_valid_all_df.iloc[:, UCI_extracted_features]
madelon_valid_relevant_target = madelon_valid_all_df['target']

In [23]:
madelon_valid_relevant_target.to_pickle('../assets/pickled_samples/madelon_valid_relevantfeat_targets.pkl')

In [24]:
madelon_valid_relevant_df.head()

Unnamed: 0,28,48,64,105,128,153,241,281,318,336,338,378,433,442,451,453,455,472,475,493
0,490,436,450,420,472,409,541,432,513,418,523,423,427,444,486,300,548,454,538,259
1,491,544,629,541,480,567,456,519,522,626,484,580,559,414,484,523,547,439,429,534
2,479,437,426,500,480,485,517,471,482,383,485,432,485,526,477,479,457,494,517,495
3,472,447,574,314,463,405,425,426,458,560,700,443,427,649,472,453,525,540,386,463
4,469,501,499,395,471,417,537,434,451,483,609,517,448,518,470,368,570,487,527,343


In [25]:
madelon_valid_relevant_df.to_pickle('../assets/pickled_samples/madelon_valid_relevantfeat_df.pkl')

##### Cook Madelon

In [26]:
# con = pg2.connect(host='34.211.227.227', dbname='postgres', user='postgres')
# cur = con.cursor(cursor_factory=RealDictCursor)

# cur.execute('SELECT feat_257, feat_269, feat_308, feat_315, feat_336, feat_341, feat_395, feat_504, feat_526, feat_639, feat_681, feat_701, feat_724, feat_736, feat_769, feat_808, feat_829, feat_867, feat_920, feat_956, target FROM madelon;')
# results1 = cur.fetchall()
# con.close()

Since Josh Cook closed the connection to his DB server, I will only build clsasification pipelines for the UCI madelon dataset.

### Train/Test/Split the Data
- ***UCI Madelon*** was not train test split, since the data on the website is already split into a train set and a test set. The data was imported above in the manner described under the **Load Data** section of this notebook where only the 20 relevant features are being used for both the train set and the valid set provided by the website. 

##### UCI Madelon

In [27]:
madelonXtrain, madelonXtest, madelonytrain, madelonytest = train_test_split(madelon_rel_feat_train_only, \
                                                                            madelon_rel_feat_train_target)

### Preprocessing - Scale and Deskew

Preprocessing would include:
- deskewing
- scaling/normalizing

##### UCI Madelon

In [28]:
# Examine the Skew

uci_relfeat_orig_skew = []
for feature in madelonXtrain.columns:
    uci_relfeat_orig_skew.append((feature, stats.skew(madelonXtrain[feature])))

In [29]:
uci_relfeat_orig_skew

[(28, 0.0424632300211934),
 (48, 0.053478345191775065),
 (64, -0.015466362557005892),
 (105, -0.04217710733288818),
 (128, -0.09496308333809365),
 (153, 0.08346008545882982),
 (241, -0.02641981784608411),
 (281, 0.04571709812600067),
 (318, 0.04969107205402998),
 (336, 0.0033761939608515984),
 (338, 0.0788975814476203),
 (378, 0.04737601971678685),
 (433, 0.06484703085187964),
 (442, -0.05404389532881141),
 (451, 0.05944609899334041),
 (453, -0.14686912484464604),
 (455, -0.002508127318888884),
 (472, -0.04839922386742785),
 (475, -0.017320003922152508),
 (493, -0.09250021996045521)]

### Classify and Score the Data Using the Same 4 Classification Models in Benchmarking (Notebook 1)
- Calculate Test Scores aka Accuracy
- Run Classification Reports
- Run Confusion Matrices
- Calculate LogLoss in each case