Notebook 1 Think about the goals of Project 3 in terms of supervised learning and unsupervised learning. Use this to start your report!

### Problem Statement

Your challenge here is to develop a series of models for two purposes:

1. for the purposes of identifying relevant features. 
2. for the purposes of generating predictions from the model. 

### Solution Statement

Your final product will consist of:

1. A prepared report
2. A series of Jupyter notebooks to be used to control your pipelines

### Tasks

#### Data Manipulation

You should do substantive work on at least six subsets of the data. 

- 3 sets of 10% of the data from the UCI Madelon set.
- 3 sets of 10% of the data from the Larger Madelon set.

##### Prepared Report

Your report should:
1. present results from Step 1: Benchmarking
2. present results from Step 2: Identify Salient Features
3. present results from Step 3: Feature Importances
4. present results from Step 4: Build Model

##### Jupyter Notebook, EDA 

- perform EDA on each set as you see necessary

In [2]:
cd ..

/Users/johnphillips/Desktop/DSI-Class-Stuff/Project03_on_AWS/Project_03_on_AWS


In [3]:
# Standard Imports

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from scipy import stats
%matplotlib inline

# Reference Feature Selection 4.3 & 4.4 Lesson

In [4]:
# UCI Madelon Data Links
train_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.data'
train_data = pd.read_csv(train_url, delimiter=' ', header=None)

train_label_url ='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.labels'
train_label_data = pd.read_csv(train_label_url, delimiter=' ', header=None)

test_url ='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_test.data'
test_data = pd.read_csv(test_url, delimiter=' ', header=None)

validate_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_valid.data'


In [5]:
train_data.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,491,492,493,494,495,496,497,498,499,500
0,485,477,537,479,452,471,491,476,475,473,...,481,477,485,511,485,481,479,475,496,
1,483,458,460,487,587,475,526,479,485,469,...,478,487,338,513,486,483,492,510,517,
2,487,542,499,468,448,471,442,478,480,477,...,481,492,650,506,501,480,489,499,498,


In [6]:
test_data.tail(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,491,492,493,494,495,496,497,498,499,500
1797,479,499,538,462,481,476,472,474,465,475,...,480,492,458,503,590,472,474,482,446,
1798,481,499,493,490,495,482,478,478,469,473,...,483,509,571,523,514,478,491,534,494,
1799,481,439,543,488,476,470,474,475,467,464,...,488,463,607,523,495,486,467,502,479,


In [7]:
train_data[500].head(5)

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: 500, dtype: float64

In [8]:
test_data[500].head(5)

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: 500, dtype: float64

In [9]:
train_data.drop(500, axis=1, inplace=True)
test_data.drop(500, axis=1, inplace=True)

In [10]:
train_label_data.head(4)

Unnamed: 0,0
0,-1
1,-1
2,-1
3,1


In [11]:
train_data['Label'] = train_label_data

In [12]:
train_data.head(4)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,491,492,493,494,495,496,497,498,499,Label
0,485,477,537,479,452,471,491,476,475,473,...,481,477,485,511,485,481,479,475,496,-1
1,483,458,460,487,587,475,526,479,485,469,...,478,487,338,513,486,483,492,510,517,-1
2,487,542,499,468,448,471,442,478,480,477,...,481,492,650,506,501,480,489,499,498,-1
3,480,491,510,485,495,472,417,474,502,476,...,480,474,572,454,469,475,482,494,461,1


### Basic Benchmarking:

In [13]:
# Check for nulls and counts:
train_data['Label'].isnull().value_counts()

False    2000
Name: Label, dtype: int64

In [14]:
# Find count of each value:
train_data['Label'].value_counts()

 1    1000
-1    1000
Name: Label, dtype: int64

In [15]:
print((1000.0/2000)) # What % are '1'?
print(1000.0/2000) # What % are '-1'?

0.5
0.5


In [16]:
# Now Pickle Time
train_data.to_pickle('data/train_data.p')

Move to Notebook 01 from here...

Will go ahead and import the BIG data to use later. 
My goal is to import a dataset now with all the features,
narrow down the features, and then import a dataset
with only the important features but many rows.

In [1]:
!conda install psycopg2 --yes

Solving environment: done

## Package Plan ##

  environment location: /Users/johnphillips/anaconda2

  added / updated specs: 
    - psycopg2


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.0.2n             |       hdbc3d79_0         3.4 MB
    certifi-2017.11.5          |   py27hfa9a1c4_0         196 KB
    ------------------------------------------------------------
                                           Total:         3.6 MB

The following packages will be UPDATED:

    certifi: 2017.7.27.1-py27h482ffc0_0 --> 2017.11.5-py27hfa9a1c4_0
    openssl: 1.0.2m-h86d3e6a_1          --> 1.0.2n-hdbc3d79_0       


Downloading and Extracting Packages
openssl 1.0.2n: ######################################################## | 100% 
certifi 2017.11.5: ##################################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing trans

In [17]:
# Imports to connect to madelon data:

import psycopg2 as pg2
from psycopg2.extras import RealDictCursor
import pandas as pd

In [None]:
# Connection to the huge madelon data set

# Careful on changing 'LIMIT' ... its 200000 rows!!!

# Want to use t2.medium (if i can get it to run!) 
# and then increase to LIMIT 6500 or more later ...

connection = pg2.connect(host='34.211.227.227',
                  dbname='postgres',
                  user='postgres')
curs = connection.cursor(cursor_factory=RealDictCursor)
curs.execute('SELECT * FROM madelon LIMIT 2300;')  # Change LIMIT based upon how much I want
results = curs.fetchall()
connection.close()  # Close the connection, ALWAYS!

In [None]:
# Create a DataFrame from results
huge = pd.DataFrame(results)

In [None]:
huge.shape

In [16]:
huge.head(3)

Unnamed: 0,_id,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,...,feat_991,feat_992,feat_993,feat_994,feat_995,feat_996,feat_997,feat_998,feat_999,target
0,81264,-0.619314,-0.980879,0.260013,0.109861,-1.09166,-2.345588,0.727887,0.189447,-0.400514,...,0.524966,1.865985,0.47681,-0.562234,0.295281,-0.128997,0.679676,0.085488,-0.375616,0
1,81265,-0.254716,-0.507283,0.586206,0.522276,0.689763,0.083975,1.165854,-0.269793,0.509566,...,-1.476176,0.742824,-0.388359,-0.536324,1.268221,0.015912,-1.016712,0.072405,1.152787,0
2,81266,-1.730033,-0.039938,-0.199574,0.592114,-0.016629,-0.352547,-1.269467,-0.962625,0.709714,...,-0.833891,1.954665,1.247889,-1.58503,0.694697,-1.9081,0.09307,-2.160079,-1.860555,1


In [18]:
# Now Pickle Time
huge.to_pickle('data/huge.p')