### Week 19 In Class Homework

#### Take one of the supervised learning models you have built recently and apply at least three dimensionality reduction techniques to it (separately). Be sure to create a short summary of each technique you use. Indicate how each changed the model performance. Reference: https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/

I used the multiple logistic regression model on the diabetes data set from homework 14.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#creating the dataframe

diabetes_df = pd.read_csv("../week_13/diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
diabetes_df.shape

In [None]:
diabetes_df.describe()

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix,plot_confusion_matrix

In [6]:
# Defining the feature matrix and the target.

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

In [7]:
# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 14, stratify = y)


In [8]:
# Standardizing

from sklearn.preprocessing import StandardScaler

scale = StandardScaler()

X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)

In [9]:
#Train the model on features training matrix

logit = LogisticRegression(random_state = 0).fit(X_train, y_train)
y_pred = logit.predict(X_test)

In [10]:
#Simplest logistic regression approach
#clf = LogisticRegression(random_state=0).fit(X_train, y_train)
#sample_pred=[[6,148,72,35,0,33.6,0.627,50]]
#y_predicted = clf.predict(X_test)

In [11]:
logit.score(X_test,y_test)

0.7552083333333334

In [12]:
logit.score(X_train, y_train)

0.7777777777777778

Additional performance metrics are summarized with the classification report:

In [14]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.87      0.82       125
           1       0.69      0.54      0.61        67

    accuracy                           0.76       192
   macro avg       0.74      0.70      0.71       192
weighted avg       0.75      0.76      0.75       192



### Dimensionality Reduction

The goal of dimensionality reduction is to optimize model performance. This goal is achieved by simplifying the model. By reducing the number of explanatory variables to the most informative set,overfitting can (hopefully) be sufficiently minimized.

Different techniques perform the same fundamental task of  "engineering" a new set of features from the original feature matrix. 

If you apply a  is important to apply a dimensionality reduction technique to ALL data sets (training, testing, validating, and novel) that interact with a specific model. 

I followed the outline in the provided reference. I applied a projection methods (PCA) and one technique rooted in manifolds (LLE) (basically helps work with non-linear behavior in the data). I haven't thought about manifolds in a very long time and didn't have time to read up on them. I also didn't have time to do a third model. 

I used Brownlee's evaluation scheme on my original model for a baseline set of metrics.

In [15]:
#Used for stratified K-fold cross-validation.
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

#Basic statistics reported accross all folds.
from numpy import mean
from numpy import std


In [16]:
#model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(logit, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# print metrics
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.773 (0.041)


The classification accuracy given here is about 77%, close to what was generatied in the report.

I think I may have done something wrong in the implentation with my model but I didn't have time to fix it.

#### Dimensionality reduction algorithm: PCA

I started with the classic, Principle Component Analysis (PCA). In a nutshell,  PCA performes linear combinations of features in the training matrix to yield a smaller, orthogonal (and thus uncorrelated) set of features that capture most of the variance in the dataset. PCA is great for dense matrices (few zeros).

Standardization is a necessary first step prior to application of PCA. It has already been done with ths data set.

In [26]:
# I used Brownlee's method of defining a pipeline to combine
#the model and the PCA transform into a single unit.
#Makes evaluation easier.

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

steps_pca = [('pca', PCA(n_components=10)), ('m', LogisticRegression())]
model_pca = Pipeline(steps=steps)


In [27]:
#model evaluation
cv_PCA = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores_PCA = cross_val_score(model_pca, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

In [28]:
# print metrics
print('Accuracy (PCA): %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy (PCA): 0.773 (0.041)


Hm, the results are identical to the original results. This arouses suspicion. However, Brownlee's results were, too, so it may be the case that applying this technique did not improve model performance. 

#### Dimensionality reduction algorithm: LLE

In [29]:
from sklearn.manifold import LocallyLinearEmbedding

# define the pipeline
steps_lle = [('lle', LocallyLinearEmbedding(n_components=10)), ('m', LogisticRegression())]
model_lle = Pipeline(steps=steps)

# evaluate model
cv_lle = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores_lle = cross_val_score(model_lle, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.773 (0.041)


Okay, clearly I have done something wrong, but I don't have time to figure it out. This was the best I could do. 

#### 2) Write a function that will indicate if an inputted IPv4 address is accurate or not. IP addresses are valid if they have 4 values between 0 and 255 (inclusive), punctuated by periods.

#### Input 1:  2.33.245.5
#### Output 1:  True

#### Input 2:  12.345.67.89
#### Output 2:  False

A valid IP address, as defined in this problem, has two fundamental features: exactly three periods; and values on either side of each period are between 0 and 255, inclusive. Here I consider it a period-delimited string. Input must be submitted in quotes.

One approach to validating a possible IP address in two steps:

1) Check to see if the string has exactly three periods. If not, return False. 

2) If true, check each "number" to see if it falls within the allowable range. If not, return False, else return True.


In [20]:
def IPvalid(address):
    
    #exactly three periods?
    
    if address.count('.') != 3:
        
        return False
    
    # if true, chunk the address into sub strings of numeric 
    # characters, according to th delimiter, and store
    # in a list
    
    nums = list(map(str, address.split('.')))
    
    # check if each 'number' is not out of bounds
    
    for n in nums:
        
            # evaluating the integer version of list elements

            if int(n) < 0 or int(n) > 255:
                return False
       
    return True

In [21]:
# Testing cases

#Input 1:   
IPvalid('2.33.245.5')


True

In [22]:
#Input 2:
IPvalid('12.345.67.89')

False

In [23]:
#Input 3:
IPvalid('2.33.2.45.5')

False

In [24]:
#Input 4:

# NOTE: I made multiple attempts to include a try/except 
#in the function in order to handle special cases like 
# this one (e.g. a typo) but I had no success.

IPvalid('..30.40')

ValueError: invalid literal for int() with base 10: ''