In [1]:
import numpy as np
import pandas as pd
from sklearn import metrics
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap as ISO
from numpy.random import seed

import warnings
warnings.filterwarnings("ignore")

heart_df = pd.read_csv("df.csv")

X = heart_df.drop('class', axis=1).values
y = heart_df['class'].values

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

##### 1. Take one of the supervised learning models you have built recently and apply at least three dimensionality reduction techniques to it (separately). Be sure to create a short summary of each technique you use. Indicate how each changed the model performance. 

##### Reference: https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/


### Singular Value Decomposition, or SVD - 

This method uses an algorithm that breaks down the input matrix into the product of three matrices.  These matrices contain either orthonormal eigenvectors of the matrix times its transpose or the square root of the eigenvalues of the same.  The pseudo-inverse of this matrix breakdown is used to minimize least squared error.

For the abalone dataset, we ran an iterator to determine the best number of components to optimize results. That scenario was to use 52 components, resulting in an accuracy score of 79.4%.

In [2]:
svd = TruncatedSVD(n_components=1)
X_train_svd=svd.fit_transform(X_train)
X_test_svd=svd.transform(X_test)
logr=LogisticRegression(random_state=42)
model_1=logr.fit(X_train_svd, y_train)
best = model_1.score(X_test_svd, y_test)

num_comp = 1

for i in range(2,136): 
    svd = TruncatedSVD(n_components=i)
    X_train_svd=svd.fit_transform(X_train)
    X_test_svd=svd.transform(X_test)
    model_1=logr.fit(X_train_svd, y_train)
    if model_1.score(X_test_svd, y_test) > best:
        best = model_1.score(X_test_svd, y_test)
        num_comp = i
    else:
        pass

print('Optimal Number of Components: ', num_comp) 
print('Accuracy Score: ', best)

Optimal Number of Components:  52
Accuracy Score:  0.7941176470588235


### Principal Component Analysis, or PCA - 

This method of dimensionality reduction creates a smaller matrix of the most important components of features from the original input matrix.  It examines collinearity to combine features, keeping the most important aspects of the data.  Like SVD, PCA uses eigenvectors and eigenvalues to create principal components that are combinations of the original features, preserving as much of the commononalities as possible while simultaneously reducing the size of the dataset.  The features that contain very little new information are dropped until the most varied (and therefore, most descriptive) information is all that remains.

Again, we ran an iterator to determine the best number of components to optimize results for PCA. That scenario was to use 51 components.  The resulting accuracy score was higher than that of SVD, at 82.4%.

In [3]:
# Principal Component Analysis

pca = PCA(n_components=1)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

model_2=logr.fit(X_train_pca, y_train)
best = model_2.score(X_test_pca, y_test)

num_comp = 1

for i in range(2,136): 
    pca = PCA(n_components=i)
    X_train_pca=pca.fit_transform(X_train)
    X_test_pca=pca.transform(X_test)
    model_2=logr.fit(X_train_pca, y_train)
    if model_2.score(X_test_pca, y_test) > best:
        best = model_2.score(X_test_pca, y_test)
        num_comp = i
    else:
        pass

print('Optimal Number of Components: ', num_comp) 
print('Accuracy Score: ', best)

Optimal Number of Components:  51
Accuracy Score:  0.8235294117647058


### Isomap Embedding, or Isomap - 

Unlike the first two methods, Isomap Embedding takes a non-linear approach to dimensionality reduction.  The method begins with KNearest Neighbors, then constructs a "neighborhood graph" by connecting each point to its nearest neighbors.  Then the geodesic distance (NOT the Euclidean distance used by linear approaches) is calculated as the shortest distance between each pair of points by traveling through the neighborhood connections in the neighborhood graph.  This provides a much more vivid description of the relationships between points than calculating Euclidean distance would.  Then the dimensions of the input matrix are able to be greatly reduced while preserving the non-linear relationships between these neighborhood clusters of data.

We ran an iterator one last time to determine the best number of components to optimize results for Isomap. That scenario was to use 23 components, and its resulting accuracy score was the lowest of all methods at 73.5%.

In [11]:
# Isomap Embedding

iso = ISO(n_components=1)
X_train_iso = iso.fit_transform(X_train)
X_test_iso = iso.transform(X_test)

model_3 = logr.fit(X_train_iso, y_train)
best = model_3.score(X_test_iso, y_test)

num_comp = 1

for i in range(2,136): 
    iso = ISO(n_components=i)
    X_train_iso=iso.fit_transform(X_train)
    X_test_iso=iso.transform(X_test)
    model_3 = logr.fit(X_train_iso, y_train)
    if model_3.score(X_test_iso, y_test) > best:
        best = model_3.score(X_test_iso, y_test)
        num_comp = i
    else:
        pass

print('Optimal Number of Components: ', num_comp) 
print('Accuracy Score: ', best)

Optimal Number of Components:  23
Accuracy Score:  0.7352941176470589


##### 2. Write a function that will indicate if an inputted IPv4 address is accurate or not.  IP addresses are valid if they have 4 values between 0 and 255 (inclusive), punctuated by periods.

#### Input 1:
#### 2.33.245.5
#### Output 1:
#### True


#### Input 2:
#### 12.345.67.89
#### Output 2:
#### False

In [5]:
def IPA(address):
    try:
        numbers = address.split('.')
        if len(numbers) != 4:
            return False
        for number in numbers:
            if int(number) < 0 or int(number) > 255:
                return False
        return True
    except Exception as e:
        return False

In [6]:
IPA('2.33.245.5')

True

In [7]:
IPA('12.345.67.89')

False

In [8]:
IPA('-32.53.208.33')

False

In [10]:
IPA('32a.53.208.33')

False