# Question 1: 
Take one of the supervised learning models you have built recently and apply at least three dimensionality reduction techniques to it (separately). Be sure to create a short summary of each technique you use. Indicate how each changed the model performance. 
Reference:
https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/

# 1. Truncated SVD 

In [31]:
import csv

#import the data file and write out each row into a csv file
with open("../WeeklyHomework/abalone.data") as infile, open("abalone.csv", "w") as outfile:
    csv_writer = csv.writer(outfile)
    prev = ''
    csv_writer.writerow(['Sex', 'Length', 'Diameter', 'Height', 'Whole Weight', 'Shucked Weight', 'Viscera Weight', 'Shell Weight', 'Rings'])
    for line in infile:
        row = [field.strip() for field in line.split(',')]
        csv_writer.writerow(row)

In [32]:
import pandas as pd
import numpy as np

#load the abalone dataset from csv file and save as a pandas dataframe
abalone_df = pd.read_csv('./abalone.csv')
abalone_df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [33]:
#Remove Outliers from the Dataset
# calculate summary statistics
data_mean, data_std = np.mean(abalone_df['Rings']), np.std(abalone_df['Rings'])
# identify outliers
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off

In [34]:
# identify outliers
outliers = [x for x in abalone_df['Rings'] if x < lower or x > upper]

In [35]:
# remove outliers
abalone_df = abalone_df[(abalone_df['Rings'] > lower) & (abalone_df['Rings'] < upper)]
abalone_df['Rings'].describe()

count    4115.000000
mean        9.758931
std         2.904193
min         1.000000
25%         8.000000
50%         9.000000
75%        11.000000
max        19.000000
Name: Rings, dtype: float64

In [36]:
#save the predictor variables into the dataframe X
X = abalone_df.drop('Rings', axis=1)
#save the independent variable y
y = abalone_df['Rings']

In [37]:
#Perform OneHotEncoding on only the 'Sex' Column to turn it into a numerical column instead of a categorical column. Drop the first column since it is repetitive data. 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([("Sex", OneHotEncoder(drop='first'), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)

In [38]:
from sklearn.preprocessing import StandardScaler

#Standardize the dataset
sc = StandardScaler()
X_res= sc.fit_transform(X)

In [39]:
#Rename X and y to predictors and target for convention
predictors=X
target=y

In [40]:
#import necessary modules
from keras.layers import Dense
from keras.models import Sequential
from keras.callbacks import EarlyStopping
from keras.optimizers import SGD

In [41]:
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression

svd = TruncatedSVD(n_components=5)

predictors=svd.fit_transform(predictors)

#clr=LogisticRegression(random_state=42).fit(X_test_svd, y_test)
#print(clr.score(X_test_svd, y_test))

In [42]:
#get the number of columns in the predictors array
n_cols = predictors.shape[1]
print(n_cols)

5


In [47]:
#instantiate the keras model
model=Sequential()

#add the layers 
model.add(Dense(200, activation='relu', input_shape=(n_cols,)))
model.add(Dense(200, activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(1))

#compile the model
model.compile(optimizer=SGD(lr=0.001), loss='mean_squared_error')

#set an early stopping monitor so that the model will stop running if improvement to the loss function is not seen after a specified number of epochs
early_stopping_monitor = EarlyStopping(patience=4)

#fit the model
model.fit(predictors, target, validation_split=0.3, epochs=40, callbacks=[early_stopping_monitor])

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40


<tensorflow.python.keras.callbacks.History at 0x26b4e524910>

In [49]:
rmse = np.sqrt(3.6597)
print(rmse)

1.9130342391081243


Truncated SVD is a good reduction technique when you have a sparse dataset. It stands for singular value decomposition. Using this method, I got a rmse of 1.91, which is slightly worse than when I used keras without dimensionality reduction.

# 2. PCA

In [1]:
from sklearn.decomposition import PCA

In [2]:
pca = PCA()

In [12]:
X_reduced = pca.fit_transform(X_res)

In [13]:
#Rename X and y to predictors and target for convention
predictors=X_reduced
target=y

In [15]:
#get the number of columns in the predictors array
n_cols = predictors.shape[1]
print(n_cols)

9


In [16]:
#instantiate the keras model
model=Sequential()

#add the layers 
model.add(Dense(200, activation='relu', input_shape=(n_cols,)))
model.add(Dense(200, activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(1))

#compile the model
model.compile(optimizer=SGD(lr=0.001), loss='mean_squared_error')

#set an early stopping monitor so that the model will stop running if improvement to the loss function is not seen after a specified number of epochs
early_stopping_monitor = EarlyStopping(patience=4)

#fit the model
model.fit(predictors, target, validation_split=0.3, epochs=40, callbacks=[early_stopping_monitor])

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40


<tensorflow.python.keras.callbacks.History at 0x26b48ed88b0>

In [17]:
rmse = np.sqrt(3.5423)
print(rmse)

1.8820998910791107


Principal Component Analysis (PCA) is a reduction technique that works well with dense data, meaning there are few missing values.  It reduces dimensions of the data by finding the principal component, which is a eigenvector of the data's covariance matrix that maximizes the variance from X.  The next prinipal component is perpendicular to the first eigenvector meaning they are uncorrelated.  This technique gave me the lowest rmse that I have gotten for the abalone dataset so far, 1.88. 

# 3. NMF (Non-Negative Matrix Factorization)

In [63]:
from sklearn.decomposition import NMF
model = NMF(n_components=5, init='random', random_state=42)
X_new = model.fit_transform(X)



In [64]:
#Rename X and y to predictors and target for convention
predictors=X_new
target=y

In [65]:
#get the number of columns in the predictors array
n_cols = predictors.shape[1]
print(n_cols)

5


In [67]:
#instantiate the keras model
model=Sequential()

#add the layers 
model.add(Dense(200, activation='relu', input_shape=(n_cols,)))
model.add(Dense(200, activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(1))

#compile the model
model.compile(optimizer=SGD(lr=0.001), loss='mean_squared_error')

#set an early stopping monitor so that the model will stop running if improvement to the loss function is not seen after a specified number of epochs
early_stopping_monitor = EarlyStopping(patience=6)

#fit the model
model.fit(predictors, target, validation_split=0.3, epochs=70, callbacks=[early_stopping_monitor])

Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70
Epoch 28/70
Epoch 29/70
Epoch 30/70
Epoch 31/70
Epoch 32/70
Epoch 33/70
Epoch 34/70
Epoch 35/70
Epoch 36/70
Epoch 37/70
Epoch 38/70
Epoch 39/70
Epoch 40/70
Epoch 41/70
Epoch 42/70
Epoch 43/70
Epoch 44/70
Epoch 45/70
Epoch 46/70
Epoch 47/70
Epoch 48/70
Epoch 49/70
Epoch 50/70
Epoch 51/70


<tensorflow.python.keras.callbacks.History at 0x26b4fed2910>

In [68]:
rmse = np.sqrt(3.6351)
print(rmse)

1.9065938214522777


Non-Negative Matrix Factorization is another reduction technique that works for non-negative datasets.  This also gave me a similar rmse to when I used keras alone for the abalone dataset, 1.90.

# Question 2.
Write a function that will indicate if an inputted IPv4 address is accurate or not.
IP addresses are valid if they have 4 values between 0 and 255 (inclusive), punctuated
by periods.
Input 1:
2.33.245.5
Output 1:
True
Input 2:
12.345.67.89
Output 2:
False

In [125]:
def valid_ip(ip_address):
    try:
        parts = ip_address.split('.')
        for x in parts:
            if int(x)<0 or int(x)>255 or len(parts) != 4:
                return False
            else:
                continue
    except:
        print("IP Address should be a string with values punctuated by periods.  This IP address doesn't have periods.")
        return False
    return True
    

In [126]:
valid_ip('2,33,255,5')

IP Address should be a string with values punctuated by periods.  This IP address doesn't have periods.


False

In [127]:
valid_ip('2.33.245.5')

True

In [128]:
valid_ip('12.345.67.89')

False