# Analysis of Online News Popularity

**Created by Phillip Efthimion, Scott Payne, Gino Varghese and John Blevins**

*MSDS 7331 Data Mining - Section 403 - Lab 2*

## Data Preparation Part 1	
Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.

Phillip
* Copy data preparation steps
* Copy Dimension Reduction and Scaling from Minilab (PCA)- this pickup up in Modeling and Evaluation 2

In [189]:
# Import and Configure Required Modules
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import warnings
import datetime
warnings.simplefilter('ignore', DeprecationWarning)
plt.rcParams['figure.figsize']=(15,10)

# Read Online News Data
df = pd.read_csv('data/OnlineNewsPopularity.csv')

# Correct Column Names by Removing Leading Space
df.columns = df.columns.str.replace(' ', '')

# Rename Columns for Ease of Display
df = df.rename(columns={'weekday_is_monday': 'monday', 'weekday_is_tuesday': 'tuesday', 'weekday_is_wednesday': 'wednesday', 'weekday_is_thursday': 'thursday', 'weekday_is_friday': 'friday', 'weekday_is_saturday': 'saturday', 'weekday_is_sunday': 'sunday', 'is_weekend': 'weekend'})
df = df.rename(columns={'data_channel_is_lifestyle':'lifestyle', 'data_channel_is_entertainment':'entertainment', 'data_channel_is_bus':'business', 'data_channel_is_socmed':'social_media', 'data_channel_is_tech':'technology', 'data_channel_is_world':'world'})

# Encode a new "popular" column based on the # of shares 
# "popular" = 1 and "not popular" to 0.
df['popularity'] = pd.qcut(df['shares'].values, 2, labels=[0,1])
df.popularity = df.popularity.astype(np.int)
df.weekend = df.weekend.astype(np.int)


# Take a subset of the data related to Technology News Articles
dfsubset = df.loc[df['technology'] == 1]

# Reassign to New Variable and remove Columns which aren't needed
df_imputed = dfsubset
del df_imputed['url']
del df_imputed['shares']
del df_imputed['timedelta']
del df_imputed['lifestyle']
del df_imputed['entertainment']
del df_imputed['business']
del df_imputed['social_media']
del df_imputed['technology']
del df_imputed['world']

# Display Dataframe Structure
df_imputed.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 7346 entries, 4 to 39639
Data columns (total 53 columns):
n_tokens_title                  7346 non-null float64
n_tokens_content                7346 non-null float64
n_unique_tokens                 7346 non-null float64
n_non_stop_words                7346 non-null float64
n_non_stop_unique_tokens        7346 non-null float64
num_hrefs                       7346 non-null float64
num_self_hrefs                  7346 non-null float64
num_imgs                        7346 non-null float64
num_videos                      7346 non-null float64
average_token_length            7346 non-null float64
num_keywords                    7346 non-null float64
kw_min_min                      7346 non-null float64
kw_max_min                      7346 non-null float64
kw_avg_min                      7346 non-null float64
kw_min_max                      7346 non-null float64
kw_max_max                      7346 non-null float64
kw_avg_max                  

## Data Preparation Part 2
Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).

Phillip
* Describe how we created Popularity
* Insert chart with attributes and meaning from lab 1
The online news popularity data set utilized in this analysis is publicly accessible from the UCI machine learning repository at https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity.  The data was originally collected by Mashable based on articles published through their website between 2013 and 2015.  The data set contains 39,644 data points with 61 attributes including 58 predictive attributes, 2 non-predictive attributes and 1 target attribute.

We created the ‘popularity' class variable using our data from our ‘shares’ variable. We have decided to measure how popular an article from the Mashable dataset is based on the number of shares it receives. If an article has been shared more than a requisite number of times, then it is deemed popular. We created popularity with the ‘qcut’ tool in pandas and split the ‘shares' into two categories: ‘popular’ and ‘not popular’, which are coded by 1 and 0 respectively. ‘Popularity’ is coded as a non-null integers, though most of the other variables are floats. 

The 2 non-predictive attributes are as follows:
* URL of the news article
* Number of days between article publication and dataset acquisition

The explanatory attributes can be split into several different groupings as follows:
* Word Structure and Frequency
* Hyperlink References
* Digital Media References (Images and Videos)
* Channel Categorization and Keywords (content type such as lifestyle, entertainment, etc...)
* Publication Time
* Sentiment and Subjectivity Levels

The target attribute is the number of shares a site receives which indicates the popularity of the site.  The complete list of attributes is shown in the following table:

|Attribute|Data Type|Description|
|---------|---------|-----------|
| url | Object | URL of the article (non-predictive) |
| timedelta | Float | Days between the article publication and the dataset acquisition (non-predictive) |
| n_tokens_title | Float | Number of words in the title |
| n_tokens_content | Float | Number of words in the content |
| n_unique_tokens | Float | Rate of unique words in the content |
| n_non_stop_words | Float | Rate of non-stop words in the content |
| n_non_stop_unique_tokens | Float | Rate of unique non-stop words in the content |
| num_hrefs | Float | Number of links |
| num_self_hrefs | Float | Number of links to other articles published by Mashable |
| num_imgs | Float | Number of images |
| num_videos | Float | Number of videos |
| average_token_length | Float | Average length of the words in the content |
| num_keywords | Float | Number of keywords in the metadata |
| data_channel_is_lifestyle | Float | Is data channel 'Lifestyle'? |
| data_channel_is_entertainment | Float | Is data channel 'Entertainment'? |
| data_channel_is_bus | Float | Is data channel 'Business'? |
| data_channel_is_socmed | Float | Is data channel 'Social Media'? |
| data_channel_is_tech | Float | Is data channel 'Tech'? |
| data_channel_is_world | Float | Is data channel 'World'? |
| kw_min_min | Float | Worst keyword (min) |
| kw_max_min | Float | Worst keyword (max) |
| kw_avg_min | Float | Worst keyword (avg) |
| kw_min_max | Float | Best keyword (min) |
| kw_max_max | Float | Best keyword (max) |
| kw_avg_max | Float | Best keyword (avg) |
| kw_min_avg | Float | Avg keyword (min) |
| kw_max_avg | Float | Avg keyword (max) |
| kw_avg_avg | Float | Avg keyword (avg) |
| self_reference_min_shares | Float | Min  of referenced articles in Mashable |
| self_reference_max_shares | Float | Max  of referenced articles in Mashable |
| self_reference_avg_sharess | Float | Avg  of referenced articles in Mashable |
| weekday_is_monday | Float | Was the article published on a Monday? |
| weekday_is_tuesday | Float | Was the article published on a Tuesday? |
| weekday_is_wednesday | Float | Was the article published on a Wednesday? |
| weekday_is_thursday | Float | Was the article published on a Thursday? |
| weekday_is_friday | Float | Was the article published on a Friday? |
| weekday_is_saturday | Float | Was the article published on a Saturday? |
| weekday_is_sunday | Float | Was the article published on a Sunday? |
| is_weekend | Float | Was the article published on the weekend? |
| LDA_00 | Float | Closeness to LDA topic 0 |
| LDA_01 | Float | Closeness to LDA topic 1 |
| LDA_02 | Float | Closeness to LDA topic 2 |
| LDA_03 | Float | Closeness to LDA topic 3 |
| LDA_04 | Float | Closeness to LDA topic 4 |
| global_subjectivity | Float | Text subjectivity |
| global_sentiment_polarity | Float | Text sentiment polarity |
| global_rate_positive_words | Float | Rate of positive words in the content |
| global_rate_negative_words | Float | Rate of negative words in the content |
| rate_positive_words | Float | Rate of positive words among non-neutral tokens |
| rate_negative_words | Float | Rate of negative words among non-neutral tokens |
| avg_positive_polarity | Float | Avg polarity of positive words |
| min_positive_polarity | Float | Min polarity of positive words |
| max_positive_polarity | Float | Max polarity of positive words |
| avg_negative_polarity | Float | Avg polarity of positive words |
| min_negative_polarity | Float | Min polarity of positive words |
| max_negative_polarity | Float | Max polarity of positive words |
| title_subjectivity | Float | Title subjectivity |
| title_sentiment_polarity | Float | Title polarity |
| abs_title_subjectivity | Float | Absolute subjectivity level |
| abs_title_sentiment_polarity | Float | Absolute polarity level |
| Number of shares (target) | Integer | Number of Article Shares (tweets, shares, etc...)|

## Modeling and Evaluation 1
Choose and explain your evaluation metrics that you will use (i.e., accuracy,
precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.

Phillip
* Calculating Acc, Rec, F-measure, Negative Predictive Value, Specificity etc... from confusion matrix

Gino - Saturday Afternoon

Final attributes from dimension reduction:
['n_tokens_content', 'n_unique_tokens', 'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'kw_min_min', 'kw_max_avg', 'kw_avg_avg', 'weekend', 'global_rate_positive_words', 'global_rate_negative_words', 'rate_negative_words']

* 2 Tasks: Popularity (classification) and Share # (Regression)
* Regression - Linear, Lasso, Ridge
* Classification - Logistic Regression, K nearest Neighbors, Random Forest


* Accuracy
Accuracy : the proportion of the total number of predictions that were correct.
(a + d) / (a + b + c + d)

* Precision
Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified.
a / (a + b)
Negative Predictive Value : the proportion of negative cases that were correctly identified.
d / (c + d)



* Recall
Recall : the proportion of actual positive cases which are correctly identified.
a / (a + c)

Recall could be our primary go to metric as it evaluates the context of identifying online articles which fall between High to Medium popularity or Medium to Low popularity(ex. articles that are on the border line between popularity classes)

* F-Measure
the proportion of actual negative cases which are correctly identified.
d / (b + d)

In [190]:
#import metrics to collect metrics for each modelS
from sklearn import metrics as mt

#Separating data sets for each task

#task 1 Popularity Classification
#df_task1 = df_imputed[['popularity','n_tokens_content', 'n_unique_tokens', 'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'kw_min_min', 'kw_max_avg', 'kw_avg_avg', 'weekend', 'global_rate_positive_words', 'global_rate_negative_words', 'rate_negative_words']].copy()
df_task1=df_imputed

#task 2 isWeekend Classification
df_task2=df_imputed
df_task2 = df_task2.drop(['monday', 'tuesday','wednesday','thursday','friday','saturday','sunday'], axis=1)
 

#Task 1
#Create list to store datapoints for accuracy, precision, recall, F-measure from each of the model
accuracy_task1 = []
precision_task1 = []
recall_task1 = []
fmeasure_task1 = []

#Task 2
#Create list to store datapoints for accuracy, precision, recall, F-measure from each of the model
accuracy_task2 = []
precision_task2 = []
recall_task2 = []
fmeasure_task2 = []


## Modeling and Evaluation 2
Choose the method you will use for dividing your data into training and
testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why
your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.

Gino


For our approach we will be using Stratified 10-Fold Cross Validation for our analysis. 
- This approach is used to obtain a sample population that best represents the entire population being analysed. Some of the advantages include but not limited to, minimizing sample selection bias and ensuring certain segments of the population not overrepresented or underrepresented, botom line is Stratification is the process of rearranging the data as to ensure each fold is a good representative of the whole. In our process, the data will be rearranged 10 times and it is most appropriate for us to use in our analysis due to the characteristic of our data set.         
<br>
- As the data in our data set was collected to summarizes a heterogeneous set of features about articles published by Mashable in a period of two years and because it includes such a diverse sample of individuals, it is very likely that the features that are highly correlated to our target classes. The locality of our testing data should not influence the classification of individual testing data. This is very important to the model we design. By performing a Stratified 10-Fold Cross Validation not only will provide us with a thorough method of splitting, training, and testing our models against our data set but also ensures our model to maintain a similar level of accuracy as it was developed.

## Modeling and Evaluation 3
Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms!

Gino

* For Regression make sure popularity field is excluded and for classification make sure Share # is excluded!

## Task 1 : Popularity (classification) 

In [180]:
#stratified 10 folds
from sklearn.model_selection import StratifiedKFold
df_task1_temp=df_task1

# we want to predict the X and y data as follows:
if 'popularity' in df_task1_temp:
    y_1 = df_task1_temp['popularity'].values # get the labels we want
    del df_task1_temp['popularity'] # get rid of the class label
    X_1 = df_task1_temp.values # use everything else to predict!
    
yhat_1 = np.zeros(y.shape)

cv_object = StratifiedKFold(n_splits=10, random_state=True, shuffle=True)
                         
print(cv_object)
cv_object.get_n_splits(X_1,y_1)

StratifiedKFold(n_splits=10, random_state=True, shuffle=True)


10

#### Logistic Regression

In [181]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from ipywidgets import widgets as wd

# get a handle to the classifier object, which defines the type
#penalty is set to default, which is 12.
def lr_explor(cost):
    lr_task1 = LogisticRegression(C=cost, class_weight=None)

    # iterate through and get predictions for each row in yhat
    for train, test in cv_object.split(X_1,y_1):
        lr_task1.fit(X_1[train],y_1[train])
        yhat_1[test] = lr_task1.predict(X_1[test])

    #evaluation metrics   
    acc = mt.accuracy_score(y_1, yhat_1)
    recall = mt.recall_score(y_1, yhat_1)
    precision = mt.precision_score(y_1, yhat_1)
    f = mt.f1_score(y_1, yhat_1)

    #results in percentage
    print("Accuracy of the model: {0:.4f}%".format(acc*100))
    print("Recall of the model: {0:.4f}%".format(recall*100))
    print("Precision of the model: {0:.4f}%".format(precision*100))
    print("F-measure of the model: {0:.4f}%".format(f*100))
wd.interact(lr_explor,cost=(0.001,5.0,0.05),__manual=True)

#adding evaluation metrics to list for further analysis between models
accuracy_task1.append(['logistic_accuracy',acc])
recall_task1.append(['logistic_recall',recall])
precision_task1.append(['logistic_precision',precision])
fmeasure_task1.append(['logistic_F-Measure',f])


Accuracy of the model: 61.3940%
Recall of the model: 89.6306%
Precision of the model: 62.1046%
F-measure of the model: 73.3709%


#### K nearest Neighbors

In [182]:
# As a team we setup KNN Classifier iterator to to determine the accurate number of nearest neighbours
# the highest iterations we are planning was 30, to get the best accuracy
from sklearn.neighbors import KNeighborsClassifier
counter = 1;
best_accuracy= 0.0;
kVal = 1;
while counter <= 30:
    clf = KNeighborsClassifier(n_neighbors=counter)
    clf.fit(X_1[train],y_1[train])
    acc = clf.score(X_1[test],y_1[test]);
    if acc > best_accuracy:
        best_accuracy = acc;
        kVal = counter;
    counter += 1;
neighbors=kVal
print("Best Accuracy returned by the classifier is: {0:.4f}%".format(best_accuracy*100),"with k value of:",kVal);

Best Accuracy returned by the classifier is: 58.2538% with k value of: 21


In [184]:
# Actual trainning and testing of the model begins
print("The best k value:", neighbors)
knn_task1 = KNeighborsClassifier(n_neighbors=neighbors)

# iterate through and get predictions for each row in yhat
for train, test in cv_object.split(X_1,y_1):
    knn_task1.fit(X_1[train],y_1[train])
    yhat_1[test] = knn_task1.predict(X_1[test])

#evaluation metrics   
acc = mt.accuracy_score(y_1, yhat_1)
recall = mt.recall_score(y_1, yhat_1)
precision = mt.precision_score(y_1, yhat_1)
f = mt.f1_score(y_1, yhat_1)

#adding evaluation metrics to list for further analysis between models
accuracy_task1.append(['KNN',acc])
recall_task1.append(['KNN',recall])
precision_task1.append(['KNN',precision])
fmeasure_task1.append(['KNN',f])



#results in percentage
print("Accuracy of the model: {0:.4f}%".format(acc*100))
print("Recall of the model: {0:.4f}%".format(recall*100))
print("Precision of the model: {0:.4f}%".format(precision*100))
print("F-measure of the model: {0:.4f}%".format(f*100))

The best k value: 21
Accuracy of the model: 58.0860%
Recall of the model: 75.6596%
Precision of the model: 62.0391%
F-measure of the model: 68.1757%


#### Random Forest

In [185]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from ipywidgets import widgets as wd

def n_estimator(num):
    # get a handle to the classifier object, which defines the type
    rf_task1 = RandomForestClassifier(n_estimators=num, n_jobs=-1)

    # iterate through and get predictions for each row in yhat
    for train, test in cv_object.split(X_1,y_1):
        rf_task1.fit(X_1[train],y_1[train])
        yhat_1[test] = rf_task1.predict(X_1[test])

    #evaluation metrics   
    acc = mt.accuracy_score(y_1, yhat_1)
    recall = mt.recall_score(y_1, yhat_1)
    precision = mt.precision_score(y_1, yhat_1)
    f = mt.f1_score(y_1, yhat_1)

    #results in percentage
    print("Accuracy of the model: {0:.4f}%".format(acc*100))
    print("Recall of the model: {0:.4f}%".format(recall*100))
    print("Precision of the model: {0:.4f}%".format(precision*100))
    print("F-measure of the model: {0:.4f}%".format(f*100))
wd.interact(n_estimator,num=(100,150,10),__manual=True) 

#adding evaluation metrics to list for further analysis between models
accuracy_task1.append(['RandomForest',acc])
recall_task1.append(['RandomForest',recall])
precision_task1.append(['RandomForest',precision])
fmeasure_task1.append(['RandomForest',f])



Accuracy of the model: 65.0422%
Recall of the model: 81.5095%
Precision of the model: 66.8485%
F-measure of the model: 73.4546%


## Task 2 : Weekend (classification)

### Logistic Regression

In [191]:
cv_object=None
df_task2_temp=df_task2
# we want to predict the X and y data as follows:
if 'weekend' in df_task2_temp:
    y_2 = df_task2_temp['weekend'].values # get the labels we want
    del df_task2_temp['weekend'] # get rid of the class label
    X_2 = df_task2_temp.values # use everything else to predict!
    
yhat_2 = np.zeros(y.shape)

cv_object = StratifiedKFold(n_splits=10, random_state=True, shuffle=True)
                         
print(cv_object)
cv_object.get_n_splits(X_2,y_2)

StratifiedKFold(n_splits=10, random_state=True, shuffle=True)


10

In [193]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from ipywidgets import widgets as wd

def lr_explor(cost_2):
# get a handle to the classifier object, which defines the type
#penalty is set to default, which is 12.
    lr_task2 = LogisticRegression(C=cost_2, class_weight=None)

    # iterate through and get predictions for each row in yhat
    for train, test in cv_object.split(X_2,y_2):
        lr_task2.fit(X_2[train],y_2[train])
        yhat_2[test] = lr_task2.predict(X_2[test])

    #evaluation metrics   
    acc = mt.accuracy_score(y_2, yhat_2)
    recall = mt.recall_score(y_2, yhat_2)
    precision = mt.precision_score(y_2, yhat_2)
    f = mt.f1_score(y_2, yhat_2)
    
    #results in percentage
    print("Accuracy of the model: {0:.4f}%".format(acc*100))
    print("Recall of the model: {0:.4f}%".format(recall*100))
    print("Precision of the model: {0:.4f}%".format(precision*100))
    print("F-measure of the model: {0:.4f}%".format(f*100))
    
wd.interact(lr_explor,cost_2=(0.001,5.0,0.05),__manual=True) 

#adding evaluation metrics to list for further analysis between models
accuracy_task2.append(['logistic_accuracy',acc])
recall_task2.append(['logistic_recall',recall])
precision_task2.append(['logistic_precision',precision])
fmeasure_task2.append(['logistic_F-Measure',f])



Accuracy of the model: 86.9589%
Recall of the model: 0.0000%
Precision of the model: 0.0000%
F-measure of the model: 0.0000%


### K nearest Neighbors

In [194]:
# As a team we setup KNN Classifier iterator to to determine the accurate number of nearest neighbours
# the highest iterations we are planning was 30, to get the best accuracy
from sklearn.neighbors import KNeighborsClassifier
counter = 1;
best_accuracy= 0.0;
kVal = 1;
while counter <= 30:
    clf = KNeighborsClassifier(n_neighbors=counter)
    clf.fit(X_2[train],y_2[train])
    acc = clf.score(X_2[test],y_2[test]);
    if acc > best_accuracy:
        best_accuracy = acc;
        kVal = counter;
    counter += 1;
neighbors=kVal
print("Best Accuracy returned by the classifier is: {0:.4f}%".format(best_accuracy*100),"with k value of:",kVal);

Best Accuracy returned by the classifier is: 87.4488% with k value of: 16


In [195]:
# Actual trainning and testing of the model begins
print("The best k value:", neighbors)
knn_task2 = KNeighborsClassifier(n_neighbors=neighbors)

# iterate through and get predictions for each row in yhat
for train, test in cv_object.split(X_2,y_2):
    knn_task2.fit(X_2[train],y_2[train])
    yhat_2[test] = knn_task2.predict(X_2[test])

#evaluation metrics   
acc = mt.accuracy_score(y_2, yhat_2)
recall = mt.recall_score(y_2, yhat_2)
precision = mt.precision_score(y_2, yhat_2)
f = mt.f1_score(y_2, yhat_2)

#adding evaluation metrics to list for further analysis between models
accuracy_task2.append(['KNN',acc])
recall_task2.append(['KNN',recall])
precision_task2.append(['KNN',precision])
fmeasure_task2.append(['KNN',f])


#results in percentage
print("Accuracy of the model: {0:.4f}%".format(acc*100))
print("Recall of the model: {0:.4f}%".format(recall*100))
print("Precision of the model: {0:.4f}%".format(precision*100))
print("F-measure of the model: {0:.4f}%".format(f*100))

The best k value: 16
Accuracy of the model: 87.4217%
Recall of the model: 0.5429%
Precision of the model: 38.4615%
F-measure of the model: 1.0707%


### Random Forest

In [149]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from ipywidgets import widgets as wd

def n_estimator(num):
    # get a handle to the classifier object, which defines the type
    rf_task2 = RandomForestClassifier(n_estimators=num, n_jobs=-1)

    # iterate through and get predictions for each row in yhat
    for train, test in cv_object.split(X_2,y_2):
        rf_task2.fit(X_2[train],y_2[train])
        yhat_2[test] = rf_task2.predict(X_2[test])

    #evaluation metrics   
    acc = mt.accuracy_score(y, yhat)
    recall = mt.recall_score(y, yhat)
    precision = mt.precision_score(y, yhat)
    f = mt.f1_score(y, yhat)

    #results in percentage
    print("Accuracy of the model: {0:.4f}%".format(acc*100))
    print("Recall of the model: {0:.4f}%".format(recall*100))
    print("Precision of the model: {0:.4f}%".format(precision*100))
    print("F-measure of the model: {0:.4f}%".format(f*100))
wd.interact(n_estimator,num=(100,150,10),__manual=True) 

#adding evaluation metrics to list for further analysis between models
accuracy_task2.append(['RandomForest',acc])
recall_task2.append(['RandomForest',recall])
precision_task2.append(['RandomForest',precision])
fmeasure_task2.append(['RandomForest',f])


Accuracy of the model: 88.8375%
Recall of the model: 12.0521%
Precision of the model: 91.7355%
F-measure of the model: 21.3052%


## Modeling and Evaluation 4
Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.

John
* For Regression just compare attributes selected which are significantly
* For Classification determine accuracy, etc...

## Modeling and Evaluation 5
Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course.

John
Notebook - Grand Puba Notebook
Rsquared


## Modeling and Evaluation 6
Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.

John
* Discuss weights in more detail


## Deployment
How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.? 

Scott

## Exceptional Work

You have free reign to provide additional analyses. One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?

Scott

Possible Analysis:

Deep Learning

Neaural Network