# Analysis of Online News Popularity

**Created by Phillip Efthimion, Scott Payne, Gino Varghese and John Blevins**

*MSDS 7331 Data Mining - Section 403 - Lab 2*

## Data Preparation Part 1	
Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.

Phillip
* Copy data preparation steps
* Copy Dimension Reduction and Scaling from Minilab (PCA)- this pickup up in Modeling and Evaluation 2

In [5]:
# Import and Configure Required Modules
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import warnings
import datetime
warnings.simplefilter('ignore', DeprecationWarning)
plt.rcParams['figure.figsize']=(15,10)

# Read Online News Data
df = pd.read_csv('data/OnlineNewsPopularity.csv')

# Correct Column Names by Removing Leading Space
df.columns = df.columns.str.replace(' ', '')

# Rename Columns for Ease of Display
df = df.rename(columns={'weekday_is_monday': 'monday', 'weekday_is_tuesday': 'tuesday', 'weekday_is_wednesday': 'wednesday', 'weekday_is_thursday': 'thursday', 'weekday_is_friday': 'friday', 'weekday_is_saturday': 'saturday', 'weekday_is_sunday': 'sunday', 'is_weekend': 'weekend'})
df = df.rename(columns={'data_channel_is_lifestyle':'lifestyle', 'data_channel_is_entertainment':'entertainment', 'data_channel_is_bus':'business', 'data_channel_is_socmed':'social_media', 'data_channel_is_tech':'technology', 'data_channel_is_world':'world'})

# Encode a new "popular" column based on the # of shares 
# "popular" = 1 and "not popular" to 0.
df['popularity'] = pd.qcut(df['shares'].values, 2, labels=[0,1])
df.popularity = df.popularity.astype(np.int)

# Take a subset of the data related to Technology News Articles
dfsubset = df.loc[df['technology'] == 1]

# Reassign to New Variable and remove Columns which aren't needed
df_imputed = dfsubset
del df_imputed['url']
#del df_imputed['shares']
del df_imputed['timedelta']
del df_imputed['lifestyle']
del df_imputed['entertainment']
del df_imputed['business']
del df_imputed['social_media']
del df_imputed['technology']
del df_imputed['world']

# Display Dataframe Structure
df_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7346 entries, 4 to 39639
Data columns (total 54 columns):
n_tokens_title                  7346 non-null float64
n_tokens_content                7346 non-null float64
n_unique_tokens                 7346 non-null float64
n_non_stop_words                7346 non-null float64
n_non_stop_unique_tokens        7346 non-null float64
num_hrefs                       7346 non-null float64
num_self_hrefs                  7346 non-null float64
num_imgs                        7346 non-null float64
num_videos                      7346 non-null float64
average_token_length            7346 non-null float64
num_keywords                    7346 non-null float64
kw_min_min                      7346 non-null float64
kw_max_min                      7346 non-null float64
kw_avg_min                      7346 non-null float64
kw_min_max                      7346 non-null float64
kw_max_max                      7346 non-null float64
kw_avg_max                  

### including data points for evaluation section

### Parameter Adjustment for Improving Accuracy
Attributes that don't provide any explanatory value have already been removed that would have affected the fitted model.  In addition the cost can be adjusted to improve accuracy.  The cost parameter tells the model optimization how much training data you want to avoid being misclassified. The large the cost, the optimization will choose smaller-margin hyperplane, the hyperplane does a good job in getting all the points in the training data classified. Similarly a small cost value will let the optimization to look for larger-margin separating hyperplane,in this approach the hyperplane ignore some points the training data will still be linearly separable.

## Resubmission change: 01
ipywidget has been fixed to be interactive 

In [6]:
# and here is an even shorter way of getting the accuracies for each training and test set
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(lr_clf, X, y=y, cv=cv_object) # this also can help with parallelism
print(accuracies)

NameError: name 'lr_clf' is not defined

In [None]:
# here we can change some of the parameters interactively
from ipywidgets import widgets as wd

def lr_explor(cost):
    lr_clf = LogisticRegression(penalty='l2', C=cost, class_weight=None) # get object
    accuracies = cross_val_score(lr_clf,X,y=y,cv=cv_object) # this also can help with parallelism
    print(accuracies)

wd.interact(lr_explor,cost=(0.001,5.0,0.05),__manual=True)

### Interpretting weights for Logistic Regression

The weights of the coefficients for logistic regressions are important for determining what attributes to include in the model. The logistic regression coefficients are used to predict the probability of an outcome, in this case, the popularity of the news article being "popular" or "not popular". A positive weight indicates that an increase in the attribute will increase the odds of the outcome being “popular”, while a negative weight indicates that an increase in the attribute will decrease the likelihood of the outcome being “popular”.  

In [None]:
# interpret the weights

# iterate over the coefficients
weights = lr_clf.coef_.T # take transpose to make a column vector
variable_names = df_imputed.columns
for coef, name in zip(weights,variable_names):
    print(name, 'has weight of', coef[0])
    
# does this look correct? 

### Normalizing Features
Because the attributes do not all use the same measurement scale, the magnitude of the weights does not give meaningful insight into which attributes are most important for the model. The weights need to be scaled in the same way so that the magnitudes can be compared across all the features. The standard scaler will adjust the values of each attribute to be scaled by the average and standard deviation of each feature. The linear regression model will then be fit to the scaled values and the coefficient weights will be calculated based on that model.

In [None]:
from sklearn.preprocessing import StandardScaler

# we want to normalize the features based upon the mean and standard deviation of each column. 
# However, we do not want to accidentally use the testing data to find out the mean and std (this would be snooping)
# to Make things easier, let's start by just using whatever was last stored in the variables:
##    X_train , y_train , X_test, y_test (they were set in a for loop above)

# scale attributes by the training set
scl_obj = StandardScaler()
scl_obj.fit(X_train) # find scalings for each column that make this zero mean and unit std
# the line of code above only looks at training data to get mean and std and we can use it 
# to transform new feature data

X_train_scaled = scl_obj.transform(X_train) # apply to training
X_test_scaled = scl_obj.transform(X_test) # apply those means and std to the test set (without snooping at the test set values)

# train the model just as before
lr_clf = LogisticRegression(penalty='l2', C=0.05) # get object, the 'C' value is less (can you guess why??)
lr_clf.fit(X_train_scaled,y_train)  # train object

y_hat = lr_clf.predict(X_test_scaled) # get test set precitions

acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print('accuracy:', acc )
print(conf )

# sort these attributes and spit them out
zip_vars = zip(lr_clf.coef_.T,df_imputed.columns) # combine attributes
zip_vars = sorted(zip_vars)
for coef, name in zip_vars:
    print(name, 'has weight of', coef[0]) # now print them out

The weights of the regression coefficients are now scaled and the magnitudes can be compared against each other. The greatest magnitude coefficients are features that contain information about the maximum, average, and minimum number of shares for the post’s keyword (kw_max_avg, kw_min_min, kw_avg_avg). It makes sense that keywords with a higher minimum and average popularity would increase the probability of the outcome being popular. The number of links (num_hrefs) found within the article has a large positive magnitude, while the number of links to other articles on the same website(num_self_hrefs) has a large negative magnitude. It is possible that there is a collinearity problem with these features that should be explored further. The rate of positive words and rate of negative words have nearly identical magnitudes but in opposite directions, rate of positive words decreases the probability of popularity and rate of negative words increases it. The features that determine if the day is Saturday or the day is a weekend both have large positive magnitudes. Since these features are related, only including the feature indicating if the day is a weekend should not hurt the model. The last three impactful features are concerning the number of tokens, unique tokens, and nonstop unique tokens in the article. Greater number of tokens, or longer articles and nonstop unique tokens seem to increase the odds of popularity while total number of unique tokens is negative. This would indicate that longer articles with flourishes of colorful language can increase popularity while a large vocabulary in general can decrease popularity.

In [None]:
# now let's make a pandas Series with the names and values, and plot them
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize']=(15,10)

weights = pd.Series(lr_clf.coef_[0],index=df_imputed.columns)
weights.plot(kind='bar')
plt.show()

Based on the graph of the coefficient weights, a cut off of > 0.1 and < - 0.1 would include the highest magnitude features while removing some of the less impactful attributes. Using this threshold, the following features would be included in the model: n_tokens_content, n_unique_tokens, n_non_stop_unique_tokens, num_hrefs, num_self_hrefs, kw_min_min, kw_max_max, kw_max_avg, kw_avg_avg, weekend, global_rate_positive_words, global_rate_negative_words.

In [None]:
from sklearn.preprocessing import StandardScaler
# we want to normalize the features based upon the mean and standard deviation of each column. 
# However, we do not want to accidentally use the testing data to find out the mean and std (this would be snooping)

from sklearn.pipeline import Pipeline
# you can apply the StandardScaler function inside of the cross-validation loop 
#  but this requires the use of PipeLines in scikit. 
#  A pipeline can apply feature pre-processing and data fitting in one compact notation
#  Here is an example!

std_scl = StandardScaler()
lr_clf = LogisticRegression(penalty='l2', C=0.05) 

# create the pipline
piped_object = Pipeline([('scale', std_scl),  # do this
                         ('logit_model', lr_clf)]) # and then do this

weights = []
# run the pipline cross validated
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
    piped_object.fit(X[train_indices],y[train_indices])  # train object
    # it is a little odd getting trained objects from a  pipeline:
    weights.append(piped_object.named_steps['logit_model'].coef_[0])
    

weights = np.array(weights)

In [None]:
import plotly
from plotly import __version__
from plotly.offline import init_notebook_mode, iplot
from plotly.graph_objs import Bar
# run at the start of every notebook


init_notebook_mode() # run at the start of every notebook

error_y=dict(
            type='data',
            array=np.std(weights,axis=0),
            visible=True
        )

graph1 = {'x': df_imputed.columns,
          'y': np.mean(weights,axis=0),
    'error_y':error_y,
       'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}

plotly.offline.iplot(fig)

If we use a 0.1 threshold from the weight plot, we see that n_tokens_content, n_unique_tokens, n_non_stop_unique_tokens, num_hrefs, num_self_hrefs, kw_min_min, kw_max_max, kw_max_avg, kw_avg_avg, self_reference_avg_sharess, saturday, weekend, global_rate_positive_words, global_rate_negative_words and rate_negative_words all have the most weight in predicting popularity.

In [None]:
Xnew = df_imputed[['n_tokens_content', 'n_unique_tokens', 'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'kw_min_min', 'kw_max_max', 'kw_max_avg', 'kw_avg_avg', 'self_reference_avg_sharess', 'saturday', 'weekend', 'global_rate_positive_words', 'global_rate_negative_words', 'rate_negative_words']].values

weights = []
# run the pipline corssvalidated
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(Xnew,y)):
    piped_object.fit(Xnew[train_indices],y[train_indices])  # train object
    weights.append(piped_object.named_steps['logit_model'].coef_[0])
    
weights = np.array(weights)

error_y=dict(
            type='data',
            array=np.std(weights,axis=0),
            visible=True
        )

graph1 = {'x': ['n_tokens_content', 'n_unique_tokens', 'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'kw_min_min', 'kw_max_max', 'kw_max_avg', 'kw_avg_avg', 'self_reference_avg_sharess', 'saturday', 'weekend', 'global_rate_positive_words', 'global_rate_negative_words', 'rate_negative_words'],
          'y': np.mean(weights,axis=0),
    'error_y':error_y,
       'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}

plotly.offline.iplot(fig)




If we use a 0.1 threshold from the weight plot, we see that n_tokens_content, n_unique_tokens, n_non_stop_unique_tokens, num_hrefs, num_self_hrefs, kw_min_min, kw_max_max, kw_max_avg, kw_avg_avg,  weekend, global_rate_positive_words, global_rate_negative_words and rate_negative_words all have the most weight in predicting popularity.

In [None]:
Xnew = df_imputed[['n_tokens_content', 'n_unique_tokens', 'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'kw_min_min', 'kw_max_max', 'kw_max_avg', 'kw_avg_avg', 'weekend', 'global_rate_positive_words', 'global_rate_negative_words', 'rate_negative_words']].values

weights = []
# run the pipline corssvalidated
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(Xnew,y)):
    piped_object.fit(Xnew[train_indices],y[train_indices])  # train object
    weights.append(piped_object.named_steps['logit_model'].coef_[0])
    
weights = np.array(weights)

error_y=dict(
            type='data',
            array=np.std(weights,axis=0),
            visible=True
        )

graph1 = {'x': ['n_tokens_content', 'n_unique_tokens', 'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'kw_min_min', 'kw_max_max', 'kw_max_avg', 'kw_avg_avg', 'weekend', 'global_rate_positive_words', 'global_rate_negative_words', 'rate_negative_words'],
          'y': np.mean(weights,axis=0),
    'error_y':error_y,
       'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}

plotly.offline.iplot(fig)

If we use a 0.1 threshold from the weight plot, we see that n_tokens_content, n_unique_tokens, n_non_stop_unique_tokens, num_hrefs, num_self_hrefs, kw_min_min, kw_max_avg, kw_avg_avg,  weekend, global_rate_positive_words, global_rate_negative_words and rate_negative_words all have the most weight in predicting popularity.

In [None]:
Xnew = df_imputed[['n_tokens_content', 'n_unique_tokens', 'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'kw_min_min', 'kw_max_avg', 'kw_avg_avg', 'weekend', 'global_rate_positive_words', 'global_rate_negative_words', 'rate_negative_words']].values

weights = []
# run the pipline corssvalidated
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(Xnew,y)):
    piped_object.fit(Xnew[train_indices],y[train_indices])  # train object
    weights.append(piped_object.named_steps['logit_model'].coef_[0])
    
weights = np.array(weights)

error_y=dict(
            type='data',
            array=np.std(weights,axis=0),
            visible=True
        )

graph1 = {'x': ['n_tokens_content', 'n_unique_tokens', 'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'kw_min_min', 'kw_max_avg', 'kw_avg_avg', 'weekend', 'global_rate_positive_words', 'global_rate_negative_words', 'rate_negative_words'],
          'y': np.mean(weights,axis=0),
    'error_y':error_y,
       'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}

plotly.offline.iplot(fig)

## Data Preparation Part 2
Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).

Phillip
* Describe how we created Popularity
* Insert chart with attributes and meaning from lab 1
The online news popularity data set utilized in this analysis is publicly accessible from the UCI machine learning repository at https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity.  The data was originally collected by Mashable based on articles published through their website between 2013 and 2015.  The data set contains 39,644 data points with 61 attributes including 58 predictive attributes, 2 non-predictive attributes and 1 target attribute.

We created the ‘popularity' class variable using our data from our ‘shares’ variable. We have decided to measure how popular an article from the Mashable dataset is based on the number of shares it receives. If an article has been shared more than a requisite number of times, then it is deemed popular. We created popularity with the ‘qcut’ tool in pandas and split the ‘shares' into two categories: ‘popular’ and ‘not popular’, which are coded by 1 and 0 respectively. ‘Popularity’ is coded as a non-null integers, though most of the other variables are floats. 

The 2 non-predictive attributes are as follows:
* URL of the news article
* Number of days between article publication and dataset acquisition

The explanatory attributes can be split into several different groupings as follows:
* Word Structure and Frequency
* Hyperlink References
* Digital Media References (Images and Videos)
* Channel Categorization and Keywords (content type such as lifestyle, entertainment, etc...)
* Publication Time
* Sentiment and Subjectivity Levels

The target attribute is the number of shares a site receives which indicates the popularity of the site.  The complete list of attributes is shown in the following table:

|Attribute|Data Type|Description|
|---------|---------|-----------|
| url | Object | URL of the article (non-predictive) |
| timedelta | Float | Days between the article publication and the dataset acquisition (non-predictive) |
| n_tokens_title | Float | Number of words in the title |
| n_tokens_content | Float | Number of words in the content |
| n_unique_tokens | Float | Rate of unique words in the content |
| n_non_stop_words | Float | Rate of non-stop words in the content |
| n_non_stop_unique_tokens | Float | Rate of unique non-stop words in the content |
| num_hrefs | Float | Number of links |
| num_self_hrefs | Float | Number of links to other articles published by Mashable |
| num_imgs | Float | Number of images |
| num_videos | Float | Number of videos |
| average_token_length | Float | Average length of the words in the content |
| num_keywords | Float | Number of keywords in the metadata |
| data_channel_is_lifestyle | Float | Is data channel 'Lifestyle'? |
| data_channel_is_entertainment | Float | Is data channel 'Entertainment'? |
| data_channel_is_bus | Float | Is data channel 'Business'? |
| data_channel_is_socmed | Float | Is data channel 'Social Media'? |
| data_channel_is_tech | Float | Is data channel 'Tech'? |
| data_channel_is_world | Float | Is data channel 'World'? |
| kw_min_min | Float | Worst keyword (min) |
| kw_max_min | Float | Worst keyword (max) |
| kw_avg_min | Float | Worst keyword (avg) |
| kw_min_max | Float | Best keyword (min) |
| kw_max_max | Float | Best keyword (max) |
| kw_avg_max | Float | Best keyword (avg) |
| kw_min_avg | Float | Avg keyword (min) |
| kw_max_avg | Float | Avg keyword (max) |
| kw_avg_avg | Float | Avg keyword (avg) |
| self_reference_min_shares | Float | Min  of referenced articles in Mashable |
| self_reference_max_shares | Float | Max  of referenced articles in Mashable |
| self_reference_avg_sharess | Float | Avg  of referenced articles in Mashable |
| weekday_is_monday | Float | Was the article published on a Monday? |
| weekday_is_tuesday | Float | Was the article published on a Tuesday? |
| weekday_is_wednesday | Float | Was the article published on a Wednesday? |
| weekday_is_thursday | Float | Was the article published on a Thursday? |
| weekday_is_friday | Float | Was the article published on a Friday? |
| weekday_is_saturday | Float | Was the article published on a Saturday? |
| weekday_is_sunday | Float | Was the article published on a Sunday? |
| is_weekend | Float | Was the article published on the weekend? |
| LDA_00 | Float | Closeness to LDA topic 0 |
| LDA_01 | Float | Closeness to LDA topic 1 |
| LDA_02 | Float | Closeness to LDA topic 2 |
| LDA_03 | Float | Closeness to LDA topic 3 |
| LDA_04 | Float | Closeness to LDA topic 4 |
| global_subjectivity | Float | Text subjectivity |
| global_sentiment_polarity | Float | Text sentiment polarity |
| global_rate_positive_words | Float | Rate of positive words in the content |
| global_rate_negative_words | Float | Rate of negative words in the content |
| rate_positive_words | Float | Rate of positive words among non-neutral tokens |
| rate_negative_words | Float | Rate of negative words among non-neutral tokens |
| avg_positive_polarity | Float | Avg polarity of positive words |
| min_positive_polarity | Float | Min polarity of positive words |
| max_positive_polarity | Float | Max polarity of positive words |
| avg_negative_polarity | Float | Avg polarity of positive words |
| min_negative_polarity | Float | Min polarity of positive words |
| max_negative_polarity | Float | Max polarity of positive words |
| title_subjectivity | Float | Title subjectivity |
| title_sentiment_polarity | Float | Title polarity |
| abs_title_subjectivity | Float | Absolute subjectivity level |
| abs_title_sentiment_polarity | Float | Absolute polarity level |
| Number of shares (target) | Integer | Number of Article Shares (tweets, shares, etc...)|

## Modeling and Evaluation 1
Choose and explain your evaluation metrics that you will use (i.e., accuracy,
precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.

Phillip
* Calculating Acc, Rec, F-measure, Negative Predictive Value, Specificity etc... from confusion matrix

Gino - Saturday Afternoon

Final attributes from dimension reduction:
['n_tokens_content', 'n_unique_tokens', 'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'kw_min_min', 'kw_max_avg', 'kw_avg_avg', 'weekend', 'global_rate_positive_words', 'global_rate_negative_words', 'rate_negative_words']

* 2 Tasks: Popularity (classification) and Share # (Regression)
* Regression - Linear, Lasso, Ridge
* Classification - Logistic Regression, K nearest Neighbors, Random Forest


* Accuracy
Accuracy : the proportion of the total number of predictions that were correct.
(a + d) / (a + b + c + d)

* Precision
Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified.
a / (a + b)
Negative Predictive Value : the proportion of negative cases that were correctly identified.
d / (c + d)



* Recall
Recall : the proportion of actual positive cases which are correctly identified.
a / (a + c)

Recall could be our primary go to metric as it evaluates the context of identifying online articles which fall between High to Medium popularity or Medium to Low popularity(ex. articles that are on the border line between popularity classes)

* F-Measure
the proportion of actual negative cases which are correctly identified.
d / (b + d)

In [None]:
#import metrics to collect metrics for each modelS
from sklearn import metrics as mt

#Separating data sets for each task

#task 1 Popularity Classification
df_task1 = df_imputed[['popularity','n_tokens_content', 'n_unique_tokens', 'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'kw_min_min', 'kw_max_avg', 'kw_avg_avg', 'weekend', 'global_rate_positive_words', 'global_rate_negative_words', 'rate_negative_words']].copy()


#task 2 Shares Regression
df_task2 = df_imputed
df_task2 = df_task2.drop(['popularity'],axis=1)


#Create list to store datapoints for accuracy, precision, recall, F-measure from each of the model
accuracy_dp = []
precision_dp = []
recall_dp = []
fmeasure_dp = []


## Modeling and Evaluation 2
Choose the method you will use for dividing your data into training and
testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why
your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.

Gino


For our approach we will be using Stratified 10-Fold Cross Validation for our analysis. 
This approach will provide our analysis with the high degree of precision, it is most appropriate for us to use in our analysis due to the characteristic of our data set:

In [None]:
#stratified 10 folds
from sklearn.model_selection import StratifiedKFold

# we want to predict the X and y data as follows:
if 'popularity' in df_task1:
    y = df_task1['popularity'].values # get the labels we want
    del df_task1['popularity'] # get rid of the class label
    X = df_task1.values # use everything else to predict!
    

yhat = np.zeros(y.shape)

cv_object = StratifiedKFold(n_splits=10, random_state=True, shuffle=True)
                         
print(cv_object)
cv_object.get_n_splits(X,y)

## Modeling and Evaluation 3
Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms!

Gino

* For Regression make sure popularity field is excluded and for classification make sure Share # is excluded!

## Task 1 : Popularity (classification) 

#### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

# get a handle to the classifier object, which defines the type
#penalty is set to default, which is 12.
lr_task1 = LogisticRegression(C=1.0, class_weight=None)

# iterate through and get predictions for each row in yhat
for train, test in cv_object.split(X,y):
    lr_task1.fit(X[train],y[train])
    yhat[test] = lr_task1.predict(X[test])

#evaluation metrics   
acc = mt.accuracy_score(y, yhat)
recall = mt.recall_score(y, yhat)
precision = mt.precision_score(y, yhat)
f = mt.f1_score(y, yhat)


#adding evaluation metrics to list for further analysis between models
accuracy_dp.append(acc)
recall_dp.append(recall)
precision_dp.append(precision)
fmeasure_dp.append(f)



#results in percentage
print("Accuracy of the model: {0:.4f}%".format(acc*100))
print("Recall of the model: {0:.4f}%".format(recall*100))
print("Precision of the model: {0:.4f}%".format(precision*100))
print("F-measure of the model: {0:.4f}%".format(f*100))


#### K nearest Neighbors, Random Forest

In [None]:
# As a team we setup KNN Classifier iterator to to determine the accurate number of nearest neighbours
# the highest iterations we are planning was 30, to get the best accuracy
from sklearn.neighbors import KNeighborsClassifier
counter = 1;
best_accuracy= 0.0;
kVal = 1;
while counter <= 30:
    clf = KNeighborsClassifier(n_neighbors=counter)
    clf.fit(X[train],y[train])
    acc = clf.score(X[test],y[test]);
    if acc > best_accuracy:
        best_accuracy = acc;
        kVal = counter;
    counter += 1;
neighbors=kVal
print("Best Accuracy returned by the classifier is: {0:.4f}%".format(best_accuracy*100),"with k value of:",kVal);

In [None]:
# Actual trainning and testing of the model begins
print("The best k value:", neighbors)
knn_task1 = KNeighborsClassifier(n_neighbors=neighbors)

# iterate through and get predictions for each row in yhat
for train, test in cv_object.split(X,y):
    knn_task1.fit(X[train],y[train])
    yhat[test] = knn_task1.predict(X[test])

#evaluation metrics   
acc = mt.accuracy_score(y, yhat)
recall = mt.recall_score(y, yhat)
precision = mt.precision_score(y, yhat)
f = mt.f1_score(y, yhat)

#adding evaluation metrics to list for further analysis between models
accuracy_dp.append(acc)
recall_dp.append(recall)
precision_dp.append(precision)
fmeasure_dp.append(f)



#results in percentage
print("Accuracy of the model: {0:.4f}%".format(acc*100))
print("Recall of the model: {0:.4f}%".format(recall*100))
print("Precision of the model: {0:.4f}%".format(precision*100))
print("F-measure of the model: {0:.4f}%".format(f*100))

#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# get a handle to the classifier object, which defines the type
rf_task1 = RandomForestClassifier(n_estimators=150, n_jobs=-1)

# iterate through and get predictions for each row in yhat
for train, test in cv_object.split(X,y):
    rf_task1.fit(X[train],y[train])
    yhat[test] = rf_task1.predict(X[test])

#evaluation metrics   
acc = mt.accuracy_score(y, yhat)
recall = mt.recall_score(y, yhat)
precision = mt.precision_score(y, yhat)
f = mt.f1_score(y, yhat)

#adding evaluation metrics to list for further analysis between models
accuracy_dp.append(acc)
recall_dp.append(recall)
precision_dp.append(precision)
fmeasure_dp.append(f)


#results in percentage
print("Accuracy of the model: {0:.4f}%".format(acc*100))
print("Recall of the model: {0:.4f}%".format(recall*100))
print("Precision of the model: {0:.4f}%".format(precision*100))
print("F-measure of the model: {0:.4f}%".format(f*100))

## Task 2 : Number of Shares (Regression)

### Linear, Lasso, Ridge

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold


# we want to predict the X and y data as follows:
if 'shares' in df_task2:
    y = df_task2['shares'].values # get the labels we want
    del df_task2['shares'] # get rid of the class label
    X = df_task2.values # use everything else to predict!

#X =  np.array([np.concatenate((v,[1])) for v in df_task2.values])

yhat = np.zeros(y.shape)
#cv_object = StratifiedKFold(n_splits=10, random_state=True, shuffle=True)
cv_object = KFold(len(X),n_folds=10)

print(cv_object)
#cv_object.get_n_splits(X,y)

lr_task2 = LinearRegression()

xval_error = 0
for train, test in cv_object:
    lr_task2.fit(X[train], y[train])
    p = lr_task2.predict(X[test])
    e = p-y[test]
    xval_error += np.dot(e,e)

rmse_10cv = np.sqrt(xval_error/len(X))



method_name = 'Simple Linear Regression'
print('Method: %s' %method_name)
print('Root Mean Square Error on training: %.4f' %rmse_train)
print('Root Mean Square Error on 10-fold CV: %.4f' %rmse_10cv)
print(mt.r2_score(y,yhat))

### Lasso

In [None]:
from sklearn.linear_model import Lasso


lr_task2 = Lasso(fit_intercept=True,alpha=0.5)

xval_error = 0
for train, test in cv_object:
    lr_task2.fit(X[train], y[train])
    p = lr_task2.predict(X[test])
    e = p-y[test]
    xval_error += np.dot(e,e)

rmse_10cv = np.sqrt(xval_error/len(X))



method_name = 'Simple Linear Regression'
print('Method: %s' %method_name)
print('Root Mean Square Error on training: %.4f' %rmse_train)
print('Root Mean Square Error on 10-fold CV: %.4f' %rmse_10cv)
print(mt.r2_score(y,yhat))

### Ridge

In [None]:
from sklearn.linear_model import Ridge


lr_task2 = Ridge(fit_intercept=True,alpha=0.5)

xval_error = 0
for train, test in cv_object:
    lr_task2.fit(X[train], y[train])
    p = lr_task2.predict(X[test])
    e = p-y[test]
    xval_error += np.dot(e,e)

rmse_10cv = np.sqrt(xval_error/len(X))



method_name = 'Simple Linear Regression'
print('Method: %s' %method_name)
print('Root Mean Square Error on training: %.4f' %rmse_train)
print('Root Mean Square Error on 10-fold CV: %.4f' %rmse_10cv)
print(mt.r2_score(y,yhat))

## Modeling and Evaluation 4
Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.

John
* For Regression just compare attributes selected which are significantly
* For Classification determine accuracy, etc...

## Modeling and Evaluation 5
Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course.

John
Notebook - Grand Puba Notebook
Rsquared


## Modeling and Evaluation 6
Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.

John
* Discuss weights in more detail


## Deployment
How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.? 

Scott

## Exceptional Work

You have free reign to provide additional analyses. One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?

Scott

Possible Analysis:

Deep Learning

Neaural Network