Erasmia Kornelatou, f2821907

# Spark Assignment

In this assignment, you will use Spark to predict the popularity of online news.

---

> Panos Louridas, Associate Professor <br />
> Department of Management Science and Technology <br />
> Athens University of Economics and Business <br />
> louridas@aueb.gr

## The Problem

You will work with the [Online News Popularity Data Set](https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity), so go ahead and download it.

Your purpose is to predict the number of shares of an online article, based on a number of attributes. In the dataset you will see that the total number of attributes is 61:

  * 58 predictive attributes
  * 2 non-predictive
  * 1 target field
  
so you must use the 58 predictive attributes to predict the goal field.

To do that you will use a [Random Forrest Regressor](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression). That is similar to a Random Forrest Classifier, but it is used to predict numerical values, not just classes (hence the name regressor, although it is not the same with statistical regression). Make sure to partition your data to training and testing datasets. At the end, you will print the Root Mean Square Error from your effort *and* a table showing the basic statistics of the target variable, so that you will be able to see if your predictions are worthwhile or not.

In [49]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import desc

In [50]:
news_popularity = spark\
  .read\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .csv("D:/Users/astar/Desktop/spark-5-5-20/OnlineNewsPopularity/OnlineNewsPopularity/OnlineNewsPopularity.csv")

Let's have a look now on the variables of the dataset. For a more user-friendly output, we will use the  toPandas() action, which collects all rows and returns a pandas DataFrame.

In [51]:
news_popularity.limit(10).toPandas()

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,...,0.1,0.7,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593.0
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,...,0.033333,0.7,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711.0
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.57513,1.0,0.663866,3.0,1.0,1.0,...,0.1,1.0,-0.466667,-0.8,-0.133333,0.0,0.0,0.5,0.0,1500.0
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,...,0.136364,0.8,-0.369697,-0.6,-0.166667,0.0,0.0,0.5,0.0,1200.0
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.54089,19.0,19.0,20.0,...,0.033333,1.0,-0.220192,-0.5,-0.05,0.454545,0.136364,0.045455,0.136364,505.0
5,http://mashable.com/2013/01/07/beewi-smart-toys/,731.0,10.0,370.0,0.559889,1.0,0.698198,2.0,2.0,0.0,...,0.136364,0.6,-0.195,-0.4,-0.1,0.642857,0.214286,0.142857,0.214286,855.0
6,http://mashable.com/2013/01/07/bodymedia-armba...,731.0,8.0,960.0,0.418163,1.0,0.549834,21.0,20.0,20.0,...,0.1,1.0,-0.224479,-0.5,-0.05,0.0,0.0,0.5,0.0,556.0
7,http://mashable.com/2013/01/07/canon-poweshot-n/,731.0,12.0,989.0,0.433574,1.0,0.572108,20.0,20.0,20.0,...,0.1,1.0,-0.242778,-0.5,-0.05,1.0,0.5,0.5,0.5,891.0
8,http://mashable.com/2013/01/07/car-of-the-futu...,731.0,11.0,97.0,0.670103,1.0,0.836735,2.0,0.0,0.0,...,0.4,0.8,-0.125,-0.125,-0.125,0.125,0.0,0.375,0.0,3600.0
9,http://mashable.com/2013/01/07/chuck-hagel-web...,731.0,10.0,231.0,0.636364,1.0,0.797101,4.0,1.0,1.0,...,0.1,0.5,-0.238095,-0.5,-0.1,0.0,0.0,0.5,0.0,710.0


Let's have a look at the types of our variables.

In [52]:
news_popularity.printSchema()

root
 |-- url: string (nullable = true)
 |--  timedelta: double (nullable = true)
 |--  n_tokens_title: double (nullable = true)
 |--  n_tokens_content: double (nullable = true)
 |--  n_unique_tokens: double (nullable = true)
 |--  n_non_stop_words: double (nullable = true)
 |--  n_non_stop_unique_tokens: double (nullable = true)
 |--  num_hrefs: double (nullable = true)
 |--  num_self_hrefs: double (nullable = true)
 |--  num_imgs: double (nullable = true)
 |--  num_videos: double (nullable = true)
 |--  average_token_length: double (nullable = true)
 |--  num_keywords: double (nullable = true)
 |--  data_channel_is_lifestyle: double (nullable = true)
 |--  data_channel_is_entertainment: double (nullable = true)
 |--  data_channel_is_bus: double (nullable = true)
 |--  data_channel_is_socmed: double (nullable = true)
 |--  data_channel_is_tech: double (nullable = true)
 |--  data_channel_is_world: double (nullable = true)
 |--  kw_min_min: double (nullable = true)
 |--  kw_max_min: do

As it is described in the assignment, there are 58 predictive variables and 2 non predictive variables. The non predictive variables are:
* url: URL of the article
* timedelta: Days between the article publication and the dataset acquisition

We are going to remove those 2 variables.

In [53]:
newsDropped = news_popularity.drop('url',' timedelta')
newsDropped.limit(10).toPandas()

Unnamed: 0,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,average_token_length,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,0.0,4.680365,...,0.1,0.7,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593.0
1,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,0.0,4.913725,...,0.033333,0.7,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711.0
2,9.0,211.0,0.57513,1.0,0.663866,3.0,1.0,1.0,0.0,4.393365,...,0.1,1.0,-0.466667,-0.8,-0.133333,0.0,0.0,0.5,0.0,1500.0
3,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,0.0,4.404896,...,0.136364,0.8,-0.369697,-0.6,-0.166667,0.0,0.0,0.5,0.0,1200.0
4,13.0,1072.0,0.415646,1.0,0.54089,19.0,19.0,20.0,0.0,4.682836,...,0.033333,1.0,-0.220192,-0.5,-0.05,0.454545,0.136364,0.045455,0.136364,505.0
5,10.0,370.0,0.559889,1.0,0.698198,2.0,2.0,0.0,0.0,4.359459,...,0.136364,0.6,-0.195,-0.4,-0.1,0.642857,0.214286,0.142857,0.214286,855.0
6,8.0,960.0,0.418163,1.0,0.549834,21.0,20.0,20.0,0.0,4.654167,...,0.1,1.0,-0.224479,-0.5,-0.05,0.0,0.0,0.5,0.0,556.0
7,12.0,989.0,0.433574,1.0,0.572108,20.0,20.0,20.0,0.0,4.617796,...,0.1,1.0,-0.242778,-0.5,-0.05,1.0,0.5,0.5,0.5,891.0
8,11.0,97.0,0.670103,1.0,0.836735,2.0,0.0,0.0,0.0,4.85567,...,0.4,0.8,-0.125,-0.125,-0.125,0.125,0.0,0.375,0.0,3600.0
9,10.0,231.0,0.636364,1.0,0.797101,4.0,1.0,1.0,1.0,5.090909,...,0.1,0.5,-0.238095,-0.5,-0.1,0.0,0.0,0.5,0.0,710.0


* We need to collect all feature columns in one vector. 

* The features columns are all but the last one.

* We will use a `VectorAssembler` to create the new column containing the features vector.

In [60]:
assembler = VectorAssembler(
    inputCols=[ x for x in newsDropped.columns[:-1] ],
    outputCol='features')

ml_data = assembler.transform(newsDropped)

* As in all Machine Learning tasks, we will split the dataset in two parts, one for training and one for testing.

In [66]:
(training_data, test_data) = ml_data.randomSplit([0.7, 0.3],seed = 80)

* We create a `RandomForest Regressor`, specifying the features and the target (label) column.

In [67]:
dt = RandomForestRegressor(labelCol=" shares", featuresCol="features", seed = 80)

* We create a model by fitting with the training data.

In [68]:
model = dt.fit(training_data)

* Having created a Random Forest, we make our predictions using `transform()`.

In [69]:
predictions = model.transform(test_data)

In [70]:
evaluator_rf = RegressionEvaluator(
    labelCol=" shares",
    predictionCol="prediction", 
    metricName="rmse")

rmse = evaluator_rf.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 12762.5


Here are the top 10 predictions per shares:

In [44]:
predictions.select(' shares','prediction').sort(desc('prediction')).toPandas().head(10)

Unnamed: 0,shares,prediction
0,678.0,49821.056392
1,1800.0,39835.039539
2,41800.0,39803.338588
3,3900.0,39538.184662
4,1300.0,39190.097318
5,3500.0,38891.275689
6,2200.0,38031.486547
7,1300.0,37851.061639
8,665.0,37247.108692
9,4200.0,36115.879824


The basic statistics of the target variable after predicting are:

In [42]:
predictions.select(' shares','prediction').describe().toPandas()

Unnamed: 0,summary,shares,prediction
0,count,11962.0,11962.0
1,mean,3520.413893997659,3392.371876502913
2,stddev,12674.289328700464,2179.396414703895
3,min,28.0,1707.4688436677673
4,max,663600.0,49821.05639155982


The general basic statistics of the target variable are:

In [43]:
newsDropped.select(' shares').describe().toPandas()

Unnamed: 0,summary,shares
0,count,39644.0
1,mean,3395.380183634345
2,stddev,11626.95074865171
3,min,1.0
4,max,843300.0


### Conclusion
Based on our statistic results and high RMSE we come to the conclusion that the outliers really affect the predictive ability of our Random Forest model and make our predictions worthless. 