#IS6493 - Data Science & Big Data
# Final Report
# Clickstream Analysis for Outbrain 



---


*Submitted by: Ann-Kathrin Breuning(u1158200), Srivatsa Gangadhara(u1087996), Evelyn Mietzner(u0573308), Himanshi Sharma(u1168592)*




#Introduction


---


##** Background**
Outbrain is an online advertiser that connects businesses with their target audience by presenting the businesses’ ads on a third-party website. Outbrain is charging its clients on a Pay-Per-Click basis; about half of the revenue goes to the third-party website that  presented the ad. To only present the most relevant content to the third-party website visitor, Outbrain uses a recommendation algorithm. This project will try to help Outbrain to improve this algorithm to present more relevant content to individuals.

## Goal

Analysis of clickstream data using web analytics procedures serves as a useful tool in the enhancement of a
B2B website by investigating how visitors move through the website conversion process. Determining which
content, the user might click can earn huge revenues for an Ad aggregator like Outbrain. On inserting a relevant
content to the slot, and these contents generates many clicks in turn more revenue to the company. 

## ** Business Questions**

- Most revenue generating content category?
- What are the most popular ads?
- What factors influence click-through-rate?
- Where are your visitors geographically located?
- Overall Goal: What pieces of content are more likely to be clicked on?

## Value proposition and Benefits
Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in
turn, leads to smarter business moves, more efficient operations, higher profits and happier customers.
This can add value in the following ways:
1. Cost reduction - Big data technologies such as Hadoop and cloud-based analytics bring significant cost
advantages when it comes to storing large amounts of data and they can identify more efficient ways of
doing business

2. Faster, better decision making - With the ability to analyze new sources of data, businesses can analyze
information immediately – and make decisions based on what they’ve learned

3. New products and services - With the ability to gauge customer needs and satisfaction through analytics
comes the power to give customers what they need and want without them asking for it

4. For our project, the most beneficial findings include:
    - Highest number of visitors to a content
    - Popular content category based on clicks
    - Popular content category based on geographical area
    
The risk of not achieving this would lead to fail in creating click conversions and hence decreased revenues

## Data Mining Tasks
Our business focus is determining what ads work and what does not, as this forms the center for many
advertising firms which are striving to find the best revenue generating ads. In order to categorizing ads in
profitable and non-profitable classes we plan to first performed exploratory data analysis to understand the
underlying relationships between a website, campaign and set of recommendations Outbrain is providing.


![Graph showing the ads distribution in train data](https://lh4.googleusercontent.com/6O3pYphjiHoujGGcf8ahGqKYU9JNv4tbuiI7hLGmxcoNJJyjfTXcMYkaM3hcyG5qKiVwaiNpBsJg3cirx6Po=w1280-h694)

The graph above shows the distribution of ads in displays, and we can see that the most of the displays (~30%) have 4 ads.

The graph below shows the clicking distribution of ads in training set.

![Graph showing if ads were clicked or not](https://lh5.googleusercontent.com/88yePL8qh5MnfWjvu6swPdXVNq3Q1cWqL9nIbP-h_rsnCwFwQz-5WJqLF4EuhUJcrnc3ych-PyoRS473Xieg=w1280-h694)

 
 The graph above shows the clicking percentage of ads in display. 
 

![Graph showing the clicking percentage of ads in ad groups](https://lh3.googleusercontent.com/w7gCFZY3Z1NEcDMf6NuAhTMN1jUQ3jeXUHipPiTM6HCxXsUgIXLVqRfczurOu1HfpEVVn43xwIikhQK6mVOk=w1280-h694)

Second, using the explained relationships, and clicked (1 for if ad is clicked and 0 otherwise) as target
variable we can use classifications techniques like Logistic Regression, Random Forest etc. to find the
likelihood of a person clicking or not clicking a recommended ad. 

#Data


---



##Dataset
The data set for the project is from a data challenge by Outbrain on Kaggle.com
[lKaggle Outbrain](https://www.kaggle.com/c/outbrain-click-prediction).

## Metadata
The dataset for the competition is comprised of a sample from June 14, 2016 to June 28, 2016 of users’ page
views and clicks collected from various publisher sites in the US. Following are the common fields among the
various files of the competition data:
- Unique id(uuid): Each user is represented by these unique id.
- Document id(document_id): It represents the document/webpage a user visit.
- Ads id(ads_id): Represents the set of ads that are displayed on the document.
- Campaign id(compaign_id): Represents the campaign of a particular ad on the document.
- Advertiser id(advertiser_id): It represents the advertiser running the campaigns.





#** Prepared Data and Data Mining Results**


---



## Data Preperation

Before we can start the analysis we have to prepare our data. The following section is dedicated to that. 

In [0]:
# Packages needed for data preperation (some data exploration)
import csv
from pyspark.sql.functions import col, count, countDistinct

from pyspark.ml.feature import RFormula

In [0]:
clicks_train = spark.read.csv("clicks_train.csv", header = True, inferSchema="true")
clicks_train.printSchema()
clicks_train.show(5)


In [0]:
#decrease size
clicks_train = clicks_train.where(col("ad_id")<500)

In [0]:
events = spark.read.csv("events.csv", header = True, inferSchema = "true")

In [0]:
## Join by display_id
# Join expression
joinExpression = clicks_train["display_id"] == events["display_id"]
# Join type
joinType = "inner"
# Join command
data = clicks_train.join(events, joinExpression, joinType)
# Drop redundant columns
data = data.drop(events["display_id"])
data.printSchema()
data.show(5)

In [0]:
promoted_content = spark.read.csv("promoted_content.csv", header = True, inferSchema = "true")
promoted_content.printSchema()
promoted_content.show(5)
promoted_content.select(count("ad_id").alias("Number of ads"), 
                       count("display_id").alias("Number of displays")
                       count("compaign_id").alias("Number of campaigns"), 
                       count("advertiser_id").alias("Number of advertiser")).show()

In [0]:
clicks_test = spark.read.csv("clicks_test.csv", header = True, inferSchema = "true")
clicks_test.printSchema()
clicks_test.show(5)
clicks_test.select(count("ad_id").alias("Number of ads"), 
                   count("display_id").alias("Number of displays")).show()

In [0]:
## Join by document_id
# Join expression
joinExpression = data["document_id"] == documents_topics["document_id"]
# Join type
joinType = "inner"
# Join command
data = data.join(documents_topics, joinExpression, joinType)
# Drop redundant columns
data = data.drop(documents_topics["document_id"])
data.printSchema()
data.show(5)

In [0]:
data.select(countDistinct("ad_id")).show()                  #many different categories so we chose not to
data.select(countDistinct("geo_location")).show()           #show the distribution because of lack of space


In [0]:
#distribution
data.groupBy("clicked").count().show()
data.groupBy("platform").count().show()



## Data Mining methods selected

1. Data Mining Method: Simple Aggregation, Logistic Regression, Cost Sensitive Analysis, Random Forest Classification
These methods were used because we wanted to classify our target variable "clicked" in two class - 0: not clicked and 1: clicked.

2. Goal: Predict future clicks
      + We are trying to follow the Kaggle instructions and use our Random Forest Classification (and Simple Aggregation) on training data (formed with joining clicks_train and promoted_content) to predict future clicks on the provided test dataset (formed with joining click_test and promoted_content)
      + The logistic model and cost sensitive model don't use the given test set at all but rather uses a join of clicks_train, events, and document_topics tables to foresee future clicks based on a broader range of independent variables. The given test set only works with display_id and ad_id.

## Data Mining Models
### Logistic Regression

In [0]:
# original selection
data1 = data.select('clicked', 'ad_id', 'geo_location', 'platform')

In [0]:
formula = RFormula(
    formula="clicked ~ . ad_id:geo_location",
    featuresCol="features",
    labelCol="label")

df1 = formula.fit(data1).transform(data1)
df1.show(5)
train3 test3 = df1.select('label', 'features').randomSplit([0.7, 0.3], seed=2018)

### Cost Sensitve Model
The data is very unbalanced in terms of the positive class (click) versus the negative class (no click). In other words, the data contains far more cases in which the dependent variable *clicked* equals 0, meaning no click, than clicks. This has been an issue for classification models for quite some time. Cost sensitive models offer a solution to the problem by associating a cost with a false prediction - in the case of Outbrain the false positive (predicting a click that actually isn't one) and the false negative (predicting no click although it is a click) are linked to a cost. A model will be more meek in making a decision based on the value of the cost. 
 
Pyspark currently doesn't support cost sensitive models. Therefore, the following code was written in R and will only be executable by R.

In [0]:
#Cost Sensitive Model
library(RWeka)
data <- read.csv("data.csv")           #data is the join of clicks, events, and document_topic above
str(data)
data$clicked <- factor(data$clicked)
data$ad_id <- factor(data$ad_id)
data$platform <- factor(data$platform)
data <- data[,-c(1,7)]
data1 <- data[,c(2,3,6,7)]


# cost matrix 1

# Create a cost matrix 
matrix_dimensions <- list(c("True No-Click", "True Click"), c("Predicted No-Click", "Predicted Click"))

costMatrix2 <- matrix(c(0,3,1,0),nrow=2,dimnames = matrix_dimensions)

costMatrix2

#  Configure the model using this cost matrix 
j48_cs_model2 <- CostSensitiveClassifier(clicked ~ ., data=data1,control=Weka_control('cost-matrix'=costMatrix2,
                                                                                      W="weka.classifiers.trees.J48"))

## Performance

### Logistic Regression

In [0]:
# Define a function to generate a classification metrics
from pyspark.mllib.evaluation import MulticlassMetrics

# Binary Classification
def B_Classifier_evaluator (predictions):
    predictionAndLabels =predictions.select("prediction", "label").rdd
    metrics = MulticlassMetrics(predictionAndLabels)
    print("Precision(no):", round(metrics.precision(0.0),3))
    print("Recall(no):", round(metrics.recall(0.0),3))
    print("F-Score(no):", round(metrics.fMeasure(0.0, beta=1.0),3))
    print("Precision(yes):", round(metrics.precision(1.0),3))
    print("Recall(yes):", round(metrics.recall(1.0),3))
    print("F-Score(yes):", round(metrics.fMeasure(1.0, beta=1.0),3))
    print("Accuracy:", round(metrics.accuracy,3))   
    
# logistic regression with interaction
lr = LogisticRegression(labelCol = 'label', featuresCol = 'features')

# training the model
fittedLR = lr.fit(train3)

# applying the model to the test set
lr_test_results = fittedLR.transform(test3)

# evaluating the models performance
B_Classifier_evaluator(lr_test_results)

### Cost Sensitive Model

In [0]:
# Build and evaluate cost-sensitive models based on the configuration above via 10-fold cross validation using evaluate_Weka_classifier
eval_j48_cs2 <- evaluate_Weka_classifier(j48_cs_model2,data1,numFolds = 10, complexity = FALSE,seed = 1, class = TRUE)

eval_j48_cs2




###Code For Simple Aggregation Model

In [0]:
#loading the modules
import pandas as pd
import numpy as np

In [0]:
## defining the data types
dtypes = {'ad_id': np.float32, 'clicked': np.int8}

In [0]:
# loading train data
train = pd.read_csv("clicks_train.csv", usecols=['ad_id','clicked'], dtype=dtypes)

In [0]:
# calculating likelihood as 
likelihood = train.groupby('ad_id').clicked.agg(['count','sum','mean']).reset_index()
mean = train.clicked.mean()
del train

In [0]:
likelihood['likelihood'] = (likelihood['sum'] + 12*mean) / (12 + likelihood['count'])
test = pd.read_csv("clicks_test.csv")
test = test.merge(likelihood, how='left')
test.likelihood.fillna(mean, inplace=True)

In [0]:
test.sort_values(['display_id','likelihood'], inplace=True, ascending=False)
subm = test.groupby('display_id').ad_id.apply(lambda x: " ".join(map(str,x))).reset_index()
subm.to_csv("simple_submission.csv", index=False)

### Code For Logistic Regression

In [0]:
# baseline logistic regression on original selection
lr = LogisticRegression(labelCol = 'label', featuresCol = 'features')

# training the model
base_fittedLR = lr.fit(train1)

# applying the model to the test set
base_lr_test_results = base_fittedLR.transform(test1)

# evaluating the models performance
B_Classifier_evaluator(base_lr_test_results)

In [0]:
# baseline logistic regression on final selection (same performance as original selection)
lr = LogisticRegression(labelCol = 'label', featuresCol = 'features')

# training the model
base_fittedLR = lr.fit(train2)

# applying the model to the test set
base_lr_test_results = base_fittedLR.transform(test2)

# evaluating the models performance
B_Classifier_evaluator(base_lr_test_results)

In [0]:
# logistic regression with interaction term
lr = LogisticRegression(labelCol = 'label', featuresCol = 'features')

# training the model
model1_fittedLR = lr.fit(train3)

# applying the model to the test set
model1_lr_test_results = model1_fittedLR.transform(test3)

# evaluating the models performance
B_Classifier_evaluator(model1_lr_test_results)


In [0]:
# logistic regression with interaction term and parameters different than defauls
lr = LogisticRegression(labelCol = 'label', featuresCol = 'features', maxIter=10, regParam=0.5, elasticNetParam=0.5)

# training the model
model2_fittedLR = lr.fit(train3)

# applying the model to the test set
model2_lr_test_results = model2_fittedLR.transform(test3)

# evaluating the models performance
B_Classifier_evaluator(model2_lr_test_results)

We changed the parameters to many different values but all of them performed worse than the default. The above model is just one example and stands for all the models we tested. It is also one of the worst performers. It doesn't predict any clicks at all.

In [0]:
# baseline logistic regression modeled and evaluated on similar format as the competition's train and test set
lr = LogisticRegression(labelCol = 'label', featuresCol = 'features')

# training the model
base_fittedLR = lr.fit(train4)

# applying the model to the test set
base_lr_test_results = base_fittedLR.transform(test4)

# evaluating the models performance
B_Classifier_evaluator(base_lr_test_results)

The baseline logistic regression is useless to Outbrain when clicked can only be explained by ad_id and display_id. It doesn't predict any clicks at all.

### Code For Random Forest

In [0]:
#importing modules
import numpy as np # linear algebra
import pandas as pd # data processing
from sklearn.ensemble import RandomForestClassifier

In [0]:
# Preparing the data

print ("Get tables to combine")
content = spark.read.csv("/FileStore/tables/promoted_content.csv", header = True)
train = spark.read.csv("/FileStore/tables/clicks_train.csv", header = True)
schema = train.printSchema()
train = train.selectExpr("display_id as display_id", "ad_id as ad", "clicked as clicked")

# Building the model with iterating over chunks of training data
training=True
i = 0
j = 1
while j < 10:
  if j < 9:
    chunk = train.toPandas()[i:9000000*j]
    if i == 0:
      i+= 9000000 + 1
    else:
      i+= 9000000
    j+=1
  else:
    chunk = train.toPandas()[i:6141731]
  print('Taking the chunk in to build the model')
  chunk.show(5)
  joinExpression = chunk["ad"] == content["ad_id"]
  # Join type
  joinType = "left"
  # Join command
  chunk = chunk.join(content, chunk["ad"] == content["ad_id"], joinType)
  # collecting the predictors' names
  predictors=[x for x in chunk.columns if x not in ['display_id','clicked', 'ad_id']]
  chunk=chunk.fillna(0.0)
  chunk = chunk.toPandas()
  # building the model
  alg = RandomForestClassifier(random_state=1, n_estimators=3, min_samples_split=4, min_samples_leaf=2, warm_start=True)
  alg.fit(chunk[predictors], chunk["clicked"])#Fit the Algorithm
  print('model fitted on the chunk')


In [0]:
alg
##Out[11]: 
#RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
#       max_depth=None, max_features='auto', max_leaf_nodes=None,
#           min_impurity_split=1e-07, min_samples_leaf=2,
#           min_samples_split=4, min_weight_fraction_leaf=0.0,
#            n_estimators=3, n_jobs=1, oob_score=False, random_state=1,
#            verbose=0, warm_start=True)

In [0]:
# predicting the clicked likelihood on clicks_test.csv
train=''
print('Testing')
test=spark.read.csv('/FileStore/tables/clicks_test.csv',header=True, inferSchema = "true") #Load data
test = test.selectExpr("display_id as display_id", "ad_id as ad")
predY=[]
print("Merging with content")
joinExpression = test["ad"] == content["ad_id"]
# Join type
joinType = "left"
# Join command
chunk = test.join(content, joinExpression, joinType)
chunk=chunk.fillna(0.0)
print("predicting")
chunk = chunk.toPandas()
#predicting
chunk_pred=list(alg.predict_proba(chunk[predictors]).astype(float)[:,1])
predY += chunk_pred
print("done predicting")

In [0]:
# Preparing the results in submission format for submitting to kaggle
## scores with this submission : 0.48569
print('Preparing for Submission')
filename = "submission"
test=''#We do not want the iterable version of test
test=spark.read.csv('/FileStore/tables/clicks_test.csv',header=True, inferSchema = "true") #Load data
results=pd.concat((test.toPandas(),pd.DataFrame(predY)) ,axis=1,ignore_index=True)#Combine the predicted values with the ids
print(results.head(10))
results.rename(
  columns={
    0 : 'display_id',
    1 : 'ad_id',
    2 :  'clicked'
  },
  inplace=True
)
print(results.head(10))
#results=results[results['clicked'] > 0.0]
results = results.sort_values(by=['display_id','clicked'], ascending=[True, False])
results = results.reset_index(drop=True)
results=results[['display_id','ad_id']].groupby('display_id')['ad_id'].agg(lambda col: ' '.join(map(str,col)))
print(results.head(5))
results=results[['display_id','ad_id']]
print(results.head(5))
print("creating csv")
results.to_csv('/dbfs/FileStore/tables/predict.csv', sep=',', header=True, index=True)



In [0]:
## Local evaluating the model by splitting the training data into two sets
print ("Get tables to combine")
train = spark.read.csv("/FileStore/tables/clicks_train.csv", header = True)
schema = train.printSchema()

training, testing = train.randomSplit([0.7, 0.3], seed=2018)

In [0]:
# training model on set of train data
i = 0
j = 1
while j <  7:
  if j <= 6:
    chunk = training.toPandas()[i:9000000*j]
    if i == 0:
      i+= 9000000 + 1
    else:
      i+= 9000000
    j+=1
  else:
    chunk = training.toPandas()[i:6998916]
  print('Taking the chunk in to build the model')
  print(chunk.head(5))
  chunk=pd.merge(chunk,content.toPandas(),how='left',on='ad_id')
  predictors=[x for x in chunk.columns if x not in ['display_id','clicked', 'ad_id']]
  chunk=chunk.fillna(0.0)
  alg_evaluating = RandomForestClassifier(random_state=1, n_estimators=3, min_samples_split=4, min_samples_leaf=2, warm_start=True)
  alg_evaluating.fit(chunk[predictors], chunk["clicked"])#Fit the Algorithm
  print('model fitted on the chunk')

In [0]:
alg_evaluating

In [0]:
train=''
print('Predicting on training set')
testing.printSchema()
train_predY=[]
print("Merging with content")
joinExpression = testing["ad"] == content["ad_id"]
# Join type
joinType = "left"
# Join command
chunk = testing.join(content, joinExpression, joinType)
chunk=chunk.fillna(0.0)
print("predicting")
chunk = chunk.toPandas()
chunk_pred_train=list(alg_evaluating.predict_proba(chunk[predictors]).astype(int)[:,1])
train_predY += chunk_pred_train
print("done predicting")

In [0]:
#Evaluating
print(testing.head(5))
trainResults=pd.concat((testing.toPandas(),pd.DataFrame(train_predY)) ,axis=1,ignore_index=True)#Combine the predicted values with the ids
print(trainResults.head(5))
trainResults.rename(
  columns={
    0 : 'display_id',
    1 : 'ad_id',
    2 : 'actual',
    3 : 'predicted'
  },
  inplace=True
)
#

In [0]:
#confusion matrix
print(pd.crosstab(trainResults['actual'], trainResults['predicted'], rownames=['Actual'], colnames=['Predicted']))

###Code for Random Forest using python pandas dataframe

In [0]:
#importing modules
import numpy as np # linear algebra
import pandas as pd # data processing
from sklearn.ensemble import RandomForestClassifier

In [0]:
testing=False
filename = 'submission'
print ("combine the data")
content = pd.read_csv('promoted_content.csv')
print('Done combining')

chunksize=50000# Out of 87141731.
train = pd.read_csv('clicks_train.csv',iterator=True,chunksize=chunksize) #Load data
print( 'Training')
for chunk in train:
    print('Taking the chunk in for modeling')
    chunk=pd.merge(chunk,content,how='left',on='ad_id')	
    predictors=[x for x in chunk.columns if x not in ['display_id','clicked']]
    chunk=chunk.fillna(0.0)
    alg = RandomForestClassifier(random_state=1, n_estimators=3, min_samples_split=4, min_samples_leaf=2, warm_start=True)
    alg.fit(chunk[predictors], chunk["clicked"])#Fit the Algorithm
    print("Model built with the chunk")
    if testing:
        break

In [0]:
train=''
print('Testing')
test= pd.read_csv('clicks_test.csv',iterator=True,chunksize=chunksize) #Load data
predY=[]
for chunk in test:
    init_chunk_size=len(chunk)
    chunk=pd.merge(chunk,content,how='left',on='ad_id')
    chunk=chunk.fillna(0.0)
    chunk_pred=list(alg.predict_proba(chunk[predictors]).astype(float)[:,1])
    predY += chunk_pred
    print("Prediction done for chunk")
    if testing:
        break
print('Done Testing')

In [0]:
print('Preparing results Submission')
test=''
test= pd.read_csv('clicks_test.csv')#But rather the full version
results=pd.concat((test,pd.DataFrame(predY)) ,axis=1,ignore_index=True)#Combine the predicted values with the ids
print(results.head(10))
#Renaming the columns
results.columns = ['display_id','ad_id','clicked']
results = results.sort_values(by=['display_id','clicked'], ascending=[True, False])
results = results.reset_index(drop=True)
results=results[['display_id','ad_id']].groupby('display_id')['ad_id'].agg(lambda col: ' '.join(map(str,col)))
results.columns=[['display_id','ad_id']]
print("creating csv")
results.to_csv(filename+'.csv')	

## Method parameter settings


1. Simple Aggregation: This model is based on simple aggregation and it uses 'ad_id', 'clicked' and 'display_id'.
2. Logistic Regression: The best performing model uses PySpark's default parameters of the 'LogisticRegression' algorithm. It uses the columns 'ad_id', 'platform', 'geo_location', and a interaction term between 'ad_id' and "geo_location". Since some of the ads are probably location related an interaction between these two variables. The rest was simply chosen because its results were the best.
3. Cost Sensitive Model: The best performing model uses a cost matrix that gives a false negative a cost of 3 and a false positive a cost of 1. True positive and negative are assigned zero cost. These cost values were simply chosen because they performed best.
4. Random Forest Classification: The best performing model uses random_state=1, n_estimators=3, min_samples_split=4, min_samples_leaf=2, warm_start=True, as the parameter. These parameters were chosen because of there best performance.
The model also uses 'display_id', 'ad_id', 'document_id', 'campaign_id', and 'advertiser_id" as the independent variables because a individual that comes to a website are motivated by only some specific content and for that reason using promoted_content data in cohesion with training helps predict the likelihood of the clicks based on the content they are shown. 



## Data Mining performance metrics and the evaluation approaches

1. Simple Aggregation: Model is evaluated by Kaggle, the public score for the approach is 0.63713.
2. Logistic Regression: To evaluate the model we chose a train-test split (0.7/0.3) meaning that we trained the model on 70% of the data and then evaluated the performance on the remaining 30%. The performance measures of our choice are precision, recall, and F-measure for both the positive and negative class. The metrics for the positive class shows how well the model does in predicting a click. Besides the measures for each class, we also were interested in the models overall performance - it's accuracy. 
The applied performance measures are standard for classification tasks to raise awareness of possible risks associated with the application of the model.
We tested other methods such as support vector machine model using a linear kernel or a gradiens boosting tree model. However, these methods performed slightly worse than the logistic regression in the base model. Due to time constraints, we decided to only fine-tune the logistic regression model. We hoped for a F-score of 0.5 and a recall of 0.3 but maybe low precision due to the size of the data set.
3. Random Forest Classification: The model was evaluated by Kaggle, the best public score is 0.53601. We also split the training data set into two sets, to evaluate the model, the  precision(no) = 0.806388 and recall(no) = 0.99995
4. Cost Sensitive Model: To evaluate the cost sensitive model we used a 10-fold cross-validation to receive the standard performance measures for classification - recall, precision, and F-score. 

#**Implementation Effort and Risk Management**


---


## Project Timeline


We originally estimated about 38 hours for the project as a whole. However, we spent already at least 80 hours by the time of the interim report. At the end of the project we spent about 100 hours which is more than 2.5 times as much as we anticipated. 
We didn't know about the amount of time we'd spend waiting for the code to run before we started the project. Our computers aren't super computers and thus working with large amounts of data requires a lot of time.

For future real-world projects we recommend to consider about 100 hours, a very poweful computer, and subscription to a cloud IDE supporting PySpark. 


## Risks and Challenges

The major risk involved in any data project is that a bad model would result in many of the business questions unanswered. Given the huge dataset we had to find the right and most importantly best answer to our question. 
Another challenge for any BIG data project is performing an analysis on a big dataset. As mentioned earlier, most computers aren't suited for this kind of task and we were really surprised by the amount of time we spent on running the code.
One risk that any team is facing is the risk of everyone contributing. We had delegated tasks to each of the team member and all of us but one worked parallely on different analytical approaches. Robert did not communicate with us and didn't reply to any of our attempts to contact him. This was very unfortunate since the four of us had to juggle even more work than anticipated. 

To avoid these risks and challenges in the future we recommend to frequently perform validations on the model to make sure that the team is on the right track. To receive accurate results it is also important to use proper data preparation techniques. The challenge of big data can be solved by reducing the data at the beginning and only applying the final model on the whole data or by using a cloud IDE.


