# **JAY KAKKAD**

## ○ What features did you consider? 

Features considere for this problem statement are 

1.   **BOOKMARK, COMMENT, CREATED, FOLLOW, LIKE, VIEW** : To calculate virality score
2.   **Labels** : Virality Score
3.   **AuthorId** : One hot encoding was applied to differentiate each author as its own seperate label.
4.  **Title** : Title is used and not text due to computational limitation of my system.
5.  **lang** : Language en or pt, one hot encoding applied
6.  **url** : domain of each url has its own individual popularity, so that also had to be taken into consideration and hence one hot encoding was applied
7. **Timestamp** : converted time stamp into its hour of the day to correlate between posted hour and its significance on virality score
8. **Author Region, Author Country** : both of these factors have been one hot encoded due to its textual labelling
9. **eventType** : content shared was tagged +1 and content removed was tagged -1, with an assumption that removed content is either provocative, misleading or not gaining the wanted attention.


---

## ○ What model did you use and why?

- I used **BERT** for generating numerical annotations of title, and used **Artifical Neural Networks Sequential model** for regression analysis. I chose to avoid simple linear or polynomial regression due number of features I had obtain for input (i.e 2103 feature vectors ). Dataset was split between 80% train and 19% validation and 1% test. Following paramaters were chosen due to size of dataset.

- Worked with RMSProp and adman optimizer, along with tuning hyperparamters using batch normalization and adding bias to the weights, to improve accuracy.


---

## ○ What was your evaluation metric for this?

- Since our output was to predict virality score, we chose to opt for Mean Square Error and Mean Average Error


---

## ○ What features would you like to add to the model in the future if you had more time?

- One hot encoding is useful when we have limited labels. I would have liked to process data further more to create a rating system for user and domain name based on their previous performance on virality index. This would give us an apt idea on true influence of user and domain on virality score, and would further more improve time and space complexity of the model, hence saving time and cost. 

- I would have liked to implement PCA on data to figure out most impactful features

- I was able to only compute virality score based on title, and hence I would like to further extend it with text as well, but due to time and computational limitations I was unable to do so


---

## ○ What other things would you want to try before deploying this model in production.

- I would have liked to try on LSTM for extracting text's numerical annotation, and use various other ANN models such as RNN or machine learning models such as XGBoost

- I would like to create a big data pipeline using apache spark to create a multi-threaded system for better cost and time efficiency.

---




In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


In [None]:
df_sa = pd.read_csv('https://raw.githubusercontent.com/jay-kakkad/Neeva/master/shared_articles.csv')
print("Shared Articles")
print("Total Rows:" + str(len(df_sa)))
df_sa.head()

Shared Articles
Total Rows:3122


Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
0,1459192779,CONTENT REMOVED,-6451309518266745024,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,CONTENT SHARED,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,CONTENT SHARED,2448026894306402386,4340306774493623681,8940341205206233829,,,,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en


In [None]:
df_ui = pd.read_csv('https://raw.githubusercontent.com/jay-kakkad/Neeva/master/users_interactions.csv')
print("User Interaction")
print("Total Rows:" + str(len(df_ui)))
df_ui.head()

User Interaction
Total Rows:72312


Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,


In [None]:
# Grouping all events for each contentId
df_ui['COUNTER'] = 1
group_user_data = df_ui.groupby(['contentId','eventType'])['COUNTER'].sum().reset_index()
group_user_data

Unnamed: 0,contentId,eventType,COUNTER
0,-9222795471790223670,BOOKMARK,1
1,-9222795471790223670,COMMENT CREATED,2
2,-9222795471790223670,FOLLOW,3
3,-9222795471790223670,LIKE,4
4,-9222795471790223670,VIEW,16
...,...,...,...
7335,9217155070834564627,COMMENT CREATED,2
7336,9217155070834564627,VIEW,14
7337,9220445660318725468,LIKE,2
7338,9220445660318725468,VIEW,50


In [None]:
user_events = group_user_data.pivot_table('COUNTER', ['contentId'], 'eventType')
user_events = user_events.fillna(0)

In [None]:
# Determing Y for the model, i.e virality score by using user_events table
def virality_score(row):
    return (1* row['VIEW']) + (4*row['LIKE']) + (10*row['COMMENT CREATED']) +( 25*row['FOLLOW'] )+ (100*row['BOOKMARK'])

user_events['labels'] = user_events.apply(lambda row: virality_score(row), axis = 1)
user_events

eventType,BOOKMARK,COMMENT CREATED,FOLLOW,LIKE,VIEW,labels
contentId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
-9222795471790223670,1.0,2.0,3.0,4.0,16.0,227.0
-9216926795620865886,1.0,1.0,1.0,3.0,15.0,162.0
-9194572880052200111,2.0,1.0,1.0,4.0,21.0,272.0
-9192549002213406534,0.0,1.0,0.0,5.0,50.0,80.0
-9190737901804729417,0.0,0.0,0.0,1.0,8.0,12.0
...,...,...,...,...,...,...
9213260650272029784,0.0,0.0,0.0,0.0,11.0,11.0
9215261273565326920,3.0,0.0,0.0,3.0,24.0,336.0
9217155070834564627,0.0,2.0,0.0,0.0,14.0,34.0
9220445660318725468,0.0,0.0,0.0,2.0,50.0,58.0


In [None]:
score = user_events.drop(columns=['BOOKMARK','COMMENT CREATED', 'FOLLOW', 'LIKE', 'VIEW'], axis=1)

In [None]:
# Data preprocessing by filling NA values
content_info = df_sa.join(score, on='contentId')
content_info['eventType'] = content_info.apply(lambda row: 1 if row['eventType'] == 'CONTENT SHARED' else -1, axis=1)
content_info['labels'] = content_info['labels'].fillna(0)
content_info['authorRegion'] = content_info['authorRegion'].fillna('unknown')
content_info['authorCountry'] = content_info['authorCountry'].fillna('unknown')


In [None]:
from datetime import datetime
def extract_domain(row):
    netloc = row['url'].split('/')
    if len(netloc) > 1:
        if len(netloc[1]) > 1:
            return netloc[1]
        else:
            return netloc[2]
    return netloc[0]

def convert_time(row):
    timestamp = datetime.fromtimestamp(row['timestamp'])
    return timestamp.hour

content_info['url'] = content_info.apply(lambda row: extract_domain(row), axis = 1)
content_info['timestamp'] = content_info.apply(lambda row: convert_time(row), axis = 1)

In [None]:
content_info = content_info.drop(columns=['authorUserAgent', 'authorSessionId'], axis=1)


In [None]:
# One Hot encoding
df = pd.get_dummies(content_info, columns = ['authorPersonId', 'contentType', 'lang','authorCountry', 'authorRegion','url'])
# df = pd.get_dummies(content_info, columns = ['contentType', 'lang','authorCountry', 'authorRegion','url'])
dfTitle = df.drop(columns = ['text'])
dfTitle

Unnamed: 0,timestamp,eventType,contentId,title,labels,authorPersonId_-9120685872592674274,authorPersonId_-9047547311469006438,authorPersonId_-9016528795238256703,authorPersonId_-9009798162809551896,authorPersonId_-9001583565812478106,authorPersonId_-8860671864164757449,authorPersonId_-8845298781299428018,authorPersonId_-8830250090736356260,authorPersonId_-8781306637602263252,authorPersonId_-8694104221113176052,authorPersonId_-8606737560479590536,authorPersonId_-8606085472606356565,authorPersonId_-8550167523008133722,authorPersonId_-8424644554119645763,authorPersonId_-8420584158427265596,authorPersonId_-8336698712796133957,authorPersonId_-8241940599580729220,authorPersonId_-8205402408645015051,authorPersonId_-8132559109129514792,authorPersonId_-8123627990288459252,authorPersonId_-8020832670974472349,authorPersonId_-7711052404720939396,authorPersonId_-7611460419696903236,authorPersonId_-7606731662737258050,authorPersonId_-7531858294361854119,authorPersonId_-7496361692498935601,authorPersonId_-7456488753754080246,authorPersonId_-7421703586506797266,authorPersonId_-7410485589492665094,authorPersonId_-7377845572432324217,authorPersonId_-7299489921519728176,authorPersonId_-7148295538010878187,authorPersonId_-7118575739764684077,authorPersonId_-6946355789336786528,authorPersonId_-6944500707172804068,...,url_www.usatoday.com,url_www.userlike.com,url_www.usmagazine.com,url_www.valor.com.br,url_www.vanityfair.com,url_www.viajenaviagem.com,url_www.vice.com,url_www.viget.com,url_www.vogella.com,url_www.w3.org,url_www.wareable.com,url_www.washingtonpost.com,url_www.webmotors.com.br,url_www.webwash.net,url_www.weforum.org,url_www.wemblog.com,url_www.whitehouse.gov,url_www.widerfunnel.com,url_www.wildml.com,url_www.willegan.com,url_www.windowscentral.com,url_www.wired.co.uk,url_www.wired.com,url_www.wocintechchat.com,url_www.wsj.com,url_www.xmind.net,url_www.yahoo.com,url_www.youwilldobetter.com,url_www.zdnet.com,url_www.zeldman.com,url_www.zivtech.com,url_www.ztop.com.br,url_www1.folha.uol.com.br,url_www1.valor.com.br,url_www2.deloitte.com,url_www2.portalnovidade.com.br,url_xamarinbr.azurewebsites.net,url_xorcatt.wordpress.com,url_zeroturnaround.com,url_zoocha.com
0,19,-1,-6451309518266745024,"Ethereum, a Virtual Currency, Enables Transact...",3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,19,1,-4110354420726924665,"Ethereum, a Virtual Currency, Enables Transact...",1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,19,1,-7292285110016212249,Bitcoin Future: When GBPcoin of Branson Wins O...,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,19,1,-6151852268067518688,Google Data Center 360° Tour,22.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,19,1,2448026894306402386,"IBM Wants to ""Evolve the Internet"" With Blockc...",0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3117,14,1,9213260650272029784,"Conheça a Liga IoT, plataforma de inovação abe...",11.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3118,14,1,-3295913657316686039,Amazon takes on Skype and GoToMeeting with its...,3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3119,19,1,3618271604906293310,Code.org 2016 Annual Report,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3120,16,1,6607431762270322325,JPMorgan Software Does in Seconds What Took La...,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
!pip install simpletransformers

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/c8/26/80cd70f4fa17377bc2d2209f437cb8865f4686a886ca5c82a124f5ba6ff0/simpletransformers-0.47.4-py3-none-any.whl (208kB)
[K     |████████████████████████████████| 215kB 2.7MB/s 
[?25hCollecting tensorboardx
[?25l  Downloading https://files.pythonhosted.org/packages/af/0c/4f41bcd45db376e6fe5c619c01100e9b7531c55791b7244815bac6eac32c/tensorboardX-2.1-py2.py3-none-any.whl (308kB)
[K     |████████████████████████████████| 317kB 7.5MB/s 
Collecting tqdm>=4.47.0
[?25l  Downloading https://files.pythonhosted.org/packages/28/7e/281edb5bc3274dfb894d90f4dbacfceaca381c2435ec6187a2c6f329aed7/tqdm-4.48.2-py2.py3-none-any.whl (68kB)
[K     |████████████████████████████████| 71kB 5.8MB/s 
Collecting transformers>=3.0.2
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |██████████████

In [None]:
# BERT model for generating text's numerical annotations

from simpletransformers.language_representation import RepresentationModel
model = RepresentationModel(
        model_type="bert",
        model_name="bert-base-uncased",
        use_cuda=False
    )



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTextRepresentation: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTextRepresentation from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForTextRepresentation from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [None]:
encodedText = []
def encode_sentence(row):
  encoded = np.mean(model.encode_sentences(row['title'], combine_strategy="mean"), axis=0)
  encodedText.append(encoded)
for row in range(0,len(dfTitle)):
  if row%200 == 0:
    print(row)
  encode_sentence(dfTitle.iloc[row])
dfEncoded = pd.DataFrame(data=encodedText)
print(dfEncoded)

0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
           0         1         2    ...       765       766       767
0    -0.039516 -0.077298 -0.002395  ...  0.090112 -0.049559  0.035127
1    -0.039516 -0.077298 -0.002395  ...  0.090112 -0.049559  0.035127
2    -0.119668 -0.036004 -0.039138  ...  0.066680 -0.050220  0.037964
3    -0.069113 -0.023199 -0.020311  ...  0.087319 -0.051461  0.045186
4    -0.062067 -0.065808 -0.020830  ...  0.082442 -0.047881  0.044261
...        ...       ...       ...  ...       ...       ...       ...
3117 -0.043575 -0.076468 -0.036323  ...  0.113552 -0.050621  0.012353
3118 -0.052005 -0.066198 -0.009247  ...  0.104132 -0.043776  0.070822
3119 -0.050306 -0.058000 -0.008510  ...  0.083342 -0.050039  0.061445
3120 -0.088415 -0.037325 -0.021644  ...  0.084511 -0.061697  0.064235
3121 -0.073989 -0.049651 -0.021638  ...  0.103965 -0.080269  0.046961

[3122 rows x 768 columns]


In [None]:
# Concatenating textual annoation with existing statistical features
dfTitle = pd.concat([dfTitle, dfEncoded], axis=1)

In [None]:
dfTitle

Unnamed: 0,timestamp,eventType,contentId,title,labels,authorPersonId_-9120685872592674274,authorPersonId_-9047547311469006438,authorPersonId_-9016528795238256703,authorPersonId_-9009798162809551896,authorPersonId_-9001583565812478106,authorPersonId_-8860671864164757449,authorPersonId_-8845298781299428018,authorPersonId_-8830250090736356260,authorPersonId_-8781306637602263252,authorPersonId_-8694104221113176052,authorPersonId_-8606737560479590536,authorPersonId_-8606085472606356565,authorPersonId_-8550167523008133722,authorPersonId_-8424644554119645763,authorPersonId_-8420584158427265596,authorPersonId_-8336698712796133957,authorPersonId_-8241940599580729220,authorPersonId_-8205402408645015051,authorPersonId_-8132559109129514792,authorPersonId_-8123627990288459252,authorPersonId_-8020832670974472349,authorPersonId_-7711052404720939396,authorPersonId_-7611460419696903236,authorPersonId_-7606731662737258050,authorPersonId_-7531858294361854119,authorPersonId_-7496361692498935601,authorPersonId_-7456488753754080246,authorPersonId_-7421703586506797266,authorPersonId_-7410485589492665094,authorPersonId_-7377845572432324217,authorPersonId_-7299489921519728176,authorPersonId_-7148295538010878187,authorPersonId_-7118575739764684077,authorPersonId_-6946355789336786528,authorPersonId_-6944500707172804068,...,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767
0,19,-1,-6451309518266745024,"Ethereum, a Virtual Currency, Enables Transact...",3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.083978,-0.259923,0.139390,0.099553,-0.335783,0.003261,-0.453525,0.372251,-0.046157,0.158701,0.132897,0.467020,0.018253,0.165468,0.307462,-0.394013,-0.084979,-0.170337,-0.295815,-0.319584,-0.093000,0.246695,0.143790,-0.023047,-2.384967,-0.017559,-0.069826,-0.176428,0.042679,-0.329104,0.432516,-0.063531,0.058508,-0.248117,-0.006832,-0.573886,0.278394,0.090112,-0.049559,0.035127
1,19,1,-4110354420726924665,"Ethereum, a Virtual Currency, Enables Transact...",1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.083978,-0.259923,0.139390,0.099553,-0.335783,0.003261,-0.453525,0.372251,-0.046157,0.158701,0.132897,0.467020,0.018253,0.165468,0.307462,-0.394013,-0.084979,-0.170337,-0.295815,-0.319584,-0.093000,0.246695,0.143790,-0.023047,-2.384967,-0.017559,-0.069826,-0.176428,0.042679,-0.329104,0.432516,-0.063531,0.058508,-0.248117,-0.006832,-0.573886,0.278394,0.090112,-0.049559,0.035127
2,19,1,-7292285110016212249,Bitcoin Future: When GBPcoin of Branson Wins O...,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.073105,-0.250710,0.144412,0.138788,-0.288513,-0.016962,-0.388132,0.363164,-0.017877,0.146029,0.097405,0.483159,0.049023,0.149755,0.277015,-0.362519,-0.071742,-0.141488,-0.293138,-0.351519,-0.044697,0.258752,0.183865,-0.022764,-2.378159,-0.055338,-0.091294,-0.187025,0.035351,-0.315890,0.419480,-0.111962,0.059834,-0.237388,-0.011094,-0.558227,0.273242,0.066680,-0.050220,0.037964
3,19,1,-6151852268067518688,Google Data Center 360° Tour,22.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.098539,-0.260516,0.132347,0.129348,-0.324905,0.007626,-0.416470,0.378005,-0.060273,0.186835,0.135672,0.424589,0.011806,0.183411,0.297726,-0.371318,-0.111850,-0.171337,-0.308281,-0.356278,-0.107532,0.282167,0.175323,-0.010617,-2.388668,-0.027423,-0.082584,-0.134449,0.028610,-0.270521,0.433293,-0.060609,0.034192,-0.205734,0.026570,-0.543180,0.248698,0.087319,-0.051461,0.045186
4,19,1,2448026894306402386,"IBM Wants to ""Evolve the Internet"" With Blockc...",0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.070913,-0.279069,0.113407,0.129772,-0.329954,0.001510,-0.438659,0.366639,-0.043227,0.195464,0.122556,0.496479,0.004402,0.158780,0.306705,-0.377385,-0.089376,-0.159704,-0.296015,-0.338278,-0.073185,0.273557,0.143222,0.029970,-2.341043,-0.011294,-0.105786,-0.172955,0.015195,-0.338567,0.426189,-0.072659,0.088273,-0.254805,-0.004180,-0.571559,0.284346,0.082442,-0.047881,0.044261
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3117,14,1,9213260650272029784,"Conheça a Liga IoT, plataforma de inovação abe...",11.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.089610,-0.271939,0.147562,0.083684,-0.340850,0.036086,-0.450647,0.384716,-0.026865,0.187980,0.135534,0.460584,0.003044,0.174371,0.312607,-0.376781,-0.113071,-0.198283,-0.318271,-0.322131,-0.120726,0.250430,0.142816,-0.032969,-2.424765,-0.025062,-0.076240,-0.136097,0.042990,-0.306371,0.437366,-0.063015,0.070094,-0.210509,0.019114,-0.558625,0.256834,0.113552,-0.050621,0.012353
3118,14,1,-3295913657316686039,Amazon takes on Skype and GoToMeeting with its...,3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.059688,-0.280219,0.137674,0.121885,-0.321881,0.029690,-0.437809,0.378044,-0.010850,0.195355,0.125763,0.449060,-0.010326,0.159921,0.329332,-0.361570,-0.086034,-0.186336,-0.320984,-0.323719,-0.104398,0.250593,0.162664,0.011704,-2.404793,-0.022733,-0.089377,-0.157783,0.014221,-0.303969,0.433872,-0.084641,0.075870,-0.226174,0.029562,-0.559076,0.272037,0.104132,-0.043776,0.070822
3119,19,1,3618271604906293310,Code.org 2016 Annual Report,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.099979,-0.241374,0.127200,0.134039,-0.320886,-0.000175,-0.406751,0.350364,-0.020532,0.173584,0.133123,0.413073,0.021344,0.159278,0.323579,-0.369782,-0.099953,-0.179893,-0.346389,-0.357113,-0.105227,0.251802,0.212792,-0.000450,-2.410806,-0.052985,-0.072339,-0.156309,0.038620,-0.264578,0.434009,-0.055630,0.040958,-0.207559,0.040596,-0.550693,0.235698,0.083342,-0.050039,0.061445
3120,16,1,6607431762270322325,JPMorgan Software Does in Seconds What Took La...,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.097273,-0.256127,0.125721,0.122549,-0.325583,0.010568,-0.406618,0.374008,-0.068326,0.165533,0.117054,0.465354,0.007853,0.191864,0.271418,-0.362245,-0.099519,-0.178361,-0.304741,-0.349050,-0.064337,0.286012,0.172618,-0.014991,-2.387448,-0.027423,-0.078882,-0.157544,0.011774,-0.289854,0.415190,-0.071328,0.046348,-0.219499,0.012839,-0.526987,0.256877,0.084511,-0.061697,0.064235


In [None]:
# Generating TEST and TRAIN dataset
from sklearn.model_selection import train_test_split
dfTitleFloat = dfTitle.drop(columns=['title', 'contentId'])
train, test = train_test_split(dfTitleFloat, test_size=0.05)

In [None]:
train_X, train_Y = train.drop('labels',axis = 1), train['labels']
test_X, test_Y = test.drop('labels',axis = 1), test['labels']



In [None]:
train_X

Unnamed: 0,timestamp,eventType,authorPersonId_-9120685872592674274,authorPersonId_-9047547311469006438,authorPersonId_-9016528795238256703,authorPersonId_-9009798162809551896,authorPersonId_-9001583565812478106,authorPersonId_-8860671864164757449,authorPersonId_-8845298781299428018,authorPersonId_-8830250090736356260,authorPersonId_-8781306637602263252,authorPersonId_-8694104221113176052,authorPersonId_-8606737560479590536,authorPersonId_-8606085472606356565,authorPersonId_-8550167523008133722,authorPersonId_-8424644554119645763,authorPersonId_-8420584158427265596,authorPersonId_-8336698712796133957,authorPersonId_-8241940599580729220,authorPersonId_-8205402408645015051,authorPersonId_-8132559109129514792,authorPersonId_-8123627990288459252,authorPersonId_-8020832670974472349,authorPersonId_-7711052404720939396,authorPersonId_-7611460419696903236,authorPersonId_-7606731662737258050,authorPersonId_-7531858294361854119,authorPersonId_-7496361692498935601,authorPersonId_-7456488753754080246,authorPersonId_-7421703586506797266,authorPersonId_-7410485589492665094,authorPersonId_-7377845572432324217,authorPersonId_-7299489921519728176,authorPersonId_-7148295538010878187,authorPersonId_-7118575739764684077,authorPersonId_-6946355789336786528,authorPersonId_-6944500707172804068,authorPersonId_-6895155480127642372,authorPersonId_-6786856227257648356,authorPersonId_-6730258785244938562,...,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767
1500,11,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.077130,-0.273617,0.131391,0.126100,-0.313527,0.008148,-0.454892,0.370231,-0.014584,0.188939,0.142274,0.469259,0.015782,0.153384,0.317755,-0.355636,-0.077836,-0.167733,-0.286897,-0.345212,-0.093678,0.251718,0.125659,-0.012348,-2.353914,-0.047283,-0.095028,-0.162566,0.004264,-0.311225,0.442809,-0.061783,0.057702,-0.237164,0.001288,-0.558538,0.251202,0.104448,-0.054787,0.032548
193,20,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.080428,-0.267470,0.158015,0.118752,-0.316776,0.070423,-0.469150,0.375380,-0.027766,0.186172,0.164862,0.433927,-0.042682,0.178671,0.347262,-0.366354,-0.138757,-0.199964,-0.350413,-0.345884,-0.136928,0.210491,0.169812,-0.004560,-2.446651,-0.044374,-0.055543,-0.143434,0.010940,-0.270479,0.412457,-0.065623,0.061063,-0.225243,0.020523,-0.550722,0.260758,0.098434,-0.062163,0.061656
3117,14,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.089610,-0.271939,0.147562,0.083684,-0.340850,0.036086,-0.450647,0.384716,-0.026865,0.187980,0.135534,0.460584,0.003044,0.174371,0.312607,-0.376781,-0.113071,-0.198283,-0.318271,-0.322131,-0.120726,0.250430,0.142816,-0.032969,-2.424765,-0.025062,-0.076240,-0.136097,0.042990,-0.306371,0.437366,-0.063015,0.070094,-0.210509,0.019114,-0.558625,0.256834,0.113552,-0.050621,0.012353
2836,19,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.089113,-0.279673,0.145779,0.108887,-0.349225,0.080296,-0.473132,0.419396,-0.051678,0.190014,0.173420,0.393661,-0.040035,0.179527,0.333627,-0.412754,-0.162385,-0.197550,-0.352063,-0.359719,-0.136142,0.250603,0.212638,0.055423,-2.408853,-0.017716,-0.070048,-0.147578,0.015597,-0.295391,0.450382,-0.063319,0.067721,-0.242100,0.011082,-0.531492,0.281215,0.076615,-0.051685,0.115984
480,12,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.065226,-0.272002,0.143419,0.108013,-0.311704,-0.015972,-0.449694,0.369738,0.021659,0.160848,0.127474,0.470507,0.013444,0.145814,0.317278,-0.359566,-0.080669,-0.154440,-0.282255,-0.334131,-0.102872,0.239202,0.156320,-0.025588,-2.417209,-0.045869,-0.062476,-0.168996,0.019182,-0.309410,0.426371,-0.089497,0.062424,-0.226302,-0.009770,-0.571835,0.258484,0.105823,-0.038975,0.041067
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1,19,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.083978,-0.259923,0.139390,0.099553,-0.335783,0.003261,-0.453525,0.372251,-0.046157,0.158701,0.132897,0.467020,0.018253,0.165468,0.307462,-0.394013,-0.084979,-0.170337,-0.295815,-0.319584,-0.093000,0.246695,0.143790,-0.023047,-2.384967,-0.017559,-0.069826,-0.176428,0.042679,-0.329104,0.432516,-0.063531,0.058508,-0.248117,-0.006832,-0.573886,0.278394,0.090112,-0.049559,0.035127
1318,16,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.087564,-0.273711,0.142261,0.116421,-0.314439,-0.017132,-0.417919,0.383165,-0.021309,0.170650,0.113388,0.477903,0.022810,0.175813,0.317921,-0.371892,-0.091336,-0.159070,-0.320681,-0.347799,-0.097389,0.253396,0.167459,-0.003869,-2.373139,-0.041011,-0.115426,-0.171559,0.046354,-0.331783,0.421124,-0.084490,0.094542,-0.220539,-0.000563,-0.545252,0.257077,0.099013,-0.049131,0.048908
506,17,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.086356,-0.292293,0.140914,0.074946,-0.344132,0.079281,-0.493319,0.413142,-0.051801,0.201450,0.180247,0.426096,-0.026089,0.190012,0.354696,-0.403841,-0.193616,-0.203461,-0.352775,-0.352328,-0.126923,0.242232,0.167489,-0.008302,-2.390862,0.004590,-0.042849,-0.126754,0.008071,-0.283549,0.431989,-0.030777,0.053911,-0.249725,0.009910,-0.546180,0.266963,0.120626,-0.071644,0.062609
786,17,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,-0.096085,-0.264073,0.148415,0.073177,-0.316996,-0.006701,-0.435257,0.365066,0.036640,0.164241,0.126911,0.492885,0.006586,0.174518,0.298749,-0.336814,-0.113992,-0.166586,-0.312155,-0.345721,-0.126009,0.235810,0.135778,-0.031535,-2.486718,-0.059933,-0.047491,-0.132725,0.034820,-0.316040,0.380752,-0.083034,0.049665,-0.204684,-0.000106,-0.541984,0.241151,0.117464,-0.036536,0.022087


In [None]:
# Artificial Neural Network

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.models import Sequential
from keras.layers import Dense
from tensorflow.keras import initializers

print(len(train_X.keys()))
def build_model():
  model = keras.Sequential([
    layers.Dense(1024, activation='relu', kernel_initializer='random_normal', input_shape=[len(train_X.keys())]),
    layers.Dense(256, activation='relu',),
    layers.Dense(16, activation='relu'),
    layers.Dense(1)
  ])
  optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.01)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model

model = build_model()
model.summary()

2103
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 1024)              2154496   
_________________________________________________________________
dense_5 (Dense)              (None, 256)               262400    
_________________________________________________________________
dense_6 (Dense)              (None, 16)                4112      
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 17        
Total params: 2,421,025
Trainable params: 2,421,025
Non-trainable params: 0
_________________________________________________________________


In [None]:
!pip install git+https://github.com/tensorflow/docs

Collecting git+https://github.com/tensorflow/docs
  Cloning https://github.com/tensorflow/docs to /tmp/pip-req-build-o0huax5z
  Running command git clone -q https://github.com/tensorflow/docs /tmp/pip-req-build-o0huax5z
Building wheels for collected packages: tensorflow-docs
  Building wheel for tensorflow-docs (setup.py) ... [?25l[?25hdone
  Created wheel for tensorflow-docs: filename=tensorflow_docs-0.0.09266de4a8b3781c75947b25790e2191c32f70292_-cp36-none-any.whl size=140358 sha256=beeda446311d8e973224d4634286f3849d410c685a2072182677169d097edb96
  Stored in directory: /tmp/pip-ephem-wheel-cache-ec717rvm/wheels/eb/1b/35/fce87697be00d2fc63e0b4b395b0d9c7e391a10e98d9a0d97f
Successfully built tensorflow-docs
Installing collected packages: tensorflow-docs
Successfully installed tensorflow-docs-0.0.09266de4a8b3781c75947b25790e2191c32f70292-


In [None]:
import pathlib

import matplotlib.pyplot as plt
import seaborn as sns

  import pandas.util.testing as tm


In [None]:
import tensorflow_docs as tfdocs
import tensorflow_docs.modeling as mod
EPOCHS = 1000

history = model.fit(
  train_X, train_Y,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[mod.EpochDots()])


Epoch: 0, loss:77672.7969,  mae:135.4546,  mse:77672.7969,  val_loss:36900.1914,  val_mae:115.7739,  val_mse:36900.1914,  
....................................................................................................
Epoch: 100, loss:46083.1680,  mae:96.9403,  mse:46083.1680,  val_loss:40964.3945,  val_mae:118.1164,  val_mse:40964.3945,  
....................................................................................................
Epoch: 200, loss:29441.0352,  mae:82.6329,  mse:29441.0352,  val_loss:47170.2852,  val_mae:129.9509,  val_mse:47170.2852,  
....................................................................................................
Epoch: 300, loss:19864.6113,  mae:77.8599,  mse:19864.6113,  val_loss:41807.8320,  val_mae:119.0620,  val_mse:41807.8320,  
....................................................................................................
Epoch: 400, loss:16685.9277,  mae:73.0780,  mse:16685.9277,  val_loss:45845.2266,  val_mae:125.6242,

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

Unnamed: 0,loss,mae,mse,val_loss,val_mae,val_mse,epoch
995,11065.694336,63.77644,11065.694336,65286.382812,154.479874,65286.382812,995
996,11403.785156,65.068192,11403.785156,56258.636719,138.503464,56258.636719,996
997,11813.77832,64.402214,11813.77832,52461.285156,135.096695,52461.285156,997
998,11724.618164,64.846954,11724.618164,52076.457031,133.616714,52076.457031,998
999,11209.113281,64.260185,11209.113281,49060.964844,131.110443,49060.964844,999


In [None]:
loss, mae, mse = model.evaluate(test_X, test_Y, verbose=2)
score = model.evaluate(test_X, test_Y, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} Virality".format(mae))

5/5 - 0s - loss: 60148.0781 - mae: 148.0712 - mse: 60148.0781
5/5 - 0s - loss: 60148.0781 - mae: 148.0712 - mse: 60148.0781
Testing set Mean Abs Error: 148.07 Virality


In [None]:
results = model.predict(test_X)

results

array([[  58.52546  ],
       [  58.52546  ],
       [  58.52546  ],
       [ 599.48175  ],
       [ 340.43613  ],
       [  58.52546  ],
       [  58.52546  ],
       [ 229.34633  ],
       [  58.52546  ],
       [  58.52546  ],
       [  58.52546  ],
       [ 378.35513  ],
       [  58.52546  ],
       [ 261.2078   ],
       [ 176.62741  ],
       [  58.52546  ],
       [ 264.67737  ],
       [  58.52546  ],
       [  58.52546  ],
       [ 229.55779  ],
       [  58.52546  ],
       [ 164.574    ],
       [  58.52546  ],
       [   7.1709404],
       [  58.52546  ],
       [  58.52546  ],
       [  58.52546  ],
       [  58.52546  ],
       [  58.52546  ],
       [  58.52546  ],
       [1025.2701   ],
       [ 166.33212  ],
       [  58.52546  ],
       [  58.52546  ],
       [  76.04188  ],
       [ 243.24463  ],
       [ 286.37106  ],
       [  58.52546  ],
       [  58.52546  ],
       [ 202.35037  ],
       [  62.91578  ],
       [  58.52546  ],
       [  58.52546  ],
       [  4

In [None]:
test_Y

18        5.0
641     135.0
2157     76.0
1648    290.0
2925      8.0
        ...  
316      13.0
72       24.0
1123     92.0
1727      2.0
2244     86.0
Name: labels, Length: 157, dtype: float64