# Project Outline

## Background:

The marketing division of A Modern Bank has purchased data related to the number of reader comments on blog posts. We want to produce a quick prototype to understand the value of this data. You have been asked to present a business case to the Executive Manager of the marketing team.

## Data:

We will be using the ‘BlogFeedback’ dataset which can be downloaded here: https://archive.ics.uci.edu/ml/datasets/BlogFeedback

A description of how the data was constructed and a data dictionary are available on this page.

## Scenario Assumptions.

I will assume the data set is "relevant". i.e that even if it is based on hungarian websites scraped in the early 2010's in real life, the features and descriptions websites are relevant to australia today. 

#### Imports and project Structure

In [1]:
%matplotlib inline
import requests
from pathlib import Path
from zipfile import ZipFile
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import seaborn as sns
import joblib

url_orig = r"https://archive.ics.uci.edu/ml/machine-learning-databases/00304/BlogFeedback.zip"
url = Path(url_orig)

image_folder = Path('images')
image_folder.mkdir(exist_ok=True)
test_folder = Path('test')
test_folder.mkdir(exist_ok=True)

models_folder = Path('models')
models_folder.mkdir(exist_ok=True)

train_path_name = Path('blogData_train.csv')
test_path_name = Path('test.csv')

#### Get Train data

In [2]:
if not Path(url.name).exists():
    req = requests.get(url_orig)
    zip_name = url_orig.split('/')[-1]
    with open(zip_name, 'wb') as zfile:
        zfile.write(req.content)

if not train_path_name.exists():
    if not Path(url.name).exists():
        pass
        #response = requests.get(str(url))
    zfile = ZipFile(url.name, 'r')
    zfile.extract('blogData_train.csv')

#### Get test data

In [3]:
if not test_path_name.exists():
    if not Path(url.name).exists():
        pass
        #response = requests.get(str(url))
    zfile = ZipFile(url.name, 'r')
    for info in zfile.infolist():
        if info.filename.startswith("blogData_test"):
            zfile.extract(info.filename, test_folder)
    
    chunks = []
    for chunk in test_folder.glob('blogData_test*.csv'):
        chunks += [pd.read_csv(chunk, header=None)]
    test_data = pd.concat(chunks, ignore_index=True)
    test_data.columns = ['Col_{}'.format(col+1) for col in test_data.columns]

    
    test_data.to_csv(test_path_name, index=False)
test_df = pd.read_csv(test_path_name)

In [4]:
test_df

Unnamed: 0,Col_1,Col_2,Col_3,Col_4,Col_5,Col_6,Col_7,Col_8,Col_9,Col_10,...,Col_272,Col_273,Col_274,Col_275,Col_276,Col_277,Col_278,Col_279,Col_280,Col_281
0,10.630660,17.882992,1.0,259.0,5.0,4.018276,10.396790,0.0,235.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
1,43.435825,75.590485,0.0,634.0,20.0,15.998589,44.560870,0.0,473.0,2.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.733333,3.043390,0.0,9.0,0.0,0.733333,1.526070,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,27.230215,45.970950,0.0,371.0,14.0,10.784173,24.209942,0.0,228.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
4,4.500000,6.677075,0.0,18.0,0.5,3.000000,4.000000,0.0,10.0,0.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7619,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7620,56.512093,77.442830,0.0,438.0,32.0,19.296530,49.221344,0.0,432.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7621,49.442368,112.620125,1.0,849.0,9.0,20.445482,62.619390,0.0,506.0,2.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7622,16.593575,19.671364,1.0,144.0,10.0,6.512450,11.051215,0.0,111.0,2.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0


In [5]:
df = pd.read_csv(train_path_name, header=None)

df.columns = ['Col_{}'.format(col+1) for col in df.columns]
df.head()

Unnamed: 0,Col_1,Col_2,Col_3,Col_4,Col_5,Col_6,Col_7,Col_8,Col_9,Col_10,...,Col_272,Col_273,Col_274,Col_275,Col_276,Col_277,Col_278,Col_279,Col_280,Col_281
0,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.0


In [None]:
import dtale
dtale.show(df)

2022-05-05 13:25:15,853 - INFO     - Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-05-05 13:25:15,858 - INFO     - NumExpr defaulting to 8 threads.


#### Attribute Information:



1...50:
Average, standard deviation, min, max and median of the
Attributes 51...60 for the source of the current blog post
With source we mean the blog on which the post appeared.
For example, myblog.blog.org would be the source of
the post myblog.blog.org/post_2010_09_10

51: Total number of comments before basetime

52: Number of comments in the last 24 hours before the
basetime

53: Let T1 denote the datetime 48 hours before basetime,
Let T2 denote the datetime 24 hours before basetime.
This attribute is the number of comments in the time period
between T1 and T2

54: Number of comments in the first 24 hours after the
publication of the blog post, but before basetime

55: The difference of Attribute 52 and Attribute 53
56...60:
The same features as the attributes 51...55, but
features 56...60 refer to the number of links (trackbacks),
while features 51...55 refer to the number of comments.

61: The length of time between the publication of the blog post
and basetime

62: The length of the blog post

63...262:
The 200 bag of words features for 200 frequent words of the
text of the blog post

263...269: binary indicator features (0 or 1) for the weekday
(Monday...Sunday) of the basetime

270...276: binary indicator features (0 or 1) for the weekday
(Monday...Sunday) of the date of publication of the blog
post

277: Number of parent pages: we consider a blog post P as a
parent of blog post B, if B is a reply (trackback) to
blog post P.

278...280:
Minimum, maximum, average number of comments that the
parents received

281: The target: the number of comments in the next 24 hours
(relative to basetime)
    

#### Note

Unfortunately it seems like the data has cleaned out some potentially useful features such as year, month. I suspect yearly, monthly (holidays school years) might be washing out some of the weekly trends.

##### Notes: 
Four columns are all zero. Perhaps missing and indicative of bad scrape/data.

In [None]:
(df==0).all(0).astype(int).sum()

Target variable is int as expected

In [None]:
(df['Col_281'] % 1 == 0).all()

In [None]:
df['Col_281'] = df['Col_281'].astype(int)

### Frequency distribution of the comments

In [None]:
df["Col_281"].describe()

In [None]:
fig, ax = plt.subplots()

sns.histplot(data=df, binwidth=100,
             x="Col_281", ax=ax)
ax.set_yscale('log')
ax.set_xlabel('Comments')
ax.set_ylabel('Web Blogs')


from matplotlib.ticker import StrMethodFormatter, NullFormatter
ax.yaxis.set_major_formatter(StrMethodFormatter('{x:.0f}'))


plt.savefig(str(Path('Images') / 'Distribution_of_Comment_Counts.svg'))
plt.show()
fig

#### Note this might suggest exponatial distributional possibly need poisson regression

In [None]:
from collections import Counter
count_bin = Counter(df["Col_281"].astype(int))

In [None]:
count_bin[0], count_bin[1], count_bin[2], count_bin[3]

#### Correlations in predictors

This might help us think about what a linear regressor would do.

In [None]:
fig, ax = plt.subplots(figsize=(20, 6))
correlations = df.corr()
# mask = np.triu(np.ones_like(correlations, dtype=bool))
sns.heatmap(correlations, ax=ax)
fig

#### Note

As expected the stat features like mean median etc highly correlated with each other. 

The "bag of words", 63...262, are alo highly correlated with each other.

#### Look at which correlate with the predictor Col_281

In [None]:
fig, ax = plt.subplots(figsize=(20, 6))
sns.heatmap(correlations[-1:], ax=ax, cmap =sns.color_palette("viridis", as_cmap=True) )
fig

#### Note:

It looks like strongest influences are the popularity of the web blog.

#### Comment out and in the m, n range you'd like to look at.

In [None]:
m, n = 0, 50 # basetime stats, Average, standard deviation, min, max and median 
# m, n = 269, 276 # Publish dates
# m, n = 63, 262 # 'Bag of words' Features
m, n = 263, 280 # Seems empty.


cols = list(range(m, n)) + [280]
sub_corr = correlations.iloc[cols, cols]
#mask=None
mask = np.triu(np.ones_like(sub_corr, dtype=bool))
sns.heatmap(sub_corr, mask=mask, )


In [None]:
fig, ax = plt.subplots(figsize=(20, 6))
sns.heatmap(sub_corr[-1:], ax=ax, cmap = sns.color_palette("viridis", as_cmap=True))
fig



### Most Common day of publication

In [None]:
publication_day_df = df.iloc[:, 269:276].astype(int)
publication_day_df.columns = ['Mon', "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]


In [None]:
fig, ax = plt.subplots()
count_publication_df = publication_day_df.sum()
count_publication_df.plot(ax=ax)
ax.set_ylabel('Web Posts')
ax.set_xlabel('Publish Day')
fig

In [None]:
df.iloc[:, -1]

In [None]:
fig.savefig(image_folder/ 'Publishing Days.svg' )


In [None]:
count_cols = pd.DataFrame()
for col in publication_day_df:
    count_cols[col] = publication_day_df[col]*df.iloc[:, -1]
count_cols = count_cols.sum()


In [None]:
fig, ax = plt.subplots()
count_cols.plot(ax=ax)
ax.set_ylabel('Comments on Post')
ax.set_xlabel('Publish Day')

In [None]:
fig, ax = plt.subplots()
(count_cols/count_publication_df).plot(ax=ax)
ax.set_ylabel('Comments Per Web Blog Post')
ax.set_xlabel('Publish Day')

y_min, y_max = ax.get_ylim()
ax.set_ylim(y_min/2, y_max*1.2)

fig.savefig(image_folder/ 'Replies per post Days.svg')

In [None]:
list(models_folder.glob("*.joblib"))

In [None]:
data = df.iloc[:, 0:-2]
target = df.iloc[:, -1]

In [None]:
target

Standardise Data
First we're going to standardise. 

Recall

263...269: binary indicator features (0 or 1) for the weekday
(Monday...Sunday) of the basetime

270...276: binary indicator features (0 or 1) for the weekday
(Monday...Sunday) of the date of publication of the blog
post

We're going to not scaled this since they're True and False. We will also get rid of monday since that we need 6 indicators to represent 7 categories.

In [None]:
# Binary features are day of publish and Basetime
categorical_columns = ["Col_{}".format(n) for n in range(263, 277)]
print(categorical_columns)

In [None]:
raw_data = df.iloc[:, 0:-1]

# Binary features are day of publish and Basetime
def prepare_data(raw_data, categorical_columns, transformer)
    categorical_features = data[[col for col in categorical_columns if col not in ["Col_263", "Col_270"]]]
    features = data[[col for col in data.columns if col not in categorical_columns]]
    
    data = features.merge(categorical_features, left_index=True, right_index=True, validate='1:1')
    assert data.shape[0] == raw_data.shape[0]
    
    return data
    
    
data = prepare_data(raw_data, categorical_columns
    



In [None]:
features.merge(categorical_features, left_index=True, right_index=True, validate='1:1')

In [None]:
from sklearn import preprocessing

In [None]:
data = features

# Linear model
Try to predict the number of comments recieved in 24 hours.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:

model_name = 'basic_linear_model'

model = LinearRegression().fit(data, target)

In [None]:
model.score(data, target)

...meh...

In [None]:
np.argmax(model.coef_)

Apparently the largest positive factor to the model is column index 17... or Col_18. Col 18 Corresponds to the minumum of the of "Number of comments in the first 24 hours after the publication of the blog post for a web blog. We can infer this looking at the correlation plots. data isn't standardsided...

In [None]:
fig, ax = plt.subplots()
sns.regplot(x=df['Col_18'], y=df['Col_281'], ax=ax)
ax

Note: Col 18 looks digitised because at tail end there must be few Blogs with multiple posts 

In [None]:
model.coef_[262:269]

In [None]:
model.coef_[270:276]

In [None]:
df.iloc[:, 270:276].describe()

#### Test Data

In [None]:
test = test_df.iloc[:, 0:-2]
test_target = test_df.iloc[:, -1]

In [None]:
model.score(test, test_target)

#### Note:

My pc runs out of memory if I try to generate interaction features in the data...could try dask

# Experiment 2 Logistic Model

Predict "significant" activity. i.e. Web blogs with comments above N.

Use logistic regression.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

N = 1
class_target = (target > N-1).astype(int) # I.e. Active Blogs.

In [None]:
logi_model = LogisticRegression(max_iter=30000, solver='lbfgs').fit(data, class_target)

In [None]:
logi_model.score(data, class_target)

In [None]:
confusion_matrix(class_target, logi_model.predict(data))

In [None]:
confusion_matrix(class_target, logi_model.predict(data)).ravel()

In [None]:
#[tn,  ]
#

In [None]:
print(classification_report(class_target, logi_model.predict(data)))

In [None]:
predict = logi_model.predict(data).astype(bool)
predict

In [None]:
actual = class_target.values.astype(bool)
actual

In [None]:
true_positive = predict & actual
print(true_positive)
true_positves = true_positive.sum()
print(true_positves)

In [None]:
false_positive = predict & ~actual
print(false_positive)
false_positives = false_positive.sum()
print(false_positives)

In [None]:
true_negative = ~predict & ~actual
print(true_negative)

#### Test data

In [None]:
test = test_df.iloc[:, 0:-2]
test_target = test_df.iloc[:, -1]

class_target_test = (test_target > N-1).astype(int)

In [None]:
logi_model.score(test, class_target_test)

#### Note

When I tested N=100 I got near .99 overall f1, likely spurious

In [None]:
print(classification_report(class_target_test, logi_model.predict(test)))

#### Will more data help?

Turn max iter down if taking too long.

In [None]:
max_iter = 10000
#logi_model_100 = LogisticRegression(max_iter=10000).fit(data[:100], class_target[:100])
logi_model_1000 = LogisticRegression(max_iter=max_iter).fit(data[:1000], class_target[:1000])
logi_model_5000 = LogisticRegression(max_iter=max_iter).fit(data[:5000], class_target[:5000])
logi_model_10k = LogisticRegression(max_iter=max_iter).fit(data[:10000], class_target[:10000])
logi_model_20k = LogisticRegression(max_iter=max_iter).fit(data[:20000], class_target[:20000])
logi_model_50k = LogisticRegression(max_iter=max_iter).fit(data[:50000], class_target[:50000])

#### Note:

Lot's of warnings about convergence. Current data is mix of categorical flags and numerical can optimise later

In [None]:
r_vals = [#logi_model_100.score(test, class_target_test),
          logi_model_1000.score(test, class_target_test),
          logi_model_5000.score(test, class_target_test),
          logi_model_10k.score(test, class_target_test),
          logi_model_20k.score(test, class_target_test),
          logi_model_50k.score(test, class_target_test),
         ]

In [None]:
r_vals

In [None]:
lines = plt.plot(r_vals)

#### Notes:

Increasing the training set size *seems* to help... 

##### Experiment 3

rando forest regression