<a href="https://www.kaggle.com/jaganadhg/clapmediumclap?scriptVersionId=88611497" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction
Medium is one of the leading digital publishing platforms. People from all the disciples started to publish their content on this platform. If a user is impressed by the material in a post, they can engage by adding a comment or expressing a clap. The current data-set contains details extracted from 6008 medium blog posts published under various publication banners on medium. The current task is to predict claps based on the article metadata. 

In [None]:
%matplotlib inline
import os
import unicodedata
import warnings
warnings.simplefilter(action='ignore')

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import sklearn as sl

# Data Exploration

In [None]:
medium_data = pd.read_csv("../input/medium-articles-dataset/medium_data.csv")

In [None]:
medium_data.head(2)

In [None]:
medium_data.info()

## Attribute Infromation

The data contains nine attributes, including 'id.'  The attributes are:
* id: Unique id for each record 
* url: URL for the Medium post
* title: title of the medium post 
* subtitle: subtitle of the medium post
* image: image file name if available. The images are available in the image folder. 
* claps: total claps received fort the post. This is the target variable for our task here. 
* response: count of comments for the post.
* reading_time: reading time estimated by Medium for the post. 
* publication: The publication name in Medium, such as 'Towards Data Science.' 
* date:  date of publication. This is just a date, not date and time. 

In [None]:
for idx in range(10):
    print(medium_data.title[idx], medium_data.title[idx].split(" "))

**Potential Data Quality Issue**

There is a non-breaking space visible in the text. It may impact the tokenization efforts if applied in the text. 

In [None]:
def normalize_text(text : str) -> str:
    """ Normalize the unicode string
        :param text: text data
        :retrns clean_text: clean text
    """
    
    if text != np.nan:
        clean_text = unicodedata.normalize("NFKD",
                                           text)
    else:
        clean_text = text
    
    return clean_text

In [None]:
medium_data['clean_title'] = medium_data.title.apply(lambda x: normalize_text(x) if x!= np.nan else x)

In [None]:
medium_data.title[0], medium_data.clean_title[0]

In [None]:
medium_data['clean_subtitle'] = medium_data.subtitle.apply(lambda x: normalize_text(x) if x!= np.nan and type(x) == str else x)

**Creating New Features**

We will create two new features, namely the title word count (title_wc) and subtitle word count(subtitle_wc). 

In [None]:
def create_wc(text : str) -> int:
    """ Count words in a text
        :param text: String to check the len
        :retirns wc: Word count
    """
    
    wc = 0
    
    norm_text = text.lower()
    
    wc = len(norm_text.split(" "))
    
    return wc

In [None]:
medium_data.title[0].lower()

In [None]:
medium_data['title_wc'] = medium_data.title.apply(lambda x: create_wc(x) if x!= np.nan else 0)

In [None]:
medium_data['subtitle_wc'] = medium_data.subtitle.apply(lambda x: create_wc(x) if x!= np.nan and type(x) == str else 0)

In [None]:
medium_data.head()

In [None]:
cout_pub_ax = medium_data.publication.value_counts().plot(kind='bar',
                                                        figsize=(10,6),
                                                        rot=35,
                                                        align='center',
                                                        title="Count of Article by Publication")
cout_pub_ax.set_xlabel("Publication")
cout_pub_ax.set_ylabel("Count")

In [None]:
pub_clap_ax = medium_data.groupby(['publication'])['claps'].agg(sum).plot(kind='bar',
                                                                           figsize=(10,6),
                                                                           rot=35,
                                                                          align='center',
                                                                           title="Claps by Publications")
pub_clap_ax.set_xlabel("Publication")
pub_clap_ax.set_ylabel("Count")

In [None]:
medium_data.title_wc.plot(kind='hist',
                         figsize=(10,6),
                         title="Histogram of Title Word Count")

In [None]:
medium_data.subtitle_wc.plot(kind='hist',
                         figsize=(10,6),
                         title="Histogram of Sub Title Word Count")

In [None]:
medium_data.reading_time.plot(kind='hist',
                         figsize=(10,6),
                         title="Histogram of Reading Time")

## Basline Model
Now let's create a baseline model, nothing fancy yet. We will use the following attributes from the data and build a RandomForest model.  

#### Attributes Selected 
 * publication  
 * title_wc 
 * subtitle_wc
 * reading_time
 * claps

In [None]:
model_data = medium_data[['publication','title_wc','subtitle_wc','reading_time','claps']]
model_data.head()

The publication attribute is categorical, so we are applying one-hot encoding here. 

In [None]:
publications_cat = pd.get_dummies(model_data.publication)

The reading time range from 1 to 50. From the graph, it is evident that time longer than 15 is very less. Lets clips the values here for uniformity. 

In [None]:
model_data.reading_time.clip(lower=1,upper=15,inplace=True)

In [None]:
model_data.reading_time.plot(kind='hist',
                         figsize=(10,6),
                         title="Histogram of Reading Time after Clip")

Let's drop the publication column and add the one-hot encoded values to the dataframe. 

In [None]:
#model_data.drop('publication',
#                inplace=True,
#               axis=1)
#model_data.head(2)

In [None]:
model_data_treated = pd.concat([publications_cat,model_data],
                              axis=1,
                               sort=False)
model_data_treated.head(2)

In [None]:
from sklearn.model_selection import train_test_split

We are not building a classification model. But would love to include the behavior of each publication type in the model. We decided to stratify the data based on publication attributes and create a training test split by 70-30. 

In [None]:
train,test = train_test_split(model_data_treated,
                              test_size=0.3,
                             stratify=model_data_treated['publication'])

Now we will drop the publication attribute from the data (train and test). 

In [None]:
train.drop('publication',
                inplace=True,
               axis=1)
train.head(2)

In [None]:
test.drop('publication',
                inplace=True,
               axis=1)
test.head(2)

# Modelling Time!
Let's build our baseline model here. 

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf_model = RandomForestRegressor()

In [None]:
train_x = train[train.columns[:-1]]
train_y = train[['claps']]

In [None]:
_ = rf_model.fit(train_x,
            train_y)

In [None]:
test_x = test[test.columns[:-1]]
test_y = test[['claps']]

In [None]:
predictions = rf_model.predict(test_x)

In [None]:
test["prediction"] = predictions

In [None]:
sl.metrics.mean_squared_error(test.claps, test.prediction)

In [None]:
y = test.claps.values
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(y, predictions)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')


Now we have a pretty awesome bleh! model :-(. Time to wear a strategy hat and work more!!!

## AutoML

Let's try what automl can do here. We are using the libraray TPOT here

In [None]:
from tpot import TPOTRegressor

In [None]:
automl_reg = TPOTRegressor(generations=10,
                          population_size=100,
                          verbosity=2,
                          random_state=2020,
                          early_stop=3)

In [None]:
automl_reg.fit(train_x,
            train_y)

In [None]:
automl_predict = automl_reg.predict(test_x)

In [None]:
sl.metrics.mean_squared_error(test.claps, automl_predict)

In [None]:
y = test.claps.values
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(y, automl_predict)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')

In [None]:
test["automl_predict"] = automl_predict

In [None]:
test.head(10)

In [None]:
medium_data.image.nunique