# How many subscribers should I expect?...
## Predicting new subscriptions with Machine Learning

*In this article we learn how to build a simple time series model in Python to predict the number of new newsletter subscribers in ````Mailchimp````. We then demonstrate how this process can be greatly simplified with ````Magicsheets````.*

*NOTE: This notebook can be used directly with your own Mailchimp dataset. Just save it down on your laptop together with your own dataset called "data.csv", and run the code again.*

Regardless of whether you

* are a subscription-based business,
* manage a ````Slack```` channel,
* manage your start-ups newsletter,
* etc.

your key metric to think about is new subscriber numbers.

For a **community manager**, knowing how big the new user traffic in the coming days is likely ot be is key for effective management of resources and efficient use of your time.

Getting your data into nice charts for analysis is one thing, but how do we **predict** the new subscriber number?

One way is to use a *time series* model.

> Our model will learn to spot a pattern in the number of daily signups and extend this pattern by the extra prediction period of 7 days 

Let's look at an example to understand what this means.
<!-- One simple and often effective way to do so is simple *regression models*.
What are these? Regression models are just fitting a line to your data.
If on x-axis (horizontal) we put the subscription times and y-axis (vertical) we plot subscription numbers, we might end up with a plot like this: -->

### The Mailchimp subscriber forecasting

We will try to answer the question:
> **How many new newsletter subscribers can I expect to get in the next week?**

We will do so by applying Machine Learning to find a pattern in our ````Mailchimp```` subscriber dataset and extend this pattern to formulate a prediction.

#### The dataset
First, let's get the past subscriber dataset from our ````Mailchimp````. 

*(You can read about exporting your contacts [here](https://mailchimp.com/help/view-export-contacts/) (once you log into Mailchimp, go to Audiences $\rightarrow$ select your audience $\rightarrow$ view contacts $\rightarrow$ export $\rightarrow$ select export to .csv))*

Export your Mailchimp dataset and save it in the same folder as this Jupyter notebook. In this tutorial, we are using our own [Magicsheets.io](www.magicsheets.io) ````Mailchimp```` newsletter subscriber dataset, anonimised to protect the subscibers' privacy. 

Once we have the dataset saved down in the same folder, we load it using the ````Pandas```` package. (The email addresses here are randomly generated.)

In [2]:
# get pandas
import pandas as pd

# load the dataset as "data"
data = pd.read_csv('data.csv')

# view the top 10 rows of the dataset 
data.head()

Unnamed: 0,Email Address,MEMBER_RATING,OPTIN_TIME,CONFIRM_TIME,GMTOFF,DSTOFF,TIMEZONE,CC,LAST_CHANGED,NOTES,TAGS
0,lawrence@magicsheets.io,2,2021/10/14 4:57,2021/10/14 4:57,8.0,8.0,europe/london,uk,2021/10/14 4:57,,
1,hello@magicsheets.io,2,2021/4/17 16:06,2021/4/17 16:06,1.0,2.0,europe/brussels,be,2021/9/26 12:47,,
2,gogrogane0@over-blog.com,2,2021/10/14 11:21,2021/10/14 11:21,,,,,2021/10/14 11:21,,"""Add-on Signup"""
3,ddockery1@dropbox.com,4,2021/5/4 14:30,2021/5/4 14:30,1.0,2.0,europe/berlin,de,2021/9/26 12:47,,
4,abickerdyke2@photobucket.com,2,2021/4/22 16:41,2021/4/22 16:41,1.0,2.0,Europe/Brussels,BE,2021/9/26 12:53,,


#### Cleaning the data

````Mailchimp```` provides us with tonnes of useful information, such as he IP address of the subscirber, the location (actual geographical location!) of the IP address, the subscriber's time zone, etc.

For us, however, the only useful piece of information for now will be the time when the person actually became a subscriber.

> The key column for us is thu "CONFIRM_TIME" column which tells us when the subscriber became a subscriber.

In [3]:
# define "times" variable to hold the desired data input
times = data['CONFIRM_TIME']

# modify "times" to only contain the date (we don't care about the exact hour when the person subscribed; just the date is enough)
times = pd.to_datetime(times)

Once we isolated the desired "times" dataset, we want to group the subscription numbers by date.

That is, we want to know how many new subscribers there were on each day.

In [5]:
times_day_count = times.groupby(times.dt.floor('d')).size().reset_index(name='count')
print(times_day_count)

   CONFIRM_TIME  count
0    2021-04-16      1
1    2021-04-17      4
2    2021-04-22      4
3    2021-04-23      1
4    2021-04-25      1
..          ...    ...
72   2021-10-14      5
73   2021-10-15      1
74   2021-10-16      1
75   2021-10-18      1
76   2021-10-24      1

[77 rows x 2 columns]


#### Forecasting Horizon (prediction period)

We define the prediction period (how many days we want to make predictions for).

Here we are focusing on a 1 week prediction, but you can use any number of days you wish.

**Note:** *The farther ahead you are looking, the less accurate the predictions will be. You should also be careful when picking the size of the prediction period *relative to your dataset size*. For example, if you have a dataset encompassing 3 months, a week's worth of predictions might make sense, but making a 5-year forecast based on the same data is probably not going to be very accurate!*

We will use a simple, pre-defined and fully automated function ForecastingHorizon from ````sktime```` library, which contains easy-to-deploy time series models, like so:

In [16]:
# get the function
from sktime.forecasting.base import ForecastingHorizon

# define the prediction period (AKA the Forecasting Horizon)
fh = ForecastingHorizon(times_day_count.index, is_relative=False)

#### The model
We now build the time series model. This can also be loaded directly from ````sktime```` library.

> The model we are going to use is called Exponential Smoothing. You can read more about the theory behind it [here](https://en.wikipedia.org/wiki/Exponential_smoothing).

In [27]:
# download the pre-defined model framework from the sktime library
from sktime.forecasting.exp_smoothing import ExponentialSmoothing

# construct the model for this case
model = ExponentialSmoothing()

#### Training the model

> We train our model using the time series dataset, by directly applying the *fit* function, and feeding the *count* column of the ````times_day_count```` variable to the model

In [28]:
model.fit(times_day_count['count'])

ExponentialSmoothing(seasonal='add', sp=7, trend='add')

#### <font color=blue|black>Predicting the future ````Mailchimp```` subcribers!</font>

Once we have trained our model, we can finally generate predictions for the new subscriber numbers in the coming week.

> To create the predictions, we can directly apply ````predict```` function to the trained model, using as argument the Forecasting Horizon (prediction period)

In [22]:
import numpy as np

In [29]:
np.floor(model.predict(fh=[1,2,3,4,5,6,7]))

77    2.0
78    1.0
79    2.0
80    1.0
81    1.0
82    1.0
83    1.0
dtype: float64

In [None]:
## 

First, let's see if the current subscriber dataset can give us an answer.


1. Log into your Mailchimp, find your Audience signup data and download the dataset to your computer.
2. Load the dataset into Python with Pandas.
3. Identify and select the relevant data columns.
4. Build the *time series* model.
5. Train and test the model.
6. The next day (or week, or month): rinse, wash, repeat!

vs. with Magicsheets:

1. Log into your Mailchimp and select audience for predictions
2. Re-run the model any time you want in the dedicated Slack channel with '/run-magicpipe'



In [25]:

from sktime.performance_metrics.forecasting import mean_squared_error

from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.utils.plotting import plot_series

In [36]:
fh

ForecastingHorizon([57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73,
            74, 75, 76],
           dtype='int64', is_relative=False)

ExponentialSmoothing()

In [11]:
pred = model.predict(fh)

In [12]:
pred

0     2.538197
1     2.398095
2     2.544000
3     2.676615
4     2.523906
        ...   
72    2.157529
73    2.416427
74    2.287416
75    2.170156
76    2.063576
Length: 77, dtype: float64

In [50]:
model.fit(times_day_count['count'])

ExponentialSmoothing(seasonal='add', sp=7, trend='add')

In [53]:
np.floor(model.predict(fh=[1,2,3,4,5,6,7]))

77    2.0
78    1.0
79    2.0
80    1.0
81    1.0
82    1.0
83    1.0
dtype: float64