<a href="https://colab.research.google.com/github/mohamedyosef101/101_learning_area/blob/area/Time%20Series/Practical%20Time%20Series%20Analysis%20-%20Nielsen/0-data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Time Series Data Collection

---
**Source:** Aileen Nielsen. [Practical Time Series Analysis](https://www.oreilly.com/library/view/practical-time-series/9781492041641/), 2019.O'Reilly.

---
**Scenario:**
Imagine working for a large nonprofit organization. You have been tracking a variety of factors:
* An email recipient's reaction to emails over time: Did they open the emails or not?
* A membership history: Were there periods when a member let their membership lapse?
* Transaction history: When does an individual buy and can we predict this?



In [None]:
# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Get the data source (from the book repo)
PATH = 'https://github.com/PracticalTimeSeriesAnalysis/BookRepo/raw/master/Ch02/data/'

# Read the datasets
year_joined = pd.read_csv(PATH + 'year_joined.csv')
emails = pd.read_csv(PATH + 'emails.csv')
donations = pd.read_csv(PATH + 'donations.csv')

As you see, we have serveral related datasets available. So, we will need to line them up together, possibly dealing with disparate timestamping conventions or different leveles of granularity in the data.

- `year_joined` The year each member joined and currend status of membership.
- `emails` Number of emails you sent out in a given week that were opened by the member.
- `donations` Time a member donated to your organization.

In [None]:
year_joined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user        1000 non-null   int64 
 1   userStats   1000 non-null   object
 2   yearJoined  1000 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 23.6+ KB


In [None]:
year_joined.groupby('user').count().groupby('userStats').count()

Unnamed: 0_level_0,yearJoined
userStats,Unnamed: 1_level_1
1,1000


Here we can see that all 1000 users have only one status, so that the year they joined is indeed likely to be the year joined, accompanied by a status that may be the user's current status or status when they joined.

In [None]:
emails.user.value_counts()

932.0    173
896.0    173
155.0    172
396.0    171
867.0    171
        ... 
66.0       1
949.0      1
946.0      1
365.0      1
504.0      1
Name: user, Length: 539, dtype: int64

In [None]:
emails[emails.user == 998]

Unnamed: 0,emailsOpened,user,week
25464,1.0,998.0,2017-12-04 00:00:00
25465,3.0,998.0,2017-12-11 00:00:00
25466,3.0,998.0,2017-12-18 00:00:00
25467,3.0,998.0,2018-01-01 00:00:00
25468,3.0,998.0,2018-01-08 00:00:00
25469,2.0,998.0,2018-01-15 00:00:00
25470,3.0,998.0,2018-01-22 00:00:00
25471,2.0,998.0,2018-01-29 00:00:00
25472,3.0,998.0,2018-02-05 00:00:00
25473,3.0,998.0,2018-02-12 00:00:00


We can see that some weeks are missing. There aren't any Dec 2017 email events after Dec 04, 2017. Below is the code for that.

In [None]:
emails.week = pd.to_datetime(emails.week)

max_ = (emails[emails.user == 998]).max().week
min_ = (emails[emails.user == 998]).min().week
sub = (max_ - min_).days/7 + 1
# we add one because we should consider the first value in

print(f"from {min_.date()} to {max_.date()}. so it is {int(sub)} weeks")

from 2017-12-04 to 2018-05-28. so it is 26 weeks


In [None]:
emails[emails.user == 998].shape

(24, 3)

It’s a lot easier to fill in all missing weeks for all members by exploiting Pandas’
indexing functionality, rather than writing our own solution. We can generate a
MultiIndex for a Pandas data frame, which will create all combinations of weeks and
members—that is, a Cartesian product:

In [None]:
complete_idx = pd.MultiIndex.from_product((set(emails.week), set(emails.user)))

we use this index to reindex the original table and fill in the missing values--inthis case with 0 on the assumption that nothing recoreded means there was nothing to record. We also reset the index to make the user and week information available as columns, and name those columns:

In [None]:
all_email = emails.set_index(['week', 'user']).reindex(complete_idx,
                                                       fill_value=0).reset_index()

all_email.columns = ['week', 'user', 'email_opened']

all_email[all_email.user == 998].sort_values('week')

Unnamed: 0,week,user,email_opened
30183,2015-02-09,998.0,0.0
36112,2015-02-16,998.0,0.0
91629,2015-02-23,998.0,0.0
45814,2015-03-02,998.0,0.0
59828,2015-03-09,998.0,0.0
...,...,...,...
16169,2018-04-30,998.0,3.0
23176,2018-05-07,998.0,3.0
3233,2018-05-14,998.0,3.0
54977,2018-05-21,998.0,3.0


If we had the precise date a user started receiving emails, we would have an objective objective cutoff. As it is, we will let the data guide us. For each member we determine the `start_date` and `end_date` cutoffs by grouping the email `DataFrame` per user and selecting the maximum and minimum week values:

In [None]:
cutoff_dates = emails.groupby('user').week.agg(['min', 'max']).reset_index()

In [None]:
cutoff_dates

Unnamed: 0,user,min,max
0,1.0,2015-06-29,2018-05-28
1,3.0,2018-03-05,2018-04-23
2,5.0,2017-06-05,2018-05-28
3,6.0,2016-12-05,2018-05-28
4,9.0,2016-07-18,2018-05-28
...,...,...,...
534,991.0,2016-10-24,2016-10-24
535,992.0,2015-02-09,2015-07-06
536,993.0,2017-09-11,2018-05-28
537,995.0,2016-09-05,2018-05-28
