<a href="https://colab.research.google.com/github/hatim1971/covid-19/blob/master/Covid19_Feature_creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature creation notebook

This notebook is the second in the COVID-19 project. It deals with the transformation and creation of features which will enable us to use machine learning and deep learning algorithms to predict the evolution of the infection through three countries: Italy, Iran and Spain.

Time series forecasting is an important area of machine learning. However, while the time component adds additional information, it can not be used as such in machine learning or deep learning algorithms. Thus we have to transform the time component without loosing the information related to time, we will also add at least one feature to make models perform better.



Let's create a function that will transform dates on a format used in the file paths

In [0]:
import pandas as pd
import datetime as dt
df_global, df_complete = pd.DataFrame(), pd.DataFrame()

#data is available in github and provided by JHU and updated daily in separate files. (one file per day )
#the structure of each url can be divided on two parts const_url and date_url
const_url="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"

# Last update date
date_update= '2020-03-18'

# this function change time format from yyyy-mm-dd to dd-mm-yyyy
import re
def date_format(dt):
        return re.sub(r'(\d{4})-(\d{1,2})-(\d{1,2})', '\\2-\\3-\\1', dt)


# this code generates the urls of data and load daily data
timeframe=pd.date_range('2020-01-22', date_update)
l=len(timeframe)
for i in range(l):
  d=str(timeframe[i]).replace("00:00:00", "").strip()
  dr= date_format(d)
  part_url= str(dr)+'.csv'
  file_url= const_url + part_url
  dfi=pd.read_csv(file_url, usecols=[1,2,3])
  dfic=pd.read_csv(file_url)
  df_global=pd.concat([df_global,dfi], sort=False)
  df_complete=pd.concat([df_complete,dfic], sort=False)

#Some countries are reported with two or more different names. The objectif of this script is to fix this issue
for i in range(df_global.shape[0]):
  if df_global.iloc[i,1] == 'Iran (Islamic Republic of)':
    df_global.iloc[i,1]='Iran'
  if df_complete.iloc[i,1] == 'Iran (Islamic Republic of)':
    df_complete.iloc[i,1]='Iran'

for i in range(df_global.shape[0]):
  if df_global.iloc[i,1] == 'Korea, South':
    df_global.iloc[i,1]='South Korea'
  if df_complete.iloc[i,1] == 'Korea, South':
    df_complete.iloc[i,1]='South Korea'

#create a dataframe that group data by country
gc_global=df_global.groupby('Country/Region')

# create dataframe of Italy 
df_it=gc_global.get_group('Italy')
df_it.rename(columns={"Last Update": "date"}, inplace=True)
df_it.reset_index(drop=True, inplace=True) # setting drop=True avoid adding new column index

# create dataframe of Iran
df_ir=gc_global.get_group('Iran')
df_ir.rename(columns={"Last Update": "date"}, inplace=True)
df_ir.reset_index(drop=True, inplace=True) # setting drop=True avoid adding new column index

# create dataframe of Spain
df_sp=gc_global.get_group('Spain')
df_sp.rename(columns={"Last Update": "date"}, inplace=True)
df_sp.reset_index(drop=True, inplace=True) # setting drop=True avoid adding new column index



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


In [0]:
df_it.head()

Unnamed: 0,Country/Region,date,Confirmed
0,Italy,1/31/2020 23:59,2.0
1,Italy,1/31/2020 8:15,2.0
2,Italy,2020-01-31T08:15:53,2.0
3,Italy,2020-01-31T08:15:53,2.0
4,Italy,2020-01-31T08:15:53,2.0


In [0]:
df_ir.head()

Unnamed: 0,Country/Region,date,Confirmed
0,Iran,2020-02-19T23:43:02,2.0
1,Iran,2020-02-20T17:33:02,5.0
2,Iran,2020-02-21T18:53:02,18.0
3,Iran,2020-02-22T10:03:05,28.0
4,Iran,2020-02-23T15:13:15,43.0


In [0]:
df_sp.head()

Unnamed: 0,Country/Region,date,Confirmed
0,Spain,2/1/2020 2:13,1.0
1,Spain,2020-02-01T23:43:02,1.0
2,Spain,2020-02-01T23:43:02,1.0
3,Spain,2020-02-01T23:43:02,1.0
4,Spain,2020-02-01T23:43:02,1.0


# **Feature Transformation**
Time series forecasting is an important area of machine learning. However, while the time component adds additional information, it can not be used as such in machine learning or deep learning algorithms. Thus we have to transform the time component without loosing the information related to time.
## **Using the starting point**
The starting point for each country is the day that country had reached nearly 20 confirmed cases.
This allows us to compare the trajectory of confirmed cases between countries.
The infection trajectory is based on standarized date point (integer type) that show the evolution of the infection during time.
The definition of the starting point enable us to have a common reference to appreciate the trajectory of the pandemic and see how a given country is flattening the curve.

https://www.visualcapitalist.com/infection-trajectory-flattening-the-covid19-curve/



In [0]:
# To implement the starting point we have to drop data that did not fit the condition "confirmed cases nearly 20"
df_itt=df_it.copy()
df_irt=df_ir.copy()
df_spt=df_sp.copy()
df_itt.drop(df_it.index[0:21], inplace=True)
df_spt.drop(df_sp.index[0:26], inplace=True)
df_irt.drop(df_ir.index[0:2], inplace=True)


In [0]:
df_itt.head()

Unnamed: 0,Country/Region,date,Confirmed
21,Italy,2020-02-21T23:33:06,20.0
22,Italy,2020-02-22T23:43:02,62.0
23,Italy,2020-02-23T23:43:02,155.0
24,Italy,2020-02-24T23:43:01,229.0
25,Italy,2020-02-25T18:55:32,322.0


In [0]:
df_irt.head()

Unnamed: 0,Country/Region,date,Confirmed
2,Iran,2020-02-21T18:53:02,18.0
3,Iran,2020-02-22T10:03:05,28.0
4,Iran,2020-02-23T15:13:15,43.0
5,Iran,2020-02-24T11:13:10,61.0
6,Iran,2020-02-25T14:53:03,95.0


In [0]:
df_spt.head()

Unnamed: 0,Country/Region,date,Confirmed
26,Spain,2020-02-27T13:23:02,15.0
27,Spain,2020-02-28T15:33:03,32.0
28,Spain,2020-02-29T19:13:08,45.0
29,Spain,2020-03-01T23:33:03,84.0
30,Spain,2020-03-02T14:43:05,120.0


## **Transforming time series data**
A specific date in itself isn't relevant, because the trajectory of the COVID-19 pandemic depends on several parameters and strategies deployed by countries.      
Thus we have to transform time series given the starting day for each country. 
Data will be transformed from %Y-%m-%dT%H:%M:%S to integer type which measure the number of days since the 20th confirmed case.

In [0]:
# Scaling datetime for Iran, day1 = 2020-02-21
from datetime import datetime,timedelta
ev_ir=df_irt.copy()
Format = '%Y-%m-%dT%H:%M:%S'
datelist = ev_ir['date']
ev_ir['date'] = datelist.map(lambda x : (datetime.strptime(x, Format)- datetime.strptime("2020-02-20T00:00:00", Format)).days  )
ev_ir.reset_index(inplace=True, drop=True)

# Scaling datetime for Italy, day1 = 2020-02-21
from datetime import datetime,timedelta
ev_it=df_itt.copy()
Format = '%Y-%m-%dT%H:%M:%S'
datelist = ev_it['date']
ev_it['date'] = datelist.map(lambda x : (datetime.strptime(x, Format)- datetime.strptime("2020-02-20T00:00:00", Format)).days)
ev_it.reset_index(inplace=True, drop=True)
ev_it.drop(ev_it.index[20], inplace=True)

# Scaling datetime for Spain, day1 = 2020-02-27
from datetime import datetime,timedelta
ev_esp=df_spt.copy()
Format = '%Y-%m-%dT%H:%M:%S'
datelist = ev_esp['date']
ev_esp['date'] = datelist.map(lambda x : (datetime.strptime(x, Format)- datetime.strptime("2020-02-26T00:00:00", Format)).days)
ev_esp.reset_index(inplace=True, drop=True)
ev_esp.drop(ev_esp.index[14], inplace=True)

for i in range(ev_it.shape[0]):
  ev_it.iloc[[i],[1]]=i+1
for i in range(ev_ir.shape[0]):
  ev_ir.iloc[[i],[1]]=i+1
for i in range(ev_esp.shape[0]):
  ev_esp.iloc[[i],[1]]=i+1

In [0]:
ev_it.head()

Unnamed: 0,Country/Region,date,Confirmed
0,Italy,1,20.0
1,Italy,2,62.0
2,Italy,3,155.0
3,Italy,4,229.0
4,Italy,5,322.0


In [0]:
ev_ir.head()

Unnamed: 0,Country/Region,date,Confirmed
0,Iran,1,18.0
1,Iran,2,28.0
2,Iran,3,43.0
3,Iran,4,61.0
4,Iran,5,95.0


In [0]:
ev_esp.head()

Unnamed: 0,Country/Region,date,Confirmed
0,Spain,1,15.0
1,Spain,2,32.0
2,Spain,3,45.0
3,Spain,4,84.0
4,Spain,5,120.0


# Feature Creation
We can phrase the problem as a regression problem.

That is, given the number of confirmed cases in a given date, what is the number of confirmed cases in the day after.

We can write a simple function to convert our two columns of data into a three-column dataset: the first column containing the day from starting point, the second column containing the confirmed cases count and the third column containing next day confirmed cases, to be predicted.

Thus, we will have features and labeled data to build our model in a supervised learning context.

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. ... A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.


### **Feature creation: using the t+1 output**.

This is called a window, and the size of the window is a parameter that can be tuned for each problem.

For example, given the current time (t) we want to predict the value at the next time in the sequence (t+1), we can use the current time (t), as well as the two prior times (t-1 and t-2) as input variables.

When phrased as a regression problem, the input variables are t-2, t-1, t and the output variable is t+1.

The create_dataset() function we created in the previous section allows us to create this formulation of the time series problem by increasing the look_back argument from 1 to 3.**

In [0]:
# convert a two columns dataframe into a three column dataframe with the new created feature
def create_new_feature(df):
  list_feature=[]
  for i in range(df.shape[0]-1):
    list_feature.append(df.iloc[i+1,2])

  list_feature.append(0)
  df['New Cases']= list_feature
  

In [0]:
create_new_feature(ev_esp)
ev_esp.head()

Unnamed: 0,Country/Region,date,Confirmed,New Cases
0,Spain,1,15.0,32.0
1,Spain,2,32.0,45.0
2,Spain,3,45.0,84.0
3,Spain,4,84.0,120.0
4,Spain,5,120.0,165.0


In [0]:
create_new_feature(ev_it)
ev_it.head()

Unnamed: 0,Country/Region,date,Confirmed,New Cases
0,Italy,1,20.0,62.0
1,Italy,2,62.0,155.0
2,Italy,3,155.0,229.0
3,Italy,4,229.0,322.0
4,Italy,5,322.0,453.0


In [0]:
create_new_feature(ev_ir)
ev_ir.head()

Unnamed: 0,Country/Region,date,Confirmed,New Cases
0,Iran,1,18.0,28.0
1,Iran,2,28.0,43.0
2,Iran,3,43.0,61.0
3,Iran,4,61.0,95.0
4,Iran,5,95.0,139.0
