# Data Acquisition and Processing Systems (DaPS) (ELEC0136)    
### Final Assignment
---

<div class="alert alert-heading alert-info">

#### Task 1: Data Acquisition

You will first have to acquire the necessary data for conducting your study. One essential type of
data that you will need, are the stock prices for each company from April 2017 to April 202 1 as
described in Section 1. Since these companies are public, the data is made available online. The
first task is for you to search and collect this data, finding the best way to access and download
it. A good place to look is on platforms that provide free data relating to the stock market such as
Google Finance or Yahoo! Finance.

[Optional] Providing more than one method to acquire the very same or different data, e.g. from
a downloaded comma-separated-value file and a web API, will result in a higher score.

There are many valuable sources of information for analysing the stock market. In addition to time
series depicting the evolution of stock prices, acquire auxiliary data that is likely to be useful for
the forecast, such as:

- Social Media, e.g., Twitter: This can be used to uncover the public’s sentimental
response to the stock market
- Financial reports: This can help explain what kind of factors are likely to affect the stock
market the most
- News: This can be used to draw links between current affairs and the stock market
- Climate data: Sometimes weather data is directly correlated to some companies’ stock
prices and should therefore be taken into account in financial analysis
- Others: anything that can justifiably support your analysis.

Remember, you are looking for historical data, not live data.
   
    
</div>

In [1]:
def acquire():
    import os
    import sys

    import csv
    import urllib

    import pandas_datareader.data as web
    import pandas as pd
    import numpy as np
    import datetime
    def download_save_csv(url,filename):
        """ To download and save .csv file according to url and save path"""
        data_dir='./Dataset'
        if not os.path.exists(data_dir):
            os.makedirs(data_dir)    
        file_path=data_dir+'/'+filename+'.csv'
        if not os.path.exists(file_path):
            file_path, headers = urllib.request.urlretrieve(url,file_path)   
        return file_path

    stock_url = 'https://query1.finance.yahoo.com/v7/finance/download/AAL?period1=1491004800&period2=1619827200&interval=1d&events=history&includeAdjustedClose=true'
    stock_Path=download_save_csv(stock_url,'stock')
    stock=pd.read_csv(stock_Path,parse_dates=True) 

    COVID_url = "https://covid19.who.int/WHO-COVID-19-global-data.csv"
    Cases_Path= download_save_csv(COVID_url,'cases')
    Cases=pd.read_csv(Cases_Path,parse_dates=True)  
    caseUS=Cases.loc[Cases["Country_code"] == "US"]
    Cases.shape[0]
    Cases.dtypes
    Precipitation_url = "https://www.ncdc.noaa.gov/cag/national/time-series/110-pcp-all-12-2017-2021.csv?base_prd=true&begbaseyear=2017&endbaseyear=2021"
    Precip_Path = download_save_csv(Precipitation_url ,'Precip')
    Precip=pd.read_csv(Precip_Path,skiprows=[0,1,2,3], parse_dates=True,infer_datetime_format=True,date_parser=True)
    Precip.shape[0]
    Precip.dtypes
    
    startDate=datetime.datetime(2017,4,1)
    endDate=datetime.datetime(2021,5,31)
    Stock =web.DataReader("AAL","yahoo",startDate,endDate)
    Stock.to_csv('./Dataset/Stock.csv',index=True, header=True)

    startDate=datetime.datetime(2017,4,1)
    endDate=datetime.datetime(2021,5,31)
    Oil =web.DataReader("CL=F","yahoo",startDate,endDate)
    Oil.to_csv('./Dataset/Oil.csv',index=True, header=True) 

    startDate=datetime.datetime(2017,4,1)
    endDate=datetime.datetime(2021,5,31)
    NASDAQ =web.DataReader("^IXIC","yahoo",startDate,endDate)
    NASDAQ.to_csv('./Dataset/NASDAQ.csv',index=True, header=True) 
    print('task1 succeed')
    return Stock,Oil, NASDAQ,caseUS,Precip

<div class="alert alert-heading alert-info">
    
## Task 2: Data Storage

Once you have found a way to acquire the relevant data, you need to decide on how to store it.
You should choose a format that allows an efficient read access to allow training a parametric
model. Also, the data corpus should be such that it can be easily inspected. Data can be stored
locally, on your computer.
    
</div>

In [2]:
def store(Stock,Oil, NASDAQ,caseUS,Precip):
    import os
    import sys

    import csv
    import urllib
    import pandas_datareader.data as web
    import pandas as pd
    import numpy as np
    import datetime
    import matplotlib
    import pymongo
    def save_to_db(dataName,Name):
        """Transform Dataframe to dictionary then save then to MongoDB"""
        if not dataName.count_documents({}, limit = 1):
            dataName.insert_many(Name.to_dict(orient='records'))
        else:
            dataName.delete_many({})
            dataName.insert_many(Name.to_dict(orient='records'))

    client = pymongo.MongoClient("mongodb+srv://hcy:0523@cluster0.piobu.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")
    db = client.Assign
    StockPrices = db['StockPrices']
    Infects = db['Infects']
    Precip_db = db['Precip_db']
    Oil_db = db['Oil_db']
    NASDAQ_db = db['NASDAQ_db']

    Stock.reset_index(inplace=True)
    caseUS.reset_index(inplace=True)
    Precip.reset_index(inplace=True)
    Oil.reset_index(inplace=True)
    NASDAQ.reset_index(inplace=True)

    save_to_db(StockPrices,Stock)
    save_to_db(Infects,caseUS)
    save_to_db(Precip_db,Precip)
    save_to_db(Oil_db,Oil)
    save_to_db(NASDAQ_db,NASDAQ)

    client.close()
    print('task2 succeed')
    return StockPrices,Oil_db,NASDAQ_db, Infects, Precip_db

<div class="alert alert-heading alert-warning">

[Optional] Create a simple API to allow Al retrieving the compound of data you collected. It is enough to provide a single access point to retrieve all the data, and not implement query mechanism. The API must be accessible from the web. If you engage in this task data must be stored online.  
    
</div>

In [3]:
def retrieve(dataName):
    """retrieve dataset from MongoDB"""
    import pymongo
    import pandas as pd
    client = pymongo.MongoClient("mongodb+srv://hcy:0523@cluster0.piobu.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")
    db = client.Assign
    if dataName.count_documents({}, limit = 1):
        table = dataName.find()
        Data=pd.DataFrame.from_records(table)
        Data.drop('_id',axis = 1,inplace = True)
        client.close()
        return Data

    else:
        client.close()
        

<div class="alert alert-heading alert-info">

## Task 3: Data Preprocessing

Now that you have the data stored, you can start preprocessing it. Think about what features to
keep, which ones to transform, combine or discard. Make sure your data is clean and consistent
(e.g., are there many outliers? any missing values?). You are expected to:

1. Clean the data from missing values and outliers, if any.
2. Provide useful visualisation of the data. Plots should be saved on disk, and not printed on
the juptyer notebook.
3. Transform your data (e.g., using normalization, dimensionality reduction, etc.) to improve
the forecasting performance.

</div>

In [4]:
def process(StockPrices,Oil_db,NASDAQ_db, Infects, Precip_db):
    """
    Data Preprocess
    """
    import os
    import sys

    import csv
    import urllib

    import pandas_datareader.data as web
    import pandas as pd
    import numpy as np
    import datetime
    import matplotlib
    import pymongo
    import matplotlib.pyplot as plt
    from scipy import stats
    import seaborn as sn
    from sklearn.preprocessing import MinMaxScaler
    """retrieve data from MongoDB"""
    client = pymongo.MongoClient("mongodb+srv://hcy:0523@cluster0.piobu.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")
    db = client.Assign
    Stock_data=retrieve(db.StockPrices)
    Cases_data=retrieve(db.Infects)
    Precip_data=retrieve(db.Precip_db)
    Oil_data=retrieve(db.Oil_db)
    NASDAQ_data=retrieve(db.NASDAQ_db)
    """
    Data Choosing,only figure uout useful Dataset
    """
    Stock_data.rename(columns={'Close':'Stock_Price'},inplace=True)
    Stock_data=Stock_data.loc[:,['Date','Stock_Price']]
    Stock_data.set_index('Date',drop=True,inplace=True)

    Oil_data.rename(columns={'Close':'Oil_Price'},inplace=True)
    Oil_data=Oil_data.loc[:,['Date','Oil_Price']]
    Oil_data.set_index('Date',drop=True,inplace=True)

    NASDAQ_data.rename(columns={'Close':'NASDAQ_Index'},inplace=True)
    NASDAQ_data=NASDAQ_data.loc[:,['Date','NASDAQ_Index']]
    NASDAQ_data.set_index('Date',drop=True,inplace=True)

    Precip_data.set_index('Date',inplace=True)
    Precip_data.rename(columns={'Value':'Precipitation'},inplace=True)
    Precip_data=Precip_data.loc[:,['Precipitation']]

    Cases_data=Cases_data.loc[Cases_data["Date_reported"] < "2021-12-31"]
    Cases_data=Cases_data.rename(columns={'Date_reported':'Date'})
    Cases_data=Cases_data[['Date','New_cases']]
    Cases_data.set_index('Date',drop=True,inplace=True)
    Cases_data['New_cases']=True
    image_dir='./image/Data Preprocessing/outliers'
    
    if not os.path.exists(image_dir):
        os.makedirs(image_dir)   

    def Z_score(Data,yTitle,filename):
        """
        Using Z-Score to figure out outliers
        Then use Cap method to alter outliers
        """
        day=Data.reset_index()
        data=np.array(Data).reshape(1,len(day))[0]
        day=np.array(day.iloc[:,0])
        day=day.reshape(1,len(day))[0]
        z = np.abs(stats.zscore(data))
        threshold = 3
        loc = np.where(z > threshold)
        outlier = data[loc]

        drop = np.array([remain for remain in data if remain not in outlier])
        drop_day = np.array([remain for remain in day if remain not in day[loc]])
        cap=np.copy(data)
        # cap the outliers
        Num=[]
        for i,element in enumerate(cap[loc]):
            if element>np.mean(drop):
                Num.append(np.max(drop))
            else:
                Num.append(np.min(drop))

        cap[loc]=Num
        Data.iloc[:,0]=cap 

        file_path=image_dir+'/'+filename+'.jpg'

        fig, (ax1, ax2) = plt.subplots(2, 1, sharey=True,sharex=True, figsize=(10, 16))
        ax1.set_title('Before Cap')
        ax1.scatter(day,data)
        ax1.scatter(day[loc], data[loc], c='r')
        ax1.set_xlabel('Date')
        ax1.set_ylabel(yTitle)
        ax1.grid(which='major',axis='y')

        ax2.set_title('After cap')
        ax2.scatter(day,cap)
        ax2.scatter(day[loc], cap[loc], c='r')
        ax2.set_xlabel('Date')
        ax2.set_ylabel(yTitle)
        ax2.grid(which='major',axis='y')
        plt.savefig(file_path)
        plt.close(fig)

    def  IQR_score(outlier_dataset):
        """
        Just a test of IQR-Score, it is not used in programmed"""
        outlier_dataset=np.array(outlier_dataset)
        Q1 = np.quantile(outlier_dataset,0.25)
        Q3 = np.quantile(outlier_dataset,0.75)
        IQR = Q3-Q1
        Minimum = Q1-1.5*IQR
        Maximum = Q3+1.5*IQR

        outlier_by_IQR_Score=outlier_dataset[(outlier_dataset<Minimum) | (outlier_dataset>Maximum)]   

    plt.boxplot(Stock_data)
    plt.savefig('./image/Data Preprocessing/outliers/boxStock')
    Z_score(Stock_data,"Price in USD","Stock")
    plt.close()
    
    plt.boxplot(Oil_data)
    plt.savefig('./image/Data Preprocessing/outliers/boxOil')
    Z_score(Oil_data,"prices in USD","Oil")
    plt.close()
    
    plt.boxplot(NASDAQ_data)
    plt.savefig('./image/Data Preprocessing/outliers/boxNASDAQ')
    Z_score(NASDAQ_data,"prices in USD",'NASDAQ')
    plt.close()
    
    plt.boxplot(Precip_data)
    plt.savefig('./image/Data Preprocessing/outliers/boxPrecip')
    Z_score(Precip_data,"inch",'Precipiation')
    plt.close()

    image_dir='./image/Data Preprocessing/missing'
    if not os.path.exists(image_dir):
        os.makedirs(image_dir)  
    def Resample(df):
        """resample and interpolate monthly precipitation
        """

        df_RE = df.resample('1D')

        df_RE = df_RE.interpolate(method='time')


        return df_RE
    Precip_data.reset_index(inplace=True)
    Precip_data['Date']=[datetime.datetime.strptime(str(time),'%Y%m') for time in Precip_data['Date']]
    Precip_data.set_index('Date',inplace=True,drop=True)
    Precip_data=Resample(Precip_data)
    Cases_data.reset_index(inplace=True)
    Cases_data.set_index('Date',inplace=True,drop=True)
    Cases_data.index=pd.DatetimeIndex(Cases_data.index)
#     Combine.isnull().sum() 
    Combine = pd.concat([Stock_data,Oil_data,NASDAQ_data,Precip_data,Cases_data],axis=1)
    Combine.dropna(axis=0,subset = ['Stock_Price'],inplace=True)
    Combine['New_cases'].fillna(False,inplace=True)    
    print('Wheher the missing value exists:')
    print(Combine.isnull().any())
    Combine.to_csv('./Dataset/Combine.csv', index=True, header=True)
#     print(Combine.columns)
#     print(Combine.index)
    Y = Combine.loc[:,'Stock_Price']
    Y1 = Combine.loc[:,'Oil_Price']
    Y2 = Combine.loc[:,'NASDAQ_Index']
    Y3 = Combine.loc[:,'New_cases']
    Y4 = Combine.loc[:,'Precipitation']
    """
    Plot the data after process outliers and missing values
    """
    fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharey=True, figsize=(14, 14))
    ax1.hist(Y)
    ax1.set_ylabel('Stock_Price Count')
    ax2.hist(Y1)
    ax2.set_ylabel('Oil_Price Count')
    ax3.hist(Y2)
    ax3.set_ylabel('NASDAQ_Index Count')
    ax4.hist(Y4)
    ax4.set_ylabel('Precipitation Count')
    plt.savefig('./image/Data Preprocessing/missing/hist')
    plt.close()  
    fig, (ax1, ax2, ax3, ax4, ax5) =plt.subplots(5, sharex=True,figsize=(10, 16))
    ax1.plot(Y,label='Stock_Price', c='r')
    ax1.set_ylabel('Stock Price in USD')
    ax1.grid()
    ax1.legend()

    ax2.plot(Y1,label='Oil_Price', c='b')
    ax2.set_ylabel('Oil Price in USD')
    ax2.grid()
    ax2.legend()

    ax3.plot(Y2,label='NASDAQ_Index', c='y')
    ax3.set_ylabel('NASDAQ Index')
    ax3.grid( )
    ax3.legend()

    ax4.plot(Y3,label='New_cases', c='g')
    ax4.set_ylabel('New COVID cases in US')
    ax4.grid()
    ax4.legend()

    ax5.plot(Y4,label='Precipitation', c='black')
    ax5.set_xlabel('Date')
    ax5.set_ylabel('Precipitation in Inches')
    ax5.grid()
    ax5.legend()

    plt.savefig('./image/Data Preprocessing/missing/plot')
    plt.close()
    Precip_data['Precipitation']=Combine['Precipitation']
    Precip_data.dropna(axis=0,inplace=True)
    Combine_fit=Combine.copy()
    """Normalized data to [0,1] scale."""
    scaler0 = MinMaxScaler()
    Stock_fit=Stock_data.copy()
    Stock_fit['Stock_Price']=scaler0.fit_transform(Stock_data)

    scaler1 = MinMaxScaler()
    Oil_fit=Oil_data.copy()
    Oil_fit['Oil_Price']=scaler1.fit_transform(Oil_data)

    scaler2 = MinMaxScaler()
    NASDAQ_fit=NASDAQ_data.copy()
    NASDAQ_fit['NASDAQ_Index']=scaler2.fit_transform(NASDAQ_data)

    scaler3 = MinMaxScaler()
    Precip_fit=Precip_data.copy()
    Precip_fit['Precipitation']=scaler3.fit_transform(Precip_data)
    X = Combine_fit['Stock_Price']
    X1 = Combine_fit['Oil_Price']
    X2 = Combine_fit['NASDAQ_Index']
    X4 = Combine_fit['Precipitation']
    Combine_fit=Combine.copy()
    scaler = MinMaxScaler()
    Combine_fit=scaler.fit_transform(Combine)
    Combine_fit=pd.DataFrame(Combine_fit,index=Combine.index,columns=Combine.columns)
    Combine_fit.boxplot()
    plt.savefig('./image/Data Preprocessing/missing/box_fit')
    plt.close()
    print('task3 succeed')
    return Stock_data,Oil_data, NASDAQ_data,Precip_data,Combine,Stock_fit,Oil_fit,NASDAQ_fit,Precip_fit,Combine_fit,scaler0

<div class="alert alert-heading alert-info">
    
## Task 4: Data Exploration

After ensuring that the data is well preprocessed, it is time to start exploring the data to carry out
hypotheses and intuition about possible patterns that might be inferred. Depending on the data,
different EDA (exploratory data analysis) techniques can be applied, and a large amount of
information can be extracted.
For example, you could do the following analysis:

    
- Time series data is normally a combination of several components:
  - Trend represents the overall tendency of the data to increase or decrease over time.
  - Seasonality is related to the presence of recurrent patterns that appear after regular
intervals (like seasons).
  - Random noise is often hard to explain and represents all those changes in the data
that seem unexpected. Sometimes sudden changes are related to fixed or predictable
events (i.e., public holidays).
- Features correlation provides additional insight into the data structure. Scatter plots and
boxplots are useful tools to spot relevant information.
- Explain unusual behaviour.
- Explore the correlation between stock price data and other external data that you can
collect (as listed in Sec 2.1)
- Use hypothesis testing to better understand the composition of your dataset and its
representativeness.

    
At the end of this step, provide key insights on the data. This data exploration procedure should
inform the subsequent data analysis/inference procedure, allowing one to establish a predictive
relationship between variables.

</div>

In [5]:
def explore(Stock_data,Oil_data, NASDAQ_data,Precip_data,Combine,Stock_fit,Oil_fit,NASDAQ_fit,Precip_fit,Combine_fit):
    """
    Data Exploration
    """
    import os
    import sys
    import pandas as pd
    import numpy as np
    import datetime
    import matplotlib
    import pymongo
    import matplotlib.pyplot as plt
    import warnings
    warnings.filterwarnings('ignore')
    from scipy import stats
    import seaborn as sn
    from sklearn.preprocessing import MinMaxScaler
    from scipy.stats import pearsonr,spearmanr,chi2, chi2_contingency
    from calendar import day_abbr, month_abbr, mdays
    import holidays

    image_dir='./image/Data Exploration'
    if not os.path.exists(image_dir):
        os.makedirs(image_dir)

    Des=Combine.describe()
    Des.to_csv('./Dataset/Combine_describe.csv',index=True, header=True) 

    def seasonal_cycle(Dataset,title):
        """
        Explorae the seasonlity of Dataset
        """
        seasonal_cycle = Dataset.rolling(window=30, center=True).mean().groupby(Dataset.index.dayofyear).mean()
        q25 = Dataset.rolling(window=30, center=True).mean().groupby(Dataset.index.dayofyear).quantile(0.25)
        q75 = Dataset.rolling(window=30, center=True).mean().groupby(Dataset.index.dayofyear).quantile(0.75)
        Day_in_Mon = mdays.copy()
        Day_in_Mon[2] = 29
        Day_in_Mon = np.cumsum(Day_in_Mon)
        month_ticks = month_abbr[1:]
        Fre, axes = plt.subplots(figsize=(10,7)) 

        seasonal_cycle.plot(ax=axes, lw=2, color='b', legend=False)
        axes.fill_between(seasonal_cycle.index, q25.values.ravel(), q75.values.ravel(), color='b', alpha=0.3)
        axes.set_xticklabels(month_ticks)
        axes.grid(ls=':')
        axes.set_xlabel('Month', fontsize=15)
        axes.set_ylabel(title, fontsize=15);
        axes.set_xlim(0, 365)
        [l.set_fontsize(13) for l in axes.xaxis.get_ticklabels()]
        [l.set_fontsize(13) for l in axes.yaxis.get_ticklabels()]

        axes.set_title('30 days running average '+title, fontsize=15)
        plt.savefig('./image/Data Exploration/seasonal_cycle '+title)
        plt.close()

    seasonal_cycle(Stock_data,'Price of stock')
    seasonal_cycle(Oil_data,'Price of Oil')
    seasonal_cycle(NASDAQ_data,'NASDAQ Index')
    seasonal_cycle(Precip_data,'Precipitation')

    def year_and_month(Dataset,title):
        """
        Explorae the year and month regularity of Dataset
        """
        month_year = Dataset.copy()
        month_year.loc[:,'year'] = month_year.index.year
        month_year.loc[:,'month'] = month_year.index.month
        month_year = month_year.groupby(['year','month']).mean().unstack()
        month_year.columns = month_year.columns.droplevel(0)

        f, ax = plt.subplots(figsize=(12,6))

        sn.heatmap(month_year, ax=ax, cmap=plt.cm.viridis, cbar_kws={'boundaries':np.arange(10000,45000,5000)})

        cbax = f.axes[1]
        [l.set_fontsize(13) for l in cbax.yaxis.get_ticklabels()]
        cbax.set_ylabel(title, fontsize=13)

        [ax.axhline(x, ls=':', lw=0.5, color='0.8') for x in np.arange(1, 7)]
        [ax.axvline(x, ls=':', lw=0.5, color='0.8') for x in np.arange(1, 24)];

        ax.set_title(title+' per year and month', fontsize=16)

        [l.set_fontsize(13) for l in ax.xaxis.get_ticklabels()]
        [l.set_fontsize(13) for l in ax.yaxis.get_ticklabels()]

        ax.set_xlabel('Month', fontsize=15)
        ax.set_ylabel('Year', fontsize=15)
        ax.set_yticklabels(np.arange(2017, 2022, 1), rotation=0);
        plt.savefig('./image/Data Exploration/year_and_month '+title)
        plt.close()

    year_and_month(Stock_data,'Price of stock')
    year_and_month(Oil_data,'Price of Oil')
    year_and_month(NASDAQ_data,'NASDAQ Index')
    year_and_month(Precip_data,'Precipitation')

    def month_day(Dataset,title):
        """
        Explorae the day of the week and month regularity of Dataset
        """
        month_day = Dataset.copy()
        month_day.loc[:,'day_of_week'] = month_day.index.dayofweek
        month_day.loc[:,'month'] = month_day.index.month
        month_day = month_day.groupby(['day_of_week','month']).mean().unstack()
        month_day.columns = month_day.columns.droplevel(0)

        f, ax = plt.subplots(figsize=(12,6))

        sn.heatmap(month_day, ax = ax, cmap=plt.cm.viridis, cbar_kws={'boundaries':np.arange(10000,45000,5000)})

        cbax = f.axes[1]
        [l.set_fontsize(13) for l in cbax.yaxis.get_ticklabels()]
        cbax.set_ylabel('Santander cycles hires', fontsize=13)

        [ax.axhline(x, ls=':', lw=0.5, color='0.8') for x in np.arange(1, 7)]
        [ax.axvline(x, ls=':', lw=0.5, color='0.8') for x in np.arange(1, 24)];

        ax.set_title(title+' per day of the week and month', fontsize=16)

        [l.set_fontsize(13) for l in ax.xaxis.get_ticklabels()]
        [l.set_fontsize(13) for l in ax.yaxis.get_ticklabels()]

        ax.set_xlabel('Month', fontsize=15)
        ax.set_ylabel('Day of the week', fontsize=15)
        ax.set_yticklabels(day_abbr[0:month_day.shape[0]]);
        plt.savefig('./image/Data Exploration/month_day '+title)
        plt.close()

    month_day(Stock_data,'Price of stock')
    month_day(Oil_data,'Price of Oil')
    month_day(NASDAQ_data,'NASDAQ Index')
    month_day(Precip_data,'Precipitation')

    holidays_df = pd.DataFrame([], columns = ['ds','holiday'])
    ldates = []
    lnames = []
    for date, name in sorted(holidays.UnitedStates(years=np.arange(2017, 2021 + 1)).items()):
        ldates.append(date)
        lnames.append(name)
    """
    Filter the date in dataset which are also holidays
    """
    ldates = np.array(ldates)
    lnames = np.array(lnames)
    holidays_df.loc[:,'ds'] = ldates
    holidays_df.loc[:,'holiday'] = lnames
    holidays_df.holiday.unique()
    holidays_df.loc[:,'holiday'] = holidays_df.loc[:,'holiday'].apply(lambda x : x.replace(' (Observed)',''))

    Dataset = Stock_data.copy()
    holidays = Dataset.loc[Dataset.index.isin(holidays_df['ds'])]
    normaldays = Dataset.loc[~Dataset.index.isin(holidays_df['ds'])]
    summary_month_holidays = holidays.groupby(holidays.index.month).describe()
    summary_month_holidays.columns = summary_month_holidays.columns.droplevel(0)
    summary_month_holidays["mean"]
    print('Holidays that stock markets opening: ')
    print(holidays)
    def holidays(Dataset,title):
        """
  
        Explorae the holiday regularity of Dataset
        """
        Dataset = Dataset.copy()
        holidays = Dataset.loc[Dataset.index.isin(holidays_df['ds'])]
        normaldays = Dataset.loc[~Dataset.index.isin(holidays_df['ds'])]
        summary_month_holidays = holidays.groupby(holidays.index.month).describe()
        summary_month_normaldays = normaldays.groupby(normaldays.index.month).describe()
        summary_month_holidays.columns = summary_month_holidays.columns.droplevel(0)
        summary_month_normaldays.columns = summary_month_normaldays.columns.droplevel(0)
        f, ax = plt.subplots(figsize=(10,7))

        ax.plot(summary_month_holidays.index, summary_month_holidays.loc[:,'mean'], color='y', label='Holidays', ls='--', lw=3)
        ax.fill_between(summary_month_holidays.index, summary_month_holidays.loc[:,'25%'], \
                        summary_month_holidays.loc[:,'75%'], facecolor='y', alpha=0.1)
        ax.plot(summary_month_normaldays.index, summary_month_normaldays.loc[:,'mean'], color='b', label='Weekdays', lw=3)
        ax.fill_between(summary_month_normaldays.index, summary_month_normaldays.loc[:,'25%'], \
                        summary_month_normaldays.loc[:,'75%'], facecolor='b', alpha=0.1)
        ax.legend(fontsize=15)
        ax.set_xticks(range(1,13));
        ax.grid(ls=':', color='0.8')
        ax.set_xlabel('Month', fontsize=15)
        ax.set_ylabel(title, fontsize=15);

        [l.set_fontsize(13) for l in ax.xaxis.get_ticklabels()]
        [l.set_fontsize(13) for l in ax.yaxis.get_ticklabels()]

        plt.savefig('./image/Data Exploration/holidays '+title)
        plt.close()

    holidays(Stock_data,'Price of stock')
    holidays(Oil_data,'Price of Oil')
    holidays(NASDAQ_data,'NASDAQ Index')
    holidays(Precip_data,'Precipitation')
    Y = Combine.loc[:,'Stock_Price']
    Y1 = Combine.loc[:,'Oil_Price']
    Y2 = Combine.loc[:,'NASDAQ_Index']
    Y3 = Combine.loc[:,'New_cases']
    Y4 = Combine.loc[:,'Precipitation']
    image_dir='./image/Data Exploration/Data Relationships'
    if not os.path.exists(image_dir):
        os.makedirs(image_dir)
    """
    Calculate the varianceof data,Covariance, correlation between data
    """
    def DEmean(x):
        x_mean = np.mean(x)
        return [i - x_mean for i in x]

    def covariance(x, y):
        n = len(x)
        return np.dot(DEmean(x), DEmean(y)) / (n-1)
    var0=np.var(Y)
    cov1=covariance(Y, Y1)
    Cov1=np.cov(Y, Y1)
    sn.heatmap(Cov1, annot=True, fmt='g')
    plt.savefig('./image/Data Exploration/Data Relationships/Stock Oil')
    plt.close()
    var1=np.var(Y1)
    corr1, _ = pearsonr(Y, Y1)
    Corr1, _ = spearmanr(Y, Y1)
    plt.scatter(Y, Y1)
    plt.title('Stock Price  VS Curde Oil Price ')
    plt.ylabel('Curde Oil Price ($USD$)')
    plt.xlabel('Stock Price ($USD$)')
    plt.savefig('./image/Data Exploration/Data Relationships/Stock Oil scatter')
    plt.close()

    cov2=covariance(Y, Y2)
    Cov2=np.cov(Y, Y2)

    sn.heatmap(Cov2, annot=True, fmt='g')
    plt.savefig('./image/Data Exploration/Data Relationships/Stock NASDAQ')
    plt.close()

    var2=np.var(Y2)

    corr2, _ = pearsonr(Y, Y2)


    Corr2, _ = spearmanr(Y, Y2)


    plt.scatter(Y, Y2)
    plt.title('Stock Price  VS NASDAQ_Index ')
    plt.ylabel('NASDAQ_Index ($USD$)')
    plt.xlabel('Stock Price ($USD$)')
    plt.savefig('./image/Data Exploration/Data Relationships/Stock NASDAQ scatter')
    plt.close()

    cov4=covariance(Y, Y4)
    Cov4=np.cov(Y, Y4)

    sn.heatmap(Cov4, annot=True, fmt='g')
    plt.savefig('./image/Data Exploration/Data Relationships/Stock Precipitation')
    plt.close()

    var4=np.var(Y4)

    corr4, _ = pearsonr(Y, Y4)
    Corr4, _ = spearmanr(Y, Y4)
    plt.scatter(Y, Y4)
    plt.title('Stock Price  VS Precipitation ')
    plt.ylabel('Precipitation ($Inch$)')
    plt.xlabel('Stock Price ($USD$)')
    plt.savefig('./image/Data Exploration/Data Relationships/Stock Precipitation scatter')
    plt.close()
    columns=['Oil_Price', 'NASDAQ_Index','Precipitation']
    Oil_Price=[var1,cov1,corr1,Corr1]
    NASDAQ_Index=[var2,cov2,corr2,Corr2]
    Precipitation=[var4,cov4,corr4,Corr4]
    index=['Variance','Covariance','Pearsonr_correlation','Spearmanr_correlation']

    dic={'Oil_Price':Oil_Price, 'NASDAQ_Index':NASDAQ_Index, 
          'Precipitation':Precipitation}

    df=pd.DataFrame(data=dic,index=index)
    df['Stock_Price']=[var0,"-","-","-"]
    print(df)
    def Chi_Square_Stock_Price(tar1,tar2):
        """
        Chi-Square test for stock price and other auxiliary Dataset
        """
        thres1=1/3*(max(Combine[tar1])-min(Combine[tar1]))+min(Combine[tar1])
        thres2=2/3*(max(Combine[(tar1)])-min(Combine[tar1]))+min(Combine[tar1])

        stocklow = len(Combine[(Combine[tar1]<thres1)])
        stockmid = len(Combine[(Combine[tar1]<thres2)&(Combine[tar1]>=thres1)])
        stockhigh = len(Combine[(Combine[tar1]>=thres2)])

        Thres1=1/3*(max(Combine[tar2])-min(Combine[tar2]))+min(Combine[tar2])
        Thres2=2/3*(max(Combine[tar2])-min(Combine[tar2]))+min(Combine[tar2])
        auxiliaryLow_Day=len(Combine[(Combine[tar2]<Thres1)])
        auxiliaryMid_Day=len(Combine[(Combine[tar2]<Thres2)&(Combine[tar2]>=Thres1)])
        auxiliaryHigh_Day=len(Combine[(Combine[tar2]>=Thres2)])

        stocklow_auxiliaryLow = len(Combine[(Combine[tar2]<Thres1)&
                             (Combine[tar1]<thres1)])
        stocklow_auxiliaryMid = len(Combine[(Combine[tar2]<Thres2)&(Combine[tar2]>=Thres1)&
                                            (Combine[tar1]<thres1)])
        stocklow_auxiliaryHigh = len(Combine[(Combine[tar2]>=Thres2)&
                              (Combine[tar1]<thres1)])

        stockmid_auxiliaryLow = len(Combine[(Combine[tar2]<Thres1)&
                                            (Combine[tar1]<thres2)&(Combine[tar1]>=thres1)])
        stockmid_auxiliaryMid = len(Combine[(Combine[tar2]<Thres2)&(Combine[tar2]>=Thres1)&
                                            (Combine[tar1]<thres2)&(Combine[tar1]>=thres1)])
        stockmid_auxiliaryHigh = len(Combine[(Combine[tar2]>=Thres2)&
                              (Combine[tar1]<thres2)&(Combine[tar1]>=thres1)])

        stockhigh_auxiliaryLow = len(Combine[(Combine[tar2]<Thres1)&
                              (Combine[tar1]>=thres2)])
        stockhigh_auxiliaryMid = len(Combine[(Combine[tar2]<Thres2)&(Combine[tar2]>=Thres1)&
                              (Combine[tar1]>=thres2)])
        stockhigh_auxiliaryHigh = len(Combine[(Combine[tar2]>=Thres2)&
                               (Combine[tar1]>=thres2)])

        auxiliaryLow=[stocklow_auxiliaryLow,stockmid_auxiliaryLow,stockhigh_auxiliaryLow]  
        auxiliaryMid=[stocklow_auxiliaryMid,stockmid_auxiliaryMid,stockhigh_auxiliaryMid] 
        auxiliaryHigh=[stocklow_auxiliaryHigh,stockmid_auxiliaryHigh,stockhigh_auxiliaryHigh]                                      

        index=['stocklow','stockmid','stockhigh']
        D={'auxiliaryLow':auxiliaryLow,'auxiliaryMid':auxiliaryMid,'auxiliaryHigh':auxiliaryHigh}
        Table_Simple=pd.DataFrame(data=D,index=index,columns=None)   

        index=['stocklow','stockmid','stockhigh','Total']
        auxiliaryLow=[stocklow_auxiliaryLow,stockmid_auxiliaryLow,stockhigh_auxiliaryLow,auxiliaryLow_Day]  
        auxiliaryMid=[stocklow_auxiliaryMid,stockmid_auxiliaryMid,stockhigh_auxiliaryMid,auxiliaryMid_Day] 
        auxiliaryHigh=[stocklow_auxiliaryHigh,stockmid_auxiliaryHigh,stockhigh_auxiliaryHigh,auxiliaryHigh_Day]

        Total=[stocklow,stockmid,stockhigh,(stocklow+stockmid+stockhigh)]
        D2={'auxiliaryLow':auxiliaryLow,'auxiliaryMid':auxiliaryMid,'auxiliaryHigh':auxiliaryHigh, 'Total': Total}
        Table_with_Total = pd.DataFrame(data=D2,index=index,columns=None)
        return (Table_Simple,Table_with_Total)
    def Chi_Sqaure(CaliTable):
        stat, p, dof, expected = chi2_contingency(CaliTable)
        """
        Through Chi-Square test to judge dependency or independency
        """
        print("statistic",stat)
        print("p-value",p)
        print("degres of fredom: ",dof)
        print("table of expected frequencies\n",expected)
        prob = 0.90
        critical = chi2.ppf(prob, dof)
        if abs(stat) >= critical:
            print('Dependent (reject H0)')
        else:
            print('Independent (fail to reject H0)')

    OilTable, OilTable_with_Total=Chi_Square_Stock_Price('Stock_Price','Oil_Price')
    Chi_Sqaure(OilTable)

    NASDAQTable, NASDAQTable_with_Total=Chi_Square_Stock_Price('Stock_Price','NASDAQ_Index')
    Chi_Sqaure(NASDAQTable)

    PrecipitationTable, PrecipitationTable_with_Total=Chi_Square_Stock_Price('Stock_Price','Precipitation')
    Chi_Sqaure(PrecipitationTable)

    def Chi_Square_Stock_Price_for_Pandemic(tar1,tar2):
        """
        Chi-Square test for stock price, other auxiliary Dataset and pandemic conditions
        """
        thres1=1/3*(max(Combine[tar1])-min(Combine[tar1]))+min(Combine[tar1])
        thres2=2/3*(max(Combine[(tar1)])-min(Combine[tar1]))+min(Combine[tar1])

        stocklow = len(Combine[(Combine[tar1]<thres1)])
        stockmid = len(Combine[(Combine[tar1]<thres2)&(Combine[tar1]>=thres1)])
        stockhigh = len(Combine[(Combine[tar1]>=thres2)])

        Pandemic_Day=len(Combine[(Combine[tar2]==True)])
        NoPandemic_Day=len(Combine[(Combine[tar2]==False)])

        stocklow_Pandemic = len(Combine[(Combine[tar2]==True)&
                             (Combine[tar1]<thres1)])
        stocklow_NoPandemic = len(Combine[(Combine[tar2]==False)&
                                            (Combine[tar1]<thres1)])

        stockmid_Pandemic = len(Combine[(Combine[tar2]==True)&
                                            (Combine[tar1]<thres2)&(Combine[tar1]>=thres1)])
        stockmid_NoPandemic = len(Combine[(Combine[tar2]==False)&
                                            (Combine[tar1]<thres2)&(Combine[tar1]>=thres1)])

        stockhigh_Pandemic = len(Combine[(Combine[tar2]==True)&
                              (Combine[tar1]>=thres2)])
        stockhigh_NoPandemic = len(Combine[(Combine[tar2]==False)&
                              (Combine[tar1]>=thres2)])

        Pandemic=[stocklow_Pandemic,stockmid_Pandemic,stockhigh_Pandemic]  
        NoPandemic=[stocklow_NoPandemic,stockmid_NoPandemic,stockhigh_NoPandemic] 


        index=['stocklow','stockmid','stockhigh']
        D={'Pandemic':Pandemic,'NoPandemic':NoPandemic}
        Table_Simple=pd.DataFrame(data=D,index=index,columns=None)   

        index=['stocklow','stockmid','stockhigh','Total']
        Pandemic=[stocklow_Pandemic,stockmid_Pandemic,stockhigh_Pandemic,Pandemic_Day]  
        NoPandemic=[stocklow_NoPandemic,stockmid_NoPandemic,stockhigh_NoPandemic,NoPandemic_Day] 

        Total=[stocklow,stockmid,stockhigh,(stocklow+stockmid+stockhigh)]
        D2={'Pandemic':Pandemic,'NoPandemic':NoPandemic, 'Total': Total}
        Table_with_Total = pd.DataFrame(data=D2,index=index,columns=None)
        return (Table_Simple,Table_with_Total)
    casesTable, casesTable_with_Total=Chi_Square_Stock_Price_for_Pandemic('Stock_Price','New_cases')
    Chi_Sqaure(casesTable)
    print('task4 succeed')
    return holidays_df # holiday data is useful for follows inference

<div class="alert alert-heading alert-info">

## Task 5: Inference

Train a model to predict the closing stock price on each day for the data you have already
collected, stored, preprocessed and explored from previous steps. The data must be spanning
from April 2017 to April 202 1.
You should develop two separate models:


1. A model for predicting the closing stock price on each day for a 1-month time window (until
    end of May 202 1 ), using only time series of stock prices.
2. A model for predicting the closing stock price on each day for a 1-month time window (until
    end of May 202 1 ), using the time series of stock prices and the auxiliary data you collected.
Which model is performing better? How do you measure performance and why? How could you
further improve the performance? Are the models capable of predicting the closing stock prices
far into the future?

[IMPORTANT NOTE] For these tasks, you are not expected to compare model architectures, but
examine and analyse the differences when training the same model with multiple data attributes
and information from sources. Therefore, you should decide a single model suitable for time series
data to solve the tasks described above. Please see the lecture slides for tips on model selection
and feel free to experiment before selecting one.

The following would help you evaluate your approach and highlight potential weaknesses in your
process:

1. Evaluate the performance of your model using different metrics, e.g. mean squared error,
    mean absolute error or R-squared.
2. Use ARIMA and Facebook Prophet to explore the uncertainty on your model’s predicted
    values by employing confidence bands.
3. Result visualization: create joint plots showing marginal distributions to understand the
    correlation between actual and predicted values.
4. Finding the mean, median and skewness of the residual distribution might provide
    additional insight into the predictive capability of the model.
</div>

In [6]:
def train(Stock_fit,NASDAQ_fit,Combine,Combine_fit,holidays_df):
    """
    Train a Facebook Prophet model
    """
    import os
    import sys

    import pandas as pd
    import numpy as np
    import datetime
    import matplotlib
    import matplotlib.pyplot as plt

    import warnings
    warnings.filterwarnings('ignore')
    import seaborn as sn
    from sklearn.preprocessing import MinMaxScaler
    from scipy.stats import pearsonr,spearmanr,chi2, chi2_contingency
    from calendar import day_abbr, month_abbr, mdays
    import holidays
    from fbprophet import Prophet
    from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
    image_dir='./image/Inference/single'
    if not os.path.exists(image_dir):
        os.makedirs(image_dir)
    def train_test_split(Data: pd.DataFrame) -> pd.DataFrame:
        """
        Split data into train set and test set
        """
        data = Data.copy()
        data.reset_index(inplace=True)    
        df=data.rename(columns={'Date':'ds','Stock_Price':'y'})

        train = df.set_index('ds').loc[:'2021-04-30', :].reset_index()
        test = df.set_index('ds').loc['2021-05-03':, :].reset_index()
        return train,test

    def fit_model(train):   
        """Add condition and data in model."""
        m = Prophet(
            holidays=holidays_df,
    #         changepoint_prior_scale=0.05, 
            seasonality_mode='multiplicative',
            daily_seasonality=False,
            yearly_seasonality=True, 
            weekly_seasonality=True, 
    #        growth="logistic",
        )

    #     m.add_seasonality(name='monthly', period=30.5, fourier_order=5, prior_scale=0.1)
    #     m.add_country_holidays(country_name='US')   

        m.fit(train)
        return m


    import json
    from fbprophet.serialize import model_to_json, model_from_json

    def save_model(filename):
        """
        Function to save model to local disk
        """
        model_dir='./model'
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)    
        file_path=model_dir+'/'+filename+'.json'
        with open(file_path, 'w') as fout:
            json.dump(model_to_json(m), fout)  # Save model

    def load_model(filename):
        """
        Function to load model from local disk
        """
        model_dir='./model'
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)    
        file_path=model_dir+'/'+filename+'.json'
        with open(file_path, 'r') as fin:
            m = model_from_json(json.load(fin))  # Load model
        return m

    train,test=train_test_split(Stock_fit)
    m = fit_model(train)
    save_model('Single_model')
    def Stock_with_auxiliary(data_fit,title):
        """
        Build a model using stock price and auxiliary data
        """
        
        image_dir='./image/Inference/'+title
        if not os.path.exists(image_dir):
            os.makedirs(image_dir)
        Stock_N=pd.concat([Stock_fit,data_fit],axis=1)
        train_N, test_N=train_test_split(Stock_N)

        m = Prophet(holidays=holidays_df, 
                    seasonality_mode='multiplicative',
                    yearly_seasonality=True, 
                    weekly_seasonality=True,
                    daily_seasonality=False)
        m.add_regressor(title, mode='multiplicative')

        m.fit(train_N)

        future = m.make_future_dataframe(periods=30, freq='1D')
        future=future.set_index('ds')
        futures = pd.concat([future, data_fit.loc[:, [title]]], axis=1)
        futures.reset_index(inplace=True)
        futures.rename(columns={'index':'ds'},inplace=True)
        futures =futures.loc[futures['ds'].dt.weekday.isin([0, 1, 2, 3, 4])]
        save_model(title+'_model')
        return m,train_N, test_N,futures

    def Stock_with_auxiliary_with_Pandemic(data_fit1,title):
        """
        Build a model using stock price,auxiliary data and pandemic condition.
        """
        image_dir='./image/Inference/with_Pandemic/'+title
        if not os.path.exists(image_dir):
            os.makedirs(image_dir)
        data_fit2=Combine['New_cases']
        Stock_N=pd.concat([Stock_fit,data_fit1,data_fit2],axis=1)
        Stock_N=Stock_N.rename(columns={'New_cases':'Pandemic'})
        Stock_N['NoPandemic']=~data_fit2
        train_N, test_N=train_test_split(Stock_N)

        m = Prophet(holidays=holidays_df, 
                    seasonality_mode='multiplicative',
                    yearly_seasonality=True, 
                    weekly_seasonality=True,
                    daily_seasonality=False)
        """
        Add auxiliary data in to model, so that it can training
        """
        m.add_regressor(title, mode='multiplicative')
        m.add_seasonality(name='Pandemic', period=365, fourier_order=3, mode='multiplicative', condition_name='Pandemic')
        m.add_seasonality(name='NoPandemic', period=365, fourier_order=3, mode='multiplicative', condition_name='NoPandemic')
        m.fit(train_N)
        future = m.make_future_dataframe(periods=30, freq='1D')
        future=future.set_index('ds')
        futures = pd.concat([future, data_fit1.loc[:, [title]],Stock_N['Pandemic'],Stock_N['NoPandemic']], axis=1)
        futures.reset_index(inplace=True)
        futures.rename(columns={'index':'ds'},inplace=True)
        futures =futures.loc[futures['ds'].dt.weekday.isin([0, 1, 2, 3, 4])]
        save_model(title+'_Pandemic_model')
        return m,train_N, test_N,futures

    
#     model1,tra1,te1,fu1=Stock_with_auxiliary(Oil_fit,'Oil_Price')
#     Model1,Tra1, Te1,Fu1=Stock_with_auxiliary_with_Pandemic(Oil_fit,'Oil_Price')

    model2,tra2,te2,fu2=Stock_with_auxiliary(NASDAQ_fit,'NASDAQ_Index')
    Model2,Tra2, Te2,Fu2=Stock_with_auxiliary_with_Pandemic(NASDAQ_fit,'NASDAQ_Index')

#     model3,tra3,te3,fu3=Stock_with_auxiliary(Precip_fit,'Precipitation')
#     Model3,Tra3, Te3,Fu3=Stock_with_auxiliary_with_Pandemic(Precip_fit,'Precipitation')
#     modelA,traA,teA,fuA=Stock_with_all_auxiliary("All")
#     ModelA,TraA, TeA,FuA=Stock_with_all_auxiliary_with_Pandemic("All")


    print('task5 train succeed')
    return m,train,test,model2,tra2,te2,fu2,Model2,Tra2, Te2,Fu2

In [7]:
def evaluate(Stock_data,m,train,test,model2,tra2,te2,fu2,Model2,Tra2, Te2,Fu2,scaler0):
    """test and evaluate model"""
    
    import os
    import sys

    import pandas as pd
    import numpy as np
    import datetime
    import matplotlib
    import matplotlib.pyplot as plt

    import warnings
    warnings.filterwarnings('ignore')
    import seaborn as sn
    from sklearn.preprocessing import MinMaxScaler
    from scipy.stats import pearsonr,spearmanr,chi2, chi2_contingency
    from calendar import day_abbr, month_abbr, mdays
    import holidays
    from fbprophet import Prophet
    from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
    
    def make_predict(m,periods):    
        """
        Add auxiliary data in the table to prepare for test
        """
        future = m.make_future_dataframe(periods=periods, freq='1D')
        future=future.loc[future['ds'].dt.weekday.isin([0, 1, 2, 3, 4])]
        forecast = m.predict(future)   
        fig = m.plot_components(forecast, figsize=(12, 16))

        return forecast,future

    def make_predictions_df(forecast, data_train, data_test): 
        """
        Function to convert the output Prophet dataframe to timestamp and append the actual target values at the end
        """
        forecast.index = pd.to_datetime(forecast.ds)
        data_train.index = pd.to_datetime(data_train.ds)
        data_test.index = pd.to_datetime(data_test.ds)
        data = pd.concat([data_train, data_test], axis=0)
        forecast.loc[:,'y'] = data.loc[:,'y']

        return forecast

    def plot_predictions(forecast, start_date):
        """
        Function to plot the predictions 
        """
        f, ax = plt.subplots(figsize=(14, 8))

        train = forecast.loc[start_date:'2021-04-30',:]
        ax.plot(train.index, train.y, 'ko', markersize=3)
        ax.plot(train.index, train.yhat, color='steelblue', lw=0.5)
        ax.fill_between(train.index, train.yhat_lower, train.yhat_upper, color='steelblue', alpha=0.3)

        test = forecast.loc['2021-05-03':,:]
        ax.plot(test.index, test.y, 'ro', markersize=3)
        ax.plot(test.index, test.yhat, color='coral', lw=0.5)
        ax.fill_between(test.index, test.yhat_lower, test.yhat_upper, color='coral', alpha=0.3)
        ax.axvline(forecast.loc['2021-05-03', 'ds'], color='k', ls='--', alpha=0.7)
        ax.grid(ls=':', lw=0.5)
        return f, ax
    
    def create_joint_plot(forecast, x='yhat', y='y', title=None): 
        """
        Function to plot joint diagram for train set
        """

        g = sn.jointplot(x='yhat', y='y', data=forecast, kind="reg", color="b")
        g.fig.set_figwidth(8)
        g.fig.set_figheight(8)

        ax = g.fig.axes[1]
        if title is not None: 
            ax.set_title(title, fontsize=16)

        ax = g.fig.axes[0]
        ax.text(0.2, 0.8, "R = {:+4.2f}".format(forecast.loc[:,['y','yhat']].corr().iloc[0,1]), fontsize=16)
        R=forecast.loc[:,['y','yhat']].corr().iloc[0,1]
        ax.set_xlabel('Predictions', fontsize=15)
        ax.set_ylabel('Observations', fontsize=15)
        ax.set_xlim(0, 1)
        ax.set_ylim(0, 1)
        ax.grid(ls=':')
        [l.set_fontsize(13) for l in ax.xaxis.get_ticklabels()]
        [l.set_fontsize(13) for l in ax.yaxis.get_ticklabels()];

        ax.grid(ls=':')
        return R
    
    forecast,future=make_predict(m,30)
    plt.close()
    m.plot_components(forecast).savefig('./image/Inference/single/stock forecast')
    plt.close()
    result = make_predictions_df(forecast, train, test)
    result.loc[:,'yhat'] = result.yhat.clip(lower=0)
    result.loc[:,'yhat_lower'] = result.yhat_lower.clip(lower=0)
    result.loc[:, 'yhat_upper'] = result.yhat_upper.clip(lower=0)
    result.head()

    f, ax = plot_predictions(result, '2017-04-03')
    plt.savefig('./image/Inference/single/stock')
    plt.close()
    r0=create_joint_plot(result.loc[:'2021-4-30', :], title='Train set')
    plt.savefig('./image/Inference/single/stock Train set')
    plt.close()
    rt0=create_joint_plot(result.loc['2021-05-03':, :], title='Test set')
    plt.savefig('./image/Inference/single/stock Test set')
    plt.close()
    """
    Get the predict value of model using only stock price
    """
    predict0 = pd.DataFrame({
        'True_Value':Stock_data.loc['2021-05-03':,'Stock_Price'],
        'Predict': result.loc['2021-05-03':,'yhat']
        })
    predict0.loc[:,'Predict']=scaler0.inverse_transform(pd.DataFrame(predict0.loc['2021-05-03':,'Predict']))
    
    def pred1(m,train_N, test_N,futures,title):
        """
        Get the predict value of model using stock price and auxiliary data
        """
        forecast = m.predict(futures)
        f = m.plot_components(forecast, figsize=(12, 16))
        plt.close()
        m.plot_components(forecast).savefig('./image/Inference/'+title+'/forecast')
        plt.close()
        result = make_predictions_df(forecast, train_N, test_N)
        result.loc[:,'yhat'] = result.yhat.clip(lower=0)
        result.loc[:,'yhat_lower'] = result.yhat_lower.clip(lower=0)
        result.loc[:, 'yhat_upper'] = result.yhat_upper.clip(lower=0)
        result.head()

        f, ax = plot_predictions(result, '2017-04-03')
        plt.savefig('./image/Inference/'+title+'/predict')
        plt.close()
        R_train = create_joint_plot(result.loc[:'2021-4-30', :], title='Train set')
        plt.savefig('./image/Inference/'+title+'/Train set')
        plt.close()
        R_test = create_joint_plot(result.loc['2021-05-03':, :], title='Test set')
        plt.savefig('./image/Inference/'+title+'/Test set')
        plt.close()

        predict = pd.DataFrame({
            'True_Value':Stock_data.loc['2021-05-03':,'Stock_Price'],
            'Predict': result.loc['2021-05-03':,'yhat']
            })
        predict.loc[:,'Predict']=scaler0.inverse_transform(pd.DataFrame(predict.loc['2021-05-03':,'Predict']))
        return predict,R_train,R_test
    def pred2(m,train_N, test_N,futures,title):
        """
        Get the predict value of model using stock price,auxiliary data and pandemic condition..
        """      
        forecast = m.predict(futures)

        f = m.plot_components(forecast, figsize=(12, 16))
        plt.close()
        m.plot_components(forecast).savefig('./image/Inference/with_Pandemic/'+title+'/forecast')
        plt.close()
        result = make_predictions_df(forecast, train_N, test_N)
        result.loc[:,'yhat'] = result.yhat.clip(lower=0)
        result.loc[:,'yhat_lower'] = result.yhat_lower.clip(lower=0)
        result.loc[:, 'yhat_upper'] = result.yhat_upper.clip(lower=0)
        result.head()

        f, ax = plot_predictions(result, '2017-04-03')
        plt.savefig('./image/Inference/with_Pandemic/'+title+'/predic')
        plt.close()
        R_train=create_joint_plot(result.loc[:'2021-4-30', :], title='Train set')
        plt.savefig('./image/Inference/with_Pandemic/'+title+'/Train set')
        plt.close()
        R_test=create_joint_plot(result.loc['2021-05-03':, :], title='Test set')
        plt.savefig('./image/Inference/with_Pandemic/'+title+'/Test set')
        plt.close()

        predict = pd.DataFrame({
            'True_Value':Stock_data.loc['2021-05-03':,'Stock_Price'],
            'Predict': result.loc['2021-05-03':,'yhat']
            })
        predict.loc[:,'Predict']=scaler0.inverse_transform(pd.DataFrame(predict.loc['2021-05-03':,'Predict']))
        return predict,R_train,R_test


#     predict1,r1,rt1=pred1(model1,tra1,te1,fu1,'Oil_Price')
#     Predict1,R1,RT1=pred2(Model1,Tra1, Te1,Fu1,'Oil_Price')

    predict2,r2,rt2=pred1(model2,tra2,te2,fu2,'NASDAQ_Index')
    Predict2,R2,RT2=pred2(Model2,Tra2, Te2,Fu2,'NASDAQ_Index')


#     predict3,r3,rt3=pred1(model3,tra3,te3,fu3,'Precipitation')
#     Predict3,R3,RT3=pred2(Model3,Tra3, Te3,Fu3,'Precipitation')
    
#     predictA,rA,rtA=pred3(modelA,traA,teA,fuA,"All")
#     PredictA,RA,RTA=pred4(ModelA,TraA, TeA,FuA,"All")
    Predict_table=predict0.copy()
    Predict_table['N_predict']=predict2['Predict']
    Predict_table['N_predict_PAN']=Predict2['Predict']
    
    """
    Evaulate data through MSE model and R_Score testing.
    """
    from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
    def MSE(predict):
        mse=mean_squared_error(predict['True_Value'],predict['Predict'])
        return mse
    
    from sklearn.metrics import r2_score
    def R_sqaure(predict):
        return r2_score(predict['True_Value'],predict['Predict'])
    index=['stock','with NASDAQ','with NASDAQ_PAN']
    correlation_train=[r0,r2,R2]
    correlation_test=[rt0,rt2,RT2]
    MSE_result=[MSE(predict0),MSE(predict2),MSE(Predict2)]
    R_sqaure_result=[R_sqaure(predict0),R_sqaure(predict2),R_sqaure(Predict2)]
    Result_table={'correlation_train':correlation_train,'correlation_test':correlation_test,'MSE_result':MSE_result,'R_sqaure_result':R_sqaure_result}
    Df=pd.DataFrame(Result_table,index=index)
    print('task5 evaluate succeed')
    return Predict_table,Df

<div class="alert alert-heading alert-danger">

## Autorun

</div>

In [8]:
def main():
    Stock,Oil, NASDAQ,caseUS,Precip=acquire()
    
    StockPrices,Oil_db,NASDAQ_db, Infects, Precip_db=store(Stock,Oil, NASDAQ,caseUS,Precip)
    
    Stock_data,Oil_data, NASDAQ_data,Precip_data,Combine,Stock_fit,Oil_fit,NASDAQ_fit,Precip_fit,Combine_fit,scaler0=process(StockPrices,Oil_db,NASDAQ_db, Infects, Precip_db)
   
    holidays_df =explore(Stock_data,Oil_data, NASDAQ_data,Precip_data,Combine,Stock_fit,Oil_fit,NASDAQ_fit,Precip_fit,Combine_fit)
    
    m,Train,Test,model2,tra2,te2,fu2,Model2,Tra2, Te2,Fu2=train(Stock_fit,NASDAQ_fit,Combine,Combine_fit,holidays_df)
    
    Pre_table, result=evaluate(Stock_data, m,Train,Test,model2,tra2,te2,fu2,Model2,Tra2, Te2,Fu2,scaler0)
    
    return Pre_table,result

In [9]:
Pre_table,result=main()

task1 succeed
task2 succeed
Wheher the missing value exists:
Stock_Price      False
Oil_Price        False
NASDAQ_Index     False
Precipitation    False
New_cases        False
dtype: bool
task3 succeed
Holidays that stock markets opening: 
            Stock_Price
Date                   
2017-10-09    50.599998
2017-11-10    45.820000
2018-10-08    35.900002
2018-11-12    36.860001
2019-10-14    27.620001
2019-11-11    30.590000
2020-10-12    12.920000
2020-11-11    12.040000
                        Oil_Price  NASDAQ_Index  Precipitation Stock_Price
Variance               129.861042  4.902779e+06       0.209567   181.71048
Covariance              80.450813 -2.164478e+04       1.064298           -
Pearsonr_correlation     0.523222 -7.244811e-01       0.172305           -
Spearmanr_correlation    0.449956 -8.122050e-01       0.222894           -
statistic 214.16169905703848
p-value 3.3816346269019644e-45
degres of fredom:  4
table of expected frequencies
 [[ 24.36103152 165.21776504 137.4

Importing plotly failed. Interactive plots will not work.


task5 train succeed
task5 evaluate succeed


In [10]:
Pre_table

Unnamed: 0,True_Value,Predict,N_predict,N_predict_PAN
2021-05-03,21.950001,21.934342,22.077118,19.678055
2021-05-04,21.42,21.9404,21.785593,19.262819
2021-05-05,21.57,21.905665,21.743741,19.220425
2021-05-06,21.49,21.856712,21.816173,19.353246
2021-05-07,22.0,21.83385,21.969279,19.613476
2021-05-10,22.0,21.776751,21.63395,19.288033
2021-05-11,21.57,21.800617,21.644801,19.378144
2021-05-12,20.76,21.785726,21.244461,18.881194
2021-05-13,21.209999,21.758907,21.37969,19.17662
2021-05-14,22.4,21.760881,21.769614,19.856249


In [11]:
result

Unnamed: 0,correlation_train,correlation_test,MSE_result,R_sqaure_result
stock,0.991097,0.688185,1.246649,-0.189704
with NASDAQ,0.991928,0.810072,0.821112,0.216395
with NASDAQ_PAN,0.992023,0.877136,4.0483,-2.863381
