#### <h1 style="text-align: center;"> Model Pre-Processing </p>

## Notebook Description

In this notebook, time-series data are merged for SARIMAX modeling of oil stock performance as part of the requirements of the RMDS 2021 Data Science Competition.

<a id='Table-of-Contents'></a>


##  Table of contents
1. [Required Libraries](#Required-Libraries)
2. [Load Data](#Load-Data)
3. [Format Column Names](#Format-Column-Names)
4. [Set Timestamp Range & Frequency](#Set-Timestamp-Range-&-Frequency)
5. [Merge Data](#Merge-DataFrames)
6. [Manage Missing Data](#Missing-Data)
7. [Save Data to CSV](#Save-Data)

## Required Libraries

This notebook uses several Python libraries such as:

In [276]:
# Load required packages 
import datetime
from datetime import timedelta
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

<a id='Load-Data'></a>

---
## Load Data

[[ go back to the top ]](#Table-of-Contents)

In [277]:
# Load Data Function
def LOAD_DATA(filepath, filename):
    # Read CSV files
    if filename.endswith('.csv'):
        new_df = pd.read_csv(filepath+filename)

    # Read Excel files
    elif filename.endswith('.xlsx'):
        new_df = pd.read_excel(filepath+filename)
    print(type(new_df.index))
    if type(new_df.index) != pd.core.indexes.datetimes.DatetimeIndex:
        for col in new_df.columns:
            if col.lower().find('date') != -1:
                print(f"TIMESTAMP FOUND! '{col}'")
                print()
                new_df['date'] = pd.to_datetime(new_df[col]) # format = '%Y/%m/%d'
                new_df.set_index('date', inplace = True)
                # If datetime col was already == 'date', no need to drop col after set_index, otherwise...
                if col != 'date':
                    new_df.drop(columns = col, inplace = True)
                
    # Try to identify the date column
    elif type(new_df.index) == pd.core.indexes.datetimes.DatetimeIndex:
        print('Index already in datetime')
        
    display(new_df.info())
    return new_df

In [278]:
# # Load International Sentiment Data
# 
# fpath = '../../data/News_AI_Sentiments/'
# fname = 'daily-news-sentiment-international.csv'
# 
# sentiment_int = LOAD_DATA(filepath = fpath, filename = fname)
# #sentiment_int

In [279]:
# # Load North American Sentiment Data
# 
# fpath = '../../data/News_AI_Sentiments/'
# fname = 'daily-news-sentiment-NA.csv'
# 
# sentiment_na = LOAD_DATA(filepath = fpath, filename = fname)
# #sentiment_na

In [280]:
# # Load Stock-Closing-Price by Company Data
# fpath = '../../'
# fname = 'closing_price_by_company.csv'
# closing_price = LOAD_DATA(filepath = fpath, filename = fname)
# #closing_price

In [281]:
# # Load Google Mobility Trends grouped by Indicator
# fpath = '../../data/Transportation/google/'
# fname = 'baseline_pct_change.csv'
# google_mobility = LOAD_DATA(filepath = fpath, filename = fname)
# #google_mobility

In [282]:
# # Load Dow Jones Indices grouped by Indicator
# fpath = '../../data/Financial_Market/'
# fname = 'dow_jones.csv'
# dow_jones = LOAD_DATA(filepath = fpath, filename = fname)
# #dow_jones

In [283]:
# # Load Commodity data: NYMEX Crude Oil prices, futures open interest, futures settle price
# fpath = './../data/Financial_Market/'
# fname = 'commodity_modeldata.csv'
# crude_futures = LOAD_DATA(filepath = fpath, filename = fname)
# #crude_futures

<class 'pandas.core.indexes.range.RangeIndex'>
TIMESTAMP FOUND! 'date'

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2017-03-08 to 2021-03-05
Data columns (total 3 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   settle                      1000 non-null   float64
 1   previous_day_open_interest  1000 non-null   float64
 2   price_in_usd                1000 non-null   float64
dtypes: float64(3)
memory usage: 31.2 KB


None

In [284]:
# # Nov 27, 2020 was missing from WorldData.AI Downloads
# crude_futures.loc[pd.to_datetime('2020-11-27')] = [45.50, 407847.0, 45.53]
# crude_futures.loc['2020-11-27']

Unnamed: 0_level_0,settle,previous_day_open_interest,price_in_usd
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-11-27,45.5,407847.0,45.53


In [330]:
%ls

 Volume in drive C is Windows
 Volume Serial Number is 72F8-2D4C

 Directory of C:\Users\bgrif\Downloads\github\portfolio\Hackathon_March2021\code

03/19/2021  12:21 PM    <DIR>          .
03/19/2021  12:21 PM    <DIR>          ..
03/19/2021  12:21 PM    <DIR>          .ipynb_checkpoints
03/02/2021  06:55 PM    <DIR>          __pycache__
03/19/2021  05:58 PM    <DIR>          EDA-Brandon
03/16/2021  02:16 PM    <DIR>          Model-Tuning-Brandon
03/19/2021  06:10 PM    <DIR>          Preprocessing-Brandon
               0 File(s)              0 bytes
               7 Dir(s)  190,652,223,488 bytes free


In [332]:
# Load Target-Features MINUS Mobility
fpath = '../data/Financial_Market/'
fname = 'OECD_interest_rates_updated.csv'
OECD_int = LOAD_DATA(filepath = fpath, filename = fname)
OECD_int

<class 'pandas.core.indexes.range.RangeIndex'>
TIMESTAMP FOUND! 'date'

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 815 entries, 1953-04-30 to 2021-02-28
Data columns (total 2 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   long_term_interest_rates   815 non-null    float64
 1   short_term_interest_rates  680 non-null    float64
dtypes: float64(2)
memory usage: 19.1 KB


None

Unnamed: 0_level_0,long_term_interest_rates,short_term_interest_rates
date,Unnamed: 1_level_1,Unnamed: 2_level_1
1953-04-30,2.83,
1953-05-31,3.05,
1953-06-30,3.11,
1953-07-31,2.93,
1953-08-31,2.95,
...,...,...
2020-10-31,0.79,0.12
2020-11-30,0.87,0.16
2020-12-31,0.93,0.17
2021-01-31,1.08,0.14


In [285]:
# # Load Target-Features MINUS Mobility
# fpath = './../'
# fname = 'targets_wout_mobility.csv'
# targets_wout_mobility = LOAD_DATA(filepath = fpath, filename = fname)
# #targets_wout_mobility

<class 'pandas.core.indexes.range.RangeIndex'>
TIMESTAMP FOUND! 'Date'

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 985 entries, 2017-03-21 to 2021-02-25
Data columns (total 51 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Value_PHILLIPS 66                        985 non-null    float64
 1   Value_BP P.L.C.                          985 non-null    float64
 2   Value_VALERO ENERGY CORPORATION          985 non-null    float64
 3   Value_CHEVRON CORPORATION                985 non-null    float64
 4   Value_OCCIDENTAL PETROLEUM CORPORATION   985 non-null    float64
 5   Value_MARATHON OIL CORPORATION           985 non-null    float64
 6   Value_PIONEER NATURAL RESOURCES COMPANY  985 non-null    float64
 7   Value_CONOCOPHILLIPS                     985 non-null    float64
 8   Value_EXXON MOBIL CORPORATION            985 non-null    float64
 9   Value_MARATHON PETROLEUM CORP

None

In [286]:
# # Load Target-Features ALL
# fpath = './../'
# fname = 'target_with_all_features_clean.csv'
# target_with_all_features_clean = LOAD_DATA(filepath = fpath, filename = fname)
# #target_with_all_features_clean

<a id='Format-Column-Names'></a>

---
## Format Column Names

[[ go back to the top ]](#Table-of-Contents)

In [287]:
# Function to enforce snake case (no spaces, no caps, no DOTs)
def FORMAT_TITLES(dataframe, start = 0, end = 0):
    df = dataframe
    # Remove extra characters in name
    for index, label in enumerate(df.columns):
        # IF not specified, don't remove chars from end
        if end == 0:
            name = label[start:]
        # Remove 'end'-number of chars from end-of-string
        elif end != 0:
            name = label[start:end]
        # Format column names to be in 'snake case'
        formatted_name = name.strip().lower().replace('.', '').replace(" ", "_")
        df.rename(columns = {label:formatted_name}, inplace = True)
    return df

In [288]:
# sentiment_int = FORMAT_TITLES(sentiment_int)
# sentiment_int.rename(columns = {'daily_news_sentiment':'int_news_sentiment'}, inplace = True)
# #sentiment_int

In [289]:
# sentiment_na = FORMAT_TITLES(sentiment_na)
# sentiment_na.rename(columns = {'daily_news_sentiment':'na_news_sentiment'}, inplace = True)
# #sentiment_na

In [290]:
# closing_price = FORMAT_TITLES(closing_price, start = 6)
# #closing_price

In [291]:
# google_mobility = FORMAT_TITLES(google_mobility)
# #google_mobility

In [292]:
# dow_jones = FORMAT_TITLES(dow_jones)
# #dow_jones

In [293]:
# # Renname columns to be more descriptive
# crude_futures.rename(columns = {'settle':'futures_settle',
#                                 'previous_day_open_interest':'futures_prev_day_open_interest',
#                                 'price_in_usd':'crude_usd_per_barrel'},
#                      inplace = True)

In [294]:
%whos DataFrame

Variable                         Type         Data/Info
-------------------------------------------------------
crude_futures                    DataFrame                futures_settl<...>\n[1001 rows x 3 columns]
df_0                             DataFrame                Value_PHILLIP<...>n[1007 rows x 54 columns]
df_1                             DataFrame                futures_settl<...>\n[1001 rows x 3 columns]
df_i                             DataFrame                futures_settl<...>\n[1001 rows x 3 columns]
df_na                            DataFrame                Value_PHILLIP<...>\n[102 rows x 54 columns]
target_with_all_features_clean   DataFrame                phillips_66  <...>\n[244 rows x 75 columns]
targets_wout_mobility            DataFrame                Value_PHILLIP<...>\n[985 rows x 51 columns]


---
## Set Timestamp Range & Frequency

[[ go back to the top ]](#Table-of-Contents)

In [295]:
# Identify all DataFrames in this Notebook
%who DataFrame

crude_futures	 df_0	 df_1	 df_i	 df_na	 target_with_all_features_clean	 targets_wout_mobility	 


In [296]:
df_listed = [crude_futures, targets_wout_mobility]

In [297]:
# Store the latest occurring minimum Timestamp from all the DF's
start = max([df_listed[idx].index[0] for idx in range(len(df_listed))])
start

Timestamp('2017-03-21 00:00:00')

<a id='Merge-Data'></a>

---
## Merge DataFrames

[[ go back to the top ]](#Table-of-Contents)

In [333]:
# Identify all DataFrames in this Notebook
%who DataFrame

OECD_int	 crude_futures	 df_0	 df_1	 df_i	 df_na	 target_with_all_features_clean	 targets_wout_mobility	 


In [299]:
targets_wout_mobility

Unnamed: 0_level_0,Value_PHILLIPS 66,Value_BP P.L.C.,Value_VALERO ENERGY CORPORATION,Value_CHEVRON CORPORATION,Value_OCCIDENTAL PETROLEUM CORPORATION,Value_MARATHON OIL CORPORATION,Value_PIONEER NATURAL RESOURCES COMPANY,Value_CONOCOPHILLIPS,Value_EXXON MOBIL CORPORATION,Value_MARATHON PETROLEUM CORPORATION,...,i_g_carbon_footprint,i_g_emissions,i_g_epa,i_g_greenhouse,i_g_hurricane_storm,i_g_pollution,i_g_sanction,i_g_solar,i_g_turbine,i_g_vacation
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-03-21,78.60,34.24,67.15,108.04,63.36,15.05,182.30,45.18,81.83,49.26,...,0.0,0.0,-0.400000,0.900000,-0.200000,0.000000,0.000000,0.200000,0.00,-0.475000
2017-03-22,78.83,33.99,67.57,108.39,63.51,15.02,182.33,44.65,81.76,49.57,...,0.0,0.0,0.000000,0.000000,0.000000,-0.066667,-0.200000,0.033333,0.10,0.300000
2017-03-23,78.48,34.06,67.10,107.87,63.04,14.68,181.55,44.48,81.86,49.30,...,0.0,0.0,0.000000,0.400000,0.000000,-0.200000,-0.700000,0.300000,0.30,0.000000
2017-03-24,77.27,33.78,65.78,107.99,62.83,14.61,180.91,44.10,81.23,48.87,...,0.0,0.0,-0.600000,0.600000,0.000000,-0.275000,-0.550000,0.000000,0.00,-0.450000
2017-03-27,77.25,33.75,66.23,106.28,62.87,14.75,180.40,44.29,81.25,48.83,...,0.0,0.0,-0.600000,0.000000,0.000000,-0.200000,-0.350000,0.200000,0.00,0.600000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-02-19,82.31,22.87,71.53,95.80,25.42,9.45,133.94,48.42,52.37,53.00,...,-0.2,-0.2,-0.400000,-0.283333,0.000000,-0.290909,-0.424138,-0.157143,-0.35,-0.288000
2021-02-22,83.96,23.63,74.25,98.39,26.47,10.20,139.47,50.88,54.30,54.85,...,0.0,0.0,-0.200000,0.500000,-0.245455,-0.266667,-0.490476,0.071429,-0.30,-0.117647
2021-02-23,85.53,24.23,75.81,99.63,26.06,11.16,145.24,52.10,55.05,55.77,...,-0.1,-0.1,-0.500000,-0.175000,-0.300000,-0.336364,-0.423611,-0.075000,-0.40,-0.258333
2021-02-24,87.25,25.30,78.16,103.31,28.16,11.84,150.06,54.67,56.70,56.65,...,0.1,0.1,-0.314286,-0.033333,-0.100000,-0.422222,-0.475610,-0.125000,-0.32,-0.194737


In [337]:
df_0.notnull().any()

Value_PHILLIPS 66                          True
Value_BP P.L.C.                            True
Value_VALERO ENERGY CORPORATION            True
Value_CHEVRON CORPORATION                  True
Value_OCCIDENTAL PETROLEUM CORPORATION     True
Value_MARATHON OIL CORPORATION             True
Value_PIONEER NATURAL RESOURCES COMPANY    True
Value_CONOCOPHILLIPS                       True
Value_EXXON MOBIL CORPORATION              True
Value_MARATHON PETROLEUM CORPORATION       True
d_f_chevron                                True
d_f_drilling                               True
d_f_exxon                                  True
d_f_fossil_fuel                            True
d_f_marathon_oil                           True
d_f_occidental_petroleum                   True
d_f_oil                                    True
d_f_oilfield                               True
d_f_phillips_66                            True
d_f_pipeline                               True
d_f_valero                              

In [334]:
df_listed = [targets_wout_mobility, OECD_int]

count = 0
for df_i in df_listed:
    # For the first DF, all the others will concat to df_0
    if count == 0:
        # Drop weekend data
        df_0 = df_i #df_i[df_i.index.dayofweek < 5]
        # Drop dates before Google Mobililty data begins
        df_0 = df_0.loc[start:].copy()
    else:
        # Drop weekend data
        df_1 = df_i #df_i[df_i.index.dayofweek < 5]
        # Concat all the others
        df_0 = df_0.join(df_1, how = 'left')
    count += 1
display(df_0.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 985 entries, 2017-03-21 to 2021-02-25
Data columns (total 53 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Value_PHILLIPS 66                        985 non-null    float64
 1   Value_BP P.L.C.                          985 non-null    float64
 2   Value_VALERO ENERGY CORPORATION          985 non-null    float64
 3   Value_CHEVRON CORPORATION                985 non-null    float64
 4   Value_OCCIDENTAL PETROLEUM CORPORATION   985 non-null    float64
 5   Value_MARATHON OIL CORPORATION           985 non-null    float64
 6   Value_PIONEER NATURAL RESOURCES COMPANY  985 non-null    float64
 7   Value_CONOCOPHILLIPS                     985 non-null    float64
 8   Value_EXXON MOBIL CORPORATION            985 non-null    float64
 9   Value_MARATHON PETROLEUM CORPORATION     985 non-null    float64
 10  d_f_chevron                    

None

In [300]:
df_listed = [targets_wout_mobility, crude_futures]

count = 0
for df_i in df_listed:
    # For the first DF, all the others will concat to df_0
    if count == 0:
        # Drop weekend data
        df_0 = df_i #df_i[df_i.index.dayofweek < 5]
        # Drop dates before Google Mobililty data begins
        df_0 = df_0.loc[start:].copy()
    else:
        # Drop weekend data
        df_1 = df_i #df_i[df_i.index.dayofweek < 5]
        # Concat all the others
        df_0 = pd.concat([df_0, df_1.loc[start:]], axis =1)
    count += 1
display(df_0.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1007 entries, 2017-03-21 to 2021-03-05
Data columns (total 54 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Value_PHILLIPS 66                        985 non-null    float64
 1   Value_BP P.L.C.                          985 non-null    float64
 2   Value_VALERO ENERGY CORPORATION          985 non-null    float64
 3   Value_CHEVRON CORPORATION                985 non-null    float64
 4   Value_OCCIDENTAL PETROLEUM CORPORATION   985 non-null    float64
 5   Value_MARATHON OIL CORPORATION           985 non-null    float64
 6   Value_PIONEER NATURAL RESOURCES COMPANY  985 non-null    float64
 7   Value_CONOCOPHILLIPS                     985 non-null    float64
 8   Value_EXXON MOBIL CORPORATION            985 non-null    float64
 9   Value_MARATHON PETROLEUM CORPORATION     985 non-null    float64
 10  d_f_chevron                   

None

<a id='Missing-Data'></a>

---
## Manage Missing Data

[[ go back to the top ]](#Table-of-Contents)

In [320]:
targets_wout_mobility.isnull().sum().sum()

1278

In [321]:
len(targets_wout_mobility)

985

In [322]:
df_0.isnull().sum().sum()

1323

In [323]:
len(df_0)

985

In [324]:
df_0.isnull().sum()

Value_PHILLIPS 66                            0
Value_BP P.L.C.                              0
Value_VALERO ENERGY CORPORATION              0
Value_CHEVRON CORPORATION                    0
Value_OCCIDENTAL PETROLEUM CORPORATION       0
Value_MARATHON OIL CORPORATION               0
Value_PIONEER NATURAL RESOURCES COMPANY      0
Value_CONOCOPHILLIPS                         0
Value_EXXON MOBIL CORPORATION                0
Value_MARATHON PETROLEUM CORPORATION         0
d_f_chevron                                  0
d_f_drilling                                 0
d_f_exxon                                    0
d_f_fossil_fuel                              0
d_f_marathon_oil                             0
d_f_occidental_petroleum                     0
d_f_oil                                      0
d_f_oilfield                                 0
d_f_phillips_66                              0
d_f_pipeline                               464
d_f_valero                                   0
d_g_chevron  

In [325]:
# notice all these NaN's in WTI Crude Oil price per barrel
# These correspond to American holidays and will correspond to closing stock price data
df_0[df_0.isnull().any(axis=1)].sort_index()

Unnamed: 0_level_0,Value_PHILLIPS 66,Value_BP P.L.C.,Value_VALERO ENERGY CORPORATION,Value_CHEVRON CORPORATION,Value_OCCIDENTAL PETROLEUM CORPORATION,Value_MARATHON OIL CORPORATION,Value_PIONEER NATURAL RESOURCES COMPANY,Value_CONOCOPHILLIPS,Value_EXXON MOBIL CORPORATION,Value_MARATHON PETROLEUM CORPORATION,...,i_g_greenhouse,i_g_hurricane_storm,i_g_pollution,i_g_sanction,i_g_solar,i_g_turbine,i_g_vacation,futures_settle,futures_prev_day_open_interest,crude_usd_per_barrel
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-03-21,78.60,34.24,67.15,108.04,63.36,15.05,182.30,45.18,81.83,49.26,...,0.900000,-0.2,0.000000,0.000000,0.200000,0.000000,-0.475000,47.34,20330.0,47.02
2017-03-22,78.83,33.99,67.57,108.39,63.51,15.02,182.33,44.65,81.76,49.57,...,0.000000,0.0,-0.066667,-0.200000,0.033333,0.100000,0.300000,48.04,625764.0,47.29
2017-03-23,78.48,34.06,67.10,107.87,63.04,14.68,181.55,44.48,81.86,49.30,...,0.400000,0.0,-0.200000,-0.700000,0.300000,0.300000,0.000000,47.70,625344.0,47.00
2017-03-24,77.27,33.78,65.78,107.99,62.83,14.61,180.91,44.10,81.23,48.87,...,0.600000,0.0,-0.275000,-0.550000,0.000000,0.000000,-0.450000,47.97,632705.0,47.30
2017-03-27,77.25,33.75,66.23,106.28,62.87,14.75,180.40,44.29,81.25,48.83,...,0.000000,0.0,-0.200000,-0.350000,0.200000,0.000000,0.600000,47.73,619275.0,47.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-02-18,80.05,22.46,67.83,95.00,25.28,9.12,133.01,47.69,52.02,51.22,...,-0.066667,0.0,-0.287500,-0.428571,-0.040000,-0.568421,-0.100000,60.52,84307.0,60.40
2021-02-19,82.31,22.87,71.53,95.80,25.42,9.45,133.94,48.42,52.37,53.00,...,-0.283333,0.0,-0.290909,-0.424138,-0.157143,-0.350000,-0.288000,59.24,65720.0,59.12
2021-02-23,85.53,24.23,75.81,99.63,26.06,11.16,145.24,52.10,55.05,55.77,...,-0.175000,-0.3,-0.336364,-0.423611,-0.075000,-0.400000,-0.258333,61.67,452697.0,61.66
2021-02-24,87.25,25.30,78.16,103.31,28.16,11.84,150.06,54.67,56.70,56.65,...,-0.033333,-0.1,-0.422222,-0.475610,-0.125000,-0.320000,-0.194737,63.22,444616.0,63.21


In [271]:
#df_na = df_0.fillna({'int_news_sentiment':0})
#df_na = df_0.dropna()
#df_na

Unnamed: 0_level_0,Value_PHILLIPS 66,Value_BP P.L.C.,Value_VALERO ENERGY CORPORATION,Value_CHEVRON CORPORATION,Value_OCCIDENTAL PETROLEUM CORPORATION,Value_MARATHON OIL CORPORATION,Value_PIONEER NATURAL RESOURCES COMPANY,Value_CONOCOPHILLIPS,Value_EXXON MOBIL CORPORATION,Value_MARATHON PETROLEUM CORPORATION,...,i_g_greenhouse,i_g_hurricane_storm,i_g_pollution,i_g_sanction,i_g_solar,i_g_turbine,i_g_vacation,futures_settle,futures_prev_day_open_interest,crude_usd_per_barrel
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-05-01,79.56,34.32,64.36,105.92,60.82,14.82,172.43,47.48,82.06,50.86,...,0.000000,0.000000,-0.150000,0.000000,0.200000,0.0,-0.133333,48.84,572455.0,48.83
2017-06-01,76.62,36.24,62.12,104.27,60.51,13.10,166.97,44.86,80.70,53.13,...,-0.400000,0.000000,-0.100000,0.000000,0.250000,0.0,0.120000,48.36,560137.0,48.32
2017-06-21,78.89,34.65,64.68,104.49,59.88,11.61,155.03,44.95,81.44,51.73,...,0.000000,-0.250000,-0.200000,-0.100000,0.000000,0.0,0.300000,42.53,538309.0,42.48
2017-08-01,85.63,36.27,68.82,110.78,61.53,11.91,163.27,44.75,80.17,56.53,...,0.000000,0.000000,0.000000,-0.650000,-0.020000,0.0,0.200000,49.16,592595.0,49.19
2017-08-28,83.65,34.47,68.42,107.76,59.13,10.92,128.46,43.05,76.47,52.52,...,0.200000,-0.270370,-0.666667,-0.550000,0.000000,-0.1,-0.400000,46.57,516590.0,46.40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-12-28,69.33,20.99,55.09,84.90,17.40,6.63,110.65,39.30,41.74,40.78,...,-0.250000,0.000000,-0.442857,0.020000,-0.125000,0.0,-0.087500,47.62,372115.0,47.50
2021-01-15,71.74,24.26,58.30,92.09,22.39,8.25,127.38,45.12,47.89,43.98,...,-0.233333,0.100000,-0.391304,-0.500000,0.090909,0.0,-0.128571,52.36,96455.0,52.25
2021-01-28,71.45,22.86,58.33,89.02,20.95,7.56,122.60,41.11,46.06,44.77,...,-0.100000,0.000000,-0.271429,-0.308333,0.000000,0.0,-0.250000,52.34,450591.0,52.26
2021-02-01,67.38,22.21,56.24,86.41,20.39,7.34,120.26,40.71,44.92,43.45,...,-0.100000,0.000000,-0.314286,-0.633333,0.100000,-0.5,-0.800000,53.55,433690.0,53.55


In [326]:
df_0.isnull().sum().sum()

1323

### Alternate Method

In [264]:
#df_0.fillna({'int_news_sentiment':0}, inplace=True)
#df_0.fillna({'na_news_sentiment':0}, inplace=True)
#df_0.dropna(inplace=True)
#df_0

<a id='Save-Data'></a>

---
## Save Data to CSV

[[ go back to the top ]](#Table-of-Contents)

In [327]:
#df_na.to_csv(r'../target_with_all_features_clean.csv', index = True)
df_0.to_csv(r'../targets_wout_mobility2.csv', index = True)

In [328]:
len(df_0)

985

In [329]:
df_0

Unnamed: 0_level_0,Value_PHILLIPS 66,Value_BP P.L.C.,Value_VALERO ENERGY CORPORATION,Value_CHEVRON CORPORATION,Value_OCCIDENTAL PETROLEUM CORPORATION,Value_MARATHON OIL CORPORATION,Value_PIONEER NATURAL RESOURCES COMPANY,Value_CONOCOPHILLIPS,Value_EXXON MOBIL CORPORATION,Value_MARATHON PETROLEUM CORPORATION,...,i_g_greenhouse,i_g_hurricane_storm,i_g_pollution,i_g_sanction,i_g_solar,i_g_turbine,i_g_vacation,futures_settle,futures_prev_day_open_interest,crude_usd_per_barrel
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-03-21,78.60,34.24,67.15,108.04,63.36,15.05,182.30,45.18,81.83,49.26,...,0.900000,-0.200000,0.000000,0.000000,0.200000,0.00,-0.475000,47.34,20330.0,47.02
2017-03-22,78.83,33.99,67.57,108.39,63.51,15.02,182.33,44.65,81.76,49.57,...,0.000000,0.000000,-0.066667,-0.200000,0.033333,0.10,0.300000,48.04,625764.0,47.29
2017-03-23,78.48,34.06,67.10,107.87,63.04,14.68,181.55,44.48,81.86,49.30,...,0.400000,0.000000,-0.200000,-0.700000,0.300000,0.30,0.000000,47.70,625344.0,47.00
2017-03-24,77.27,33.78,65.78,107.99,62.83,14.61,180.91,44.10,81.23,48.87,...,0.600000,0.000000,-0.275000,-0.550000,0.000000,0.00,-0.450000,47.97,632705.0,47.30
2017-03-27,77.25,33.75,66.23,106.28,62.87,14.75,180.40,44.29,81.25,48.83,...,0.000000,0.000000,-0.200000,-0.350000,0.200000,0.00,0.600000,47.73,619275.0,47.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-02-19,82.31,22.87,71.53,95.80,25.42,9.45,133.94,48.42,52.37,53.00,...,-0.283333,0.000000,-0.290909,-0.424138,-0.157143,-0.35,-0.288000,59.24,65720.0,59.12
2021-02-22,83.96,23.63,74.25,98.39,26.47,10.20,139.47,50.88,54.30,54.85,...,0.500000,-0.245455,-0.266667,-0.490476,0.071429,-0.30,-0.117647,61.49,27854.0,61.67
2021-02-23,85.53,24.23,75.81,99.63,26.06,11.16,145.24,52.10,55.05,55.77,...,-0.175000,-0.300000,-0.336364,-0.423611,-0.075000,-0.40,-0.258333,61.67,452697.0,61.66
2021-02-24,87.25,25.30,78.16,103.31,28.16,11.84,150.06,54.67,56.70,56.65,...,-0.033333,-0.100000,-0.422222,-0.475610,-0.125000,-0.32,-0.194737,63.22,444616.0,63.21
