# Missing Values

In this notebook, the inputs are data frames of daily statistics. Here it is checked that there are no missing values in data and the data is filled using linear interpolation if there are any missing values.

In [1]:
import pandas as pd
import os

data_dir = "/mnt/Ivana/Data/Tezos/Final/"

First all files are loaded and checked whether they have na values.

In [6]:
for file in os.listdir(data_dir):
    df = pd.read_csv(data_dir + file)
    na_rows = df[df.isna().any(axis=1)]

    print(file, ", na values: ", na_rows.shape[0])

Voting.csv , na values:  0
OtherBlockchainPrices.csv , na values:  0
Accounts.csv , na values:  0
Supply.csv , na values:  0
Social.csv , na values:  194
Tzstats_transaction_daily.csv , na values:  0
MarketAndPrice.csv , na values:  1
Contracts.csv , na values:  0


Only the market and price data frame and the social media data frame have missing values. It is explored further where these missing values are found.

## 1. Remove missing values from market and price data

In [8]:
df = pd.read_csv(data_dir + "MarketAndPrice.csv")

df[df.isna().any(axis=1)]

Unnamed: 0,date,current_price,market_cap,total_volume,close,high,low,open
1000,2021-03-29,4.226837,3227780000.0,237698700.0,,,,


In [13]:
df = df.interpolate()
na_rows = df[df.isna().any(axis=1)]
print("Na rows after interpolation: ", na_rows.shape[0])

Na rows after interpolation:  0


In [15]:
dates = ["2021-03-28", "2021-03-29", "2021-03-30"]
df[df.date.isin(dates)]

Unnamed: 0,date,current_price,market_cap,total_volume,close,high,low,open
999,2021-03-28,4.104548,3145409000.0,178011700.0,4.191,4.32,4.073,4.108
1000,2021-03-29,4.226837,3227780000.0,237698700.0,4.3755,4.554,4.3105,4.349
1001,2021-03-30,4.570924,3490515000.0,347236100.0,4.56,4.788,4.548,4.59


In [16]:
# Save the data under the same location
df.to_csv(data_dir + "MarketAndPrice.csv", index=False)

## 2. Remove missing values from social media data

In [49]:
df = pd.read_csv(data_dir + "Social.csv")

df[df.isna().any(axis=1)]

Unnamed: 0,Date,twitter_followers,reddit_average_posts_48h,reddit_average_comments_48h,reddit_subscribers,reddit_accounts_active_48h
31,2018-08-03,,1.960,31.760,9362.0,266.576923
32,2018-08-04,,1.920,28.440,9374.0,254.500000
57,2018-08-29,,1.136,19.136,9685.0,228.826087
59,2018-08-31,,0.783,15.391,9698.0,199.791667
71,2018-09-12,,0.870,14.957,9791.0,217.083333
...,...,...,...,...,...,...
1807,2023-06-14,463387.0,0.000,0.000,,
1808,2023-06-15,463084.0,0.000,0.000,,
1809,2023-06-16,463248.0,0.000,0.000,,
1810,2023-06-17,463069.0,0.000,0.000,,


In [50]:
df.reddit_accounts_active_48h = df.reddit_accounts_active_48h.fillna(0)
df.reddit_average_posts_48h = df.reddit_average_posts_48h.fillna(0)
df.reddit_average_comments_48h = df.reddit_average_comments_48h.fillna(0)

df[df.isna().any(axis=1)]

Unnamed: 0,Date,twitter_followers,reddit_average_posts_48h,reddit_average_comments_48h,reddit_subscribers,reddit_accounts_active_48h
31,2018-08-03,,1.960,31.760,9362.0,266.576923
32,2018-08-04,,1.920,28.440,9374.0,254.500000
57,2018-08-29,,1.136,19.136,9685.0,228.826087
59,2018-08-31,,0.783,15.391,9698.0,199.791667
71,2018-09-12,,0.870,14.957,9791.0,217.083333
...,...,...,...,...,...,...
1807,2023-06-14,463387.0,0.000,0.000,,0.000000
1808,2023-06-15,463084.0,0.000,0.000,,0.000000
1809,2023-06-16,463248.0,0.000,0.000,,0.000000
1810,2023-06-17,463069.0,0.000,0.000,,0.000000


In [51]:
df.reddit_subscribers = df.reddit_subscribers.interpolate()
df.twitter_followers = df.twitter_followers.interpolate()

In [52]:
df.to_csv(data_dir + "Social.csv", index=False)

## 3. Adjust time periods according to technical indicators

After calculating the technical indicators, the newly created MarketAndPriceWithTI data frame contains some NA values since some technical indicators are calculated with a shift. Since there is already a substantial amount of data, the regarded time interval will just be shortened so that the dates containing NA values are removed. Since there needs to be uniformity, the dates will be removed from all data groups.

In [4]:
data_with_all_dates = "../../Data/Tezos/DataFullTimePeriod/"
dest_dir = "../../Data/Tezos/Final/"

new_start_date = pd.to_datetime("2018-08-02", format="%Y-%m-%d")

for file in os.listdir(data_with_all_dates):
    df = pd.read_csv(data_with_all_dates + file)
    
    date_col = df.columns[0]
    df[date_col] = pd.to_datetime(df[date_col])
    new_df = df[df[date_col] >= new_start_date]

    print(file, "removing rows ", df.shape[0]-new_df.shape[0])
    new_df.to_csv(dest_dir + file, index=False)

Voting.csv removing rows  30
OtherBlockchainPrices.csv removing rows  30
Accounts.csv removing rows  30
Supply.csv removing rows  30
Social.csv removing rows  30
Tzstats_transaction_daily.csv removing rows  30
Contracts.csv removing rows  30
MarketAndPriceWithTI.csv removing rows  30


## Final check

In [53]:
for file in os.listdir(data_dir):
    df = pd.read_csv(data_dir + file)
    na_rows = df[df.isna().any(axis=1)]

    print(file, ", na values: ", na_rows.shape[0])

Voting.csv , na values:  0
OtherBlockchainPrices.csv , na values:  0
Accounts.csv , na values:  0
Supply.csv , na values:  0
Social.csv , na values:  0
Tzstats_transaction_daily.csv , na values:  0
MarketAndPrice.csv , na values:  0
Contracts.csv , na values:  0
