# Timeseries Pandas Dataframe Concater
<i>This notebook will combine multiple files of sliced data from single directory into one large file. The large file will then be cleaned so that the data is uniform (no gaps, no missing values, and no repetitive data). The resulting file will be exported into the working directory.

In [1]:
# Importing Necessary Libraries
import os
import numpy as np
import pandas as pd

In [2]:
# Data Directory
path = 'archive/'
path_files = os.listdir(path)

# # Check
# print(len(path_files), 'Files in Directory')
# path_files

In [3]:
# Loading Files in Directory into Dataframes
frames = []
for file in path_files:
    frames.append(pd.read_csv(path + file))

# # Check
# frames[0]

In [4]:
# Find Time_Related Column for dtypes not in pandas.datetime
'''Edit me later for all Time_Related headings'''
for column in frames[0].columns:
    if 'time' in column:
        t_col = column
    else:
        pass

# # Check
# t_col

In [5]:
# Convert Time_Related Column to pandas.datetime Object
'''Edit me later to combine both date and time'''
for frame in frames:
    frame[t_col] = pd.to_datetime(frame[t_col])

# # Check
# frames[0].info()

In [6]:
# Set Time-Related Column as Index
for frame in frames:
    frame.set_index(t_col, inplace=True)

# # Check
# frames[0].info()

In [7]:
# Combining Dataframes into One Large Dataframe
df = pd.concat(frames)

# # Check
# length = 0
# for frame in frames:
#     length = length + len(frame)
# len(df) == length

In [8]:
# Reordering Large Dataframe
df.sort_index(inplace=True)

In [9]:
# Checking Dataframe Content
print(df.head())
print(df.tail())
print(df.value_counts())
print(df.isna().sum())
print ('length =', len(df))

            open  high   low  close  tick_volume
time                                            
2011-03-20  0.83   0.9  0.82   0.89        18828
2011-03-24  0.83   0.9  0.82   0.87        14009
2011-03-24  0.83   0.9  0.82   0.87        14009
2011-03-24  0.83   0.9  0.82   0.87        14009
2011-03-24  0.83   0.9  0.82   0.87        14009
                         open      high       low     close  tick_volume
time                                                                    
2021-06-15 16:00:00  40406.24  40419.24  40281.05  40291.96          672
2021-06-15 16:00:00  40406.24  40419.24  40150.14  40265.89         3997
2021-06-15 16:00:00  40406.24  40419.24  40206.39  40230.64         1981
2021-06-15 16:00:00  40406.24  40419.24  39704.99  40186.89        27306
2021-06-15 16:00:00  40406.24  40419.24  40150.14  40282.89         7118
open      high      low       close     tick_volume
7349.00   7349.00   7349.00   7349.00   1              198
7675.00   7675.00   7675.00   7675.

<i>Dataframe Content <b>Notes:</b></i>
1. Some days are skipped --> Find a way to look for skipped dates.
2. Duplicates present --> Find a way to look for duplicates.
3. Same day, different values --> Find a way to look for days with multiple timestamps or varied values.
Unvaried data due to collection, but also good to note that cryptocurrency market is 24/7 so day close becomes day open.

In [10]:
# Resampling Dataframe
# Method automatically removes duplicates.
# Save first value if applicable (Taken @ 00:00:00 Timestamp).
# Backwards fill any missing data.
DF = df.resample('D').first().bfill().ffill()

# # Check
# '''Edit this check after addressing DataFrame Content Notes'''
# print ('length =', len(df))

In [11]:
# # Export Dataframe into working directory
# DF.to_csv('data.csv')