In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [2]:
df=pd.read_csv('coin_gecko_2022-03-17 (1).csv')

### Feature Engineering Report: Data Loading
- Loaded the raw cryptocurrency dataset for feature engineering.
- The dataset includes price, volume, market cap, and other features for each coin.


In [3]:
df

Unnamed: 0,coin,symbol,price,1h,24h,7d,24h_volume,mkt_cap,date
0,Bitcoin,BTC,40851.380000,0.001,0.000,-0.027,2.047612e+10,7.760774e+11,2022-03-17
1,Ethereum,ETH,2824.420000,0.004,0.029,0.034,1.364041e+10,3.390772e+11,2022-03-17
2,Tether,USDT,1.000000,-0.000,0.000,0.000,4.413140e+10,8.020588e+10,2022-03-17
3,BNB,BNB,389.610000,0.002,0.016,-0.010,1.425354e+09,6.556116e+10,2022-03-17
4,USD Coin,USDC,0.999739,-0.001,0.000,-0.000,3.569816e+09,5.259607e+10,2022-03-17
...,...,...,...,...,...,...,...,...,...
495,IRISnet,IRIS,0.055426,0.016,-0.003,-0.088,2.976839e+06,6.809024e+07,2022-03-17
496,Circuits of Value,COVAL,0.037961,0.002,-0.012,-0.054,3.667870e+05,6.782627e+07,2022-03-17
497,ARPA Chain,ARPA,0.069003,-0.000,0.008,-0.037,1.363376e+07,6.776284e+07,2022-03-17
498,SuperRare,RARE,0.464613,-0.003,0.014,0.019,9.398219e+06,6.738822e+07,2022-03-17


### Feature Engineering Report: Data Preview
- Displayed the dataset to understand the structure and check for missing or anomalous values.


In [5]:
##these are the features with nan value
features_with_na=[features for features in df.columns if df[features].isnull().sum()>=1]
for feature in features_with_na:
    print(feature,np.round(df[feature].isnull().mean()*100,5), '% missing values')

1h 0.8 % missing values
24h 0.8 % missing values
7d 1.0 % missing values
24h_volume 0.8 % missing values


### Feature Engineering Report: Missing Value Analysis
- Identified features with missing values and calculated the percentage of missing data for each.
- This step is crucial for deciding how to handle missing values.


In [7]:
df['1h'].fillna(df['1h'].median(), inplace=True)
df['24h'].fillna(df['24h'].median(), inplace=True)
df['7d'].fillna(df['7d'].median(), inplace=True)
df['24h_volume'].fillna(df['24h_volume'].median(), inplace=True)


### Feature Engineering Report: Filling Missing Values
- Filled missing values in key features (`1h`, `24h`, `7d`, `24h_volume`) with their respective medians.
- This ensures the dataset is ready for further processing and modeling.


In [8]:
df.isnull().sum()

coin          0
symbol        0
price         0
1h            0
24h           0
7d            0
24h_volume    0
mkt_cap       0
date          0
dtype: int64

### Feature Engineering Report: Null Value Check
- Checked for any remaining missing values after imputation.
- Ensured the dataset is now complete with no nulls.


In [9]:
df.duplicated().sum()

0

### Feature Engineering Report: Duplicate Check
- Checked for duplicate rows in the dataset.
- Removing duplicates is important to avoid bias in model training.


In [10]:
df

Unnamed: 0,coin,symbol,price,1h,24h,7d,24h_volume,mkt_cap,date
0,Bitcoin,BTC,40851.380000,0.001,0.000,-0.027,2.047612e+10,7.760774e+11,2022-03-17
1,Ethereum,ETH,2824.420000,0.004,0.029,0.034,1.364041e+10,3.390772e+11,2022-03-17
2,Tether,USDT,1.000000,-0.000,0.000,0.000,4.413140e+10,8.020588e+10,2022-03-17
3,BNB,BNB,389.610000,0.002,0.016,-0.010,1.425354e+09,6.556116e+10,2022-03-17
4,USD Coin,USDC,0.999739,-0.001,0.000,-0.000,3.569816e+09,5.259607e+10,2022-03-17
...,...,...,...,...,...,...,...,...,...
495,IRISnet,IRIS,0.055426,0.016,-0.003,-0.088,2.976839e+06,6.809024e+07,2022-03-17
496,Circuits of Value,COVAL,0.037961,0.002,-0.012,-0.054,3.667870e+05,6.782627e+07,2022-03-17
497,ARPA Chain,ARPA,0.069003,-0.000,0.008,-0.037,1.363376e+07,6.776284e+07,2022-03-17
498,SuperRare,RARE,0.464613,-0.003,0.014,0.019,9.398219e+06,6.738822e+07,2022-03-17


### Feature Engineering Report: Data After Cleaning
- Displayed the cleaned dataset to confirm successful handling of missing values and duplicates.


In [11]:
df=df.drop('date', axis=1)

### Feature Engineering Report: Dropping Date Column
- Dropped the `date` column as it is not needed for feature engineering or modeling.


In [12]:
df['abs_7d_change'] = df['price'] * df['7d']
df['abs_24h_change'] = df['price'] * df['24h']

### Feature Engineering Report: Creating Change Features
- Created `abs_7d_change` and `abs_24h_change` to represent absolute price changes over 7 days and 24 hours.
- These features help capture the magnitude of price movement.


In [13]:
df['Liquidity'] = df['24h_volume'] / df['mkt_cap'] 

### Feature Engineering Report: Creating Liquidity Feature
- Created the `Liquidity` feature as the ratio of 24h volume to market cap.
- This is the target variable for the prediction task.


In [14]:
df

Unnamed: 0,coin,symbol,price,1h,24h,7d,24h_volume,mkt_cap,abs_7d_change,abs_24h_change,Liquidity
0,Bitcoin,BTC,40851.380000,0.001,0.000,-0.027,2.047612e+10,7.760774e+11,-1102.987260,0.000000,0.026384
1,Ethereum,ETH,2824.420000,0.004,0.029,0.034,1.364041e+10,3.390772e+11,96.030280,81.908180,0.040228
2,Tether,USDT,1.000000,-0.000,0.000,0.000,4.413140e+10,8.020588e+10,0.000000,0.000000,0.550227
3,BNB,BNB,389.610000,0.002,0.016,-0.010,1.425354e+09,6.556116e+10,-3.896100,6.233760,0.021741
4,USD Coin,USDC,0.999739,-0.001,0.000,-0.000,3.569816e+09,5.259607e+10,-0.000000,0.000000,0.067872
...,...,...,...,...,...,...,...,...,...,...,...
495,IRISnet,IRIS,0.055426,0.016,-0.003,-0.088,2.976839e+06,6.809024e+07,-0.004877,-0.000166,0.043719
496,Circuits of Value,COVAL,0.037961,0.002,-0.012,-0.054,3.667870e+05,6.782627e+07,-0.002050,-0.000456,0.005408
497,ARPA Chain,ARPA,0.069003,-0.000,0.008,-0.037,1.363376e+07,6.776284e+07,-0.002553,0.000552,0.201198
498,SuperRare,RARE,0.464613,-0.003,0.014,0.019,9.398219e+06,6.738822e+07,0.008828,0.006505,0.139464


In [15]:
import os
os.makedirs('data', exist_ok=True)
df.to_csv("./data/coin_gecko_2022-03-17_cleaned.csv", index=False)

### Feature Engineering Report: Saving Cleaned Data
- Saved the cleaned and feature-engineered dataset to `./data/coin_gecko_2022-03-17_cleaned.csv` for use in modeling and analysis.
