# ðŸ“Š Cryptocurrency Data Analysis Project

## Project Overview
This notebook performs comprehensive data cleaning and preprocessing on a cryptocurrency dataset containing information about 4,150 cryptocurrencies. The analysis includes handling missing values, data type conversions, and statistical imputation techniques to prepare the data for further analysis.

In [22]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1. Setup and Data Loading
In this section, we mount Google Drive to access our dataset and import the necessary Python libraries for data manipulation and visualization. We then load the cryptocurrency dataset from a CSV file stored in Google Drive.

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [24]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/CryptocurrencyData.csv')

In [25]:
df

Unnamed: 0,Rank,Coin Name,Symbol,Price,1h,24h,7d,30d,24h Volume,Circulating Supply,Total Supply,Market Cap
0,1,Bitcoin,BTC,36456.94,0.40%,-1.70%,1.00%,18.40%,"$22,801,222,945.00",19549806,21 Million,"$712,726,163,003.00"
1,2,Ethereum,ETH,2027.60,0.50%,1.40%,1.00%,20.70%,"$26,845,710,464.00",120249015,120 Million,"$243,488,187,281.00"
2,3,Tether,USDT,1.00,0.10%,-0.30%,-0.10%,-0.10%,"$47,122,466,339.00",88308652879,88.3 Billion,"$88,027,617,310.00"
3,4,BNB,BNB,231.63,-0.10%,-12.60%,-8.00%,5.40%,"$3,715,265,116.00",153856150,154 Million,"$35,716,332,862.00"
4,5,XRP,XRP,0.59,0.10%,-1.90%,-6.90%,12.10%,"$1,339,890,506.00",53718306475,100 Billion,"$31,863,926,051.00"
...,...,...,...,...,...,...,...,...,...,...,...,...
4145,4146,Aave DAI v1,ADAI,1.00,-0.20%,-0.50%,-1.20%,-1.10%,$58.60,0,-,-
4146,4147,Hive Dollar,HBD,0.97,0.20%,-3.20%,-3.10%,-3.60%,"$140,416.00",0,-,-
4147,4148,OWN Token,OWN,$-,-0.20%,0.00%,-0.20%,-0.10%,$0.60,0,5 Billion,-
4148,4149,Kmushicoin,KTV,0.52,-,-,-,-,-,0,20 Million,-


## 2. Initial Data Exploration

We examine the dataset structure, data types, and check the first few rows to understand what information is available. The dataset contains 12 columns including rank, coin name, symbol, price, percentage changes (1h, 24h, 7d, 30d), trading volume, supply metrics, and market capitalization.

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4150 entries, 0 to 4149
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Rank                4150 non-null   int64 
 1   Coin Name           4150 non-null   object
 2   Symbol              4150 non-null   object
 3    Price              4150 non-null   object
 4   1h                  4055 non-null   object
 5   24h                 4150 non-null   object
 6   7d                  4148 non-null   object
 7   30d                 4093 non-null   object
 8    24h Volume         4150 non-null   object
 9   Circulating Supply  4150 non-null   object
 10  Total Supply        4150 non-null   object
 11   Market Cap         4150 non-null   object
dtypes: int64(1), object(11)
memory usage: 389.2+ KB


In [27]:
df.columns

Index(['Rank', 'Coin Name', 'Symbol', ' Price ', '1h', '24h', '7d', '30d',
       ' 24h Volume ', 'Circulating Supply', 'Total Supply', ' Market Cap '],
      dtype='object')

In [28]:
# Changing column names
df = df.rename(columns={
    'Rank':'Rank',
    'Coin_name':'Coin_name',
    'Symbol':'Symbol',

    ' Price ':'Price($)',

    '1h':'Change_1h(%)',
    '24h':'Change_24h(%)',
    '7d':'Change_7d(%)',
    '30d':'Change_30d(%)',

    ' 24h Volume ':'Volume_24h($)',
    'Circulating Supply':'Circulating_Supply',
    ' Market Cap ':'Market_Cap($)'
})

In [29]:
df.head()

Unnamed: 0,Rank,Coin Name,Symbol,Price($),Change_1h(%),Change_24h(%),Change_7d(%),Change_30d(%),Volume_24h($),Circulating_Supply,Total Supply,Market_Cap($)
0,1,Bitcoin,BTC,36456.94,0.40%,-1.70%,1.00%,18.40%,"$22,801,222,945.00",19549806,21 Million,"$712,726,163,003.00"
1,2,Ethereum,ETH,2027.6,0.50%,1.40%,1.00%,20.70%,"$26,845,710,464.00",120249015,120 Million,"$243,488,187,281.00"
2,3,Tether,USDT,1.0,0.10%,-0.30%,-0.10%,-0.10%,"$47,122,466,339.00",88308652879,88.3 Billion,"$88,027,617,310.00"
3,4,BNB,BNB,231.63,-0.10%,-12.60%,-8.00%,5.40%,"$3,715,265,116.00",153856150,154 Million,"$35,716,332,862.00"
4,5,XRP,XRP,0.59,0.10%,-1.90%,-6.90%,12.10%,"$1,339,890,506.00",53718306475,100 Billion,"$31,863,926,051.00"


In [30]:
# Print each column name with its exact representation
for col in df.columns:
    print(repr(col))

'Rank'
'Coin Name'
'Symbol'
'Price($)'
'Change_1h(%)'
'Change_24h(%)'
'Change_7d(%)'
'Change_30d(%)'
'Volume_24h($)'
'Circulating_Supply'
'Total Supply'
'Market_Cap($)'


In [31]:
df.isnull().sum()/len(df)*100

Unnamed: 0,0
Rank,0.0
Coin Name,0.0
Symbol,0.0
Price($),0.0
Change_1h(%),2.289157
Change_24h(%),0.0
Change_7d(%),0.048193
Change_30d(%),1.373494
Volume_24h($),0.0
Circulating_Supply,0.0


## **Cleaning rows** ($,%,-," " ,etc) and **converting their dtypes**

In [32]:
# Helper function to clean and convert columns to numeric
def clean_and_convert(series):
    if series.dtype == 'object':
        # Convert to string to ensure .str accessor works
        cleaned_series = series.astype(str)

        # Remove common non-numeric characters like '$', ',', '%'
        cleaned_series = cleaned_series.str.replace('$', '', regex=False)
        cleaned_series = cleaned_series.str.replace(',', '', regex=False)
        cleaned_series = cleaned_series.str.replace('%', '', regex=False)

        # Replace '-' or any non-numeric string with NaN before converting
        # This regex will catch '-', '$-', etc. that are not valid numbers
        cleaned_series = cleaned_series.replace(r'^\s*[-]+(?:\s*\-)*\s*$', np.nan, regex=True)
        cleaned_series = cleaned_series.replace(r'^\s*$', np.nan, regex=True) # Matches empty strings

        # Convert to numeric, coercing errors will turn unconvertible values into NaN
        return pd.to_numeric(cleaned_series, errors='coerce')
    else:
        return series # If already numeric, return as is

# Columns identified for cleaning and conversion
columns_to_process = [
    'Price($)',
    'Change_1h(%)',
    'Change_24h(%)',
    'Change_7d(%)',
    'Change_30d(%)',
    'Volume_24h($)',
    'Circulating_Supply',
    'Market_Cap($)'
]

for col in columns_to_process:
    df[col] = clean_and_convert(df[col])

df.head(10)

Unnamed: 0,Rank,Coin Name,Symbol,Price($),Change_1h(%),Change_24h(%),Change_7d(%),Change_30d(%),Volume_24h($),Circulating_Supply,Total Supply,Market_Cap($)
0,1,Bitcoin,BTC,36456.94,0.4,-1.7,1.0,18.4,22801220000.0,19549810.0,21 Million,712726200000.0
1,2,Ethereum,ETH,2027.6,0.5,1.4,1.0,20.7,26845710000.0,120249000.0,120 Million,243488200000.0
2,3,Tether,USDT,1.0,0.1,-0.3,-0.1,-0.1,47122470000.0,88308650000.0,88.3 Billion,88027620000.0
3,4,BNB,BNB,231.63,-0.1,-12.6,-8.0,5.4,3715265000.0,153856200.0,154 Million,35716330000.0
4,5,XRP,XRP,0.59,0.1,-1.9,-6.9,12.1,1339891000.0,53718310000.0,100 Billion,31863930000.0
5,6,USDC,USDC,1.0,0.0,-0.4,0.0,0.0,16458240000.0,24264590000.0,24.3 Billion,24120080000.0
6,7,Solana,SOL,54.97,0.6,-0.6,-11.4,87.9,2374980000.0,423104300.0,563 Million,23258090000.0
7,8,Lido Staked Ether,STETH,2023.81,0.4,1.5,0.9,20.7,17625540.0,9107771.0,9.11 Million,18435320000.0
8,9,Cardano,ADA,0.37,0.1,-1.6,0.7,40.1,373277000.0,34964740000.0,45 Billion,12903960000.0
9,10,Dogecoin,DOGE,0.07,0.1,-2.3,0.1,14.9,957225700.0,141935600000.0,142 Billion,10532910000.0


In [33]:
(df['Market_Cap($)']== 0).sum()

np.int64(0)

In [34]:
(df['Price($)']== 0).sum()

np.int64(0)

In [35]:
print(df.shape)
df.isnull().sum()

(4150, 12)


Unnamed: 0,0
Rank,0
Coin Name,0
Symbol,0
Price($),1726
Change_1h(%),547
Change_24h(%),452
Change_7d(%),454
Change_30d(%),509
Volume_24h($),484
Circulating_Supply,0


## Filling NaN values

In [36]:
# filling NaN values using this formula
df['Price($)'] = df['Price($)'].fillna(
    df['Market_Cap($)'] / (df['Circulating_Supply']+ (0.0001))
)

# Filling market cap
df['Market_Cap($)'] = df['Market_Cap($)'].fillna(
    df['Price($)'] * df['Circulating_Supply']
)

In [37]:
# filling changes with 0.
cols = ['Change_1h(%)','Change_24h(%)','Change_7d(%)','Change_30d(%)']
for col in cols:
    df[col] = df[col].fillna(0)

In [38]:
df['Volume_24h($)'] = df['Volume_24h($)'].fillna(df['Volume_24h($)'].median())

In [39]:
df.isnull().sum()

Unnamed: 0,0
Rank,0
Coin Name,0
Symbol,0
Price($),248
Change_1h(%),0
Change_24h(%),0
Change_7d(%),0
Change_30d(%),0
Volume_24h($),0
Circulating_Supply,0


In [40]:
df.select_dtypes(include='number').skew()

Unnamed: 0,0
Rank,0.0
Price($),33.30681
Change_1h(%),25.111441
Change_24h(%),11.909291
Change_7d(%),45.448084
Change_30d(%),63.679131
Volume_24h($),38.796482
Circulating_Supply,60.4141
Market_Cap($),53.396721


In [41]:
# Let's check a sample of rows where Price is NaN
nan_price_rows = df[df['Price($)'].isna()]
print(f"Total rows with NaN price: {len(nan_price_rows)}")
print("\nSample of rows with NaN price:")
nan_price_rows[['Coin Name', 'Price($)', 'Market_Cap($)', 'Circulating_Supply']].head(10)

Total rows with NaN price: 248

Sample of rows with NaN price:


Unnamed: 0,Coin Name,Price($),Market_Cap($),Circulating_Supply
3740,FrogeX,,,411.0
3741,Mainstream For The Underground,,,788686598.0
3743,Bean Cash,,,0.0
3744,Ratecoin,,,0.0
3746,Casinocoin,,,0.0
3747,Yocoin,,,0.0
3748,LBRY Credits,,,0.0
3751,Bitmark,,,0.0
3752,Element,,,0.0
3754,Geocoin,,,0.0


In [43]:
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)

In [44]:
df.shape

(3902, 12)

In [45]:
df.isnull().sum()

Unnamed: 0,0
Rank,0
Coin Name,0
Symbol,0
Price($),0
Change_1h(%),0
Change_24h(%),0
Change_7d(%),0
Change_30d(%),0
Volume_24h($),0
Circulating_Supply,0


In [46]:
nan_price_rows = df[df['Price($)'].isna()]
print(f"Total rows with NaN price: {len(nan_price_rows)}")
print("\nSample of rows with NaN price:")
nan_price_rows[['Coin Name', 'Price($)', 'Market_Cap($)', 'Circulating_Supply']].head(10)

Total rows with NaN price: 0

Sample of rows with NaN price:


Unnamed: 0,Coin Name,Price($),Market_Cap($),Circulating_Supply


# Data is totally cleaned