# Data Wrangling & Exploratory Data Analysis

## Cleaning of scraped data

This section focuses on cleaning of the raw data which was scraped of the following websites:
- [Yahoo Finance](yahoo.finance/cryptocurrencies)
- [Coin Market](Capcoinmarketcap.com)
- [Coindesk](coindesk.com)

Also check following file 👉 [Main.py](https://github.com/mr-emreerturk/crypto_scraping_project/blob/main/main.py) and the [README](https://github.com/mr-emreerturk/crypto_scraping_project/blob/main/README.md)for more context.

### Imports & Constants


In [314]:
# Import all necesssary modules
import pandas as pd
import numpy as np
import h5py

# Define file path of H5 filea s constant
PATH = "/Users/emre/Documents/GitHub/crypto_scraping_project/data.h5"

### Loading files into dataframes


In [315]:
# https://stackoverflow.com/questions/40472912/hdf5-file-to-pandas-dataframe
# Read all datasets into dataframes by transforming the h5 file into an array first
df_prices = pd.DataFrame(np.array(h5py.File(PATH)["yahoo_prices"]))
df_news = pd.DataFrame(np.array(h5py.File(PATH)["news_data"]))
df_dev = pd.DataFrame(np.array(h5py.File(PATH)["dev_data"]))

### Convert all entries to strings

In [316]:
df_prices = df_prices.astype("str")
df_dev = df_dev.astype("str")
df_news = df_news.astype("str")

### Replace import substrings "b" & "''"

In [None]:
# Elimate imported string b & ''
df_list_mask = [df_prices, df_dev, df_news]  # list of all dataframes
for x in df_list_mask:  # loop through all dataframes
    for n in range(0, len(x.columns)):  # loop through columns of each df
        for m in range(0, len(x)):  # loop through length of rows
            x[n][m] = (
                x[n][m].replace("b'", "").replace("'", "")
            )  # Replace substring "b & ''"

### Change column names

In [317]:
# Change column names
df_prices = df_prices.rename(
    {
        0: "date",
        1: "name",
        2: "price",
        3: "market_cap",
        4: "circulating_supply",
    },
    axis=1,
)

df_dev = df_dev.rename(
    {
        0: "date",
        1: "name",
        2: "github_commits",
        3: "github_stars",
        4: "github_forks",
        5: "github_contributors",
        6: "github_followers",
        7: "twitter_followers",
        8: "reddit_members",
    },
    axis=1,
)


df_news = df_news.rename(
    {
        0: "date",
        1: "name",
        2: "amount_news_articles",
    },
    axis=1,
)

### Standartize crypto names in `df_prices`

#### Filter to crypto used in other dataframes

In [None]:
# Drop last 3 letters "USD", strip all whitespace from head & tail of string, and add hypen between words when whitespace
df_prices.name = df_prices.name.str[:-3].str.lower().str.strip().str.replace(" ", "-")

top_crypto = [ # Filter for crypto used in other datasets
        "bitcoin",
        "ethereum",
        "tether",
        "usd-coin",
        "bnb",
        "xrp",
        "binance-usd",
        "cardano",
        "solana",
        "dogecoin",
        "polygon",
        "tron",
    ]

df_prices = df_prices[df_prices.name.isin(top_crypto)]

### Correction of dtypes

In [304]:
# Change dtypes of columns
from numpy import NaN #import numpy's NaN

# Change date dtype in df_prices
df_prices.date = pd.to_datetime(df_prices.date)

df_prices.price = df_prices.price.apply(
    lambda x: float(x.replace(",", ""))
)  # Remove comma from number and covnert to float

df_prices.market_cap = df_prices.market_cap.apply(
    lambda x: float(x.strip("B")) * 1_000_000_000
)  # Replace billion with real number in market_cap column

df_prices.circulating_supply = df_prices.circulating_supply.apply(
    lambda x: float(x.strip("B")) * 1_000_000_000
    if (x[-1] == "B")
    else float(x.strip("M")) * 1_000_000
) # Replace billion and million with real number in circulating supply column

# Change date in df_dev
df_dev.date = pd.to_datetime(df_dev.date)
# Replace '--' with NaN and convert to float
for x in df_dev.columns:
    df_dev[x] = df_dev[x].replace("--", np.nan, regex=True)
df_dev.loc[:,"github_commits":"reddit_members"] = df_dev.loc[:,"github_commits":"reddit_members"].astype(float).round(0)

# Change date in df_news
df_news.date = pd.to_datetime(df_news.date)
# Change dtype of amount_news_articles to int
df_news.amount_news_articles = df_news.amount_news_articles.astype(int)

## EDA