# Phase 1 : Data Management

## Import des librairies

In [21]:
import pandas as pd

## Description du jeu de données
### (1) Données importées :  

Cryptocurrency web scraping involves extracting data related to digital currencies from various online sources such as cryptocurrency exchanges, news websites, forums, and social media platforms.  
This data can encompass a wide range of information, including real-time price data, trading volumes, market sentiment, blockchain statistics, ICO details, and more.

Cryptocurrency web scraping is utilized by traders, analysts, researchers, and developers to gather insights, conduct market research, develop trading strategies, build financial models, and create data-driven applications.  
By collecting and analyzing large volumes of cryptocurrency data, stakeholders can make informed decisions and stay up-to-date with the rapidly evolving crypto market landscape.

- URL du dataset : https://www.kaggle.com/datasets/bishop36/crypto-data  

- Shape : (12 Columns ; 21347 Entries)

- Variables :  

  * float64 = (7)  
      * price : the currend price of the cryptocurrency  
      * volume24hrs : the trading volume of the cryptocurrency in the last 24 hours   
      * circulatingsupply : the amount of the cryptocurrency currently circulating in the market.  
      * maxsupply : the maximum amount or total amount of the cryptocurrency that will ever be created or mined  
      * totalsupply : the total supply of the cryptocurrenty that currently exists  
      * Unnamed : empty  
      * Unnamed : empty
                                                          
  * object = (5)   
      * name : name of the cryptocurrency  **(20807 unique values)**
      * abbr : abreviation or symbol representing the cryptocurrency **(16117 unique values)**
      * crypturl : url or website associated with the cryptocurrency  **(21341 uniques values)**
      * marketcap : the market capitalization of the cryptocurrency **(calculated as price multiplied by circulating supply)** 
      * date_taken : date or timestamp the data was collected or extracted  **(8114 unique values)**

In [22]:
input_df = pd.read_csv("../data/input/Crypto.csv")
input_df.info()
input_df.head(5) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21347 entries, 0 to 21346
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               21347 non-null  object 
 1   abbr               21347 non-null  object 
 2   crypturl           21347 non-null  object 
 3   price              21347 non-null  float64
 4   volume24hrs        21347 non-null  float64
 5   marketcap          21347 non-null  object 
 6   circulatingsupply  21347 non-null  float64
 7   maxsupply          21347 non-null  float64
 8   totalsupply        21347 non-null  float64
 9   date_taken         21347 non-null  object 
 10  Unnamed: 10        0 non-null      float64
 11  Unnamed: 11        0 non-null      float64
dtypes: float64(7), object(5)
memory usage: 2.0+ MB


Unnamed: 0,name,abbr,crypturl,price,volume24hrs,marketcap,circulatingsupply,maxsupply,totalsupply,date_taken,Unnamed: 10,Unnamed: 11
0,Bitcoin,BTC,https://crypto.com/price/bitcoin,62732.67,272076.9766,251418.65,3065439.0,876052496.2,437497786.1,09-08-2007,,
1,Ethereum,ETH,https://crypto.com/price/ethereum,3027.22,981408.7883,502924.62,2326224.0,444082823.2,705224728.4,13-12-2007,,
2,Tether,USDT,https://crypto.com/price/tether,0.9999,944792.0394,815637.23,2207000.0,604510133.4,716566396.6,03-03-2004,,
3,BNB,BNB,https://crypto.com/price/bnb,593.85,556334.533,323746.05,240014.7,150451882.8,685308596.6,06-06-2001,,
4,Solana,SOL,https://crypto.com/price/solana,152.59,866929.0575,805143.57,2957790.0,284280494.2,647877254.0,06-08-2019,,


### (2) Nettoyage des données : 
- Identification des valeurs nulles : 

In [23]:
empty_values = input_df.isna().sum()
print(empty_values)

name                     0
abbr                     0
crypturl                 0
price                    0
volume24hrs              0
marketcap                0
circulatingsupply        0
maxsupply                0
totalsupply              0
date_taken               0
Unnamed: 10          21347
Unnamed: 11          21347
dtype: int64


- Suppression des deux colonnes vides et cast de 'date_taken' au format date :

In [24]:
input_df = input_df.drop(columns =['Unnamed: 10', 'Unnamed: 11'])
input_df['date_taken'] = input_df['date_taken'].apply(lambda x: pd.to_datetime(x, format='%d-%m-%Y').strftime('%Y-%m-%d'))
input_df.head(5)

Unnamed: 0,name,abbr,crypturl,price,volume24hrs,marketcap,circulatingsupply,maxsupply,totalsupply,date_taken
0,Bitcoin,BTC,https://crypto.com/price/bitcoin,62732.67,272076.9766,251418.65,3065439.0,876052496.2,437497786.1,2007-08-09
1,Ethereum,ETH,https://crypto.com/price/ethereum,3027.22,981408.7883,502924.62,2326224.0,444082823.2,705224728.4,2007-12-13
2,Tether,USDT,https://crypto.com/price/tether,0.9999,944792.0394,815637.23,2207000.0,604510133.4,716566396.6,2004-03-03
3,BNB,BNB,https://crypto.com/price/bnb,593.85,556334.533,323746.05,240014.7,150451882.8,685308596.6,2001-06-06
4,Solana,SOL,https://crypto.com/price/solana,152.59,866929.0575,805143.57,2957790.0,284280494.2,647877254.0,2019-08-06
