# Fecha de entrega: 23/07/2022 (corresponde a clase 15)

# Base de datos -> BTC_final

# Diccionario

- **Date**: fecha en YYYY-MM-DD (datetime64 [ns])
- **Price**: precio de cierre de BTC en el día de la fecha (USD) (float) (Variable numérica continua)
- **Open**: precio de apertura de BTC en el día de la fecha (USD) (float) (Variable numérica continua)
- **High**: precio más alto de BTC en el día de la fecha (USD) (float) (Variable numérica continua)
- **Low**: precio más bajo de BTC en el día de la fecha (USD) (float) (Variable numérica continua)
- **Vol.**: volumen de BTC (Número de intercambios) en el día de la fecha (float) (Variable numérica continua)
- **Percentage_diff**: diferencia porcentual del precio de BTC en la fecha [x+1] con respecto a la fecha [x] (float) (Variable numérica continua) 
- **Target**: 1 indica que en el día de la fecha el precio de BTC subió, y 0 que el precio bajó (float) (Se la tratará como variable categórica)

# Se cargan las librerías

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Data acquisition

In [2]:
def gdriveColabPath(sharing_url):
  file_id=sharing_url.split('/')[-2]
  dwn_url='https://drive.google.com/uc?id=' + file_id
  return dwn_url

## Adquiriendo data de BTC

In [3]:
sharing_url = "https://drive.google.com/file/d/1pnStUmNaW2CK2jKc_yCLaj2nsL7nLLOC/view?usp=sharing"

In [4]:
dwn_url=gdriveColabPath(sharing_url)
BTC_raw =pd.read_csv(dwn_url, sep=",", thousands=",", decimal=".")

In [5]:
BTC_raw.head()

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Change %
0,"Jul 20, 2022",23149.1,23412.0,23429.9,22965.9,290.21K,-1.12%
1,"Jul 19, 2022",23410.2,22529.3,23757.3,21581.8,308.91K,3.93%
2,"Jul 18, 2022",22525.8,20785.6,22714.9,20770.6,279.72K,8.37%
3,"Jul 17, 2022",20785.6,21209.8,21654.4,20755.2,132.81K,-2.00%
4,"Jul 16, 2022",21209.9,20825.2,21561.3,20484.4,136.89K,1.85%


# Data wrangling

## Etapa de descubrimiento

Se procede a analizar la estructura fundamental de la data en bruto (BTC_raw)

In [6]:
# Se aplica el método info()
BTC_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4386 entries, 0 to 4385
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      4386 non-null   object 
 1   Price     4386 non-null   float64
 2   Open      4386 non-null   float64
 3   High      4386 non-null   float64
 4   Low       4386 non-null   float64
 5   Vol.      4386 non-null   object 
 6   Change %  4386 non-null   object 
dtypes: float64(4), object(3)
memory usage: 240.0+ KB


Observaciones:
- Existen 4386 registros [0:4385]
- Posee 6 columnas ['Date', 'Price', 'Open', 'High', 'Low', 'Vol.', 'Change %']
- No presenta datos nulos en las variables de tipo float64.
- Queda por revisar si hay valores nulos en las variables de tipo objeto.

In [7]:
# Se buscan duplicados en general
print ("Cantidad de duplicados en el data set es: ", BTC_raw.duplicated().sum())

Cantidad de duplicados en el data set es:  0


In [8]:
## Se buscan conocer missing values en columna [Date]
Years = ["2011","2012","2013","2014","2015","2016","2017","2018","2019","2020","2021","2022"]

for year in Years:
    print ("Year"+year+": ", BTC_raw ["Date"].str.contains(year).sum())

Year2011:  365
Year2012:  366
Year2013:  365
Year2014:  365
Year2015:  365
Year2016:  366
Year2017:  365
Year2018:  365
Year2019:  365
Year2020:  366
Year2021:  365
Year2022:  201


Observaciones:
- 2012, 2016 y 2022 fueron años bisiestos.
- Todos los registros de la columan posee un año específico y concuerdan con la cantidad de días de cada año. Al momento de convertirse en datatime64 se revisará nuevamente la presencia de errores en la columna.

In [9]:
# Se busca conocer la letra asociada (K, M, etc.) a cada registro de la columna [Vol.] para su posterior tratamiento. 
# Se construye un Dataframe para facilitar la manipulación de los datos.
diverse_letters_volume = [] 
for n in BTC_raw ["Vol."]:
    
    diverse_letters_volume.append (n [-1])
    
BTC_vol_letter = pd.DataFrame (diverse_letters_volume)

In [10]:
# Se contará la cantidad de registros con diverso tipo de letra final asociada.
BTC_vol_letter.value_counts()

K    4115
M     249
B      16
-       6
dtype: int64

In [11]:
# Se localizan los registro que no tienen letra asociada
BTC_raw ["Vol."].loc [BTC_raw ["Vol."] == "-"]

4043    -
4044    -
4045    -
4046    -
4047    -
4048    -
Name: Vol., dtype: object

In [12]:
# Se revisa las filas de registros con valor "-" en [Vol.] 
BTC_raw [4043:4049]

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Change %
4043,"Jun 25, 2011",17.5,17.5,17.5,17.5,-,0.00%
4044,"Jun 24, 2011",17.5,17.5,17.5,17.5,-,0.00%
4045,"Jun 23, 2011",17.5,17.5,17.5,17.5,-,0.00%
4046,"Jun 22, 2011",17.5,17.5,17.5,17.5,-,0.00%
4047,"Jun 21, 2011",17.5,17.5,17.5,17.5,-,0.00%
4048,"Jun 20, 2011",17.5,17.5,17.5,17.5,-,0.00%


In [13]:
# Se traen registros anteriores y posteriores para modificar los missing values con un promedio de los mismos.
BTC_raw ["Vol."] [4036:4056]

4036     19.45K
4037     33.28K
4038     34.96K
4039     21.04K
4040     24.40K
4041     31.45K
4042     15.05K
4043          -
4044          -
4045          -
4046          -
4047          -
4048          -
4049     30.18K
4050     35.54K
4051    108.62K
4052     49.20K
4053     27.71K
4054     36.16K
4055     73.42K
Name: Vol., dtype: object

In [14]:
# Se crea Serie1 y Serie2 para promedia los valores para reemplazar "-"
Serie1 = BTC_raw ["Vol."] [4041:4043]
Serie1 = Serie1.str.replace("K","")

In [15]:
Serie2 = BTC_raw ["Vol."] [4049:4051]
Serie2 = Serie2.str.replace("K","")

In [16]:
# Se calcula el promedio de los 3 valores anteriores a los valores "-"
S1 = Serie1.astype(float).mean()
S1

23.25

In [17]:
# Se calcula el promedio de los 3 valores posteriores a los valores "-"
S2 = Serie2.astype(float).mean()
S2

32.86

In [18]:
# Se realiza un promedio de los promedios de los últimos 3 valores anteriores y 3 valores posteriores a "-"
Promedio = (S1+S2)/2
str(Promedio)

'28.055'

In [19]:
# Se crea una copia del data set original por back up
BTC_raw_2 = BTC_raw.copy()

In [20]:
# Se reemplazan los valores "-" por el promedio obtenido "28.055" y se le agrega la letra "K" dado que el promedio de los volúmenes en cuestión poseían asociada dicha letra
BTC_raw_2 ["Vol."] [4043:4049] = BTC_raw_2 ["Vol."] [4043:4049].str.replace ("-", "28.055K")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  BTC_raw_2 ["Vol."] [4043:4049] = BTC_raw_2 ["Vol."] [4043:4049].str.replace ("-", "28.055K")


In [21]:
# Se revisa el cambio (realizado con éxito)
BTC_raw_2 [4041:4051]

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Change %
4041,"Jun 27, 2011",16.8,16.5,18.0,15.0,31.45K,1.82%
4042,"Jun 26, 2011",16.5,17.5,17.5,14.0,15.05K,-6.05%
4043,"Jun 25, 2011",17.5,17.5,17.5,17.5,28.055K,0.00%
4044,"Jun 24, 2011",17.5,17.5,17.5,17.5,28.055K,0.00%
4045,"Jun 23, 2011",17.5,17.5,17.5,17.5,28.055K,0.00%
4046,"Jun 22, 2011",17.5,17.5,17.5,17.5,28.055K,0.00%
4047,"Jun 21, 2011",17.5,17.5,17.5,17.5,28.055K,0.00%
4048,"Jun 20, 2011",17.5,17.5,17.5,17.5,28.055K,0.00%
4049,"Jun 19, 2011",17.5,16.9,18.9,16.9,30.18K,3.67%
4050,"Jun 18, 2011",16.9,15.7,17.0,15.1,35.54K,7.72%


## Etapa de estructuración

In [22]:
# Se realiza un nuevo back up en caso necesario
BTC_raw_3 = BTC_raw_2.copy()

In [23]:
BTC_raw_3.head()

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Change %
0,"Jul 20, 2022",23149.1,23412.0,23429.9,22965.9,290.21K,-1.12%
1,"Jul 19, 2022",23410.2,22529.3,23757.3,21581.8,308.91K,3.93%
2,"Jul 18, 2022",22525.8,20785.6,22714.9,20770.6,279.72K,8.37%
3,"Jul 17, 2022",20785.6,21209.8,21654.4,20755.2,132.81K,-2.00%
4,"Jul 16, 2022",21209.9,20825.2,21561.3,20484.4,136.89K,1.85%


Volumen - Etapa 1

Se quitarán las letras finales de cada registro y se ajustarán los números según su valor [se dejarán en str])

In [24]:
# Se localizan los registro con letra K
# Se elimina la letra K
# Se convierte el valor en float
# Se multiplica por 1000 (K=1000)
# Se lo vuelve a transformar en str para poder seguir tratando el resto de los valores
BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("K")] = (BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("K")].str.replace("K","").astype(float)*1000).astype(str)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("K")] = (BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("K")].str.replace("K","").astype(float)*1000).astype(str)


In [25]:
# Se localizan los registro con letra M
# Se elimina la letra M
# Se convierte el valor en float
# Se multiplica por 1.000.000 (M=1.000.000)
# Se lo vuelve a transformar en str para poder seguir tratando el resto de los valores
BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("M")] = (BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("M")].str.replace("M","").astype(float)*1000000).astype(str)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("M")] = (BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("M")].str.replace("M","").astype(float)*1000000).astype(str)


In [26]:
# Se localizan los registro con letr B
# Se elimina la letra B
# Se convierte el valor en float
# Se multiplica por 1000 (B=1.000.000.000)
# Se lo vuelve a transformar en str para poder seguir tratando el resto de los valores
BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("B")] = (BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("B")].str.replace("B","").astype(float)*1000000000).astype(str)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("B")] = (BTC_raw_3 ["Vol."].loc[BTC_raw_3 ["Vol."].str.contains("B")].str.replace("B","").astype(float)*1000000000).astype(str)


In [27]:
# Se revisan algunos registros para constatar el cambio
BTC_raw_3["Vol."].value_counts().head(20)

28055.0      6
1050000.0    6
1040000.0    5
14730.0      5
1180000.0    5
1110000.0    4
5030.0       4
74360.0      4
9340000.0    4
71370.0      3
44460.0      3
15170.0      3
1260.0       3
27600.0      3
20120.0      3
40640.0      3
61720.0      3
4740.0       3
1450.0       3
49910.0      3
Name: Vol., dtype: int64

Porcentaje - etapa 1:

Se procede a quitar los "%"

In [28]:
# Se realiza un nuevo back up
BTC_raw_4 = BTC_raw_3.copy()

In [29]:
# Se localizan y se elimina el signo "%"
BTC_raw_4 ["Change %"] = BTC_raw_4 ["Change %"].str.replace ("%","")

In [30]:
BTC_raw_4 ["Change %"].head()

0    -1.12
1     3.93
2     8.37
3    -2.00
4     1.85
Name: Change %, dtype: object

Fecha - Etapa 1

Se procede a convertir en Datetime64 [ns] la columna [Date]

In [31]:
# Se realiza un nuevo back up
BTC_raw_5 = BTC_raw_4.copy()

In [32]:
# Se convierte el tipo de la columna a Datetime64 [ns]
BTC_raw_5 ['Date'] = pd.to_datetime(BTC_raw_5 ['Date'])

In [33]:
BTC_raw_5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4386 entries, 0 to 4385
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      4386 non-null   datetime64[ns]
 1   Price     4386 non-null   float64       
 2   Open      4386 non-null   float64       
 3   High      4386 non-null   float64       
 4   Low       4386 non-null   float64       
 5   Vol.      4386 non-null   object        
 6   Change %  4386 non-null   object        
dtypes: datetime64[ns](1), float64(4), object(2)
memory usage: 240.0+ KB


Volumen - Etapa 2

Se convierte el tipo de valor de la columna [Vol.] a float (también podría convertirse en int)

In [34]:
BTC_raw_5 ['Vol.'] = BTC_raw_5 ['Vol.'].astype(float)

In [35]:
BTC_raw_5 ['Vol.'].dtype

dtype('float64')

Porcentaje - Etapa 2

Se convierten los valores de la variable [Change %] a tipo float

In [36]:
BTC_raw_5 ['Change %'] = BTC_raw_5 ['Change %'].astype(float)

In [37]:
BTC_raw_5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4386 entries, 0 to 4385
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      4386 non-null   datetime64[ns]
 1   Price     4386 non-null   float64       
 2   Open      4386 non-null   float64       
 3   High      4386 non-null   float64       
 4   Low       4386 non-null   float64       
 5   Vol.      4386 non-null   float64       
 6   Change %  4386 non-null   float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 240.0 KB


Fecha - Etapa 2

Se ordena la base de datos de manera ascendente según la variable [Date]

In [38]:
BTC_raw_6 = BTC_raw_5.sort_values("Date")
BTC_raw_6.head()

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Change %
4385,2010-07-18,0.1,0.0,0.1,0.1,80.0,0.0
4384,2010-07-19,0.1,0.1,0.1,0.1,570.0,0.0
4383,2010-07-20,0.1,0.1,0.1,0.1,260.0,0.0
4382,2010-07-21,0.1,0.1,0.1,0.1,580.0,0.0
4381,2010-07-22,0.1,0.1,0.1,0.1,2160.0,0.0


In [39]:
# Se resetea el índice del data set a fin de que empiece desde 0
BTC_raw_7 = BTC_raw_6.reset_index()

In [40]:
# Se elimina la columna index creada por default
BTC_raw_8 = BTC_raw_7.drop (["index"], axis=1)

In [41]:
BTC_raw_8.head()

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Change %
0,2010-07-18,0.1,0.0,0.1,0.1,80.0,0.0
1,2010-07-19,0.1,0.1,0.1,0.1,570.0,0.0
2,2010-07-20,0.1,0.1,0.1,0.1,260.0,0.0
3,2010-07-21,0.1,0.1,0.1,0.1,580.0,0.0
4,2010-07-22,0.1,0.1,0.1,0.1,2160.0,0.0


Porcentaje - Etapa 3

Se modifica el nombre de la variable [Change %] por uno más sencillo de tratar (sin espacio y símbolo %)

In [42]:
# Se renombra la variable y se crea la base de datos BTC
BTC = BTC_raw_8.rename(columns={'Change %':'Percentage_diff'})

In [43]:
BTC.tail()

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Percentage_diff
4381,2022-07-16,21209.9,20825.2,21561.3,20484.4,136890.0,1.85
4382,2022-07-17,20785.6,21209.8,21654.4,20755.2,132810.0,-2.0
4383,2022-07-18,22525.8,20785.6,22714.9,20770.6,279720.0,8.37
4384,2022-07-19,23410.2,22529.3,23757.3,21581.8,308910.0,3.93
4385,2022-07-20,23149.1,23412.0,23429.9,22965.9,290210.0,-1.12


Se genera la columna [Target] que se utilizará en los modelos de ML
* Se otorga valor 0 => Porcetage_diff <= 0
* Se otorga valor 1 => Porcentage_diff > 0 

In [44]:
# Se crea la columna [Target]
BTC ["Target"] = BTC ["Percentage_diff"]

In [45]:
BTC.tail()

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Percentage_diff,Target
4381,2022-07-16,21209.9,20825.2,21561.3,20484.4,136890.0,1.85,1.85
4382,2022-07-17,20785.6,21209.8,21654.4,20755.2,132810.0,-2.0,-2.0
4383,2022-07-18,22525.8,20785.6,22714.9,20770.6,279720.0,8.37,8.37
4384,2022-07-19,23410.2,22529.3,23757.3,21581.8,308910.0,3.93,3.93
4385,2022-07-20,23149.1,23412.0,23429.9,22965.9,290210.0,-1.12,-1.12


In [46]:
# Se localizan las diferencias inferiores e iguales a 0 y se le aplica el valor 0
# Se localizan las diferencias superiores a 0 y se le aplica el valor 1
BTC.loc[BTC.Percentage_diff>0, "Target"]=1
BTC.loc[BTC.Percentage_diff<=0, "Target"]=0

In [47]:
BTC.tail()

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Percentage_diff,Target
4381,2022-07-16,21209.9,20825.2,21561.3,20484.4,136890.0,1.85,1.0
4382,2022-07-17,20785.6,21209.8,21654.4,20755.2,132810.0,-2.0,0.0
4383,2022-07-18,22525.8,20785.6,22714.9,20770.6,279720.0,8.37,1.0
4384,2022-07-19,23410.2,22529.3,23757.3,21581.8,308910.0,3.93,1.0
4385,2022-07-20,23149.1,23412.0,23429.9,22965.9,290210.0,-1.12,0.0


### Exportando la base de datos

In [49]:
BTC.to_csv ("D:\Luciano\Programación\Data science\Trabajo final\Bases de datos para modelos\BTC_final.csv", index=False, sep=";")