<a href="https://colab.research.google.com/github/luiscmc10/luiscmc10-Consumer_Spending_Prediction/blob/main/Consumer_Spending_Prediction_Layout.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**PROBLEMA DE NEGOCIO**


---




La necesidad de prever y optimizar el gasto de sus usuarios ha llevado a una empresa de comercio electrónico a buscar soluciones innovadoras. Como científicos de datos, hemos sido convocados para desarrollar un modelo de machine learning que pueda predecir con precisión cuánto gastará un usuario al visitar dicho sitio web.

### **Tus tareas principales serán:**

**1. Preprocesamiento de Datos:** Importar correctamente y analizar y comprender el conjunto de datos proporcionado, realizar limpieza de datos, eliminar atributos que no aportan valor y manejar valores faltantes.

**2. Exploración y Feature Engineering:** Realizar visualizaciones para entender las relaciones entre las variables y seleccionar las características relevantes, identificar variables llaves, codificación de variables categóricas y normalización/escalado de datos.

**3. Construcción de Modelos:** Experimentar con algunos algoritmos de machine learning como Linear Regression, Decision Tree Regressor, Random Forest Regressor, entre otros.

**4. Evaluación y Selección del Modelo:** Evaluar los modelos utilizando métricas como el error cuadrático medio (MSE), la raíz cuadrada del error cuadrático medio (RMSE) y el coeficiente de determinación (R²). Seleccionar el modelo con el mejor rendimiento para la predicción del gasto de los usuarios.

#**1. Configuración del Ambiente**


---




In [1]:
!pip install wget
import wget
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from scipy.stats import randint
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.preprocessing import FunctionTransformer
from joblib import dump, load
global df_traffic, resultados, modelo, modelo_clasificacion

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=4b554f704eb59407717d89f96c7f9013db0cdfc800a3de7a275891b08c77556f
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


#**2. Preprocesamiento de Datos**


---


In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/ElProfeAlejo/Bootcamp_Databases/main/traffic_site.csv')
df.head(10)

Unnamed: 0,channelGrouping,date,device,fullVisitorId,geoNetwork,sessionId,socialEngagementType,totals,trafficSource,visitId,visitNumber,visitStartTime
0,Organic Search,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",9674781571160116268,"{""continent"": ""Asia"", ""subContinent"": ""Southea...",9674781571160116268_1472804607,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472804607,1,1472804607
1,Organic Search,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",8590648239310839049,"{""continent"": ""Europe"", ""subContinent"": ""Easte...",8590648239310839049_1472835928,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472835928,1,1472835928
2,Affiliates,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",9772828344252850532,"{""continent"": ""Americas"", ""subContinent"": ""Sou...",9772828344252850532_1472856802,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""Data Share Promo"", ""source"": ""Pa...",1472856802,1,1472856802
3,Organic Search,20160902,"{""browser"": ""Safari"", ""browserVersion"": ""not a...",1350700416054916432,"{""continent"": ""Americas"", ""subContinent"": ""Nor...",1350700416054916432_1472879649,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472879649,2,1472879649
4,Organic Search,20160902,"{""browser"": ""Safari"", ""browserVersion"": ""not a...",1350700416054916432,"{""continent"": ""Americas"", ""subContinent"": ""Nor...",1350700416054916432_1472829671,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""5"", ""pageviews"": ""4"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472829671,1,1472829671
5,Organic Search,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",9874029322760071744,"{""continent"": ""Americas"", ""subContinent"": ""Sou...",9874029322760071744_1472825224,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""7"", ""pageviews"": ""5"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472825224,1,1472825224
6,Organic Search,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",9443838101670317185,"{""continent"": ""Americas"", ""subContinent"": ""Nor...",9443838101670317185_1472832130,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""19"", ""pageviews"": ""15""}","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472832130,2,1472832130
7,Direct,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",4720137602861399516,"{""continent"": ""Americas"", ""subContinent"": ""Nor...",4720137602861399516_1472826545,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""27"", ""pageviews"": ""19...","{""campaign"": ""(not set)"", ""source"": ""(direct)""...",1472826545,1,1472826545
8,Organic Search,20160902,"{""browser"": ""Safari"", ""browserVersion"": ""not a...",1350700416054916432,"{""continent"": ""Americas"", ""subContinent"": ""Nor...",1350700416054916432_1472882635,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""49"", ""pageviews"": ""32""}","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472882635,3,1472882635
9,Social,20160902,"{""browser"": ""Firefox"", ""browserVersion"": ""not ...",5990626540303882402,"{""continent"": ""Europe"", ""subContinent"": ""South...",5990626540303882402_1472812848,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""referralPath"": ""/yt/about/it/"", ""campaign"": ...",1472812848,1,1472812848


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12283 entries, 0 to 12282
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   channelGrouping       12283 non-null  object
 1   date                  12283 non-null  int64 
 2   device                12283 non-null  object
 3   fullVisitorId         12283 non-null  uint64
 4   geoNetwork            12283 non-null  object
 5   sessionId             12283 non-null  object
 6   socialEngagementType  12283 non-null  object
 7   totals                12283 non-null  object
 8   trafficSource         12283 non-null  object
 9   visitId               12283 non-null  int64 
 10  visitNumber           12283 non-null  int64 
 11  visitStartTime        12283 non-null  int64 
dtypes: int64(4), object(7), uint64(1)
memory usage: 1.1+ MB


In [None]:
# Necesitamos validar que tipos de datos tiene las columnas tipo json
df.device.value_counts().index


In [None]:
df.geoNetwork.value_counts().index

In [None]:
df.totals.value_counts().index

In [None]:
df.trafficSource.value_counts().index

In [None]:
df_traffic = pd.read_csv('https://raw.githubusercontent.com/ElProfeAlejo/Bootcamp_Databases/main/traffic_site.csv', dtype={'date':object,'fullVisitorId':object,'visitId':object})
import pandas as pd
import json

def preprocesamiento():
    global df_traffic

    # Extraer datos de diccionarios
    diccionarios = ['device','geoNetwork','trafficSource','totals']
    for columna in diccionarios:
        df_traffic = df_traffic.join(pd.DataFrame([json.loads(linea) for linea in df_traffic[columna]]))
    df_traffic = df_traffic.drop(columns=diccionarios)
    # o puedo ustilizar esta otra opción df_traffic.drop(columns=diccionarios, inplace=True)

    #Eliminar las columnas que no requerimos
    df_traffic= df_traffic.drop(columns=['socialEngagementType','browserVersion','browserSize','operatingSystemVersion','mobileDeviceBranding','mobileDeviceModel','mobileInputSelector','mobileDeviceInfo','mobileDeviceMarketingName','flashVersion','language','screenColors','screenResolution','continent','subContinent','country','region','cityId','latitude','longitude','networkLocation','keyword','adwordsClickInfo','isTrueDirect','adContent','campaignCode','networkDomain','visits'])

    # Eliminar duplicados
    df_traffic = df_traffic.drop_duplicates()

    # Reemplazar textos por números
    cuant = ['bounces', 'hits','newVisits','pageviews', 'transactionRevenue']
    for columna in cuant:
        df_traffic[columna] = pd.to_numeric(df_traffic[columna])
    df_traffic[cuant] = df_traffic[cuant].fillna(0)
    df_traffic['transactionRevenue'] = df_traffic['transactionRevenue'] / 1000000


   # Reemplazar números por fechas
    df_traffic['date'] = pd.to_datetime(df_traffic['date'], format='%Y%m%d')
    df_traffic['visitStartTime'] = pd.to_datetime(df_traffic['visitStartTime'], unit='s')
    df_traffic.bounces = df_traffic.bounces.astype(int)
    df_traffic.newVisits = df_traffic.newVisits.astype(int)



preprocesamiento()
df_traffic.sample(5)

In [None]:
#1 Valido la información de mi base de datos
df_traffic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12283 entries, 0 to 12282
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   channelGrouping     12283 non-null  object        
 1   date                12283 non-null  datetime64[ns]
 2   fullVisitorId       12283 non-null  object        
 3   sessionId           12283 non-null  object        
 4   visitId             12283 non-null  object        
 5   visitNumber         12283 non-null  int64         
 6   visitStartTime      12283 non-null  datetime64[ns]
 7   browser             12283 non-null  object        
 8   operatingSystem     12283 non-null  object        
 9   isMobile            12283 non-null  bool          
 10  deviceCategory      12283 non-null  object        
 11  metro               12283 non-null  object        
 12  city                12283 non-null  object        
 13  campaign            12283 non-null  object    

In [None]:
df_traffic['referralPath'].unique()
df_traffic.referralPath.value_counts()

/                                           1042
/yt/about/                                   931
/yt/about/tr/                                205
/analytics/web/                              196
/yt/about/vi/                                195
                                            ... 
/neo/rd                                        1
/How-do-I-get-a-Google-I-O-t-shirt             1
/yt/about/id/index.html                        1
/intl/id/permissions/using-the-logo.html       1
/intl/vi/yt/about/press/                       1
Name: referralPath, Length: 197, dtype: int64

In [None]:
diccionario_referralPath = {'/':1,'/yt/about/':1,'/yt/about/tr/':1,'/analytics/web/':1,'/yt/about/vi/':1,'/neo/rd':1,'/How-do-I-get-a-Google-I-O-t-shirt':1,'/yt/about/id/index.html':1,'/intl/id/permissions/using-the-logo.html':1,'/intl/vi/yt/about/press/':1}
df_traffic['referralPath'] = df_traffic['referralPath'].map(diccionario_referralPath)
df_traffic['referralPath'] = df_traffic['referralPath'].fillna(0)


In [None]:
#1 Valido la información de mi base de datos
df_traffic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12283 entries, 0 to 12282
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   channelGrouping     12283 non-null  object        
 1   date                12283 non-null  datetime64[ns]
 2   fullVisitorId       12283 non-null  object        
 3   sessionId           12283 non-null  object        
 4   visitId             12283 non-null  object        
 5   visitNumber         12283 non-null  int64         
 6   visitStartTime      12283 non-null  datetime64[ns]
 7   browser             12283 non-null  object        
 8   operatingSystem     12283 non-null  object        
 9   isMobile            12283 non-null  bool          
 10  deviceCategory      12283 non-null  object        
 11  metro               12283 non-null  object        
 12  city                12283 non-null  object        
 13  campaign            12283 non-null  object    

In [None]:
df_traffic.sample(5)

Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,visitId,visitNumber,visitStartTime,browser,operatingSystem,isMobile,deviceCategory,metro,city,campaign,source,medium,referralPath,hits,pageviews,bounces,newVisits,transactionRevenue
1911,Social,2017-03-20,4473272788220267679,4473272788220267679_1490022998,1490022998,1,2017-03-20 15:16:38,Chrome,Windows,False,desktop,not available in demo dataset,not available in demo dataset,(not set),youtube.com,referral,0.0,3,3,0,1,0.0
5741,Social,2016-08-14,9976407758466356759,9976407758466356759_1471211782,1471211782,1,2016-08-14 21:56:22,Opera,Windows,False,desktop,not available in demo dataset,not available in demo dataset,(not set),youtube.com,referral,0.0,1,1,1,1,0.0
5662,Organic Search,2016-12-05,110975100919566950,110975100919566950_1480999011,1480999011,1,2016-12-06 04:36:51,Safari,iOS,True,mobile,not available in demo dataset,not available in demo dataset,(not set),google,organic,0.0,1,1,1,1,0.0
2527,Organic Search,2016-08-19,7076241691251193391,7076241691251193391_1471618237,1471618237,1,2016-08-19 14:50:37,Chrome,Windows,False,desktop,not available in demo dataset,not available in demo dataset,(not set),google,organic,0.0,1,1,1,1,0.0
8102,Organic Search,2016-10-12,5310912563447452038,5310912563447452038_1476302884,1476302884,15,2016-10-12 20:08:04,Chrome,Linux,False,desktop,San Francisco-Oakland-San Jose CA,Mountain View,(not set),google,organic,0.0,13,13,0,0,372.65


#**3. Exploración y Feature Engineering**


---


In [2]:
# Grafico de dispersión (antes)
plt.figure(figsize=(10, 6))
plt.scatter(range(len(df_traffic['transactionRevenue'])), df_traffic['transactionRevenue'], alpha=0.5)
plt.title('Gráfico de Dispersión de transactionRevenue')
plt.xlabel('Índice')
plt.ylabel('transactionRevenue')
plt.show()

NameError: name 'df_traffic' is not defined

<Figure size 1000x600 with 0 Axes>

In [None]:
sns.boxplot(df_traffic.transactionRevenue)

In [None]:
df_traffic.describe()

In [None]:
conteo = df_traffic['transactionRevenue'].value_counts().sort_values(ascending=False)
conteo

In [None]:
#Histograma (antes)
ax = sns.histplot(data=df_traffic, x='transactionRevenue', kde=False)
ax.set_title('Histograma de transactionRevenue')
ax.set_xlabel('transactionRevenue');

In [None]:
feature_engineering()
df_traffic.sample(5)

In [None]:
df_traffic.info()

In [None]:
#test

In [3]:
plt.figure(figsize=(30, 10))
heatmap = sns.heatmap(df_traffic.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt=".2f")
heatmap.tick_params(axis='both', which='major', labelsize=14)
plt.title('Mapa de Calor de Correlaciones', fontsize=18)
plt.show()

NameError: name 'df_traffic' is not defined

<Figure size 3000x1000 with 0 Axes>

#**4. Construcción de Modelos**


---


In [None]:
crea_modelos()
for i, model in enumerate(resultados['Model']):
    print('-------------------------------')
    print(f"Modelo: {model}")
    print(f"R-cuadrado (R²): {resultados['R2'][i]}")
    print(f"Error cuadrático medio (MSE): {resultados['MSE'][i]}")
    print(f"Raíz del error cuadrático medio (RMSE): {resultados['RMSE'][i]}")
    print('-------------------------------')

NameError: name 'crea_modelos' is not defined

#**5. Evaluación y Selección del Modelo**


---


In [None]:
visualiza_resultados()

NameError: name 'visualiza_resultados' is not defined

#**6. Producción**


---


In [None]:
#Cargar base de prueba y ejecutar normalización utilizada en nuestro modelo
df_traffic = pd.read_csv('https://raw.githubusercontent.com/ElProfeAlejo/Bootcamp_Databases/main/traffic_test.csv', dtype={'date':object,'fullVisitorId':object,'visitId':object})
preprocesamiento()
feature_engineering()

#Cargar el modelo entrenado
wget.download('https://raw.githubusercontent.com/ElProfeAlejo/Bootcamp_Databases/main/modelo.joblib', 'modelo.joblib')
modelo = load('modelo.joblib')
wget.download('https://raw.githubusercontent.com/ElProfeAlejo/Bootcamp_Databases/main/modelo_clasificacion.joblib', 'modelo_clasificacion.joblib')
modelo_clasificacion = load('modelo_clasificacion.joblib')

#Pronosticar con la nueva base
X = df_traffic.drop('transactionRevenue',axis=1)
X['clasificacion'] = modelo_clasificacion.predict(X)
y = df_traffic.transactionRevenue.copy()
predictions = modelo.predict(xgb.DMatrix(X))
predictions[predictions < 1] = 0

#Genera algunas métricas de evaluación de los pronosticos
r2 = r2_score(y, predictions)*100
mse = mean_squared_error(y, predictions)
rmse = np.sqrt(mse)
print(f"R-cuadrado (R²): {r2}")
print(f"Error cuadrático medio (MSE): {mse}")
print(f"Raíz del error cuadrático medio (RMSE): {rmse}")

KeyError: "['campaignCode'] not found in axis"

In [None]:
#Ejemplos de los pronósticos
df_resultados = pd.DataFrame({
    'transactionRevenue': y,
    'predictions': predictions
})
df_resultados[df_resultados.transactionRevenue>0].sample(10)

NameError: name 'y' is not defined