<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part I: Imports and Data Check</h2>

In [3]:
%cd 'C:\Users\Marcio Pineda\Documents\Archivos Python\Kaggle Files (Updated)'



[WinError 123] The filename, directory name, or volume label syntax is incorrect: "'C:\\Users\\Marcio Pineda\\Documents\\Archivos Python\\Kaggle Files (Updated)'"
C:\Users\Marcio Pineda\Documents\Archivos Python\Kaggle Files (Updated)


In [7]:
!git remote set-url origin https://github.com/mpinedae21/ClickPrediction.git


In [None]:
git add B.ipynb


In [59]:
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt

# Configurar las opciones de visualización de pandas (opcional)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Función para limpiar columnas numéricas
def clean_numeric_column(column):
    column_as_str = column.astype(str).str.replace(',', '').str.replace('$', '').str.strip()
    return pd.to_numeric(column_as_str, errors='coerce')


# Cargar los conjuntos de datos
ruta_train = 'C:/Users/Marcio Pineda/Documents/Archivos Python/datasets/traincase.csv'
ruta_test = 'C:/Users/Marcio Pineda/Documents/Archivos Python/datasets/testcase.csv'
df_train = pd.read_csv(ruta_train)
df_test = pd.read_csv(ruta_test)

# Asegurarse de que 'Match Type' esté presente en los conjuntos de datos
assert 'Match Type' in df_train.columns, "La columna 'Match Type' no está presente en el conjunto de entrenamiento."
assert 'Match Type' in df_test.columns, "La columna 'Match Type' no está presente en el conjunto de prueba."

# Marcar los conjuntos de datos para poder distinguirlos después de la concatenación
df_train['set'] = 'Not Kaggle'
df_test['set'] = 'Kaggle'

# Concatenar df_train y df_test en df_full
df_full = pd.concat([df_train, df_test], ignore_index=True)

# Aplicar la función de limpieza a las columnas numéricas relevantes en df_full
columns_to_clean = ['Search Engine Bid', 'Impressions', 'Avg. Cost per Click', 'Avg. Pos.', 'Clicks']
for column in columns_to_clean:
    df_full[column] = clean_numeric_column(df_full[column])




<br><hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part II: Data Preparation</h2><br>
Complete the following steps to prepare for model building. Note that you may add or remove steps as you see fit. Please see the assignment description for details on what steps are required for this project.
<br><br>
<h3>Feature Engineering</h3>

### Feature Engineering and Transformations for NLP and Keywords Clusters 

In [None]:
# Definir funciones de preprocesamiento de texto
def preprocess_text(text):
    # Tokenización
    tokens = word_tokenize(text.lower())
    # Eliminación de stopwords y puntuación
    tokens = [token for token in tokens if token not in stopwords.words('english') and token not in string.punctuation]
    # Lematización
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)

# Preprocesamiento de la columna 'Keyword'
df_train['Preprocessed Keyword'] = df_train['Keyword'].apply(preprocess_text)

# Vectorización TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=600)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_train['Preprocessed Keyword'])

# Modelado LDA
lda_model = LatentDirichletAllocation(n_components=10, random_state=42)
lda_model.fit(tfidf_matrix)

# Asignar a cada muestra el tópico más probable de LDA
df_train['Topic'] = lda_model.transform(tfidf_matrix).argmax(axis=1)

# Convertir la columna 'Topic' en variables dummy
df_train = pd.get_dummies(df_train, columns=['Topic'], drop_first=True)

# Definir kmeans
kmeans = KMeans(n_clusters=10, random_state=42)
# Ajustar el modelo KMeans
kmeans.fit(tfidf_matrix)

def preprocess_df_general(df, kmeans):
    # Limpiar columnas numéricas, excepto 'Clicks'
    numeric_cols = ['Search Engine Bid', 'Avg. Pos.', 'Avg. Cost per Click', 'Impressions']
    for col in numeric_cols:
        if df[col].dtype == object:
            df[col] = pd.to_numeric(df[col].str.replace('$', '').str.replace(',', ''), errors='coerce')
    df['Impressions'].fillna(df['Impressions'].median(), inplace=True)

    # Procesamiento que aplica tanto al conjunto de entrenamiento como al de prueba
    keywords_tfidf = tfidf_vectorizer.transform(df['Keyword'].str.lower())
    keyword_clusters = kmeans.predict(keywords_tfidf)
    df['Keyword Cluster'] = keyword_clusters
    df['Interaction'] = df['Keyword'].astype(str) + '_' + df['Match Type'].astype(str)
    bin_edges = [0, 100, 1000, 10000, np.inf]
    bin_labels = [1, 2, 3, 4]
    df['Impressions Category'] = pd.cut(df['Impressions'], bins=bin_edges, labels=bin_labels, right=False).cat.add_categories([0]).fillna(0).astype(int)
    
    return df

def preprocess_df_train(df, kmeans):
    df = preprocess_df_general(df, kmeans)
    # Limpiar y convertir 'Clicks' a numérico solo para el conjunto de entrenamiento
    df['Clicks'] = pd.to_numeric(df['Clicks'].str.replace(',', ''), errors='coerce')
    return df

# Preprocesamiento de datos
df_train_cleaned = preprocess_df_train(df_train.copy(), kmeans)
df_test_cleaned = preprocess_df_general(df_test.copy(), kmeans)

# Transformaciones logarítmicas
df_train_cleaned['Log_Impressions'] = np.log1p(df_train_cleaned['Impressions'])
# También puedes hacer lo mismo para df_test_cleaned si es necesario

# Características polinómicas
df_train_cleaned['Search_Engine_Bid_Squared'] = df_train_cleaned['Search Engine Bid'] ** 2
df_train_cleaned['Impressions_Cubed'] = df_train_cleaned['Impressions'] ** 3
# También puedes crear más características polinómicas según sea necesario

# Actualizar las características seleccionadas
selected_features = ['Search Engine Bid', 'Impressions Category', 'Avg. Pos.', 'Keyword Cluster',
                     'Log_Impressions', 'Search_Engine_Bid_Squared', 'Impressions_Cubed']

X_train_cleaned = df_train_cleaned[selected_features]
y_train_cleaned = df_train_cleaned['Clicks'].astype(float)
X_train_cleaned.fillna(0, inplace=True)

# Ajustar el modelo de regresión lineal con las nuevas características
model_cleaned = RandomForestRegressor(n_estimators=100, random_state=42)
model_cleaned.fit(X_train_cleaned, y_train_cleaned)

# Realizar cross-validation con el modelo de regresión lineal actualizado
cv_scores_cleaned = cross_val_score(model_cleaned, X_train_cleaned, y_train_cleaned, cv=5, scoring='neg_mean_squared_error')
cv_rmse_cleaned = np.sqrt(-cv_scores_cleaned)
cv_rmse_cleaned_mean = cv_rmse_cleaned.mean()

print("RMSE promedio del modelo limpio con transformaciones logarítmicas y características polinómicas:", cv_rmse_cleaned_mean)

# Análisis de Valores Atípicos
# Investigar los casos de valores atípicos para determinar su naturaleza
outliers = df_train_cleaned[(np.abs(df_train_cleaned['Clicks'] - df_train_cleaned['Clicks'].mean()) > (3 * df_train_cleaned['Clicks'].std()))]
print("Casos de valores atípicos:")
print(outliers)

# Evaluación Estadística
# Prueba de ANOVA para determinar si las diferencias en los 'Clicks' entre los tópicos son significativas
anova_result = f_oneway(
    df_train_cleaned[df_train_cleaned['Topic_1'] == 1]['Clicks'],
    df_train_cleaned[df_train_cleaned['Topic_2'] == 1]['Clicks'],
    df_train_cleaned[df_train_cleaned['Topic_3'] == 1]['Clicks']
)

print("Resultados de ANOVA:", anova_result)

In the first attempt at feature engineering, we tried to use NLP techniques to handle the diversity of categories that the keyword variable encompasses. We transformed these words and then tried to group them to use them as a keyword group variable in the independent variables. 

1. Text Preprocessing
The first step involves cleaning and preprocessing the text in the 'Keyword' column. This includes:

Tokenization: The text is broken down into tokens or individual words, turning it to lower case for consistency.
Removal of stopwords and punctuation: Words that do not contribute significant meaning (stopwords) and punctuation marks are removed as they are not useful for text analysis.
Lemmatization: Words are converted to their base or lemma form to reduce the variation of words with the same meaning.
This preprocessing aims to reduce noise in the text data and highlight relevant keywords for analysis.

2. TF-IDF Vectorization
The preprocessed text is transformed into a numerical representation using TF-IDF (Term Frequency-Inverse Document Frequency), which helps identify the importance of words within documents relative to the dataset. This step is crucial for converting textual data into a format that machine learning models can process.

3. LDA Modeling
Latent Dirichlet Allocation (LDA) analysis is applied to identify themes or topics within the dataset. This model helps to group keywords into topics based on their semantic similarity, which can be useful for understanding the search categories and how they relate to clicks.

4. Clustering with KMeans
The KMeans algorithm is used to cluster keywords based on their TF-IDF characteristics. This step can reveal patterns in the keywords that might be associated with different levels of interaction, such as clicks.

5. Preprocessing Numerical and Categorical Features
Other features of the dataset, such as numerical and categorical data, are cleaned and processed. This includes:

Cleaning numerical features: Converting to the proper numeric format, removing unnecessary symbols, and handling missing values.
Creating new features: Generating new variables like 'Keyword Cluster' that might impact the prediction of clicks.
6. Specific Preparation for the Training Set
For the training set, additional cleaning is performed on the 'Clicks' column, ensuring it is in the proper numerical format for analysis and modeling.



<br><hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part III: Data Partitioning</h2><br>
This is a very important step for your submission on Kaggle. Make sure to complete your data preparationbefore moving forward.
<br>
<br><h3>Separating the Kaggle Data</h3><br>

### Featuring Engineering using transformations and polynomials on Impressions and Search Engine Bid

In [None]:
# Preprocesamiento de las columnas numéricas
def preprocess_numeric(df):
    for col in ['Search Engine Bid', 'Avg. Pos.', 'Impressions']:
        # Asegurar la correcta conversión de tipos de datos
        df[col] = df[col].astype(str).str.replace('$', '').str.replace(',', '').str.strip().replace('', np.nan)
        df[col] = pd.to_numeric(df[col], errors='coerce')
    return df

df_train = preprocess_numeric(df_train)
df_test = preprocess_numeric(df_test)

# Imputar los valores faltantes después de la conversión
imputer = SimpleImputer(strategy='median')
cols_to_impute = ['Impressions', 'Search Engine Bid', 'Avg. Pos.']

df_train[cols_to_impute] = imputer.fit_transform(df_train[cols_to_impute])
df_test[cols_to_impute] = imputer.transform(df_test[cols_to_impute])

# Creación de características polinómicas
poly_features = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly_features.fit_transform(df_train[['Search Engine Bid', 'Impressions', 'Avg. Pos.']])
X_test_poly = poly_features.transform(df_test[['Search Engine Bid', 'Impressions', 'Avg. Pos.']])

# Separación de la variable objetivo
# Separación de la variable objetivo
y = df_train['Clicks'].str.replace(',', '').astype(float)  # Limpiar la columna 'Clicks' y convertir a float

X_train, X_valid, y_train, y_valid = train_test_split(X_poly, y, test_size=0.2, random_state=42)



In the second attempt we used techniques with transformations for the most dominant categorical variables which were impressions and search engine bid so we used these transformation techniques and then created polynomial features between these plus average position to manage the relationships between these variables.

Preprocessing of Numerical Columns
A function named preprocess_numeric is defined to clean and convert specific columns to numeric types. This involves:

Removing dollar signs and commas which could hinder numeric conversion.
Change the columns to strings, strip any whitespace, and replace an empty string with NaN to allow numeric conversion.
This coerces values that cannot be represented as numbers into NaNs, making sure all data on the affected columns is in a numeric format.
This function is applied to both the training and testing datasets.
Imputation of Missing Values
After the conversion of columns to numeric types, there could be missing values left as a result of errors in the conversion process or due to originally missing data. A SimpleImputer with the strategy 'median' is used to fill in these missing values for specified columns. This method ensures that the model has a complete dataset to work with, improving its reliability and performance.

The fit_transform method is applied to the training dataset. It computes the median values from the training data and fills the missing values in the training and validation set with those median values.

The testing data is subsequently transformed, through the .transform method, of those calculated medians from the training data in order to keep consistencies.

Creation of Polynomial Features
Polynomial features are generated from the columns 'Search Engine Bid', 'Impressions', and 'Avg. Pos.' with the help of PolynomialFeatures from sklearn, but only for the interaction terms (no bias term) with the degree set to 2. Polynomial features include cross terms of all the variables, capturing the interaction between the variables—a linear model fails to capture the interactions and hence makes it powerful. Fit and transform the PolynomialFeatures only on the training data, then transform the test data in the same way. This way, the model has access to these features in both the training and test datasets.

In [60]:
# Realizar One-Hot Encoding para 'Match Type' y cualquier otra variable categórica necesaria
df_full['Match Type'].fillna('Unknown', inplace=True)
categorical_cols = ['Match Type']
df_full = pd.get_dummies(df_full, columns=categorical_cols)

# Llenar los NaNs restantes en las columnas numéricas con el método forward fill
df_full.fillna(method='ffill', inplace=True)

# Separación en características y objetivo, seguido por la división en entrenamiento y prueba
features_columns = ['Search Engine Bid', 'Impressions', 'Avg. Cost per Click', 'Avg. Pos.'] + \
                   [col for col in df_full.columns if col.startswith('Match Type_')]
features = df_full[df_full['set'] == 'Not Kaggle'][features_columns]
target = df_full[df_full['set'] == 'Not Kaggle']['Clicks']

# Dividir en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Entrenar un árbol de decisión
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)

# Evaluar el modelo
y_pred_tree = tree_model.predict(X_test)
rmse_tree = sqrt(mean_squared_error(y_test, y_pred_tree))
print(f"RMSE del Árbol de Decisión: {rmse_tree}")


RMSE del Árbol de Decisión: 1699.2864795797498


<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part III: Candidate Modeling</h2><br>
Develop your candidate models below.

### Modeling using NLP and Clusters for Keywords 

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Modelos RandomForestRegressor
model_rf = RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42)

# Variables predictoras y variable objetivo
X = df_train_cleaned[['Search_Engine_Bid_Squared', 'Impressions_Cubed' ]]
y = df_train_cleaned['Clicks']

# Validación cruzada con RandomForestRegressor
cv_scores_rf = cross_val_score(model_rf, X, y, cv=5, scoring='neg_mean_squared_error')
cv_rmse_rf = np.sqrt(-cv_scores_rf)
cv_rmse_rf_mean = cv_rmse_rf.mean()

print("RMSE promedio del modelo RandomForestRegressor:", cv_rmse_rf_mean)

# Dividir los datos en conjunto de entrenamiento y validación
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=42)

# Entrenar el modelo RandomForestRegressor con todos los datos de entrenamiento
model_rf.fit(X_train, y_train)

# Calcular el RMSE en el conjunto de entrenamiento
train_predictions_rf = model_rf.predict(X_train)
train_rmse_rf = np.sqrt(mean_squared_error(y_train, train_predictions_rf))
print("RMSE en el conjunto de entrenamiento del modelo RandomForestRegressor:", train_rmse_rf)

# Calcular el RMSE en el conjunto de validación
valid_predictions_rf = model_rf.predict(X_valid)
valid_rmse_rf = np.sqrt(mean_squared_error(y_valid, valid_predictions_rf))
print("RMSE en el conjunto de validación del modelo RandomForestRegressor:", valid_rmse_rf)

# Verificar sobreajuste comparando el RMSE en el conjunto de entrenamiento y el conjunto de validación
if train_rmse_rf < valid_rmse_rf:
    print("El modelo RandomForestRegressor podría estar sobreajustado.") 
    
    
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

# Define los modelos de regresión lineal
model_lr = LinearRegression()
model_ridge = Ridge(alpha=1.0)  # Puedes ajustar el parámetro alpha según sea necesario
model_lasso = Lasso(alpha=1.0)  # Puedes ajustar el parámetro alpha según sea necesario
model_elasticnet = ElasticNet(alpha=1.0, l1_ratio=0.5)  # Puedes ajustar los parámetros alpha y l1_ratio según sea necesario

# Variables predictoras y variable objetivo
X_train = df_train_cleaned[['Search_Engine_Bid_Squared', 'Impressions_Cubed']]
y_train = df_train_cleaned['Clicks']

# Validación cruzada con modelos de regresión lineal
cv_scores_lr = cross_val_score(model_lr, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
cv_rmse_lr = np.sqrt(-cv_scores_lr)
cv_rmse_lr_mean = cv_rmse_lr.mean()

print("RMSE promedio del modelo LinearRegression:", cv_rmse_lr_mean)

# Validación cruzada con Ridge Regression
cv_scores_ridge = cross_val_score(model_ridge, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
cv_rmse_ridge = np.sqrt(-cv_scores_ridge)
cv_rmse_ridge_mean = cv_rmse_ridge.mean()

print("RMSE promedio del modelo Ridge Regression:", cv_rmse_ridge_mean)

# Validación cruzada con Lasso Regression
cv_scores_lasso = cross_val_score(model_lasso, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
cv_rmse_lasso = np.sqrt(-cv_scores_lasso)
cv_rmse_lasso_mean = cv_rmse_lasso.mean()

print("RMSE promedio del modelo Lasso Regression:", cv_rmse_lasso_mean)

# Validación cruzada con ElasticNet Regression
cv_scores_elasticnet = cross_val_score(model_elasticnet, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
cv_rmse_elasticnet = np.sqrt(-cv_scores_elasticnet)
cv_rmse_elasticnet_mean = cv_rmse_elasticnet.mean()

print("RMSE promedio del modelo ElasticNet Regression:", cv_rmse_elasticnet_mean)    

#### RandomForestRegressor Performance
Average RMSE: 742.66
Training RMSE: 307.87
Validation RMSE: 861.12
The training RMSE is way below the validation RMSE. It presents a great difference between the two with the RandomForestRegressor; it might just be that a model is performing well on its training data but has weaker generalization ability on new, previously unseen data. This large gap would indicate that the model is sensitive to some particular patterns in the training data, which are less common or different in the validation data.

#### Linear Models Performance
LinearRegression Average RMSE: 908.90
Ridge Regression Average RMSE: 887.63
Lasso Regression Average RMSE: 887.63
ElasticNet Regression Average RMSE: 887.62 The linear models—Ridge, Lasso, and ElasticNet Regression—all cluster their RMSE value between 887 and 908. The average value of RMSE for RandomForestRegressor is lower than that for them, which, in relation to this group of methods, indicates a greater accuracy of prediction.
However, such similar RMSE values by those models seemed to show that regularized regressions (Ridge, Lasso, ElasticNet) don't bring a good thing to the table compared to plain LinearRegression in the context at hand.

#### Insights and Analysis

Overfitting in RandomForestRegressor: Such a big difference between the training and validation RMSE in the RandomForestRegressor is a classical indication of overfit. The model learns from the training data too well, even capturing noise and the specific patterns which don't generalize well over new data. For example, overfitting by hyperparameter tuning, increasing min_samples_leaf or decreasing max_depth, more extensive cross-validation, or regularizing techniques could be applied.

Generalization of Linear Models: The relatively stable RMSE values among models LinearRegression, Ridge, Lasso, and ElasticNet point to a fact that in this case, these perform equally to or slightly better than the RandomForestRegressor model in this case. Though an overall higher RMSE compared to the training RMSE of the RandomForestRegressor—it is indicative that they perhaps do not model with equal measure all the complexities and patterns in the data. Model selection: The choice between more complex models and simple linear models is now only a tradeoff between prediction accuracy and generalization. So, from now on, for practical purposes, it would be interesting to consider ensemble methods or look deeper into regularization techniques that could make the training performance-validation generalization gap smaller.

### Modeling using transformations and polynomials on Impressions and Search Engine Bid  

In [None]:
# Definición y entrenamiento de modelos
models = {
    'Lasso': Lasso(alpha=0.1),
    'Ridge': Ridge(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    cv_score = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    print(f'{name} CV RMSE:', np.sqrt(-cv_score.mean()))
    
    # Evaluación en el conjunto de validación
    valid_preds = model.predict(X_valid)
    valid_rmse = np.sqrt(mean_squared_error(y_valid, valid_preds))
    print(f'{name} Validation RMSE:', valid_rmse)



####  Lasso and Ridge:
Both models had significantly higher RMSEs on validation, about 2363, compared to the RMSEs obtained on cross-validation, which were approximately 1003. This large discrepancy suggests that while these models can capture some general trends in the data during training, they struggle to generalize well to unseen data. This could indicate an overfitting to the training data, despite the regularization applied to these models.

####    Random Forest:
The model showed superior performance on both cross-validation and the validation set, with RMSEs of approximately 744 and 994, respectively. The closer gap between these two values suggests a better generalization capability. The ensemble nature of Random Forest and its ability to handle complex interactions and nonlinearities between variables may contribute to its better performance compared to linear models like Lasso and Ridge.

####     Gradient Boosting:
Similar to Random Forest, Gradient Boosting demonstrated greater consistency between training and validation, with a CV RMSE of about 939 and a Validation RMSE of about 998. This model benefits from sequentially building trees that correct errors from previous trees, making it effective at minimizing errors on the training set while still maintaining good generalization ability.

### General Model 

In [None]:
# Definición y entrenamiento de modelos
models = {
    'Lasso': Lasso(alpha=0.1),
    'Ridge': Ridge(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    cv_score = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    print(f'{name} CV RMSE:', np.sqrt(-cv_score.mean()))
    
    # Evaluación en el conjunto de validación
    valid_preds = model.predict(X_valid)
    valid_rmse = np.sqrt(mean_squared_error(y_valid, valid_preds))
    print(f'{name} Validation RMSE:', valid_rmse)

## Hyperparameter Tuning 

### Hyperparameter Tuning for Model only using Impressions and Search Engine Bid transformations as most dominant variables 

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import StackingRegressor

# Ajuste fino de hiperparámetros para Gradient Boosting (XGBoost)
param_grid_xgb = {
    'learning_rate': [0.01, 0.1, 0.3],
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}
grid_search_xgb = GridSearchCV(XGBRegressor(n_estimators=100, random_state=42), param_grid=param_grid_xgb, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search_xgb.fit(X_train_pca, y_train)

# Ajuste fino de hiperparámetros para LightGBM
param_grid_lgbm = {
    'num_leaves': [31, 50, 100],
    'min_data_in_leaf': [20, 50, 100],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}
grid_search_lgbm = GridSearchCV(LGBMRegressor(n_estimators=100, random_state=42), param_grid=param_grid_lgbm, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search_lgbm.fit(X_train_pca, y_train)

# Control del sobreajuste: Ensemble de modelos
best_xgb_model = grid_search_xgb.best_estimator_
best_lgbm_model = grid_search_lgbm.best_estimator_

estimators = [('xgb', best_xgb_model), ('lgbm', best_lgbm_model)]
stacked_model = StackingRegressor(estimators=estimators, final_estimator=RandomForestRegressor(n_estimators=100, random_state=42))
stacked_model.fit(X_train_pca, y_train)

# Evaluación con Validación Cruzada
cv_score_stacked = cross_val_score(stacked_model, X_train_pca, y_train, cv=5, scoring='neg_mean_squared_error')
print("Stacked Model CV RMSE:", np.sqrt(-cv_score_stacked.mean()))

# Predicción en el conjunto de validación
stacked_valid_preds = stacked_model.predict(X_valid_pca)
stacked_valid_rmse = np.sqrt(mean_squared_error(y_valid, stacked_valid_preds))
print("Stacked Model Validation RMSE:", stacked_valid_rmse)

# Diferencia entre RMSE de entrenamiento y validación
xgb_train_preds = best_xgb_model.predict(X_train_pca)
xgb_valid_preds = best_xgb_model.predict(X_valid_pca)
xgb_train_rmse = np.sqrt(mean_squared_error(y_train, xgb_train_preds))
xgb_valid_rmse = np.sqrt(mean_squared_error(y_valid, xgb_valid_preds))
print("XGBoost Train RMSE:", xgb_train_rmse)
print("XGBoost Validation RMSE:", xgb_valid_rmse)
print("Difference between XGBoost Train and Validation RMSE:", abs(xgb_train_rmse - xgb_valid_rmse))

lgbm_train_preds = best_lgbm_model.predict(X_train_pca)
lgbm_valid_preds = best_lgbm_model.predict(X_valid_pca)
lgbm_train_rmse = np.sqrt(mean_squared_error(y_train, lgbm_train_preds))
lgbm_valid_rmse = np.sqrt(mean_squared_error(y_valid, lgbm_valid_preds))
print("LightGBM Train RMSE:", lgbm_train_rmse)
print("LightGBM Validation RMSE:", lgbm_valid_rmse)
print("Difference between LightGBM Train and Validation RMSE:", abs(lgbm_train_rmse - lgbm_valid_rmse))

stacked_train_rmse = np.sqrt(-cv_score_stacked.mean())
print("Difference between Stacked Model CV RMSE and Validation RMSE:", abs(stacked_train_rmse - stacked_valid_rmse))

####  Explanation : 

Hyperparameter Tuning for Gradient Boosting (XGBoost and LightGBM)
XGBoost Tuning: A GridSearchCV is used to find the best hyperparameters for the XGBoost regressor. The grid search explores various combinations of learning rate, max depth, min child weight, subsample, and colsample_bytree over a 5-fold cross-validation. This exhaustive search ensures the selection of the best parameter set for minimizing the negative mean squared error (MSE).

LightGBM Tuning: Similarly, a GridSearchCV is used for LightGBM, tuning parameters such as num leaves, min data in leaf, learning rate, subsample, and colsample_bytree. This process aims to optimize the LightGBM regressor by finding the best hyperparameters to minimize the negative MSE.

Overfitting Control: Ensemble of Models
After fine-tuning, the best models from XGBoost and LightGBM are used to create an ensemble using StackingRegressor. The stacking ensemble combines the predictions of the individual models and uses a RandomForestRegressor as the final estimator to predict the target variable. This method leverages the strengths of each base model and aims to improve overall prediction accuracy by reducing overfitting.
Model Evaluation with Cross-Validation
The stacked model is evaluated using cross-validation with a 5-fold split, calculating the root mean squared error (RMSE) to assess its performance. This step provides an unbiased estimation of the model's generalization error.
Prediction on the Validation Set
Predictions are made on the validation set using the stacked model, and the RMSE is calculated to measure the accuracy of these predictions. This step evaluates how well the model performs on unseen data.
Difference Between Training and Validation RMSE
The RMSE for both training and validation sets is calculated for XGBoost and LightGBM individually to assess the models' performance and detect any significant discrepancies, which could indicate overfitting.

The difference between the cross-validation RMSE of the stacked model and its validation RMSE is calculated to further evaluate the model's generalization capability and control for overfitting.

#### Results : 

Stacked Model CV RMSE: The Cross-Validation Root Mean Square Error (RMSE) for the stacked model is approximately 915.32. This value represents the model's average error in predicting the target variable across different subsets of the training data, providing an estimate of its generalization error.

Stacked Model Validation RMSE: The RMSE on the validation set for the stacked model is about 951.37. This indicates the model's performance on unseen data, which is slightly worse than its cross-validation performance, suggesting a mild overfitting to the training data.

XGBoost Train vs. Validation RMSE: The XGBoost model shows a training RMSE of approximately 697.14 and a validation RMSE of about 1324.41. The significant difference (627.27) between these two values indicates that the model performs well on the training data but poorly on unseen data, a clear sign of overfitting.

LightGBM Train vs. Validation RMSE: The LightGBM model has a training RMSE of approximately 884.64 and a validation RMSE of 731.55. The difference (153.09) between these RMSE values is smaller compared to XGBoost, suggesting better generalization from training to unseen data.

Difference between Stacked Model CV RMSE and Validation RMSE: The small difference (36.05) between the stacked model's CV RMSE and its validation RMSE suggests that the ensemble approach has helped to mitigate overfitting, offering a more robust model that generalizes better to unseen data.

In summary, the stacked model shows a promising balance between training performance and generalization to unseen data, compared to the individual performances of XGBoost and LightGBM.

### Hyperparameter Tuning using NLP and Clusters for Keywords 

In [None]:
# Definir el espacio de hiperparámetros para explorar
random_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10, 15, 20],
    'min_samples_leaf': [1, 2, 4, 6, 8],
    'max_features': ['auto', 'sqrt']
}

# Inicializar el modelo de RandomForest
rf = RandomForestRegressor(random_state=42)

# Configurar la búsqueda aleatoria con validación cruzada
rf_random = RandomizedSearchCV(
    estimator=rf,
    param_distributions=random_grid,
    n_iter=100,
    cv=3,
    verbose=2,
    random_state=42,
    n_jobs=-1,
    scoring='neg_mean_squared_error'
)

# Ajustar la búsqueda aleatoria
rf_random.fit(X_train, y_train)

# Mejores hiperparámetros encontrados
best_random_params = rf_random.best_params_
print("Mejores hiperparámetros de la búsqueda aleatoria:", best_random_params)

# Ajustar el modelo con los mejores hiperparámetros en el conjunto de entrenamiento completo
best_model_random = rf_random.best_estimator_

# Calcular el RMSE en el conjunto de validación utilizando el mejor modelo de la búsqueda aleatoria
valid_predictions_random = best_model_random.predict(X_valid)
valid_rmse_random = np.sqrt(mean_squared_error(y_valid, valid_predictions_random))

print("RMSE en el conjunto de validación con el mejor modelo de la búsqueda aleatoria:", valid_rmse_random)

# Ajustar el modelo con los mejores hiperparámetros en el conjunto de entrenamiento completo
best_model_random.fit(df_train_cleaned[selected_features_extended], df_train_cleaned['Clicks'])

# Calcular el RMSE en el conjunto de entrenamiento
train_predictions_random = best_model_random.predict(df_train_cleaned[selected_features_extended])
train_rmse_random = np.sqrt(mean_squared_error(df_train_cleaned['Clicks'], train_predictions_random))
print("RMSE en el conjunto de entrenamiento con el mejor modelo de la búsqueda aleatoria:", train_rmse_random)

# Calcular el RMSE en el conjunto de validación utilizando el mejor modelo de la búsqueda aleatoria
valid_predictions_random = best_model_random.predict(X_valid)
valid_rmse_random = np.sqrt(mean_squared_error(y_valid, valid_predictions_random))
print("RMSE en el conjunto de validación con el mejor modelo de la búsqueda aleatoria:", valid_rmse_random)

# Verificar sobreajuste comparando el RMSE en el conjunto de entrenamiento y el conjunto de validación
if train_rmse_random < valid_rmse_random:
    print("El modelo con los mejores hiperparámetros podría estar sobreajustado.")

#### Results :

Validation Set RMSE: The RMSE on the validation set with the best model from the random search is 802.27, quantifying the model's prediction error on a dataset not used during training and providing insight into the model's generalization to new data.
Training Set RMSE: The RMSE on the training set with the best model from the random search is 474.17, indicating the model's accuracy on the data it was trained on.
Discrepancy in RMSE: The significant difference between the training RMSE (474.17) and the validation RMSE (802.27) signals a potential overfitting issue, where the model performs much better on the training data than on unseen data. However, the subsequent validation RMSE reported as 416.73 raises questions due to its unusual nature, suggesting a possible error in reporting or evaluation.

Overfitting Concern
The initial comparison points to potential overfitting, as the model performs better on the training data compared to the validation data

### General Model 

In [43]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Definir los rangos de los hiperparámetros para muestrear
param_dist = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}

# Inicializar RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=tree_model, param_distributions=param_dist, n_iter=100, cv=5, scoring='neg_mean_squared_error', random_state=42)

# Realizar la búsqueda aleatoria
random_search.fit(X_train, y_train)

# Obtener los mejores hiperparámetros
best_params_random = random_search.best_params_
print("Mejores hiperparámetros encontrados mediante Randomized Search:", best_params_random)

# Obtener el mejor modelo
best_tree_model_random = random_search.best_estimator_

# Evaluar el mejor modelo
y_pred_best_random = best_tree_model_random.predict(X_test)
rmse_best_random = sqrt(mean_squared_error(y_test, y_pred_best_random))
print(f"RMSE del mejor modelo de Árbol de Decisión obtenido mediante Randomized Search: {rmse_best_random}")


Mejores hiperparámetros encontrados mediante Randomized Search: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 9}
RMSE del mejor modelo de Árbol de Decisión obtenido mediante Randomized Search: 1383.420307755057


In [61]:
from sklearn.model_selection import GridSearchCV

# Definir los parámetros para buscar
param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=DecisionTreeRegressor(random_state=42), param_grid=param_grid, cv=5, scoring='neg_root_mean_squared_error')
grid_search.fit(X_train, y_train)
best_tree_model = grid_search.best_estimator_
y_pred_grid = best_tree_model.predict(X_test)
rmse_grid = sqrt(mean_squared_error(y_test, y_pred_grid))
print(f"RMSE del Árbol de Decisión después de Grid Search: {rmse_grid}")
print("Mejores parámetros encontrados:", grid_search.best_params_)


RMSE del Árbol de Decisión después de Grid Search: 1259.3739580324384
Mejores parámetros encontrados: {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 10}


In [62]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR


models = {
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'Support Vector Regression': SVR()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    print(f"RMSE de {name}: {rmse}")

RMSE de Random Forest: 1220.4306512608287
RMSE de Gradient Boosting: 1428.87882421522
RMSE de Support Vector Regression: 923.0681355297471


In [63]:
# Definición y ajuste del modelo mediante GridSearchCV

model = RandomForestRegressor(random_state=42)
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5,
                           scoring='neg_root_mean_squared_error', verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Obtención de los mejores hiperparámetros y el mejor modelo
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Preparación del conjunto de datos de prueba para predicciones finales
df_test_processed = df_full[df_full['set'] == 'Kaggle'][features_columns]

# Generación de predicciones finales para el conjunto de datos de prueba
y_pred_test = best_model.predict(df_test_processed)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


In [54]:
from sklearn.model_selection import cross_val_score

# Definir el modelo con los mejores hiperparámetros encontrados
best_model = RandomForestRegressor(max_depth=10, min_samples_leaf=2, min_samples_split=10, n_estimators=100, random_state=42)

# Realizar validación cruzada
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error')

# Calcular el RMSE promedio y la desviación estándar de los puntajes
avg_rmse = -cv_scores.mean()
std_rmse = cv_scores.std()

print(f"RMSE promedio de la validación cruzada: {avg_rmse}")
print(f"Desviación estándar del RMSE de la validación cruzada: {std_rmse}")


RMSE promedio de la validación cruzada: 743.7941138905937
Desviación estándar del RMSE de la validación cruzada: 314.72947669398854


####  Most important variables to predict clicks on models

In [73]:
feature_importances = best_model.feature_importances_
important_features = pd.Series(feature_importances, index=X_train.columns).sort_values(ascending=False)
print(important_features)

Search Engine Bid      0.501392
Impressions            0.418091
Match Type_Exact       0.055221
Avg. Pos.              0.014070
Avg. Cost per Click    0.007077
Match Type_Broad       0.003370
Match Type_Advanced    0.000381
Match Type_Standard    0.000312
Match Type_Unknown     0.000087
dtype: float64


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Definir el modelo de regresión
model = RandomForestRegressor(random_state=42)

# Extiende el rango de hiperparámetros
param_dist_extended = {
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': randint(2, 15),
    'min_samples_leaf': randint(1, 10),
    'n_estimators': randint(100, 500)  # Aumentando el rango para n_estimators
}

# Inicializa RandomizedSearchCV con un número mayor de iteraciones
random_search_extended = RandomizedSearchCV(estimator=model, param_distributions=param_dist_extended,
                                             n_iter=200, cv=5, scoring='neg_mean_squared_error',
                                             random_state=42, verbose=2, n_jobs=-1)

random_search_extended.fit(X_train, y_train)

# Mejor modelo y parámetros
best_model_extended = random_search_extended.best_estimator_
print("Mejores hiperparámetros:", random_search_extended.best_params_) 



Fitting 5 folds for each of 200 candidates, totalling 1000 fits


In [68]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Definir los rangos de los hiperparámetros para el modelo de árbol de decisión
param_dist_tree = {
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}

# Inicializar RandomizedSearchCV para el árbol de decisión
random_search_tree = RandomizedSearchCV(estimator=DecisionTreeRegressor(random_state=42),
                                        param_distributions=param_dist_tree, n_iter=200, cv=5,
                                        scoring='neg_mean_squared_error', random_state=42, verbose=2, n_jobs=-1)

# Realizar la búsqueda aleatoria para el árbol de decisión
random_search_tree.fit(X_train, y_train)

# Obtener y mostrar los mejores hiperparámetros para el árbol de decisión
print("Árbol de Decisión - Mejores hiperparámetros:", random_search_tree.best_params_)


Fitting 5 folds for each of 200 candidates, totalling 1000 fits
Árbol de Decisión - Mejores hiperparámetros: {'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 11}


In [69]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt

# Mejores hiperparámetros obtenidos
best_hyperparams = {'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 11}

# Entrenar el Árbol de Decisión con los mejores hiperparámetros
best_decision_tree = DecisionTreeRegressor(max_depth=best_hyperparams['max_depth'],
                                           min_samples_leaf=best_hyperparams['min_samples_leaf'],
                                           min_samples_split=best_hyperparams['min_samples_split'],
                                           random_state=42)
best_decision_tree.fit(X_train, y_train)

# Realizar predicciones y calcular RMSE
y_pred_decision_tree = best_decision_tree.predict(X_test)
rmse_decision_tree = sqrt(mean_squared_error(y_test, y_pred_decision_tree))
print(f"RMSE del Árbol de Decisión con mejores hiperparámetros: {rmse_decision_tree}")


RMSE del Árbol de Decisión con mejores hiperparámetros: 1259.396863451618


In [67]:
from sklearn.ensemble import StackingRegressor

# Definir los estimadores base
estimators = [
    ('random_forest', RandomForestRegressor(n_estimators=100, random_state=42)),
    ('gradient_boosting', GradientBoostingRegressor(random_state=42))
]

# Crear el ensamblaje con Stacking
stacking_ensemble = StackingRegressor(estimators=estimators, final_estimator=DecisionTreeRegressor(max_depth=10, random_state=42))

# Entrenar el ensamblaje con Stacking
stacking_ensemble.fit(X_train, y_train)

# Evaluar el modelo ensamblado con Stacking
y_pred_stacking = stacking_ensemble.predict(X_test)
rmse_stacking = sqrt(mean_squared_error(y_test, y_pred_stacking))
print(f"RMSE del ensamblaje con Stacking: {rmse_stacking}")



RMSE del ensamblaje con Stacking: 1489.7344840712556


<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part IV: Preparing Submission File for Kaggle</h2><br>
The code below will store the predicted values for each of the models above.

### Submission for General Model 

In [56]:
# Creación del DataFrame de submission
submission = pd.DataFrame({
    'entry_id': df_test['entry_id'],  # Asegúrate de que df_test tenga 'entry_id'
    'Clicks': y_pred_test
})

# Exportación del DataFrame de submission a un archivo CSV
submission_filename = 'decision_tree_kaggle_submission_updated1.csv'
submission.to_csv(submission_filename, index=False)



### Submission for LightBoost Model using transformations and polynomials 

In [None]:
# Preprocesamiento de las columnas numéricas en los datos de prueba
df_test = preprocess_numeric(df_test)

# Aplicar las mismas transformaciones al conjunto de datos de prueba que se aplicaron al conjunto de entrenamiento
X_test_poly = poly_features.transform(df_test[['Search Engine Bid', 'Impressions', 'Avg. Pos.']])

# Aplicar PCA al conjunto de datos de prueba
X_test_pca = pca.transform(X_test_poly)  # Asumiendo que 'pca' ya está definido

# Realizar predicciones sobre el conjunto de datos de prueba utilizando el modelo entrenado
y_pred_test_lgbm = best_lgbm_model.predict(X_test_pca)

# Crear el DataFrame para el envío
submission_lgbm = pd.DataFrame({
    'entry_id': df_test['entry_id'],  # Asegúrate de que 'entry_id' está en el conjunto de prueba
    'Clicks': y_pred_test_lgbm
})

# Exportar el DataFrame a un archivo CSV para el envío
submission_filename_lgbm = 'lgbm_submission.csv'
submission_lgbm.to_csv(submission_filename_lgbm, index=False)

print(f"Archivo de submission creado: {submission_filename_lgbm}")


### Submission for NLP and Clustering Keywords 

In [None]:
# Entrenar el modelo XGBoost con todo el conjunto de datos de entrenamiento
best_xgb_model.fit(X_train_pca, y_train)

# Aplicar las mismas transformaciones al conjunto de datos de prueba
X_test_pca = pca.transform(X_test_poly)

# Realizar predicciones sobre el conjunto de datos de prueba utilizando el modelo entrenado
y_pred_test_xgb = best_xgb_model.predict(X_test_pca)

# Crear el DataFrame para el envío
submission_xgb = pd.DataFrame({
    'entry_id': df_test['entry_id'],  # Asegúrate de que 'entry_id' está en el conjunto de prueba
    'Clicks': y_pred_test_xgb
})

# Exportar el DataFrame a un archivo CSV para el envío
submission_filename_xgb = 'xgb_submission.csv'
submission_xgb.to_csv(submission_filename_xgb, index=False)

print(f"Archivo de submission creado: {submission_filename_xgb}")

### Conclusion and Recommendation  

In this script we group 3 approaches that were applied during the business challenge. The original idea was to segment the keywords with NLP techniques in order to use decision tree models to predict clicks, this with the knowledge that impressions and search engine bid were the strongest characteristics used by the models in general to predict clicks making them the dominant variables.  In the second attempt, not finding a decrease of the RSME with the keywords, it was decided to transform only impressions and search engine bid to better fit this type of decision models and improved the cross-validation of models such as XG or Light Boost, however in kaggle the results obtained in the validation were not reflected.  Therefore, it was decided to take a more general approach by not performing transformations to the dominant variables and use the original test dataset to predict click with a basic Random Forest model and this was the one that worked best when exporting the file to kaggle. 


Therefore, if we could have used predictive models based on neural networks or try models such as SVM, perhaps we could have changed the result because it would have worked better to cover all the dimensionality that the keywords have, another option was to group by clusters based on campaigns or tourist destinations but by focusing resources on modeling and adjustment we did not manage to address this approach.  Perhaps with this the implication of the keyword variable would have been more significant for the models when trying to predict clicks, in real life we identified that this would be so from a marketing perspective.  Keyword strategies are a function of proper strategy planning, selection and adjustment of words to improve click-throughs as this best explains the amount of click-throughs you can get to the target segment.