## Modelagem preditiva

Importando bibliotecas e funções necessárias para a modelagem preditiva.

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

Carrega o dataset 

In [10]:
df = pd.read_csv("./merged_data.csv")

Divide os dados em recursos (x) e variável alvo (y)

In [11]:
x = df.drop("sold_amount", axis=1)
y = df["sold_amount"]

Divide os dados em conjuntos de treinamento e teste

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

Define pré-processamento para recursos numéricos e categóricos

In [13]:
numeric_features = x.select_dtypes(include=["int64", "float64"]).columns
numeric_transformer = Pipeline(
    steps=[
        (
            "imputer",
            SimpleImputer(strategy="mean"),
        )  # You can choose a different imputation strategy
    ]
)

In [14]:
categorical_features = x.select_dtypes(include=["object"]).columns
categorical_transformer = Pipeline(
    steps=[
        (
            "imputer",
            SimpleImputer(strategy="most_frequent"),
        ),  # You can choose a different imputation strategy
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

Combina o pré-processamento para recursos numéricos e categóricos

In [15]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

Define o modelo

In [16]:
model = RandomForestRegressor(n_estimators=100, random_state=42)

Cria e avalia a pipeline

In [17]:
pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("model", model)])

Ajusta o modelo

In [18]:
pipeline.fit(x_train, y_train)

Realiza previsões no conjunto de testes

In [19]:
y_pred = pipeline.predict(x_test)

Avalia o modelo

In [20]:
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Mean Squared Error: 0.00014764677341096556
