# SpaceX Falcon 9 – Predictive Analysis (Classification)

Este notebook entrena **múltiples modelos de clasificación** para predecir el resultado de lanzamiento (éxito/fracaso) usando el dataset del Capstone de IBM.

**Repositorio del proyecto:** https://github.com/martumaida09-a11y/Files-with-proyect-IBM


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt


## 1) Cargar datos

Se intenta leer `dataset_part_2.csv` local. Si no existe, se descarga desde un repositorio público.


In [None]:
import os, urllib.request

local_path = "dataset_part_2.csv"
if not os.path.exists(local_path):
    url = "https://raw.githubusercontent.com/chuksoo/IBM-Data-Science-Capstone-SpaceX/main/dataset_part_2.csv"
    print(f"Descargando dataset desde: {url}")
    urllib.request.urlretrieve(url, local_path)

df = pd.read_csv(local_path)
df.head()


## 2) Selección de features + One-Hot Encoding

- Variables numéricas: `FlightNumber`, `PayloadMass`, `Flights`, `Block`, `ReusedCount`
- Variables binarias: `GridFins`, `Reused`, `Legs`
- Variables categóricas: `Orbit`, `LaunchSite`, `LandingPad`, `Serial`


In [None]:
features = [
    'FlightNumber','PayloadMass','Orbit','LaunchSite','Flights',
    'GridFins','Reused','Legs','LandingPad','Block','ReusedCount','Serial'
]
X = df[features]
y = df['Class'].astype(int)

X_encoded = pd.get_dummies(
    X,
    columns=['Orbit','LaunchSite','LandingPad','Serial'],
    drop_first=False
)

print("Shape X:", X_encoded.shape)
print("Shape y:", y.shape)


## 3) Split train/test + escalado

> Igual que en los labs: `test_size=0.2`, `random_state=2`.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=2
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

print("Train:", X_train_s.shape, " Test:", X_test_s.shape)


## 4) Entrenar múltiples modelos + evaluación

Se compara:
- Logistic Regression
- SVM
- Decision Tree
- KNN

Con una búsqueda de hiperparámetros simple (`GridSearchCV`, `cv=5`).


In [None]:
param_logreg = {'C':[0.01,0.1,1,10], 'penalty':['l2'], 'solver':['lbfgs']}
param_svm = {'kernel':['linear','rbf'], 'C':[0.1,1,10], 'gamma':['scale','auto']}
param_tree={'criterion':['gini','entropy'], 'max_depth':[2,4,6,8,None], 'min_samples_split':[2,3,4]}
param_knn={'n_neighbors':[1,3,5,7,9], 'p':[1,2]}

models = {
    "Logistic Regression": (LogisticRegression(max_iter=5000), param_logreg, True),
    "SVM": (SVC(), param_svm, True),
    "Decision Tree": (DecisionTreeClassifier(random_state=2), param_tree, False),
    "KNN": (KNeighborsClassifier(), param_knn, True),
}

rows = []
best = {}

for name, (estimator, params, needs_scaling) in models.items():
    Xtr = X_train_s if needs_scaling else X_train
    Xte = X_test_s if needs_scaling else X_test

    gs = GridSearchCV(estimator, params, cv=5, scoring="accuracy")
    gs.fit(Xtr, y_train)

    pred = gs.predict(Xte)
    acc = accuracy_score(y_test, pred)
    prec, rec, f1, _ = precision_recall_fscore_support(y_test, pred, average='binary', pos_label=1)
    rows.append([name, gs.best_params_, acc, prec, rec, f1])

    best[name] = gs.best_estimator_

results = pd.DataFrame(rows, columns=["Modelo","Mejores parámetros","Accuracy","Precision","Recall","F1"])
results.sort_values("Accuracy", ascending=False)


## 5) Mejor modelo + Matriz de confusión

Se elige el modelo con mejor **Accuracy/F1** (si hay empate, se prioriza interpretabilidad).


In [None]:
# Elegimos Logistic Regression si está entre los mejores (interpretabilidad)
best_model_name = "Logistic Regression"
best_model = best[best_model_name]

pred = best_model.predict(X_test_s)
cm = confusion_matrix(y_test, pred)
cm


In [None]:
fig, ax = plt.subplots(figsize=(5,4))
im = ax.imshow(cm)
ax.set_xticks([0,1]); ax.set_yticks([0,1])
ax.set_xticklabels(['Pred 0','Pred 1']); ax.set_yticklabels(['True 0','True 1'])
for (i,j), val in np.ndenumerate(cm):
    ax.text(j, i, str(val), ha='center', va='center',
            color='white' if val>cm.max()/2 else 'black', fontsize=14)
ax.set_title('Confusion Matrix – Logistic Regression')
fig.tight_layout()
plt.show()


## 6) Conclusión

- El modelo **Logistic Regression** logra buen desempeño y es interpretable.
- Próximos pasos: incorporar variables externas (clima, viento), calibrar el umbral para reducir falsos positivos y evaluar validación temporal.
