In [6]:
import pandas as pd
import numpy as np

In [7]:
#Read the data
data = pd.read_csv("/content/train.csv")
xTest= pd.read_csv("/content/test.csv")

xTrain = data.drop(["Survived"], axis=1)
yTrain = np.array(data.Survived)

#Let's take a look at the data
data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [8]:
#First save the "PassengerId" column for the submission file
passengerId = xTest.PassengerId

#We can eliminate the column "name" and "PassengerId" columns because they have no mutual information.
xTrain.drop(columns=["Name", "PassengerId"], axis=1, inplace=True)
xTest.drop(columns=["Name", "PassengerId"], axis=1, inplace=True)

In [None]:
#Let's look how many unique values has the 'Ticket' column
print("Unique values in the ticket column from the data: ", len(xTrain["Ticket"].unique()))
print("Unique values in the ticket column from the data: ", len(xTest["Ticket"].unique()))

Unique values in the ticket column from the data:  681
Unique values in the ticket column from the data:  363


It seems too many values to do the *One Hot Encoding*, so I will just drop them


In [None]:
xTrain.drop(columns=["Ticket"], axis=1, inplace=True)
xTest.drop(columns=["Ticket"], axis=1, inplace=True)

In [None]:
#Now with the 'Cabin' column.
print("Unique values in the 'Cabin' column from the data: ", len(xTrain["Cabin"].unique()))
print("Unique values in the 'Cabin' column from the data: ", len(xTest["Cabin"].unique()))

Unique values in the 'Cabin' column from the data:  148
Unique values in the 'Cabin' column from the data:  77


In [None]:
#There are also too many unique values in those columns, let's drop them too
xTrain.drop(columns=["Cabin"], axis=1, inplace=True)
xTest.drop(columns=["Cabin"], axis=1, inplace=True)

In [None]:
#Now with the 'Embarked' column.
print("Unique values in the 'Cabin' column from the data: ", len(xTrain["Embarked"].unique()))
print("Unique values in the 'Cabin' column from the data: ", len(xTest["Embarked"].unique()))

Unique values in the 'Cabin' column from the data:  4
Unique values in the 'Cabin' column from the data:  3


We can do the *One Hot* with the 'Embarked' columns because they have only 4 and 3 unique values. It won't be any problem to handle them.

Now let's see how many *Null* values we have to treat them.

In [None]:
xTrain.isnull().sum()

Unnamed: 0,0
Pclass,0
Sex,0
Age,177
SibSp,0
Parch,0
Fare,0
Embarked,2


In [None]:
xTest.isnull().sum()

Unnamed: 0,0
Pclass,0
Sex,0
Age,86
SibSp,0
Parch,0
Fare,1
Embarked,0


In [None]:
#Starting with the "Age", "Embarked" and "Fare" columns let's impute the values with the mode and median from each of them.
xTrain["Age"] = xTrain["Age"].fillna(xTrain["Age"].median())
xTest["Age"] = xTest["Age"].fillna(xTest["Age"].median())
xTrain["Embarked"] = xTrain["Embarked"].fillna(xTrain["Embarked"].mode()[0])
xTest["Embarked"] = xTest["Embarked"].fillna(xTest["Embarked"].mode()[0])
xTrain["Fare"] = xTrain["Fare"].fillna(xTrain["Fare"].median())
xTest["Fare"] = xTest["Fare"].fillna(xTest["Fare"].median())

#Let's do the One Hot with the "Embarked" and "Sex" column
categoricalCols=["Sex", "Embarked"]
xTrain = pd.get_dummies(xTrain, columns=categoricalCols, drop_first=True, dtype=int)
xTest = pd.get_dummies(xTest, columns=categoricalCols, drop_first=True, dtype=int)

xTrain.astype('int64')
xTest.astype('int64')
print(xTrain)

KeyError: 'Embarked'

Now that we don't have null values, and we handled the Categorical columns. Let's start making the model.

In [None]:
import tensorflow as tf
import matplotlib.pyplot as plt

modelo = tf.keras.Sequential([
    tf.keras.layers.BatchNormalization(),  # Capa de normalización
    tf.keras.layers.Dense(1, input_shape=[9]), #Capa de entrada
    tf.keras.layers.Dense(16, activation='relu'),  # Capa oculta
    tf.keras.layers.Dense(1, activation='sigmoid'),  # Capa oculta
])

modelo.compile(optimizer=tf.keras.optimizers.Adam(learning_rate= 0.004),
               loss='binary_crossentropy',
               metrics=["accuracy"])

history = modelo.fit(
    xTrain, yTrain,
    epochs=80,
    validation_split=0.2,  # Divide en train/validation
    batch_size=16,
    verbose=False
)

print("Modelo entrenado correctamente")

plt.xlabel("Number of epochs")
plt.ylabel("Loss")
plt.plot(history.history['loss'])

In [1]:
#Make the predictions
print(np.array(xTest).shape)
prediccion = modelo.predict(np.array(xTest).reshape(418, 8))

#Convert the probability (from 0 to 1) into binary result (0 or 1)
prediccion = np.array([1 if x > 0.5 else 0 for x in prediccion])

#Concat the passenger id and the prediction into a single dataframe
prediccionDf = pd.DataFrame({'PassengerId': passengerId, 'Survived':prediccion})

#Convert it into a csv file so we can submit it
prediccionDf.to_csv('gender_submission.csv', index="passengerId", sep=',')

#Let's print it
submission = pd.read_csv('/content/gender_submission.csv')
display(submission.head())

NameError: name 'np' is not defined

# Task
Explícame en qué consisten estos datasets

Here is all the data you need:
"test.csv"
"train.csv"

## Data loading

### Subtask:
Load the provided CSV files into pandas DataFrames.


**Reasoning**:
I need to import the pandas library and load the two CSV files into pandas DataFrames.



In [1]:
import pandas as pd

try:
    df_test = pd.read_csv('test.csv')
    df_train = pd.read_csv('train.csv')
    display(df_test.head())
    display(df_train.head())
except FileNotFoundError:
    print("Error: One or both of the CSV files were not found.")
except pd.errors.ParserError:
    print("Error: There was an issue parsing one or both of the CSV files.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Data exploration

### Subtask:
Examine the structure and data types of the loaded DataFrames.


**Reasoning**:
Examine the structure and data types of the DataFrames using .info(), .describe(), and check for missing values using .isnull().sum(). Also, analyze unique values in categorical columns using .unique() or .value_counts().



In [2]:
# Display information about df_test
print("DataFrame: df_test")
df_test.info()
display(df_test.describe())

# Analyze unique values in categorical columns for df_test
for col in ['Pclass', 'Sex', 'Embarked']:
    print(f"\nUnique values for {col}:")
    print(df_test[col].value_counts())

print("\nMissing values in df_test:")
print(df_test.isnull().sum())

# Display information about df_train
print("\n\nDataFrame: df_train")
df_train.info()
display(df_train.describe())

# Analyze unique values in categorical columns for df_train
for col in ['Survived', 'Pclass', 'Sex', 'Embarked']:
    print(f"\nUnique values for {col}:")
    print(df_train[col].value_counts())

print("\nMissing values in df_train:")
print(df_train.isnull().sum())

DataFrame: df_test
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292



Unique values for Pclass:
Pclass
3    218
1    107
2     93
Name: count, dtype: int64

Unique values for Sex:
Sex
male      266
female    152
Name: count, dtype: int64

Unique values for Embarked:
Embarked
S    270
C    102
Q     46
Name: count, dtype: int64

Missing values in df_test:
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


DataFrame: df_train
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-n

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292



Unique values for Survived:
Survived
0    549
1    342
Name: count, dtype: int64

Unique values for Pclass:
Pclass
3    491
1    216
2    184
Name: count, dtype: int64

Unique values for Sex:
Sex
male      577
female    314
Name: count, dtype: int64

Unique values for Embarked:
Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

Missing values in df_train:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


## Data exploration

### Subtask:
Determine the meaning of each column in the `df_test` and `df_train` DataFrames.


**Reasoning**:
Analyze each column in both dataframes to understand its meaning and potential relevance to a machine learning model. Compare the columns to identify differences between the datasets. Document the findings in a structured manner.



In [3]:
# Column analysis and comparison

column_analysis = {}

# Iterate through columns in df_train
for col in df_train.columns:
  column_analysis[col] = {}
  column_analysis[col]['description'] = ''
  column_analysis[col]['dtype'] = df_train[col].dtype
  column_analysis[col]['unique_values'] = df_train[col].unique()[:5]  # Show a few unique values
  column_analysis[col]['missing_values'] = df_train[col].isnull().sum()
  if col in df_test.columns:
    column_analysis[col]['present_in_test'] = True
    column_analysis[col]['dtype_test'] = df_test[col].dtype
    column_analysis[col]['unique_values_test'] = df_test[col].unique()[:5] #Show a few unique values
  else:
      column_analysis[col]['present_in_test'] = False

#Iterate through columns in df_test to find unique columns
for col in df_test.columns:
    if col not in df_train.columns:
        column_analysis[col] = {}
        column_analysis[col]['description'] = ''
        column_analysis[col]['dtype'] = df_test[col].dtype
        column_analysis[col]['unique_values'] = df_test[col].unique()[:5]
        column_analysis[col]['missing_values'] = df_test[col].isnull().sum()
        column_analysis[col]['present_in_test'] = True

#Display the analysis
for col, details in column_analysis.items():
    print(f"Column: {col}")
    print(f"  Data Type (train): {details['dtype']}")
    if 'dtype_test' in details:
        print(f"  Data Type (test): {details['dtype_test']}")
    print(f"  Unique Values (train): {details['unique_values']}")
    if 'unique_values_test' in details:
        print(f"  Unique Values (test): {details['unique_values_test']}")
    print(f"  Missing Values (train): {details['missing_values']}")
    print(f"  Present in Test set: {details['present_in_test']}")
    print("-"*20)

Column: PassengerId
  Data Type (train): int64
  Data Type (test): int64
  Unique Values (train): [1 2 3 4 5]
  Unique Values (test): [892 893 894 895 896]
  Missing Values (train): 0
  Present in Test set: True
--------------------
Column: Survived
  Data Type (train): int64
  Unique Values (train): [0 1]
  Missing Values (train): 0
  Present in Test set: False
--------------------
Column: Pclass
  Data Type (train): int64
  Data Type (test): int64
  Unique Values (train): [3 1 2]
  Unique Values (test): [3 2 1]
  Missing Values (train): 0
  Present in Test set: True
--------------------
Column: Name
  Data Type (train): object
  Data Type (test): object
  Unique Values (train): ['Braund, Mr. Owen Harris'
 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'
 'Heikkinen, Miss. Laina' 'Futrelle, Mrs. Jacques Heath (Lily May Peel)'
 'Allen, Mr. William Henry']
  Unique Values (test): ['Kelly, Mr. James' 'Wilkes, Mrs. James (Ellen Needs)'
 'Myles, Mr. Thomas Francis' 'Wirz, Mr. Albert'


## Data analysis

### Subtask:
Compare columns in `df_test` and `df_train` and identify discrepancies.


**Reasoning**:
Compare the columns of the two dataframes and identify discrepancies in columns, data types, and unique values for categorical columns.



In [4]:
# Identify columns present in one DataFrame but not the other
train_cols = set(df_train.columns)
test_cols = set(df_test.columns)

cols_in_train_not_test = train_cols - test_cols
cols_in_test_not_train = test_cols - train_cols

print("Columns in train but not in test:", cols_in_train_not_test)
print("Columns in test but not in train:", cols_in_test_not_train)

# Compare data types for common columns
common_cols = train_cols.intersection(test_cols)
for col in common_cols:
    if df_train[col].dtype != df_test[col].dtype:
        print(f"Data type mismatch for column '{col}':")
        print(f"  train: {df_train[col].dtype}")
        print(f"  test: {df_test[col].dtype}")

# Compare unique values for categorical columns
categorical_cols = ['Pclass', 'Sex', 'Embarked']  # Add other categorical columns as needed
for col in categorical_cols:
    if col in common_cols:
        train_unique = set(df_train[col].unique())
        test_unique = set(df_test[col].unique())
        if train_unique != test_unique:
            print(f"Unique value mismatch for column '{col}':")
            print(f"  train: {train_unique}")
            print(f"  test: {test_unique}")

Columns in train but not in test: {'Survived'}
Columns in test but not in train: set()
Unique value mismatch for column 'Embarked':
  train: {'C', nan, 'S', 'Q'}
  test: {'C', 'S', 'Q'}


## Data analysis

### Subtask:
Analyze each column's meaning and relevance to a machine learning task.


**Reasoning**:
Analyze each column's meaning and relevance to a machine learning task, such as predicting passenger survival, based on the provided dataframes and previous analysis.



In [5]:
column_analysis = {}

for df_name, df in [('train', df_train), ('test', df_test)]:
    for col in df.columns:
        if col not in column_analysis:
            column_analysis[col] = {}
        column_analysis[col][df_name] = {}
        column_analysis[col][df_name]['dtype'] = df[col].dtype
        column_analysis[col][df_name]['missing_values'] = df[col].isnull().sum()
        column_analysis[col][df_name]['unique_values'] = df[col].unique()[:5]

for col, details in column_analysis.items():
    print(f"Column: {col}")
    if 'train' in details:
        print(f"  Train Data Type: {details['train']['dtype']}")
        print(f"  Train Missing Values: {details['train']['missing_values']}")
        print(f"  Train Unique Values: {details['train']['unique_values']}")
    if 'test' in details:
        print(f"  Test Data Type: {details['test']['dtype']}")
        print(f"  Test Missing Values: {details['test']['missing_values']}")
        print(f"  Test Unique Values: {details['test']['unique_values']}")

    # Analysis and potential relevance to survival prediction
    if col == 'PassengerId':
        print("  Meaning: Unique identifier for each passenger.  Likely irrelevant to survival.")
    elif col == 'Survived':
        print("  Meaning: Survival status (0 = No, 1 = Yes). This is the target variable.")
    elif col == 'Pclass':
        print("  Meaning: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).  Potentially relevant, as higher class might correlate with better survival.")
    elif col == 'Name':
        print("  Meaning: Passenger name.  Potentially relevant for extracting titles (Mr., Mrs., Miss), which might correlate with survival.")
    elif col == 'Sex':
        print("  Meaning: Passenger gender. Likely relevant, as historical data suggests different survival rates for males and females.")
    elif col == 'Age':
        print("  Meaning: Passenger age.  Potentially relevant, as age groups might exhibit different survival rates.")
    elif col == 'SibSp':
        print("  Meaning: Number of siblings/spouses aboard.  Potentially relevant; larger families might have faced challenges during evacuation.")
    elif col == 'Parch':
        print("  Meaning: Number of parents/children aboard.  Similar to 'SibSp', family size could be relevant.")
    elif col == 'Ticket':
        print("  Meaning: Ticket number. Potentially relevant, but its encoding may require special consideration.")
    elif col == 'Fare':
        print("  Meaning: Passenger fare. Potentially relevant, as higher fares could imply better cabin location or class.")
    elif col == 'Cabin':
        print("  Meaning: Cabin number. Potentially relevant, as cabin location might have influenced survival. Many missing values.")
    elif col == 'Embarked':
        print("  Meaning: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton). Potentially relevant.")
    print("-" * 20)

Column: PassengerId
  Train Data Type: int64
  Train Missing Values: 0
  Train Unique Values: [1 2 3 4 5]
  Test Data Type: int64
  Test Missing Values: 0
  Test Unique Values: [892 893 894 895 896]
  Meaning: Unique identifier for each passenger.  Likely irrelevant to survival.
--------------------
Column: Survived
  Train Data Type: int64
  Train Missing Values: 0
  Train Unique Values: [0 1]
  Meaning: Survival status (0 = No, 1 = Yes). This is the target variable.
--------------------
Column: Pclass
  Train Data Type: int64
  Train Missing Values: 0
  Train Unique Values: [3 1 2]
  Test Data Type: int64
  Test Missing Values: 0
  Test Unique Values: [3 2 1]
  Meaning: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).  Potentially relevant, as higher class might correlate with better survival.
--------------------
Column: Name
  Train Data Type: object
  Train Missing Values: 0
  Train Unique Values: ['Braund, Mr. Owen Harris'
 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'
 'Heikkin

In [1]:
import pandas as pd
import numpy as np

# Load the data
try:
    data = pd.read_csv("/content/train.csv")
    xTest = pd.read_csv("/content/test.csv")
except FileNotFoundError:
    print("Error: One or both of the CSV files were not found.")
    exit() # Exit if files are not found

# Separate target variable from training data
xTrain = data.drop(["Survived"], axis=1)
yTrain = np.array(data.Survived)

# Save PassengerId for submission
passengerId = xTest.PassengerId

# --- Preprocessing Steps ---

# 1. Handle Missing Data

# Impute 'Age' with the median
xTrain["Age"].fillna(xTrain["Age"].median(), inplace=True)
xTest["Age"].fillna(xTest["Age"].median(), inplace=True)

# Impute 'Embarked' with the mode
xTrain["Embarked"].fillna(xTrain["Embarked"].mode()[0], inplace=True)
xTest["Embarked"].fillna(xTest["Embarked"].mode()[0], inplace=True)

# Impute 'Fare' with the median (only needed for test set as per initial analysis)
xTest["Fare"].fillna(xTest["Fare"].median(), inplace=True)


# 2. Feature Engineering: Extract Titles from Names

def extract_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    if title_search:
        return title_search.group(1)
    return ""

import re
xTrain['Title'] = xTrain['Name'].apply(extract_title)
xTest['Title'] = xTest['Name'].apply(extract_title)

# Consolidate rare titles
xTrain['Title'] = xTrain['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
xTrain['Title'] = xTrain['Title'].replace('Mlle', 'Miss')
xTrain['Title'] = xTrain['Title'].replace('Ms', 'Miss')
xTrain['Title'] = xTrain['Title'].replace('Mme', 'Mrs')

xTest['Title'] = xTest['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
xTest['Title'] = xTest['Title'].replace('Mlle', 'Miss')
xTest['Title'] = xTest['Title'].replace('Ms', 'Miss')
xTest['Title'] = xTest['Title'].replace('Mme', 'Mrs')


# 3. Drop Irrelevant Columns

# Drop 'Name', 'PassengerId', 'Ticket', and 'Cabin' as decided in previous analysis
xTrain.drop(columns=["Name", "PassengerId", "Ticket", "Cabin"], axis=1, inplace=True)
xTest.drop(columns=["Name", "PassengerId", "Ticket", "Cabin"], axis=1, inplace=True)


# 4. One-Hot Encode Categorical Features

categoricalCols=["Sex", "Embarked", "Title"]
xTrain = pd.get_dummies(xTrain, columns=categoricalCols, drop_first=True, dtype=int)
xTest = pd.get_dummies(xTest, columns=categoricalCols, drop_first=True, dtype=int)

# Align columns - important for consistent feature sets between train and test
train_cols = list(xTrain.columns)
test_cols = list(xTest.columns)

for col in train_cols:
    if col not in test_cols:
        xTest[col] = 0  # Add missing columns to xTest with 0
for col in test_cols:
    if col not in train_cols:
        xTrain[col] = 0 # Add missing columns to xTrain with 0

xTest = xTest[train_cols] # Ensure columns are in the same order


# Convert data types to int64 after one-hot encoding
xTrain = xTrain.astype('int64')
xTest = xTest.astype('int64')

print("Preprocessing complete. Displaying first 5 rows of preprocessed xTrain:")
display(xTrain.head())
print("\nDisplaying first 5 rows of preprocessed xTest:")
display(xTest.head())

# Now you can proceed with defining and training your neural network model

Preprocessing complete. Displaying first 5 rows of preprocessed xTrain:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  xTrain["Age"].fillna(xTrain["Age"].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  xTest["Age"].fillna(xTest["Age"].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S,Title_Miss,Title_Mr,Title_Mrs,Title_Rare
0,3,22,1,0,7,1,0,1,0,1,0,0
1,1,38,1,0,71,0,0,0,0,0,1,0
2,3,26,0,0,7,0,0,1,1,0,0,0
3,1,35,1,0,53,0,0,1,0,0,1,0
4,3,35,0,0,8,1,0,1,0,1,0,0



Displaying first 5 rows of preprocessed xTest:


Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S,Title_Miss,Title_Mr,Title_Mrs,Title_Rare
0,3,34,0,0,7,1,1,0,0,1,0,0
1,3,47,1,0,7,0,0,1,0,0,1,0
2,2,62,0,0,9,1,1,0,0,1,0,0
3,3,27,0,0,8,1,0,1,0,1,0,0
4,3,22,1,1,12,0,0,1,0,0,1,0


In [2]:
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping

# Definir el modelo secuencial
# Puedes ajustar el número de capas y neuronas según tus necesidades
modelo = tf.keras.Sequential([
    tf.keras.layers.BatchNormalization(),  # Capa de normalización
    tf.keras.layers.Dense(64, activation='relu', input_shape=[xTrain.shape[1]]),  # Capa de entrada con 64 neuronas y activación ReLU
    tf.keras.layers.Dense(32, activation='relu'),  # Primera capa oculta con 32 neuronas y activación ReLU
    tf.keras.layers.Dense(16, activation='relu'),  # Segunda capa oculta con 16 neuronas y activación ReLU
    tf.keras.layers.Dense(1, activation='sigmoid')  # Capa de salida con 1 neurona y activación Sigmoid para clasificación binaria
])

# Configurar Early Stopping
# monitored: Métrica a monitorear (aquí 'val_loss' para el error en el conjunto de validación)
# patience: Número de épocas sin mejora después de las cuales se detiene el entrenamiento
# restore_best_weights: Si es True, restaura los pesos del modelo de la época con la mejor métrica monitoreada
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

# Compilar el modelo
modelo.compile(optimizer='adam',  # Optimizador Adam
               loss='binary_crossentropy',  # Función de pérdida para clasificación binaria
               metrics=['accuracy'])  # Métrica a evaluar (precisión)

# Para entrenar el modelo, usarías algo similar a esto, incluyendo el callback de Early Stopping:
history = modelo.fit(
     xTrain, yTrain,
     epochs=100, # Puedes poner un número alto, Early Stopping se encargará de detenerlo
     validation_split=0.2,  # Divide en train/validation
     batch_size=32, # Puedes ajustar el tamaño del batch
     callbacks=[early_stopping], # Incluye el callback de Early Stopping
     verbose=1 # Muestra el progreso del entrenamiento
 )

print("Modelo definido y compilado correctamente con Early Stopping")

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/100
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 20ms/step - accuracy: 0.5025 - loss: 0.6874 - val_accuracy: 0.8045 - val_loss: 0.6192
Epoch 2/100
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7622 - loss: 0.5762 - val_accuracy: 0.7374 - val_loss: 0.5536
Epoch 3/100
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.7848 - loss: 0.5009 - val_accuracy: 0.7207 - val_loss: 0.5095
Epoch 4/100
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7891 - loss: 0.4785 - val_accuracy: 0.7318 - val_loss: 0.4868
Epoch 5/100
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.8042 - loss: 0.4481 - val_accuracy: 0.7765 - val_loss: 0.4464
Epoch 6/100
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.8074 - loss: 0.4606 - val_accuracy: 0.7933 - val_loss: 0.4486
Epoch 7/100
[1m23/23[0m [32m━━

In [19]:
#Predecimos con el modelo previamente entrenado y guardamos las predicciones en un archivo .csv

print(np.array(xTest).shape)
predicciones = modelo.predict(np.array(xTest))

print("El tamaño de las predicciones", predicciones.shape)

#Convert the probability (from 0 to 1) into binary result (0 or 1)
predicciones = np.array([1 if x > 0.5 else 0 for x in predicciones])

#Concat the passenger id and the prediction into a single dataframe
prediccionDf = pd.DataFrame({'PassengerId': passengerId, 'Survived':predicciones})

#Convert it into a csv file so we can submit it
prediccionDf.to_csv('gender_submission.csv', index=False, sep=',')

#Let's print it
submission = pd.read_csv('/content/gender_submission.csv', index_col="PassengerId")
display(submission.value_counts())

(418, 12)
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step 
El tamaño de las predicciones (418, 1)


Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
0,279
1,139


## Summary:

### Q&A
The provided datasets, "test.csv" and "train.csv", appear to be related to passenger information, likely from a historical event such as the Titanic disaster.  The "train" dataset contains a "Survived" column, indicating a supervised learning task where the goal is to predict passenger survival based on other provided features. The "test" dataset lacks this column, suggesting its purpose is to evaluate the performance of a trained prediction model.


### Data Analysis Key Findings
* **Target Variable:** The `Survived` column in `df_train` is the target variable for a prediction task (likely survival prediction). It's absent from `df_test`, as expected.
* **Missing Data:** Significant missing values exist in the `Age` and `Cabin` columns in both datasets, and a smaller number in the `Embarked` column of the training dataset.  The `Cabin` column has a very high percentage of missing values. These missing values need to be addressed during preprocessing (e.g., imputation or removal).
* **Data Type Discrepancies:** No data type mismatches were found between the common columns in the two datasets.
* **Categorical Feature Discrepancies:** The `Embarked` column shows a discrepancy where missing values are present in the training set but not in the test set.  This inconsistency requires attention during data preprocessing.
* **Feature Relevance:** Features like `Pclass`, `Sex`, `Age`, `SibSp`, `Parch`, `Fare`, `Cabin`, and `Embarked` are potentially relevant predictors of survival. The `Name` column may also be useful for extracting titles, which could be indicative of social status.  `PassengerId` and `Ticket` are likely less relevant.


### Insights or Next Steps
* **Handle Missing Data:**  Develop a strategy for handling the missing values in `Age`, `Cabin` and `Embarked`.  Consider imputation, removal, or other appropriate techniques.  Address the inconsistency in missing values for `Embarked` between the training and test sets.
* **Feature Engineering:** Explore feature engineering possibilities, particularly with the `Name` and `Ticket` columns.  Extracting titles from names and potentially grouping or categorizing ticket numbers could improve model performance.
