# Explore here

It's recommended to use this notebook for exploration purposes.

For example: 

1. You could import the CSV generated by python into your notebook and explore it.
2. You could connect to your database using `pandas.read_sql` from this notebook and explore it.

In [1]:
# Example reading the SQL database from here
import pandas as pd
import numpy as np
#import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [2]:
# Example importing the CSV here
url = 'https://raw.githubusercontent.com/4GeeksAcademy/random-forest-project-tutorial/main/titanic_train.csv'
#df = pd.read_csv(url, sep=';')
#df.shape
df = pd.read_csv(url)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
607,608,1,1,"Daniel, Mr. Robert Williams",male,27.0,0,0,113804,30.5,,S
631,632,0,3,"Lundahl, Mr. Johan Svensson",male,51.0,0,0,347743,7.0542,,S
696,697,0,3,"Kelly, Mr. James",male,44.0,0,0,363592,8.05,,S
836,837,0,3,"Pasic, Mr. Jakob",male,21.0,0,0,315097,8.6625,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
df['Age'][np.isnan(df['Age'])] = df['Age'].mean()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Age'][np.isnan(df['Age'])] = df['Age'].mean()


In [7]:
df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [8]:
df['Survived'].value_counts()/len(df['Survived'])

0    0.616162
1    0.383838
Name: Survived, dtype: float64

Obs: el dataset está ligeramente desbalanceado. No mirar accuracy y mirar f1.

In [9]:
#Analisis de categóricas:
df['Name']
df['Name'].nunique()

891

Candidato a borrar los nombres ya que son todos diferentes y no aportan nada.

In [10]:
df = df.drop(columns=['Name'])

In [11]:
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [12]:
# Transformo categorica Sex a numérica con el dict: {'male':1, 'female':0}
df['Sex'] = df['Sex'].map({'male':1, 'female':0})

In [13]:
df = df.drop(columns=['Ticket'])

In [14]:
df['Cabin'].value_counts()

B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
F33            3
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: Cabin, Length: 147, dtype: int64

In [15]:
df = df.drop(columns=['Cabin'])

In [16]:
df['Embarked'].nunique()

3

In [17]:
df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [18]:
df['Embarked'] = df['Embarked'].map({'S':2, 'C':1, 'Q':0})

In [19]:
df['Embarked'][np.isnan(df['Embarked'])] = 2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Embarked'][np.isnan(df['Embarked'])] = 2


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    int64  
 4   Age          891 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Fare         891 non-null    float64
 8   Embarked     891 non-null    float64
dtypes: float64(3), int64(6)
memory usage: 62.8 KB


In [21]:
X = df.drop(columns=['Survived'])
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y , random_state=13, test_size=0.2)

In [22]:
modelo = RandomForestClassifier(n_estimators=50, random_state=13)

In [23]:
modelo.fit(X_train, y_train)

In [24]:
y_train_pred = modelo.predict(X_train)
y_test_pred = modelo.predict(X_test)

In [25]:
print(metrics.classification_report(y_pred=y_train_pred, y_true=y_train))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       439
           1       1.00      1.00      1.00       273

    accuracy                           1.00       712
   macro avg       1.00      1.00      1.00       712
weighted avg       1.00      1.00      1.00       712



In [26]:
print(metrics.classification_report(y_pred=y_test_pred, y_true=y_test))

              precision    recall  f1-score   support

           0       0.82      0.91      0.86       110
           1       0.82      0.68      0.75        69

    accuracy                           0.82       179
   macro avg       0.82      0.80      0.80       179
weighted avg       0.82      0.82      0.82       179



f1-score es muy bueno en train (1) y no muy bueno en test (0.75), el algoritmo está "memorizando los datos" (overfitting). Entonces pruebo cambiando parámetros.

In [27]:
modelo2 = RandomForestClassifier(n_estimators=10, random_state=13, max_depth=5)
modelo2.fit(X_train, y_train)
y_train_pred = modelo2.predict(X_train)
y_test_pred = modelo2.predict(X_test)

print(metrics.classification_report(y_pred=y_train_pred, y_true=y_train))
print(metrics.classification_report(y_pred=y_test_pred, y_true=y_test))


              precision    recall  f1-score   support

           0       0.84      0.95      0.89       439
           1       0.91      0.70      0.79       273

    accuracy                           0.86       712
   macro avg       0.87      0.83      0.84       712
weighted avg       0.86      0.86      0.85       712

              precision    recall  f1-score   support

           0       0.83      0.90      0.86       110
           1       0.82      0.71      0.76        69

    accuracy                           0.83       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.83      0.83      0.82       179

