<a href="https://colab.research.google.com/github/mikecorey/python101/blob/main/Pandas_3_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Day 3

We're going to get a little bit closer to our target system today.

We're looking at a number of common pipelines for data science.

First let's look at Machine Learning...

Can we predict who lives and dies on the titanic?



In [None]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Clean our data
We want to handle empty values.  We can't really train a model with them as it's an entire dimension we don't knwo.

Maybe remove them. maybe fill.  for now, let's just drop them.

In [None]:
# Removes the rows where age or fare is empty

def clean_data(df):
  df = df.copy()
  return df.dropna(subset=["Age", "Fare"])

df = clean_data(df)

## Enhance our data
We've removed rows that won't help train our model, now we need to make some of the features more useful...

- Sex is a categorical variable.  (Binary male or female). The model we're training (Logistic Regression) only functions on numeric data.  So we need to convert Sex to a 1 or 0.

- We can also reduce SibSp and Parch to one variable called family size.  We lose some resolution, but that's ok, we're also avoiding potentially overfitting our model.

- Finally, let's make an IsAlone variable which will be hot if the family size is zero.

In [None]:
def enhance_data(df):
  df = df.copy()
  df["SexEncoded"] = (df["Sex"] == "male").astype(int)
  df["FamilySize"] = df["SibSp"] + df["Parch"]
  df["IsAlone"] = (df["FamilySize"] == 0).astype(int)
  return df

df = enhance_data(df)

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SexEncoded,FamilySize,IsAlone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1,0,1


## Define features and target label
We're going to use Passenger class, Age, Fare, SexEncoded and IsAlone as our features (X) .  The target label (y) we're aiming for will of couse be Survived.

Note: it's correct to have a capital X as the training data as this is normally a vector.  y should be lowercase.

In [None]:
features = ["Pclass", "Age", "Fare", "SexEncoded", "IsAlone"]
def select_features(df):
  df = df.copy()
  return df[features]


y = df["Survived"]
X = select_features(df)


## Split data into train and test

We've defined our labels.  Now it's time to split our data.  scikit learn includes a nice function called train_test_split to handle this.  We're setting our split size to 20% held back for test.  This will return 4 things,  Training data and labels, testing data and labels.


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Train (fit) classifier

We're going to use Logistic Regression as our classifier.  I know the name suggests it's a regression model, but we're going to treat it as a classifier by mapping the predicited output value.

To train any model in scikit-learn, we use `.fit()`.  This will train our weights for our model (fit them)

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)


## Test performance
Now that our model is trained, we should test the performance on our held out dataset.  We'll use predict for this.

In [None]:
y_pred = clf.predict(X_test)


## Evaluate Performance

We can evaluate performance using a few built in functions.  This will let us understand both the accuracy as well as what misclassifications are likely (FP vs FN)


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.7342657342657343

Confusion Matrix:
 [[68 19]
 [19 37]]

Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.78      0.78        87
           1       0.66      0.66      0.66        56

    accuracy                           0.73       143
   macro avg       0.72      0.72      0.72       143
weighted avg       0.73      0.73      0.73       143



Ok 73 percent.  That's not bad.  Let's call it and set up a function which can leverage this predicted model

In [None]:
def predict_survival(**kwargs):
  required_features = set(features)
  required_features.remove('SexEncoded')
  required_features.update({'Sex', 'SibSp', 'Parch'})
  if all(key in kwargs for key in required_features):
    df = pd.DataFrame([kwargs])
    df = clean_data(df)
    df = enhance_data(df)
    df = select_features(df)
    return clf.predict(df)[0]
  else:
    raise ValueError(f'make sure to include {required_features}')


print('rich woman', predict_survival(Pclass=1, Age=1, Fare=1000, Sex='female',
                                     SibSp=5, Parch=5, IsAlone=False))
print('poor dude', predict_survival(Pclass=3, Age=100, Fare=2, Sex='male',
                                     SibSp=0, Parch=0, IsAlone=True))

rich woman 1
poor dude 0
