# Week 4 - Linear and Logistic Machine Learning
#### Part 1
### Notebook created by Jonathan Penava
By the end of this lesson you should know the steps to building a machine learning algorithm and have practiced by using a linear and logistic ML algorithm.

## Overview
- Steps to Machine Learning
- Linear Regression
- <span style="color:red;">Logistic Regression</span>

Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no, and so on. Source: (https://www.sciencedirect.com/topics/computer-science/logistic-regression)

In this example we are going to continue with our Titanic example.  We want to model the probability of some discrete outcome.  In this case, would I survive?  We start by identifying the same steps from the previous example.

### Collecting Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

titanic = sns.load_dataset('titanic')

In [2]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


### Preparing the Data
We can see from the info we have some records that won't help us.  Before continuing on with the note, try to decide what you want to keep and what you want to get rid of.

In [4]:
titanic.drop('embark_town', axis=1, inplace=True)
titanic.drop('embarked', axis=1, inplace=True)
titanic.drop('deck', axis=1, inplace=True)
titanic.drop('class', axis=1, inplace=True)
titanic.drop('alive', axis=1, inplace=True)

In [5]:
titanic.head(20)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,who,adult_male,alone
0,0,3,male,22.0,1,0,7.25,man,True,False
1,1,1,female,38.0,1,0,71.2833,woman,False,False
2,1,3,female,26.0,0,0,7.925,woman,False,True
3,1,1,female,35.0,1,0,53.1,woman,False,False
4,0,3,male,35.0,0,0,8.05,man,True,True
5,0,3,male,,0,0,8.4583,man,True,True
6,0,1,male,54.0,0,0,51.8625,man,True,True
7,0,3,male,2.0,3,1,21.075,child,False,False
8,1,3,female,27.0,0,2,11.1333,woman,False,False
9,1,2,female,14.0,1,0,30.0708,child,False,False


In [6]:
titanic.replace('male', 0, inplace=True)
titanic.replace('female', 1, inplace=True)
titanic.replace('man', 0, inplace=True)
titanic.replace('woman', 1, inplace=True)
titanic.replace('child', 2, inplace=True)

  titanic.replace('female', 1, inplace=True)
  titanic.replace('child', 2, inplace=True)


In [7]:
titanic.dropna(axis=0, inplace=True)

In [16]:
titanic.min()

survived          0
pclass            1
sex               0
age            0.42
sibsp             0
parch             0
fare            0.0
who               0
adult_male    False
alone         False
dtype: object

In [17]:
titanic.max()

survived             1
pclass               3
sex                  1
age               80.0
sibsp                5
parch                6
fare          512.3292
who                  2
adult_male        True
alone             True
dtype: object

In [8]:
titanic.head(20)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,who,adult_male,alone
0,0,3,0,22.0,1,0,7.25,0,True,False
1,1,1,1,38.0,1,0,71.2833,1,False,False
2,1,3,1,26.0,0,0,7.925,1,False,True
3,1,1,1,35.0,1,0,53.1,1,False,False
4,0,3,0,35.0,0,0,8.05,0,True,True
6,0,1,0,54.0,0,0,51.8625,0,True,True
7,0,3,0,2.0,3,1,21.075,2,False,False
8,1,3,1,27.0,0,2,11.1333,1,False,False
9,1,2,1,14.0,1,0,30.0708,2,False,False
10,1,3,1,4.0,1,1,16.7,2,False,False


### Choosing a Model
In most cases we want to choose a model that best fits our data.  In this case this data was chosen because we want to test out a trendline.  But why use a trendline for this data?  Consider what we want to do with this data.  We have some set of x variables.  Based on this collection of x variables we want to make a prediction y.  As we change values of x, values of y should change.  Let's try to map some x against y.

In [9]:
#Remove the dependent data (our y) from our independent data (our x values)
X, y = titanic.drop('survived', axis=1), titanic['survived'].values

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=87)

### Training the Model
We now need to split our data into training and testing data.  Sometimes we will have our training and testing data in 2 separate files.  In this case all of our data is in one file, so we are going to have to split into training and testing.  A good split is usually about 70% for training and about 30% for testing.  These values might fluctuate up and down 5%-10% based on your source.

In [11]:
from sklearn.linear_model import LogisticRegression

#Note that the max_iter is needed for datasets that are greater than 100 records.  
#If you do not set max_iter greater than your record count you will receive a 
#Convergence Warning and your algorithm will not train properly.
logmodel = LogisticRegression(max_iter=1000)
logmodel.fit(X_train,y_train)



In [12]:
predictions = logmodel.predict(X_test)

from sklearn.metrics import classification_report

#Print the classification report for the model
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.87      0.90      0.89       130
           1       0.84      0.80      0.82        85

    accuracy                           0.86       215
   macro avg       0.86      0.85      0.85       215
weighted avg       0.86      0.86      0.86       215



### Evaluating the Model
Now that we have trained the model, let's test it out.  How accurate is it?

In [13]:
#Predicting the results
predictions = logmodel.predict( X_test)
logmodel.score(X_test, y_test)

0.8604651162790697

### Parameter Tuning
So our model isn't perfect.  Can we do better?  What changes can we make to our data set to improve our accuracy?  Do we remove some columns?  Do we remove some outlier data?  Do we need a larger data set?  Try going back and changing the random_state to the last 2 digits of your id.  Does it make the data more or less accurate?

### Making Predictions
At this point we need to use our model to make a prediction.  Let's modify the values below to whatever you wish and see what you get for a price.

In [14]:
X.head()


Unnamed: 0,pclass,sex,age,sibsp,parch,fare,who,adult_male,alone
0,3,0,22.0,1,0,7.25,0,True,False
1,1,1,38.0,1,0,71.2833,1,False,False
2,3,1,26.0,0,0,7.925,1,False,True
3,1,1,35.0,1,0,53.1,1,False,False
4,3,0,35.0,0,0,8.05,0,True,True


In [20]:
myTest = {
        'pclass':[1],
        'sex':[1],  #0 for male, 1 for female
        'age':[38],  #Yeah I'm not telling
        'sibsp':[1],  #Number of siblings or spouse aboard
        'parch':[0],  #Number of parents or children aboard
        'fare':[71], 
        'who':[1],  #0 for man, 1 for woman, 2 for child
        'adult_male':[False],
        'alone':[False]
       }
df = pd.DataFrame(myTest)
print(logmodel.predict(df))

[1]


Well I'm dead.