<a href="https://colab.research.google.com/github/nkaraffa/Intro-to-AI-Machine-Learning-and-Python-basics/blob/main/Classification_Model_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Classification Model**



*   Predict the class of an object
*   Number of classes is limited and pre-definded based on the dataset

---





# Problem:  Predict survivors from the Titanic based on class


*   Data Source:  Kaggle (https://www.kaggle.com/c/titanic)

---

> Data Descriptions:

>  *   Survived:    '0' = died, '1' = survived
>  *   PClass (Passenger Class):  '1' = High,  '2' = Mid,  '3' = Low 



**Prepare Data and Import Tools**

In [4]:
# Importing libraries

import pandas as pd
import numpy as np

In [9]:
# Upload Data from the saved Kaggle Data File

data = pd.read_csv('titanicData.csv')

data      # display Titanic data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


**Select Classification Factors**

In [11]:
# Survival Factors - Select data columns that could've impacted survival chance

columns_target = ['Survived']                       # Our Target column (What we want to determine)

columns_train = ['Pclass', 'Sex', 'Age', 'Fare']    #Factors that could influence survival  

In [12]:
# Assign column values to a variable

x = data[columns_train]

y = data[columns_target]

**Data Verification (Data Integrity Check)**

In [13]:
# Verify that the given data set does not have null values or empty cells

x['Pclass'].isnull().sum()

0

In [14]:
x['Sex'].isnull().sum()

0

In [15]:
x['Age'].isnull().sum()       # This column has null values

177

In [16]:
x['Fare'].isnull().sum()

0

**Data Smoothing (Eliminate Null Values & Convert Sex Variable)**

Eliminate Null Values

In [17]:
# Panda has a function that can eliminate these null values

x['Age'] = x['Age'].fillna(x['Age'].median())     # Replace null with the 'median' of the dataset

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [18]:
# Verify that this column has no null cells

x['Age'].isnull().sum()       # This column has null values

0

Convert Sex Variable Values

In [19]:
# Convert Sex Variable from str to binary (0 or 1) with a Python dictionary

d = {'male': 0 , 'female': 1}     # Specify the values for each entry

In [20]:
x['Sex'] = x['Sex'].apply(lambda x:d[x])      # Use lambda function to replace str with 0 or 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [21]:
x['Sex'].head()     #Verify that the change has been made in the dataset

0    0
1    1
2    1
3    1
4    0
Name: Sex, dtype: int64

**Divide dataset into training & test samples**

In [24]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.33, random_state = 42)   # size of sample = 1/3 dataset

**Import Support Vector Machine for Training**

In [25]:
# Import Support Vector Machine for Training

from sklearn import svm

predmodel = svm.LinearSVC()   #create the model

**Train Model with our specified training sample**

In [26]:
predmodel.fit(x_train, y_train)     #Warning will display but this is okay

  y = column_or_1d(y, warn=True)


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

**Model-based Predictions**

In [27]:
predmodel.predict(x_test[0:10])     # Predict based on the first 10 data values

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1])

In [28]:
# Check model prediction accuracy

predmodel.score(x_test, y_test)

0.7932203389830509