### Implemented different machine learning models on the same dataset and find accuracy.

You are given a dataset named passengers.csv with the follwoing columns
* Survived Indicator
* Passenger Class
* Name
* Sex
* Age
* Siblings Aboard
* Parents Aboard
* Fare paid in £s

Apply the following Machine Learnig algorithms with appropriate preprocessing techniques by splitting the given data into 80% for training and 20% for testing to predict the Survived Indicator of the passenger.
1. SVM with appropriate kernal
2. LogisticRegression
3. DecisionTreeClassifier
4. VotingClassifier
5. BaggingClassifier
6. RandomForestClassifier

#### Import the required libraries 

In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier 

#### load the dataset

In [11]:
passenger = pd.read_csv('Passengers.csv')
passenger.shape

(887, 8)

#### Display the null values per each column 

In [12]:
null_columns=passenger.columns[passenger.isnull().any()]
passenger[null_columns].isnull().sum()

Series([], dtype: float64)

#### Drop the Name column from the dataset

In [13]:
passenger.drop(["Name"], axis = 1, inplace = True)


In [14]:
passenger.head

<bound method NDFrame.head of      Survived  Pclass     Sex   Age  Siblings/Spouses Aboard  \
0           0       3    male  22.0                        1   
1           1       1  female  38.0                        1   
2           1       3  female  26.0                        0   
3           1       1  female  35.0                        1   
4           0       3    male  35.0                        0   
..        ...     ...     ...   ...                      ...   
882         0       2    male  27.0                        0   
883         1       1  female  19.0                        0   
884         0       3  female   7.0                        1   
885         1       1    male  26.0                        0   
886         0       3    male  32.0                        0   

     Parents/Children Aboard     Fare  
0                          0   7.2500  
1                          0  71.2833  
2                          0   7.9250  
3                          0  53.1000  
4

#### Convert the non-numeric column into numeric by encoding

In [18]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
enc.fit(passenger['Sex'])
passenger['Sex'] = enc.transform(passenger['Sex'])

In [19]:
passenger.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,1,22.0,1,0,7.25
1,1,1,0,38.0,1,0,71.2833
2,1,3,0,26.0,0,0,7.925
3,1,1,0,35.0,1,0,53.1
4,0,3,1,35.0,0,0,8.05


#### Prepare dataset for training and testing with 80:20

In [21]:
from sklearn.model_selection import train_test_split
X = passenger.iloc[:,:-2]
y = passenger.iloc[:,-2]

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 123)

#### Apply Logistic Regression and print the accuracy of your model on test data

In [24]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [34]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))


Accuracy of logistic regression classifier on test set: 0.76


#### Apply DecisionTree and print the accuracy of your model on test data

In [26]:
clf = DecisionTreeClassifier(criterion = 'entropy')
clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [36]:
y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score
print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train))) #validation
print('Accuracy Score on the test data: ', accuracy_score(y_true=y_test, y_pred=clf.predict(X_test)))

clf = DecisionTreeClassifier(criterion='entropy', min_samples_split=50)
clf.fit(X_train, y_train)
print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train))) #testing
print('Accuracy Score on the test data: ', accuracy_score(y_true=y_test, y_pred=clf.predict(X_test)))


Accuracy Score on train data:  0.8222849083215797
Accuracy Score on the test data:  0.8314606741573034
Accuracy Score on train data:  0.8222849083215797
Accuracy Score on the test data:  0.8314606741573034


#### Apply SVM and print the accuracy of your model on test data

In [30]:
clf = SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [37]:
from sklearn import metrics

# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))


Accuracy: 0.8314606741573034


#### Average the accuracy of the above three models

In [48]:
print(float(format(logreg.score(X_test, y_test)))+accuracy_score(y_true=y_test, y_pred=clf.predict(X_test))+metrics.accuracy_score(y_test, y_pred))


2.426966292134831


#### Apply Voting ensembler with above three models and print the accuracy of your model on test data

In [72]:
from sklearn.ensemble import VotingClassifier
from sklearn import model_selection

estimators=[('SVC', SVC()), ('DTree', DecisionTreeClassifier()), ('LogReg', LogisticRegression())]
ensemble = VotingClassifier(estimators)
results = model_selection.cross_val_score(ensemble, X, y)
print(results.mean())



0.8061861971927488




#### Apply Random Forest Classifier and print the accuracy of your model on test data

In [80]:
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7359550561797753


#### Apply Bagging with above all the three models and print the accuracy of your model on test data

In [108]:
svc = SVC()
dt = DecisionTreeClassifier()
lg = LogisticRegression()
clf_array = []
for clf in clf_array:
    vanilla_scores = cross_val_score(clf, X, y, cv=10, n_jobs=-1)
    bagging_clf = BaggingClassifier(clf, 
       max_samples=0.4, max_features=10, random_state=seed)
    bagging_scores = cross_val_score(bagging_clf, X, y, cv=10, 
       n_jobs=-1)
    
    print("Mean of: {1:.3f}, std: (+/-) {2:.3f [{0}]",format(clf.__class__.__name__,vanilla_scores.mean(), vanilla_scores.std()))  
    print("Mean of: {1:.3f}, std: (+/-) {2:.3f} [Bagging {0}]\n",format(clf.__class__.__name__,bagging_scores.mean(), bagging_scores.std()))

#### Apply Ada Boosting with 5, 25, 50, 75 and 100 estimators and print the accuracy of your model on test data

In [136]:
from sklearn.ensemble import AdaBoostClassifier
adaboost5 = AdaBoostClassifier(n_estimators=5, base_estimator= None,learning_rate=1, random_state = 1)
adaboost5.fit(X_train,y_train)


AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1,
                   n_estimators=5, random_state=1)

In [137]:
adaboost25 = AdaBoostClassifier(n_estimators=25, base_estimator= None,learning_rate=1, random_state = 1)
adaboost25.fit(X_train,y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1,
                   n_estimators=25, random_state=1)

In [142]:
adaboost50 = AdaBoostClassifier(n_estimators=50, base_estimator= None,learning_rate=1, random_state = 1)
adaboost50.fit(X_train,y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1,
                   n_estimators=50, random_state=1)

In [146]:
adaboost75 = AdaBoostClassifier(n_estimators=75, base_estimator= None,learning_rate=1, random_state = 1)
adaboost75.fit(X_train,y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1,
                   n_estimators=75, random_state=1)

In [149]:
adaboost100 = AdaBoostClassifier(n_estimators=100, base_estimator= None,learning_rate=1, random_state = 1)
adaboost100.fit(X_train,y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1,
                   n_estimators=100, random_state=1)

In [138]:
Y_pred = adaboost5.predict(X_test)


In [139]:
Y1_pred = adaboost25.predict(X_test)

In [144]:
Y2_pred = adaboost50.predict(X_test)

In [147]:
Y3_pred = adaboost75.predict(X_test)

In [150]:
Y4_pred = adaboost100.predict(X_test)

In [140]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,Y_pred)
accuracy = float(cm.diagonal().sum())/len(y_test)
print("\nAccuracy Of AdaBoost when estimator is 5 For The Given Dataset :",accuracy)


Accuracy Of AdaBoost when estimator is 5 For The Given Dataset : 0.8033707865168539


In [141]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,Y1_pred)
accuracy1 = float(cm.diagonal().sum())/len(y_test)
print("\nAccuracy Of AdaBoost when estimator is 25 For The Given Dataset :",accuracy1)


Accuracy Of AdaBoost when estimator is 25 For The Given Dataset : 0.8033707865168539


In [154]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,Y2_pred)
accuracy2 = float(cm.diagonal().sum())/len(y_test)
print("\nAccuracy Of AdaBoost when estimator is 50 For The Given Dataset :",accuracy2)


Accuracy Of AdaBoost when estimator is 50 For The Given Dataset : 0.8033707865168539


In [153]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,Y3_pred)
accuracy3 = float(cm.diagonal().sum())/len(y_test)
print("\nAccuracy Of AdaBoost when estimator is 75 For The Given Dataset :",accuracy3)


Accuracy Of AdaBoost when estimator is 75 For The Given Dataset : 0.8033707865168539


In [152]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,Y4_pred)
accuracy4 = float(cm.diagonal().sum())/len(y_test)
print("\nAccuracy Of AdaBoost when estimator is 100 For The Given Dataset :",accuracy4)


Accuracy Of AdaBoost when estimator is 100 For The Given Dataset : 0.8033707865168539


#### What inferences you have drawn after executing all the above models.

#### we can infer that our dataset is showing same accuracy when using SVM and Decision Tree which is 80%(approx).
