## Objetives

Perform exploratory  Data Analysis and determine Training Labels

*   create a column for the class
*   Standardize the data
*   Split into training data and test data

-Find best Hyperparameter for SVM, Classification Trees and Logistic Regression

*   Find the method performs best using test data


## Import Libraries and define auxiliary functions

In [70]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

import requests

In [71]:
def plot_confusion_matrix(y, y_predict):
    from sklearn.metrics import confusion_matrix

    cm = confusion_matrix(y, y_predict)
    ax = plt.subplot()
    sns.heatmap(cm, annot=True, ax=ax)
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix')
    ax.xaxis.set_ticklabels(['did not land', 'land']); ax.yaxis.set_ticklabels(['did not land', 'landed']) 
    plt.show()

## Load the dataframe

In [72]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_2.csv"
local_path = "./dataset_part_2.csv"

response = requests.get(url)

if (response.status_code == 200):
    with open(local_path, 'wb') as file:
        file.write(response.content)
    print(f'Download success, status code {response.status_code}')
else:
    print(f'Download error, status code {response.status_code}')

path = 'dataset_part_2.csv'

Download success, status code 200


In [73]:
df = pd.read_csv(path)
print(df.shape)
df.head()

(90, 18)


Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
0,1,2010-06-04,Falcon 9,6104.959412,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857,0
1,2,2012-05-22,Falcon 9,525.0,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857,0
2,3,2013-03-01,Falcon 9,677.0,ISS,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857,0
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093,0
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857,0


#### 1. Create a NumPy array from the column <code>Class</code> in <code>data</code>, by applying the method to_numpy() then assign it to the variable Y,make sure the output is a Pandas series (only one bracket df['name of column']).

In [74]:
Y = df['Class'].to_numpy()
Y

array([0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1], dtype=int64)

### 2. Standardize the data in X then reassign it to the variable X using the transform provided below.

In [75]:
transform = preprocessing.StandardScaler()

We split the data into training and testing data using the function train_test_split. The training data is divided into validation data, a second set used for training data; then the models are trained and hyperparameters are selected using the function GridSearchCV.

In [76]:
X = df[['FlightNumber', 'PayloadMass', 'Flights', 'GridFins', 'Reused', 'Legs', 'Block', 'ReusedCount', 'Longitude', 'Latitude']]
X

Unnamed: 0,FlightNumber,PayloadMass,Flights,GridFins,Reused,Legs,Block,ReusedCount,Longitude,Latitude
0,1,6104.959412,1,False,False,False,1.0,0,-80.577366,28.561857
1,2,525.000000,1,False,False,False,1.0,0,-80.577366,28.561857
2,3,677.000000,1,False,False,False,1.0,0,-80.577366,28.561857
3,4,500.000000,1,False,False,False,1.0,0,-120.610829,34.632093
4,5,3170.000000,1,False,False,False,1.0,0,-80.577366,28.561857
...,...,...,...,...,...,...,...,...,...,...
85,86,15400.000000,2,True,True,True,5.0,2,-80.603956,28.608058
86,87,15400.000000,3,True,True,True,5.0,2,-80.603956,28.608058
87,88,15400.000000,6,True,True,True,5.0,5,-80.603956,28.608058
88,89,15400.000000,3,True,True,True,5.0,2,-80.577366,28.561857


In [78]:
# Standardize
numerical_features = ['FlightNumber', 'PayloadMass', 'Flights', 'Reused', 'Legs', 'Block', 'ReusedCount', 'Longitude', 'Latitude']
X[numerical_features] = transform.fit_transform(X[numerical_features])
X

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[numerical_features] = transform.fit_transform(X[numerical_features])


Unnamed: 0,FlightNumber,PayloadMass,Flights,GridFins,Reused,Legs,Block,ReusedCount,Longitude,Latitude
0,-1.712912,-5.295263e-17,-0.653913,False,-0.835532,-1.933091,-1.575895,-0.973440,0.411430,-0.417073
1,-1.674419,-1.195232e+00,-0.653913,False,-0.835532,-1.933091,-1.575895,-0.973440,0.411430,-0.417073
2,-1.635927,-1.162673e+00,-0.653913,False,-0.835532,-1.933091,-1.575895,-0.973440,0.411430,-0.417073
3,-1.597434,-1.200587e+00,-0.653913,False,-0.835532,-1.933091,-1.575895,-0.973440,-2.433736,2.433637
4,-1.558942,-6.286706e-01,-0.653913,False,-0.835532,-1.933091,-1.575895,-0.973440,0.411430,-0.417073
...,...,...,...,...,...,...,...,...,...,...
85,1.558942,1.991005e+00,0.174991,True,1.196843,0.517306,0.945537,0.202528,0.409541,-0.395376
86,1.597434,1.991005e+00,1.003894,True,1.196843,0.517306,0.945537,0.202528,0.409541,-0.395376
87,1.635927,1.991005e+00,3.490605,True,1.196843,0.517306,0.945537,1.966480,0.409541,-0.395376
88,1.674419,1.991005e+00,1.003894,True,1.196843,0.517306,0.945537,0.202528,0.411430,-0.417073


### 3. Use the function train_test_split to split the data X and Y into training and test data. Set the parameter test_size to 0.2 and random_state to 2. The training data and test data should be assigned to the following labels.

In [79]:
# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [80]:
Y_test.shape

(18,)

### 4. Create a logistic regression object  then create a  GridSearchCV object  <code>logreg_cv</code> with cv = 10.  Fit the object to find the best parameters from the dictionary <code>parameters</code>.

In [None]:
parameters = {'C': [0.01, 0.1, 1], 'penalty': ['12'], 'solver': ['lbfgs']}
ls = LogisticRegression()

We output the <code>GridSearchCV</code> object for logistic regression. We display the best parameters using the data attribute <code>best_params\_</code> and the accuracy on the validation data using the data attribute <code>best_score\_</code>.

In [None]:
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)