En el apartado "Loading Data" de esta URL:

https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python

Se explica cómo se cargan una serie de datos: 

1. Utiliza esa misma forma para cargar los datos.
2. Limpia los datos si es necesario
3. Dibuja con plotly los que creas necesarios gráficos para entender los datos.
4. Utiliza los métodos de clasificación vistos hasta ahora para clasificar el target de los datos, ¿cuál da mejores resultados? 
5. Intenta superarte en el score cambiando las features de los algoritmos.

In [66]:
# Import the necessary libraries
import pandas as pd 
import numpy as np 

from sklearn import metrics
from sklearn.model_selection import train_test_split

import plotly.express as px
import plotly.graph_objs as go 

In [7]:
# Import sklearn dataset library
from sklearn import datasets

#Load dataset
cancer = datasets.load_breast_cancer()

### Exploring data

In [17]:
# Feature names
print('Features:\n\n', cancer.feature_names)
print('-----------')

# Label type of cancer ('malignant', 'benign') -- > Target
print('\nLabels:', cancer.target_names)
print('-----------')

# Check the shape of the dataset
print('\nDataset shape:', cancer.data.shape)
print('-----------')

# Top 5 feature records'
print('\nTop 3 records:\n\n', cancer.data[0:3])
print('-----------')

# Cancer labels (0: malignant, 1: benign)
print('\nCancer labels:', cancer.target)

Features:

 ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
-----------

Labels: ['malignant' 'benign']
-----------

Dataset shape: (569, 30)
-----------

Top 3 records:

 [[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]
 [2.057e+0

### Passing it to a pandas.DataFrame

In [63]:
cancer_df = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns= np.append(cancer['feature_names'], ['target'])) 
cancer_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0.0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0.0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0.0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0.0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0.0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0.0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0.0


In [64]:
def df_info(df):
    '''
    @leosanchezsoler
    The function provides all the relevant info of a pandas.DataFrame
        Arguments:
            - df: a pandas.Dataframe
        Prints:
            - df.shape[0]: number of rows
            - df.shape[1]: number of columns
            - df.columns: the name of the dataset columns'
            - df.info(): basic info about the dataset
            - df.isna().sum(): NaN values per column
    '''
    print('####\nDATAFRAME INFO\n####')
    print('\nNumber of rows:', df.shape[0])
    print('Number of columns:', df.shape[1])
    print('\n#### DATAFRAME COLUMNS ####\n', df.columns, '\n')
    print('### DATAFRAME COLUMN TYPES ###\n')
    print('\n', df.info()) 
    print('\n### TOTAL NaN VALUES ###\n', '\n', df.isna().sum())

df_info(cancer_df)

####
DATAFRAME INFO
####

Number of rows: 569
Number of columns: 31

#### DATAFRAME COLUMNS ####
 Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'target'],
      dtype='object') 

### DATAFRAME COLUMN TYPES ###

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   -

In [61]:
list(cancer_df.columns)

['mean radius',
 'mean texture',
 'mean perimeter',
 'mean area',
 'mean smoothness',
 'mean compactness',
 'mean concavity',
 'mean concave points',
 'mean symmetry',
 'mean fractal dimension',
 'radius error',
 'texture error',
 'perimeter error',
 'area error',
 'smoothness error',
 'compactness error',
 'concavity error',
 'concave points error',
 'symmetry error',
 'fractal dimension error',
 'worst radius',
 'worst texture',
 'worst perimeter',
 'worst area',
 'worst smoothness',
 'worst compactness',
 'worst concavity',
 'worst concave points',
 'worst symmetry',
 'worst fractal dimension',
 'target']

### Splitting data

In [38]:
# X --> Features
X = cancer_df.iloc[:, :-1]

# y --> Target
y = cancer_df.target

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=155)

### Generating Model


In [92]:
# Let's build a SVM

# Import the svm module
from sklearn import svm

# Create a svm classifier
clf = svm.SVC(kernel='linear', C=100, gamma=10) # Linear Kernel

# Train the model
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

print('y_pred:\n', y_pred)

y_pred:
 [1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1.
 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1.
 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 0. 1. 1. 1.]


### Visualizing data

In [88]:
#Visualizing the data
fig = px.scatter_3d(cancer_df, x='mean radius', y='mean perimeter', z='mean texture', color='target')
fig

fig.update_traces(marker=dict(size=5))

### Evaluate the model

In [80]:
# Model Score: how did the model perform on the train set?
print('Score:', clf.score(X_train, y_train))

# Model Accuracy: how often is the Classifier correct?
print('\nAccuracy:', metrics.accuracy_score(y_test, y_pred))

# Further Evaluation
# -------------------

# Don't know the differences between the Precision score and the Recall score
# -------------------

# Model Precision: what percentage of positive tuples are labelled as such?
print('\nPrecision:', metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labelled as such?
print('\nRecall:', metrics.recall_score(y_test, y_pred))

Score: 1.0

Accuracy: 0.6929824561403509

Precision: 0.6929824561403509

Recall: 1.0


In [93]:
metrics.confusion_matrix(y_test, y_pred)

array([[29,  6],
       [ 2, 77]])

## USING RBF KERNEL

In [100]:
# Create a svm classifier
clf = svm.SVC(kernel='rbf', C=10000, gamma=15) # Radial Basis Function Kernel

# Train the model
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)
print('y_pred:\n\n', y_pred)

y_pred:

 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


In [101]:
# Model Score: how did the model perform on the train set?
print('Score:', clf.score(X_train, y_train))

# Model Accuracy: how often is the Classifier correct?
print('\nAccuracy:', metrics.accuracy_score(y_test, y_pred))

# Further Evaluation
# -------------------

# Don't know the differences between the Precision score and the Recall score
# -------------------

# Model Precision: what percentage of positive tuples are labelled as such?
print('\nPrecision:', metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labelled as such?
print('\nRecall:', metrics.recall_score(y_test, y_pred))

Score: 1.0

Accuracy: 0.6929824561403509

Precision: 0.6929824561403509

Recall: 1.0


In [102]:
metrics.confusion_matrix(y_test, y_pred)

array([[ 0, 35],
       [ 0, 79]])

## As we can see here, the RBF Kernel model works perfectly on the train set, but it can't be said that the test set is predicted successfully in comparison with the Linear SVC
- This should be caused by how the data is displayed
- RBF kernel has a score of 1 on the train set, but does not perform that way with the test set.
    - After changing some hyperparameters in an attempt to increase the score, there was no relevant change on the score