# Descripción del Caso a Resolver


Este proyecto tiene como objetivo desarrollar un clasificador para ayudar en el diagnóstico de pacientes con cáncer de mama, utilizando el conjunto de datos de diagnóstico de cáncer de mama de Wisconsin. Este conjunto de datos incluye características extraídas de imágenes digitalizadas de aspiraciones con aguja fina (FNA) de una masa mamaria. El clasificador se desarrollará usando la librería scikit-learn de Python, y utilizará el modelo K-Nearest Neighbors (KNN) para clasificar las muestras como malignas o benignas.


## Objetivos del Proyecto


### Exploración del Conjunto de Datos: Comprender la estructura y las características del conjunto de datos de diagnóstico de cáncer de mama de Wisconsin, incluyendo el número de características y su descripción.

### Conversión de Datos: Practicar la conversión del conjunto de datos de sklearn a un DataFrame de pandas para facilitar el manejo y análisis de datos.

### Distribución de Clases: Analizar y comprender la distribución de las clases en el conjunto de datos, determinando cuántas muestras son malignas y cuántas son benignas.

### Preparación de Datos: Separar el conjunto de datos en características (X) y etiquetas (y), y dividirlos en conjuntos de entrenamiento y prueba para la validación del modelo.

### Entrenamiento del Modelo: Entrenar un clasificador KNN utilizando el conjunto de entrenamiento, y ajustar el modelo para optimizar su rendimiento.

### Evaluación del Modelo: Evaluar la precisión del modelo usando el conjunto de prueba y determinar su efectividad en la clasificación de nuevas muestras.

### Predicción y Análisis: Realizar predicciones sobre las muestras y analizar los resultados, incluyendo la predicción de etiquetas para un conjunto de características promedio.


Este documento detalla los pasos necesarios para lograr estos objetivos, desde la carga de datos hasta la evaluación del modelo, proporcionando un marco para el diagnóstico asistido por máquina del cáncer de mama.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

print(cancer.DESCR) 

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - 

In [65]:
cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

### Question 0 (Example)
How many features does the breast cancer dataset have?

In [68]:
def answer_zero():
        return cancer.data.shape[1]
print(answer_zero())

30


### Question 1

Scikit-learn works with lists, numpy arrays, scipy-sparse matrices, and pandas DataFrames, so converting the dataset to a DataFrame is not necessary for training this model. Using a DataFrame does however help make many things easier such as munging data, so let's practice creating a classifier with a pandas DataFrame. 



Convert the sklearn.dataset `cancer` to a DataFrame. 

*This function should return a `(569, 31)` DataFrame with * 

*columns = *

    ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
    'mean smoothness', 'mean compactness', 'mean concavity',
    'mean concave points', 'mean symmetry', 'mean fractal dimension',
    'radius error', 'texture error', 'perimeter error', 'area error',
    'smoothness error', 'compactness error', 'concavity error',
    'concave points error', 'symmetry error', 'fractal dimension error',
    'worst radius', 'worst texture', 'worst perimeter', 'worst area',
    'worst smoothness', 'worst compactness', 'worst concavity',
    'worst concave points', 'worst symmetry', 'worst fractal dimension',
    'target']

*and index = *

    RangeIndex(start=0, stop=569, step=1)

In [71]:
def answer_one():
    # Create a DataFrame from the data
    df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
    
    # Add the target to the DataFrame
    df['target'] = cancer.target
    
    return df
print(answer_one().shape)

(569, 31)


### Question 2
What is the class distribution? (i.e. how many instances of `malignant` and how many `benign`?)

*This function should return a Series named `target` of length 2 with integer values and index =* `['malignant', 'benign']`

In [74]:
def answer_two():
    
    # Create a pandas Series from the target
    target_series = pd.Series(cancer.target)
     # Count the occurrences of each class
    class_counts = target_series.value_counts()
    
    # Create a Series with index ['malignant', 'benign']
    target = pd.Series([class_counts[0], class_counts[1]], index=['malignant', 'benign'])
    
    return target
    raise NotImplementedError()
    
print(answer_two())


malignant    212
benign       357
dtype: int64


### Question 3
Split the DataFrame into `X` (the data) and `y` (the labels).

*This function should return a tuple of length 2:* `(X, y)`*, where* 
* `X` *has shape* `(569, 30)`
* `y` *has shape* `(569,)`.

In [77]:
def answer_three():
    # Create a DataFrame from the data
    df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
    
    # Assign X to the features (all columns)
    X = df
    
    # Assign y to the target
    y = pd.Series(cancer.target, name='target')
    
    return (X, y)

X, y = answer_three()
print(X.shape) 
print(y.shape)  

(569, 30)
(569,)


### Question 4
Using `train_test_split`, split `X` and `y` into training and test sets `(X_train, X_test, y_train, and y_test)`.

**Set the random number generator state to 0 using `random_state=0` to make sure your results match the autograder!**

*This function should return a tuple of length 4:* `(X_train, X_test, y_train, y_test)`*, where* 
* `X_train` *has shape* `(426, 30)`
* `X_test` *has shape* `(143, 30)`
* `y_train` *has shape* `(426,)`
* `y_test` *has shape* `(143,)`

In [100]:
def answer_four():
    from sklearn.model_selection import train_test_split

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=0
    )

    return (X_train, X_test, y_train, y_test)


X_train, X_test, y_train, y_test = answer_four()
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(426, 30)
(143, 30)
(426,)
(143,)


### Question 5
Using KNeighborsClassifier, fit a k-nearest neighbors (knn) classifier with `X_train`, `y_train` and using one nearest neighbor (`n_neighbors = 1`).

*This function should return a `sklearn.neighbors.classification.KNeighborsClassifier`.

In [102]:
from sklearn.neighbors import KNeighborsClassifier


def answer_five():
    # Initialize KNeighborsClassifier with one nearest neighbor
    knn = KNeighborsClassifier(n_neighbors=1)

    # Fit the classifier with the training data
    knn.fit(X_train, y_train)

    return knn

In [104]:
knn_model = answer_five()
print(knn_model)

KNeighborsClassifier(n_neighbors=1)


### Question 6
Using your knn classifier, predict the class label using the mean value for each feature.

Hint: You can use `cancerdf.mean()[:-1].values.reshape(1, -1)` which gets the mean value for each feature, ignores the target column, and reshapes the data from 1 dimension to 2 (necessary for the precict method of KNeighborsClassifier).

In [106]:
def answer_six(knn_model, X_train):
    
    #Calculate the mean value of each feature in the training set
    mean_values = X_train.mean().values.reshape(1, -1)

    #Use the trained model to predict the class of the average feature set
    prediction = knn_model.predict(mean_values)
    return prediction[0]


# Pasar el modelo entrenado y el conjunto de entrenamiento a la función
predicted_class = answer_six(knn_model, X_train)
print(predicted_class)

1




### Question 7
Using your knn classifier, predict the class labels for the test set `X_test`.

*This function should return a numpy array with shape `(143,)` and values either `0.0` or `1.0`.*

In [108]:
def answer_seven():
    #Use the trained KNN model to predict the class labels for the test set
    predictions = knn_model.predict(X_test)
    return predictions

predicted_classes = answer_seven()
print(predicted_classes.shape)
print(predicted_classes)

(143,)
[1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 0 1 0 1
 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 1
 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 1 1
 0 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0]


### Question 8
Find the score (mean accuracy) of your knn classifier using `X_test` and `y_test`.

*This function should return a float between 0 and 1*

In [110]:
def answer_eight():
    #Calculate the average precision on the test data
    accuracy = knn_model.score(X_test, y_test)
    return accuracy

# Obtener la precisión del modelo
accuracy_score = answer_eight()
print(accuracy_score)

0.916083916083916
