# IDENTIFYING THE PROBLEM AND INSPECTING THE DATA

Breast cancer is a disease in which cells in the breast with abnormalities multiply uncontrollably and form tumors that, if not treated, can spread throughout the body and cause death. Cancerous cells begin to develop within the milk-producing ducts or lobules of the breast. Cancer at stage 0 (in situ) is not potentially fatal. Cancer cells can spread to nearby breast tissue (invasion), leading to nodules or thickening. Invasive cancers can spread to nearby lymph nodes or other organs (metastasis). Metastases can be deadly. 
In the 1990s, survival rates began to improve, when countries implemented early breast cancer detection programs associated with comprehensive treatment programs that included effective pharmacological treatments. In 2020, breast cancer was diagnosed in 2.3 million women worldwide, and 685,000 died from the disease. By the end of the same year, 7.8 million women who had been diagnosed with breast cancer in the previous five years were still alive, making it the most prevalent cancer in the world.

**Expected Outcome**

Given a dataset where we can find features which are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, the classification goal is to predict whether the patient diagnosis is malignant o benign. The features describe characteristics of the cell nuclei present in the image.

**Objective**

Since the labels in the data are discrete, the prediction falls into two categories: the patient has a malignant tumor or the patient has a benign tumor. Therefore, we have a classification problem.

**Identify the data sources**

The dataset includes 569 records and 32 attributes (ID, diagnosis and 30 real-valued input features). Ten real-valued features are computed for each cell nucleus: 
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)

**Loading the libraries**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 
from sklearn.metrics import log_loss
from sklearn.utils import resample
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

We start by loading the dataset into a DataFrame.

In [2]:
breast_cancer = pd.read_csv(r"C:\Users\maria\Desktop\proyecto cancer de mama\breast-cancer-wisconsin-data_data.csv")
print(len(breast_cancer))
breast_cancer.head()

569


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Let's examine the different columns:

In [3]:
breast_cancer.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

Let's take a look at the properties of the dataset.

In [4]:
breast_cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

The DataFrame contains 32 columns, 31 features and one target variable called 'diagnosis,' which informs us whether a tumor is malignant or benign:

- 'M' if the tumor is malignant.
- 'B' if the tumor is benign.

Let's check for missing values:

In [5]:
breast_cancer.isnull().sum().sort_values()

id                         0
concave points_worst       0
concavity_worst            0
compactness_worst          0
smoothness_worst           0
area_worst                 0
perimeter_worst            0
texture_worst              0
radius_worst               0
fractal_dimension_se       0
symmetry_se                0
concave points_se          0
concavity_se               0
compactness_se             0
smoothness_se              0
area_se                    0
perimeter_se               0
texture_se                 0
radius_se                  0
fractal_dimension_mean     0
symmetry_mean              0
concave points_mean        0
concavity_mean             0
compactness_mean           0
smoothness_mean            0
area_mean                  0
perimeter_mean             0
texture_mean               0
radius_mean                0
diagnosis                  0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

We observe that we don't have any missing values. We can proceed with the exploratory data analysis.