# TP - Computação Natural
#### "Predict whether a mammogram mass is benign or malignant"

1. BI-RADS assessment: 1 to 5 (ordinal)  
2. Age: patient's age in years (integer)
3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
6. Severity: benign=0 or malignant=1 (binominal)

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from scipy import stats

## Get the Data

In [None]:
data = pd.read_csv('mammographic_masses.data.txt')
data

** Convert missing data (indicated by a ?) into NaN and add the appropriate column names (BI_RADS, age, shape, margin, density, and severity) **

In [None]:
data = data.replace('?',np.nan)
data.columns = ['BI_RADS','Age','Shape','Margin','Density','Severity']
data

** Drop BI_RADS column because it has no influence on the severity forecast **

In [None]:
data = data.drop(columns=['BI_RADS'])

** Convert datatype 'object' to 'float64' **  

In [None]:
data.info()

In [None]:
data = data.astype(float)
data

In [None]:
data.info()

In [None]:
data.describe()

** Check missing values **

In [None]:
print(data.isnull().sum(axis=0))
sns.heatmap(data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

** The missing data seems randomly distributed, so we drop rows with missing data **

In [None]:
data = data.dropna()
data.index = np.arange(1, len(data) + 1)
data

** !!! (Ou podemos meter os dados que faltavam, ver isto) !!! **

In [None]:
data.describe()

In [None]:
print(data.isnull().sum(axis=0))
sns.heatmap(data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

## Exploratory Data Analysis

** Countplot of the Severity (Benign 0 vs Malignant 1) **

In [None]:
sns.countplot(x='Severity',data=data)

** Histogram showing Age based on the Severity column **

In [None]:
sns.set_style('darkgrid')
g = sns.FacetGrid(data,hue="Severity",palette='coolwarm',size=6,aspect=2)
g = (g.map(plt.hist,'Age',bins=20,alpha=0.7)).add_legend()

##### Detect outliers: https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba

### Detect Outliers using Box plot (Uni-variate outlier)

In [None]:
sns.boxplot(x=data['Age'])

In [None]:
sns.boxplot(x=data['Shape'])

In [None]:
sns.boxplot(x=data['Margin'])

In [None]:
sns.boxplot(x=data['Density'])

### Detect Outliers using Scatter plot (Multi-variate outlier)

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(data['Age'], data['Shape'])
ax.set_xlabel('Age')
ax.set_ylabel('Shape')
#ax.set_ylabel('Margin')
#ax.set_ylabel('Density')
plt.show()

### Detect outliers using mathematical function Z-Score

In [None]:
z = np.abs(stats.zscore(data))
threshold = 3
print(np.where(z > threshold))
# The first array contains the list of row numbers and second array respective column numbers

Column 3 (density) has all outliers

### Detect outliers using IQR Score
Similar to Z-Score

In [None]:
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
iqr = Q3 - Q1
print(iqr)

In [None]:
# Não curti ...
print(data < (Q1 - 1.5 * iqr)) |(data > (Q3 + 1.5 * iqr))

### Remove Outliers using Z-Score

##### + explanations: https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-data-frame

In [None]:
# Só fazer 1 vez
data = data[(np.abs(stats.zscore(data)) < 3).all(axis=1)]
data.index = np.arange(1, len(data) + 1)
data

### Converting pandas dataframes to numpy arrays

In [None]:
X_train = data.drop('Severity',axis=1).to_numpy()
y_train = data['Severity'].to_numpy()

### Normalizing the attribute data using StandardScaler

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
scaler.transform(X_train)

## Neural Networks