# Project 1

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset.

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. 

Build a better way to interpret them through supervised machine learning.

## Your assignment

Apply Artificial Neural Network supervised machine learning techniques to this data set and validate it by applying K-Fold cross validation (K=10).

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

Many optimization techniques provide the means of "hyperparameters" to be tuned (e.g. Genetic Algorithms). Once you identify a promising approach, see if you can make it even better by tuning its hyperparameters.

Below it's described the set of steps that outline the development of this project, with some guidance and hints. If you're up for a real challenge, try doing this project from scratch in a new, clean notebook!


## Let's begin: prepare your data

Start by importing the mammographic_masses.data.txt file into a Pandas dataframe (hint: use read_csv) and take a look at it.

In [None]:
import pandas as pd
import numpy as np
from statistics import *
from sklearn.model_selection import train_test_split
data = pd.read_csv("mammographic_masses.data.txt", sep=",", header=None, na_values = '?').astype('float64')

Make sure you use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

In [None]:
data.columns = ["BI-RADS", "Age", "Shape", "Margin", "Density", "Severity"]

Evaluate whether the data needs cleaning; your model is only as good as the data it's given. Hint: use describe() on the dataframe.

In [None]:
data.describe()

There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

In [None]:
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

#sb.pairplot(data,hue='Severity')
print(data.isna().sum(axis=0))
sb.heatmap(data.isnull(), cbar=False)
data.info()

If the missing data seems randomly distributed, go ahead and drop rows with missing data. Hint: use dropna().

In [None]:
data = data.dropna()

Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

In [None]:
X = data.drop(['BI-RADS','Severity'], axis=1)
Y = data['Severity']
X=X.reset_index(drop=True)
Y=Y.reset_index(drop=True)

In [None]:
names = data.columns.drop(['BI-RADS','Severity'])

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. Hint: use preprocessing.StandardScaler().

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
scaled_features = scaler.fit_transform(X)
df_feat = pd.DataFrame(scaled_features,columns=names)
X=df_feat

## Neural Networks

You can use Tensorflow to set up a neural network with 1 binary output neuron and see how it performs. Don't be afraid to run a large number of epochs to train the model if necessary. As a bonus, try to optimize this model's hyperparameters using GA.

In [None]:
import tensorflow as tf
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
data.columns

In [None]:
age = tf.feature_column.numeric_column("Age")
shape = tf.feature_column.numeric_column("Shape")
margin = tf.feature_column.numeric_column("Margin")
density = tf.feature_column.numeric_column("Density")

feat_cols = [age,shape,margin,density]

## Functions for hyperparameters adjustment

In [None]:
#Função que cria a população inicial
#parameters=[learning_rate,num_nodes,num_hidden_l,activation_fun]
def create_new_population ():
    
    population=[]
    
    for i in range(10):
        cromo=[]
        cromo.append(np.random.uniform(low=10**-2, high=10**-1))
        cromo.append(np.random.choice([1, 2, 4, 8, 16, 32, 64, 128, 256]))
        cromo.append(np.random.randint(low=1, high=20))
        cromo.append(np.random.randint(low=0, high=2))
        population.append(cromo)
        
    return np.array(population)
        


#Função genérica que atualiza os argumentos do classifier
#parameters=[learning_rate,num_nodes,num_hidden_l,activation_fun]
def update_classifier_parameters (parameters):
    h_u=[]
    for i in range(int(parameters[2])):
        h_u.append(parameters[1])
    
    if((parameters[3]) == 0): a_f = tf.nn.softmax
    if((parameters[3]) == 1): a_f = tf.nn.relu
    if((parameters[3]) == 2): a_f = tf.nn.leaky_relu
    
    classifier = tf.estimator.DNNClassifier(hidden_units=h_u,
                                           n_classes=2,
                                           feature_columns=feat_cols,
                                           #model_dir='C:\\Users\\jose\\Desktop\\RNAmodel',
                                           activation_fn=a_f,
                                           dropout=0.5,
                                           optimizer=tf.train.AdamOptimizer(
                                              learning_rate=parameters[0]
                                           ))
    print(h_u)
    print(type(a_f))
    return classifier

def select_mating_pool(pop, fitness, parents_fitness, num_parents):
    # Selecting the best individuals in the current generation as parents for producing the offspring of the next generation.
    parents = np.empty((num_parents, pop.shape[1]))
    #parents_fitness=[] - strangely not working good
    for parent_num in range(num_parents):
        #save fitness values of best parents
        parents_fitness.append(np.max(fitness))
        #save best parents
        max_fitness_idx = np.where(fitness == np.max(fitness))
        max_fitness_idx = max_fitness_idx[0][0]
        parents[parent_num, :] = pop[max_fitness_idx, :]
        fitness[max_fitness_idx] = -99999999999
        
    return parents


def crossover(parents, offspring_size):
    offspring = np.empty(offspring_size)
    # The point at which crossover takes place between two parents. Usually it is at the center.
    crossover_point = np.uint8(offspring_size[1]/2)

    for k in range(offspring_size[0]):
        # Index of the first parent to mate.
        parent1_idx = k%parents.shape[0]
        # Index of the second parent to mate.
        parent2_idx = (k+1)%parents.shape[0]
        # The new offspring will have its first half of its genes taken from the first parent.
        offspring[k, 0:crossover_point] = parents[parent1_idx, 0:crossover_point]
        # The new offspring will have its second half of its genes taken from the second parent.
        offspring[k, crossover_point:] = parents[parent2_idx, crossover_point:]
    return offspring


def mutation(offspring_crossover):
    # Mutation changes a single gene in each offspring randomly.
    for idx in range(offspring_crossover.shape[0]):
        
        # Select which gene to mutate
        select_gene = np.random.randint(low=0, high=4)
        
        if(select_gene == 0):
            #Learning rate mutation
            random_value = np.random.uniform(low=10**-2, high=10**-1)
            offspring_crossover[idx,0] = random_value
        if(select_gene == 1):
            #num_nodes_per_layer mutation
            random_value = np.random.choice([1, 2, 4, 8, 16, 32, 64, 128, 256])
            offspring_crossover[idx,1] = random_value
        if(select_gene == 2):
            #num_hidden_layers
            random_value = np.random.randint(low=1, high=20)
            offspring_crossover[idx,2] = random_value
        if(select_gene == 3):
            #activation function mutation
            random_value = np.random.randint(low=0, high=2)
            offspring_crossover[idx,3] = random_value
            
    return offspring_crossover
              

## Data split using 10 folds

In [None]:
kf = KFold(n_splits=10)
kf.get_n_splits(X)

## Build and validate ANN for a given chromosome

In [None]:
def classify_create_folds(cromossoma):
    fold = 0
    all_test_samples=[]
    predicted_labels=[]
    scores=[]
    
    classifier = update_classifier_parameters(cromossoma)

    for train_index, test_index in kf.split(X):
        fold+=1
        print("Fold#{}".format(fold))

        X_train = X.values[train_index]
        y_train = Y[train_index]
        X_test = X.values[test_index]
        y_test = Y[test_index]

        X_train_df = pd.DataFrame(X_train,columns=names,index=train_index)
        X_test_df = pd.DataFrame(X_test,columns=names,index=test_index)

        #defining input function to feed the classifier with training data
        input_func = tf.estimator.inputs.pandas_input_fn(x=X_train_df,
                                                         y=y_train,
                                                         batch_size=20,
                                                         shuffle=True)
   
        #train the model
        classifier.train(input_fn=input_func,steps=500)

        #defining input function to feed the classifer with testing data
        pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test_df,
                                                      batch_size=len(X_test_df),
                                                      shuffle=False)

        #make predictions based on testing data
        note_predictions = list(classifier.predict(input_fn=pred_fn))

        #extract the labels
        final_preds=[]
        for pred in note_predictions:
            final_preds.append(pred['class_ids'][0])

        #Adding all accuracy values to an array
        acc = accuracy_score(y_test,final_preds)
        
        scores.append(acc)
        print("Fold-{}".format(fold),"Accuracy#{}".format(acc))

        
   
    return (scores)

## Hyperparameters otimization using Genetic Algorithm 

In [None]:
new_population = create_new_population()
print(new_population)
num_parents_mating = 5
num_generations = 20
#number of genes for each chromosome
num_genes = 4 
#number of chromosomes for each population
num_chromosomes = 10 
pop_size=(num_chromosomes,num_genes)
#fitness values for each chromosome for the current generation
fitness_values = []
#fitness vaalues for each chromosome of the last generation
last_fitness_values = []
gen = 0
cromo = 0
parents=[]
#Parents fitness so we do not repeat calculations on parents
parents_fitness = []

performances=[]
hiperparametros=[]


for generation in range(num_generations):
    gen+=1
    cromo = 0
    best_perf_per_gen = -1
    
    for cromossoma in new_population:
        cromo+=1
        score=-1
        parentNumber=0
        
        # If it's a known chromosome we dont need to train the ANN again
        # Skips the first generation because we didnt select the parents yet
        for savedCromo in parents:
            parentNumber+=1
            if (np.array_equal(cromossoma,savedCromo)):
                score = parents_fitness[parentNumber-1]
                print(savedCromo, "CROMOSSOMA CONHECIDO")
                print(score, "SCORE CONHECIDO")

        
        #If it's a new chromosome we need to train the ANN in order to get the accuracy
        if (score < 0):
            scores = classify_create_folds(cromossoma)
            score = sum(scores)/len(scores)
            
        print("Generation-{}".format(gen),"Cromossoma-{}".format(cromo),"scored",score)
        #Keep the scores in fitness_values
        fitness_values.append(score)
        
        #getting the best hyperparameters per generation to check the evolution at the end
        if(best_perf_per_gen < score):
            best_perf_per_gen = score
            best_cromo_per_gen = cromossoma
           
        
        
        print(cromossoma)
        
    performances.append(best_perf_per_gen)
    hiperparametros.append(best_cromo_per_gen)
   
    
    print(performances,"MELHORES DE CADA GERAÇÃO")
    print(hiperparametros,"MELHORES ACCURACIES DE CADA GERAÇÃO")
    #We store last generation in other array because fitness_values is changed by the selec_mating_pool
    if(gen == num_generations):
        for i in fitness_values:
            last_fitness_values.append(i)
            
    print(last_fitness_values,"LAST_FITNESS_VALUES")
    parents_fitness=[]
    parents = select_mating_pool(new_population,fitness_values,parents_fitness,num_parents_mating)
    #print(parents)
    # Generating next generation using crossover.
    offspring_crossover = crossover(parents,
                                        offspring_size=(pop_size[0]-parents.shape[0], num_genes))


    # Adding some variations to the offspring using mutation.
    offspring_mutation = mutation(offspring_crossover)

    # Creating the new population based on the parents and offspring.
    new_population[0:parents.shape[0], :] = parents
    new_population[parents.shape[0]:, :] = offspring_mutation #mudar para offspring_mutation quando mutation funcionar
    
    #Reset fitness_values
    fitness_values=[]

#Getting the best solution
print(new_population)
best_solution = new_population[last_fitness_values.index(np.max(last_fitness_values))]
print("The best hyperparameters obtained are",best_solution,"with an accuracy of",np.max(last_fitness_values))

    

    
    

## Debug stuff

In [None]:
new_population = create_new_population()
print(new_population)
fitness_values=[]
for i in range(10):
    fitness_values.append(np.random.uniform(low=0, high=1))

print(fitness_values.index(np.max(fitness_values)))

In [None]:
accs=[0.7951807228915662, 0.7963855421686746, 0.7963855421686746, 0.8012048192771084, 0.8012048192771084, 0.8012048192771084, 0.8084337349397591, 0.8084337349397591, 0.8084337349397591, 0.8096385542168674, 0.8120481927710843, 0.8120481927710843, 0.8120481927710843, 0.8120481927710843, 0.8120481927710843, 0.8120481927710843, 0.8120481927710843, 0.8120481927710843, 0.8120481927710843, 0.8132530120481928]
gen=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
plt.plot(gen,accs,color='g')
plt.xlabel('Generations')
plt.ylabel('Accuracy')
plt.title('Accuracy improvement through generations')
plt.show()