# Final Project

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

## Your assignment

Build a Multi-Layer Perceptron and train it to classify masses as benign or malignant based on its features.

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

Remember to normalize your data first! And experiment with different topologies, optimizers, and hyperparameters.

I was able to achieve over 80% accuracy - can you beat that?


## Let's begin: prepare your data

Start by importing the mammographic_masses.data.txt file into a Pandas dataframe (hint: use read_csv) and take a look at it.

In [25]:
import pandas as pd
import sqlite3 as sql
import numpy as np
import random
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import cross_val_score
from keras.wrappers.scikit_learn import KerasClassifier
import keras
import tensorflow as tf

conn = sql.connect('../spotify.db')
#chansons_data = pd.read_csv(r'C:\Users\Simplon 1\Documents\Projets\Spotify\Spotify(1)\chansons.csv' ,encoding="UTF8")



In [None]:
chansons_data['top50']=0
chansons_data.at[0:50,'top50']=1
chansons_data.head(60)


Make sure you use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

In [4]:
#chansons_data = pd.read_csv(r'C:\Users\Simplon 1\Documents\Projets\Spotify\Spotify(1)\chansons.csv',encoding="UTF8", na_values=['?'])
chansons_data = pd.read_sql_query("select * from Chansons ORDER BY popularity  DESC",conn)
top50_data = pd.read_sql_query("select * from best_chansons ",conn)
chansons_data['top50']=0
top50_data['top50']=1
chansons_data = pd.concat([top50_data,chansons_data])

chansons_data.head()


Unnamed: 0,level_0,index,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,top50
0,0,93802,4iJyoBOLtHqaGxP12qzhQI,Peaches (feat. Daniel Caesar & Giveon),100,198082,1,"['Justin Bieber', 'Daniel Caesar', 'Giveon']","['1uNFoZAHBGtllmzznpCI3s', '20wkVLutqVOYrc0kxF...",2021-03-19,...,-6.181,1,0.119,0.321,0.0,0.42,0.464,90.03,4,1
1,1,93803,7lPN2DXiMsVn7XUKtOW1CS,drivers license,99,242014,1,['Olivia Rodrigo'],['1McMsnEElThX1knmY4oliG'],2021-01-08,...,-8.761,1,0.0601,0.721,1.3e-05,0.105,0.132,143.874,4,1
2,2,93804,3Ofmpyhv5UAQ70mENzB277,Astronaut In The Ocean,98,132780,0,['Masked Wolf'],['1uU7g3DNSbsu0QjSEqZtEd'],2021-01-06,...,-6.865,0,0.0913,0.175,0.0,0.15,0.472,149.996,4,1
3,3,92811,6tDDoYIxWvMLTdKpjFkc1B,telepatía,97,160191,0,['Kali Uchis'],['1U1el3k54VvEUzo3ybLPlM'],2020-12-04,...,-9.016,0,0.0502,0.112,0.0,0.203,0.553,83.97,4,1
4,4,92810,5QO79kh1waicV47BqGRL3g,Save Your Tears,97,215627,1,['The Weeknd'],['1Xyo4u8uXC1ZmMpatF05PJ'],2020-03-20,...,-5.487,1,0.0309,0.0212,1.2e-05,0.543,0.644,118.051,4,1


Evaluate whether the data needs cleaning; your model is only as good as the data it's given. Hint: use describe() on the dataframe.

In [142]:
chansons_data.describe()

Unnamed: 0,level_0,index,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,top50
count,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0
mean,285259.327619,289552.775238,29.986667,229684.2,0.046667,0.56769,0.539367,5.153333,-10.120407,0.619048,0.099262,0.455269,0.115828,0.212344,0.543312,116.380131,3.874286,0.047619
std,174664.831496,171084.723019,22.410117,103145.3,0.211024,0.165592,0.246434,3.51577,4.903837,0.485852,0.171339,0.348864,0.270526,0.172839,0.257313,29.084873,0.459417,0.21306
min,0.0,154.0,0.0,33813.0,0.0,0.0661,0.00651,0.0,-32.253,0.0,0.0225,1e-06,0.0,0.0215,0.0308,46.282,1.0,0.0
25%,131867.5,128401.5,13.0,175056.8,0.0,0.461,0.35325,2.0,-12.8415,0.0,0.0339,0.11525,0.0,0.0992,0.33825,94.17775,4.0,0.0
50%,288254.5,286842.5,27.5,216478.0,0.0,0.5785,0.549,5.0,-9.3425,1.0,0.04305,0.431,2e-05,0.141,0.5535,114.2525,4.0,0.0
75%,434337.75,439225.5,42.0,264463.0,0.0,0.68775,0.732,8.0,-6.54175,1.0,0.072175,0.79225,0.010725,0.28,0.749,134.216,4.0,0.0
max,586231.0,585212.0,100.0,1305307.0,1.0,0.957,0.997,11.0,-0.826,1.0,0.959,0.996,0.992,0.985,0.995,206.119,5.0,1.0


In [5]:
chansons_data =chansons_data.sort_values('duration_ms')

There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

In [None]:
'''chansons_data.loc[(masses_data['popularity'].isnull()) |
              (masses_data['duration_ms'].isnull()) |
              (masses_data['explicit'].isnull()) |
              (masses_data['danceability'].isnull()) |
              (masses_data['energy'].isnull()) |
              (masses_data['key'].isnull()) |
              (masses_data['loudness'].isnull()) |
              (masses_data['artists'].isnull())]
'''


If the missing data seems randomly distributed, go ahead and drop rows with missing data. Hint: use dropna().

In [6]:
chansons_data.dropna(inplace=True)
chansons_data.describe()




Unnamed: 0,level_0,index,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,top50
count,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0,1050.0
mean,285259.327619,289552.775238,29.986667,229684.2,0.046667,0.56769,0.539367,5.153333,-10.120407,0.619048,0.099262,0.455269,0.115828,0.212344,0.543312,116.380131,3.874286,0.047619
std,174664.831496,171084.723019,22.410117,103145.3,0.211024,0.165592,0.246434,3.51577,4.903837,0.485852,0.171339,0.348864,0.270526,0.172839,0.257313,29.084873,0.459417,0.21306
min,0.0,154.0,0.0,33813.0,0.0,0.0661,0.00651,0.0,-32.253,0.0,0.0225,1e-06,0.0,0.0215,0.0308,46.282,1.0,0.0
25%,131867.5,128401.5,13.0,175056.8,0.0,0.461,0.35325,2.0,-12.8415,0.0,0.0339,0.11525,0.0,0.0992,0.33825,94.17775,4.0,0.0
50%,288254.5,286842.5,27.5,216478.0,0.0,0.5785,0.549,5.0,-9.3425,1.0,0.04305,0.431,2e-05,0.141,0.5535,114.2525,4.0,0.0
75%,434337.75,439225.5,42.0,264463.0,0.0,0.68775,0.732,8.0,-6.54175,1.0,0.072175,0.79225,0.010725,0.28,0.749,134.216,4.0,0.0
max,586231.0,585212.0,100.0,1305307.0,1.0,0.957,0.997,11.0,-0.826,1.0,0.959,0.996,0.992,0.985,0.995,206.119,5.0,1.0


Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

In [7]:
#all_features = masses_data[['explicit','speechiness','loudness','valence','duration_ms','danceability','energy']].values
all_features = chansons_data[['popularity']].values

all_classes = chansons_data[['top50']].values

#feature_names = ['age', 'shape', 'margin', 'density']

#all_features


#données= pd.DataFrame(data,columns=['Age','Shape','Margin','Density'])
#catégorie= pd.DataFrame(data,columns=['Severity'])

In [8]:

chansons_data.iloc[50:,:]
print(chansons_data)
sample = chansons_data.sample(1000)
print("_________________________________________________________________________________")
print(sample)


     level_0   index                      id  \
670   403564  484601  3dylMkIuNcV5qQW2s5vOT5   
947   552504   15104  0VPIRWFCiuAYx1NEGz1Vwl   
969   563560  116346  0xB7vTtNikRDNrWozR6ETB   
606   366165  577622  0etHcrwnc1vPsqvLrWJe7C   
946   580440  451583  4MdMxmyUMyv4NrxcDFPab1   
..       ...     ...                     ...   
845   495459  505677  44pcaKv3RGu5PRQK9hagoA   
987   550621   12238  6gfXtfyuTLU3C2fM4P00iu   
921   540332  493119  1coKh01XUmxwQCCylLMU3Y   
872   506916  219279  27lnVpIKy1eCAtKiGKoBId   
949   557972   23261  0Rq2Q4RrAG0f7CiaV9KBDs   

                                                  name  popularity  \
670                  De Globi planget uf de Geburtstag          17   
947      24 Préludes, Op. 28: Prélude No. 5 in D Major           0   
969  Verdi : La forza del destino : Act 2 "La cena ...           0   
606                           Tintin i Amerika, del 36          21   
946  St. Matthew Passion, BWV 244 - Part Two: No.55...           0   
.. 

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. Hint: use preprocessing.StandardScaler().

In [30]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()

all_features_scaled = scaler.fit_transform(all_features)
all_features_scaled
print(all_features)

[[17]
 [ 0]
 [ 0]
 ...
 [ 1]
 [ 4]
 [ 0]]


Now set up an actual MLP model using Keras:

In [23]:
from keras.layers import Dense
from keras.models import Sequential

def create_model():
    model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_dim=1),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(1)
    ])

    optimizer = tf.keras.optimizers.RMSprop(0.001)
    model.compile(loss='mean_squared_error',
                optimizer=optimizer,
                metrics=['mean_absolute_error', 'mean_squared_error'])
    #model = Sequential()
    #4 feature inputs going into an 6-unit layer (more does not seem to help - in fact you can go down to 4)
    #model.add(Dense(6, input_dim=1, kernel_initializer='normal', activation='relu'))
    # "Deep learning" turns out to be unnecessary - this additional hidden layer doesn't help either.
     #model.add(Dense(4, kernel_initializer='normal', activation='relu'))
    # Output layer with a binary classification (benign or malignant)
     #model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    # Compile model; rmsprop seemed to work best
     #model.compile(loss='mean_squared_error', optimizer='rmsprop', metrics=['accuracy'])
    return model

In [20]:
from sklearn.model_selection import cross_val_score
from keras.wrappers.scikit_learn import KerasClassifier

# Wrap our Keras model in an estimator compatible with scikit_learn
estimator = KerasClassifier(build_fn=create_model, nb_epoch=100, verbose=0)
# Now we can use scikit_learn's cross_val_score to evaluate this model identically to the others
cv_scores = cross_val_score(estimator, all_features_scaled,all_classes, cv=10)
cv_scores.mean()


Traceback (most recent call last):
  File "C:\Users\Simplon 1\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Simplon 1\Anaconda3\lib\site-packages\keras\wrappers\scikit_learn.py", line 220, in fit
    return super(KerasClassifier, self).fit(x, y, **kwargs)
  File "C:\Users\Simplon 1\Anaconda3\lib\site-packages\keras\wrappers\scikit_learn.py", line 154, in fit
    self.model = self.build_fn(**self.filter_sk_params(self.build_fn))
  File "<ipython-input-19-e65f53a740af>", line 6, in create_model
    keras.layers.Dense(64, activation=tf.nn.relu, input_shape=[1]),
NameError: name 'tf' is not defined

Traceback (most recent call last):
  File "C:\Users\Simplon 1\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Simplon 1\Anaconda3\lib\site-packages\keras\wrappers\

nan

In [None]:
print(all_classes)

In [26]:

model = create_model()
model.fit(
    all_features_scaled, # all_features_scaled,
    all_classes,                        #all_classes,
    batch_size=1200,
    epochs= 1,
    verbose=2,
    class_weight={0: 1, 1: 20},
    validation_data=(sample['popularity'], sample['top50'])
)

#chansons_data[['popularity']].values





Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
1/1 - 18s - loss: 0.8978 - mean_absolute_error: 0.1020 - mean_squared_error: 0.0516 - val_loss: 19.4694 - val_mean_absolute_error: 3.5865 - val_mean_squared_error: 19.4694


<keras.callbacks.History at 0x2934726b400>

In [28]:
def checkAccuracy(df,weight):
    reussite = 0
    size  = 0
    for  itera,row in df.iterrows():  
        print(row[0],end='')
        print('/',end='')
        print(row[1],end='')
        if row[1]==1:
            size += weight
            if row[0]==row[1]:
                reussite+= weight
            else:
                print("bad")
        else:
            size += 1
            if row[0]==row[1]:
                reussite+= 1
               
        
    return reussite/size

In [29]:

exemple  =  chansons_data['popularity'].tail(1).to_numpy()

column_names = ['prediction', 'realite']



resultat = pd.DataFrame(columns = column_names,index=range(1050))

resultat['realite'].append(sample['top50'])

resultat['prediction']=pd.DataFrame(model.predict_classes(all_features))
resultat['realite'] = all_classes
print("resultat")
print(resultat)
print(checkAccuracy(resultat,2))




resultat
      prediction  realite
0              1        0
1              0        0
2              0        0
3              1        0
4              0        0
...          ...      ...
1045           1        0
1046           0        0
1047           0        0
1048           1        0
1049           0        0

[1050 rows x 2 columns]
1/00/00/01/00/01/00/01/00/01/01/01/01/01/01/01/01/01/01/01/01/01/01/01/01/01/00/01/01/01/00/01/01/01/01/01/01/00/01/01/01/01/01/01/01/01/01/01/01/01/01/01/01/01/00/00/00/00/01/01/01/01/01/01/01/01/01/01/01/11/01/01/01/01/01/01/00/01/01/00/01/01/01/01/01/01/01/01/01/01/01/01/01/01/11/00/01/01/01/01/01/00/00/00/01/01/01/01/01/01/00/01/01/01/01/01/01/01/01/11/01/01/01/01/01/01/01/01/00/01/00/01/01/01/01/01/00/01/01/01/01/01/01/01/10/01/01/01/01/01/01/01/01/01/01/01/00/01/01/01/01/01/11/01/11/01/01/00/01/00/01/01/11/01/10/01/01/00/01/11/01/01/01/01/01/01/01/01/01/00/01/01/01/01/10/01/01/11/00/01/01/01/01/11/01/01/11/01/01/00/01/01/01/01/01/01/01/01/1

## How did you do?

Which topology, and which choice of hyperparameters, performed the best? Feel free to share your results!