# Assignment

## Dataset
[Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult)

Training set: 32'561 records

Test set: 16'280 records

### Goal

Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

### Features (14)

Note: features contain missing data!

| Name           | Type        |
|----------------|-------------|
| age            | continuous  |
| workclass      | categorical |
| fnlwgt         | continuous  |
| education      | categorical |
| education-num  | continuous  |
| marital-status | categorical |
| occupation     | categorical |
| relationship   | categorical |
| race           | categorical |
| sex            | categorical |
| capital-gain   | continuous  |
| capital-loss   | continuous  |
| hours-per-week | continuous  |
| native-country | categorical |

### Target (1)

| Name           | Type                                |
|----------------|-------------------------------------|
| income         | categorical<br>("<=50K" or ">50K")  |

### Evaluation Metrics

Accuracy and F1 Score

# Deadline

Sunday 29/05/2022 at 23:59

To submit:
- a zip file containing the code for the model (briefly commented) or a link to a Colab notebook (make sure the link is public)
- a csv/txt with one line for each test sample, containing the predictions
- note that we should be able to run the code and obtain the same predictions present in the csv/txt file


# Data Loading

In [None]:
# Data loading

import pandas as pd

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test

column_names=[
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "income"
]

df_train = pd.read_csv("adult.data",index_col=False, names=column_names)
df_test = pd.read_csv("adult.test", index_col=False, skiprows=1, names=column_names)
real_predictions = df_test.income
real_predictions.to_csv("predictions.test.csv", index=False)

#Codifica delle categorie

Le categorie (attributi che sono in un range di valori predefiniti) vanno codificate con la tecnica One-Hot encode, mentre i campi numerici vanno lasciati inalterati.

In [20]:
from os import pathconf
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# Elimina tutte le tuple contenenti valori anomali
def refine(data):
  out = []
  cols = len(data.columns)
  for i in range(len(data)):
    arr = []
    for l in range(cols):
      v = data.iat[i,l]
      if v == ' ?':
        arr = None
        break
      else:
        arr.append(v)
    if arr != None:
      out.append(arr)
    
  dataFrame = pd.DataFrame(out)
  dataFrame.columns = data.columns
  return dataFrame


# codifica le feature categoriche (con > 2 categorie) in una rappresentazione one hot.
# codifica le feature categoriche (con = 2 categorie) in una rappresentazione label.
def encode_data(data, keep_unknown=False):

  categories = {
           'workclass': [['Private'], ['Self-emp-not-inc'], ['Self-emp-inc'], ['Federal-gov'], ['Local-gov'], ['State-gov'], ['Without-pay'], ['Never-worked']],
           'education':[['Bachelors'], ['Some-college'],['11th'], ['HS-grad'], ['Prof-school'], ['Assoc-acdm'], ['Assoc-voc'], ['9th'], ['7th-8th'], ['12th'], ['Masters'], ['1st-4th'], ['10th'], ['Doctorate'], ['5th-6th'], ['Preschool']],
           'marital-status':[['Married-civ-spouse'], ['Divorced'], ['Never-married'], ['Separated'], ['Widowed'], ['Married-spouse-absent'], ['Married-AF-spouse']],
           'occupation':[['Tech-support'], ['Craft-repair'], ['Other-service'], ['Sales'], ['Exec-managerial'], ['Prof-specialty'], ['Handlers-cleaners'], ['Machine-op-inspct'], ['Adm-clerical'], ['Farming-fishing'], ['Transport-moving'], ['Priv-house-serv'], ['Protective-serv'], ['Armed-Forces']],
           'relationship':[['Wife'], ['Own-child'], ['Husband'], ['Not-in-family'], ['Other-relative'], ['Unmarried']],
           'race':[['White'], ['Asian-Pac-Islander'], ['Amer-Indian-Eskimo'], ['Other'], ['Black']],
           'sex':[['Male'],['Female']],
           'native-country':[['United-States'], ['Cambodia'], ['England'], ['Puerto-Rico'], ['Canada'], ['Germany'], ['Outlying-US(Guam-USVI-etc)'], ['India'], ['Japan'], ['Greece'], ['South'], ['China'], ['Cuba'], ['Iran'], ['Honduras'], ['Philippines'], ['Italy'], ['Poland'], ['Jamaica'], ['Vietnam'], ['Mexico'], ['Portugal'], ['Ireland'], ['France'], ['Dominican-Republic'], ['Laos'], ['Ecuador'], ['Taiwan'], ['Haiti'], ['Columbia'], ['Hungary'], ['Guatemala'], ['Nicaragua'], ['Scotland'], ['Thailand'], ['Yugoslavia'], ['El-Salvador'], ['Trinadad&Tobago'], ['Peru'], ['Hong'], ['Holand-Netherlands']]
  }
    
  for col in categories:
    enc = OneHotEncoder(handle_unknown='ignore', drop='if_binary')

    enc.fit(categories[col]) # genera la rappresentazione one-hot della feature singola
    enc_arr = pd.DataFrame(enc.transform([[x.strip()] for x in data[col]]).toarray()) # trasforma la colonna nella rappresentazione one-hot e genera un dataframe a partire da essa
    enc_arr.columns = [col + str(i) for i in range(1, enc_arr.shape[1]+1)] # nuova label per la colonna

    data = pd.concat([data.loc[:,:col].iloc[:,:-1], enc_arr, data.loc[:,col:].iloc[:,1:]], axis=1) # posiziona il dataframe con la rappresentazione one-hot nella posizione giusta all'interno del dataframe completo
    # data.loc[:,:col].iloc[:,:-1] -> tutte le colonne precedenti a quella che abbiamo trasformato
    # data.loc[:,col:].iloc[:,1:] -> tutte le colonne successive
    # loc e iloc vengono usati per prendere uno slice del dataframe, escludendo la colonna che stiamo trasformando 

  return data #dataframe codificato in One-Hot

# codifica il target (y)
def encode_target(data):
  return [0 if "<=50K" in x else 1 for x in data]

# Estrae le features
# data -> il dataset
def extractX(data):
  return data.iloc[:,:-1]


# Estrare la label
# data -> il dataset
def extractY(data):
  return data.iloc[:,-1]



#Creazione dei Training e Test set
Si creano dei dataset prendendo i dati in input e pulendo le tuple inutili.

In [None]:
# Raffina e salva il training set -> toglie ?
trainset = refine(df_train)
trainset.to_csv("Dataset.data", header=False, index=False)

# Raffina e salva il test set
#testset = refine(df_test) --> se vuoi raffinare anche il testSet
testset = df_test
testset.to_csv("Dataset.test", index=False)

# Estrae i dati usati nell'addestramento del modello 
X_train = extractX(trainset)
Y_train = extractY(trainset)

X_train = encode_data(X_train)
Y_train = encode_target(Y_train)


# Estrae i dati usati nell'addestramento del modello 
X_test = extractX(testset)
Y_test = extractY(testset)

# Codifica il test set

X_test = encode_data(X_test)
Y_test = encode_target(Y_test)


#Compute Metrics

In [74]:
# Compute Metrics

from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

def compute_metrics(path_real, path_predicted):

    column_names=[
        "age",
        "workclass",
        "fnlwgt",
        "education",
        "education-num",
        "marital-status",
        "occupation",
        "relationship",
        "race",
        "sex",
        "capital-gain",
        "capital-loss",
        "hours-per-week",
        "native-country",
        "income"
    ]
    
    if "test" in path_real:
        df_real = pd.read_csv(path_real, index_col=False, skiprows=1, names=column_names)
    else:
        df_real = pd.read_csv(path_real, index_col=False, names=column_names)
    
    df_pred = pd.read_csv(path_predicted)
    y_real = df_real.income.tolist()    #si salva, a parte, l'ultima colonna come una lista
    y_pred = df_pred.income.tolist()    #si salva, a parte, l'ultima colonna come una lista

    y_real = [0 if "<=50K" in x else 1 for x in y_real]
    y_pred = [0 if "<=50K" in x else 1 for x in y_pred]
    
    accuracy = accuracy_score(y_real, y_pred)
    f1 = f1_score(y_real, y_pred)

    print(f"Metrics comparing '{path_real}' and '{path_predicted}'")
    print(f"Accuracy: {accuracy}")
    print(f"F1: {f1}")

#Rete Neurale

##Creazione della rete

In [239]:
from re import X
#RETE NEURALE
from tensorflow import keras
from keras import models
from keras import layers
from keras import metrics
import numpy

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(len(X_train.columns),))) #la dimensione di input va da il numero di feature (104) ad un massimo di +inf
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=0.0005),
              loss='binary_crossentropy',
              metrics=[metrics.accuracy])

#scalo i dati con MinMax
mM_scaler = MinMaxScaler()
X_train_mM_scaled = mM_scaler.fit_transform(X_train)
X_test_mM_scaled = mM_scaler.transform(X_test)

#creo il validation set (di 1000 elementi) e i restanti sono usati per il training set
X_Val = X_train_mM_scaled[:1000]
Y_Val = numpy.array(Y_train[:1000])

X_Train_Parziale = X_train_mM_scaled[1000:]
Y_Train_Parziale = numpy.array(Y_train[1000:])


##Allenamento della rete

In [None]:
history = model.fit(X_Train_Parziale,
                    Y_Train_Parziale,
                    epochs=20,
                    batch_size=512,
                    validation_data=(X_Val, Y_Val))

history_dict = history.history
history_dict.keys()

##Visualizzo l'andamento dell'allenamento della rete

Andamento della funzione di loss

In [None]:
import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Andamento della accuracy

In [None]:
plt.clf()   # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

##Utilizzo la rete per predirre i dati

In [243]:
Risultati_Predetti = model.predict(X_Test)

##Calcolo dei risultati

In [244]:
predictions = model.predict(X_test)

##Salvataggio dei risultati

Funzione per riconvertire da one-hot a categoria l'ultima cella, infatti solo quella verrà considerata nel calcolo delle metriche.

In [133]:
# Decodifica le predizioni
def decode_predictions(predictions): #restituisce un dataframe
  return pd.DataFrame(["<=50K" if x == 0 else ">50K" for x in predictions], columns=['income'])


In [245]:
decoded_prediction = decode_predictions(predictions)
decoded_prediction.to_csv("neural.data", index=False)

## Compute Metrics della rete Neurale

In [246]:
compute_metrics("Dataset.test", "neural.data")

Metrics comparing 'Dataset.test' and 'neural.data'
Accuracy: 0.7934402063755298
F1: 0.2677988242978446


#Logistic Regression

Librerie da importare

In [70]:
# LogisticRegression normale
from sklearn.linear_model import LogisticRegression
#LogisticRegression con normalizzazione standard
from sklearn.preprocessing import StandardScaler
#LogisticRegression con normalizzazione minmax
from sklearn.preprocessing import MinMaxScaler
# LogisticRegression con cross-validation
from sklearn.linear_model import LogisticRegressionCV



##Logistic Regression base

Allenamento del modello

In [56]:
# fitting del modello
model = LogisticRegression(solver='saga', C=0.2, max_iter=1000)
model.fit(X_train, Y_train)



LogisticRegression(C=0.2, max_iter=1000, solver='saga')

Salvo le predizioni e computo le *metriche*

In [58]:
test_predictions = model.predict(X_test)
decode_predictions(test_predictions).to_csv("test_pred", index=False)

compute_metrics('Dataset.test', 'test_pred')

Metrics comparing 'Dataset.test' and 'test_pred'
Accuracy: 0.7871181938911023
F1: 0.34114262227702424


##Logistic Regression con Normalizzazione std

Allenamento del modello

In [61]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

scaled_model = LogisticRegression(solver='saga', C=0.1, max_iter=500) # converge entro 500 iterazioni
scaled_model.fit(X_train_scaled, Y_train)

LogisticRegression(C=0.1, max_iter=500, solver='saga')

Salvo le predizioni e computo le *metriche*

In [63]:
test_predictions = scaled_model.predict(X_test_scaled)
decode_predictions(test_predictions).to_csv("test_pred", index=False)

compute_metrics('Dataset.test', 'test_pred')

Metrics comparing 'Dataset.test' and 'test_pred'
Accuracy: 0.847675962815405
F1: 0.660751257024549


##Logistic Regression con Normalizzazione MIN/MAX

Allenamento del modello

In [87]:
mM_scaler = MinMaxScaler()
X_train_mM_scaled = mM_scaler.fit_transform(X_train)

X_test_mM_scaled = mM_scaler.transform(X_test)

scaled_model = LogisticRegression(solver='saga') 
scaled_model.fit(X_train_mM_scaled, Y_train)

LogisticRegression(solver='saga')

Salvo le predizioni e computo le *metriche*

In [69]:
test_predictions = scaled_model.predict(X_test_mM_scaled)
decode_predictions(test_predictions).to_csv("test_pred", index=False)

compute_metrics('Dataset.test', 'test_pred')

Metrics comparing 'Dataset.test' and 'test_pred'
Accuracy: 0.8460159362549801
F1: 0.6555770087628101


##Logistic Regression con Cross-Validation

Allenamento del modello

In [89]:
mM_scaler = MinMaxScaler()
X_train_mM_scaled = mM_scaler.fit_transform(X_train)

X_test_mM_scaled = mM_scaler.transform(X_test)
C = [0.01, 0.1, 0.5, 1, 10]
#solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}
cv_model = LogisticRegressionCV(cv=5, Cs=C, solver='saga', max_iter=500)
cv_model.fit(X_train_mM_scaled, Y_train)

LogisticRegressionCV(Cs=[0.01, 0.1, 0.5, 1, 10], cv=5, max_iter=500,
                     solver='saga')

Salvo le predizioni e computo le metriche

In [90]:
test_predictions = cv_model.predict(X_test_mM_scaled)
decode_predictions(test_predictions).to_csv("test_pred", index=False)

compute_metrics('Dataset.test', 'test_pred')

Metrics comparing 'Dataset.test' and 'test_pred'
Accuracy: 0.8525889073152755
F1: 0.6570448699628465
