# Model Comparison

I created 2 classes, one for the dataset, one for the model.
These are the steps to run successfully the training, testing and prediction.

 1. Load datasets
 2. Apply transformations and feature engineering to the dataset (optional)
 3. Choose variables to be used for training the model (optional)
 4. Load model from SKLearn
 5. Run the simple test
 
 Below I created an example with the model that I had to test, Support Vector Machine.
 
 The shape of the dataset is the following:
 
1. `'Family_Case_ID'`
2. `'Severity'`
3. `'Birthday_year'`
4. `'Parents or siblings infected'`
5. `'Wife/Husband or children infected'`
7. `'Medical_Expenses_Family'`
8. `'Medical_Tent_A'`
9. `'Medical_Tent_B'`
10. `'Medical_Tent_C'`
11. `'Medical_Tent_D'`
12. `'Medical_Tent_E'`
13. `'Medical_Tent_F'`
14. `'Medical_Tent_G'`
15. `'Medical_Tent_T'`
16. `'Medical_Tent_n/a'`
17. `'City_Albuquerque'`
18. `'City_Santa Fe'`
19. `'City_Taos'`
20. `'Gender_M'`
21. `'family_size'`
22. `'Sev_by_city'`: Average severity in the city of the patient.
23. `'Sev_by_tent'`: Average severity in the medical tent of the patient.
24. `'Sev_by_gender'`: Average severity whithin the gender of the patient.
25. `'Sev_family'`: Average severity in the family of the patient.
26. `'spending_vs_severity'`: Medical Expenses Family / Patient's Severity
27. `'spending_family_member'`: Medical Expenses Family / Number of cases in the family
28. `'severity_against_avg_city'`: Patient's Severity / Sev_by_city
29. `'severity_against_avg_tent'`: Patient's Severity / Sev_by_tent
30. `'severity_against_avg_gender'`: Patient's Severity / Sev_by_gender
31. `'spending_family_severity'`: Patient's Severity / Sev_family


In [1]:
from dataset import Dataset
from model import Model

## First model - Support Vector Machine - Alejandro

### Step 1: Load datasets

In [2]:
dataset = Dataset()            # Loads the preprocessed dataset
train_set = dataset.train_data # Training set without labels (train.csv)
target = dataset.target        # Labels for training set     (train.csv[Deceased])
test_set = dataset.test_data   # Unlabeled test set          (test.csv)

train_set.head(10)

Unnamed: 0_level_0,Family_Case_ID,Severity,Birthday_year,Parents or siblings infected,Wife/Husband or children infected,Medical_Expenses_Family,Sev_by_city,Sev_by_tent,Sev_by_gender,Sev_family,...,City_Santa Fe,City_Taos,Gender_M,family_size,spending_vs_severity,spending_family_member,severity_against_avg_city,severity_against_avg_tent,severity_against_avg_gender,spending_family_severity
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4696,3,-1.0,0,0,225,2.354391,2.623932,2.169811,3.0,...,1,0,0,1,75.0,225.0,1.274215,1.143322,1.382609,75.0
2,21436,1,1966.0,0,1,1663,1.893491,2.623932,2.169811,1.0,...,0,0,0,1,1663.0,831.5,0.528125,0.381107,0.46087,831.5
3,7273,3,1982.0,0,0,221,2.354391,2.623932,2.391753,3.0,...,1,0,1,1,73.666667,221.0,1.274215,1.143322,1.25431,73.666667
4,8226,3,1997.0,0,0,220,2.354391,2.623932,2.391753,3.0,...,1,0,1,1,73.333333,220.0,1.274215,1.143322,1.25431,73.333333
5,19689,3,1994.0,0,0,222,2.354391,2.623932,2.169811,3.0,...,1,0,0,1,74.0,222.0,1.274215,1.143322,1.382609,74.0
6,17598,2,-1.0,0,0,0,2.354391,2.623932,2.391753,2.0,...,1,0,1,3,0.0,0.0,0.849476,0.762215,0.836207,0.0
7,7563,3,1984.0,0,1,435,2.354391,2.623932,2.391753,3.0,...,1,0,1,1,145.0,217.5,1.274215,1.143322,1.25431,72.5
8,9520,2,1989.0,0,0,364,2.354391,2.623932,2.391753,2.0,...,1,0,1,1,182.0,364.0,0.849476,0.762215,0.836207,182.0
9,6314,3,2000.0,1,1,441,1.893491,2.623932,2.391753,3.0,...,0,0,1,2,147.0,147.0,1.584375,1.143322,1.25431,49.0
10,14392,3,-1.0,1,1,626,1.893491,2.384615,2.169811,3.0,...,0,0,0,2,208.666667,208.666667,1.584375,1.258065,1.382609,69.555556


### Step 2: Apply transformations

In [3]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler().fit(train_set)
train_set[train_set.columns] = scaler.transform(train_set)
test_set[train_set.columns] = scaler.transform(test_set)

### Step 3: Choose variables

In [4]:
selected_variables_SVC = [
    'Severity',
    'Gender_M',
    'City_Albuquerque',
    'City_Santa Fe',
    "severity_against_avg_gender",
    'Medical_Tent_n/a',
    "Sev_family",
    'spending_family_member',
    'family_size',
]

### Step 4: Load model from SKLearn

In [5]:
from sklearn import svm

# Create classifier from SciKitLearn
svm_model = svm.NuSVC(break_ties=False, 
                      cache_size=200, 
                      class_weight=None, 
                      coef0=0.0,
                      decision_function_shape='ovr', 
                      degree=3, 
                      gamma='scale', 
                      kernel='rbf',
                      max_iter=-1, 
                      nu=0.5, 
                      probability=False, 
                      random_state=None, 
                      shrinking=True,
                      tol=0.001, 
                      verbose=False
                     )

### Step 5: Run model

In [6]:
model = Model(model     = svm_model,              # Initialized classifier model from SKLearn
              variables = selected_variables_SVC, # Subset of variables from data to be used for training
                                                  # If variables=None, then all variables in set are used
              
              train_set = train_set,              # Samples X for training and validating
              target    = target,                 # Samples Y for training and validating
              test_set  = test_set                # Unlabeled samples for creating prediction
              )                 

model.run_model(path="svc_solution.csv")

Model - NuSVC(break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
      max_iter=-1, nu=0.5, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False)
Average model accuracy: 80.39%
Highest model accuracy: 86.67%
Solution set saved as 'svc_solution.csv'.
