# Chapter 2 – End-to-end Machine Learning project**

*Welcome to Machine Learning Housing Corp.! Your task is to predict median house values in Californian districts, given a number of features from these districts Using Java.*



# Setup

First, let's make sure this notebook works well in :

In [42]:
%maven commons-io:commons-io:jar:2.6
%maven io.vavr:vavr:jar:0.10.0
%maven org.apache.commons:commons-compress:1.18
%maven tech.tablesaw:tablesaw-core:jar:0.32.7
%maven tech.tablesaw:tablesaw-jsplot:jar:0.32.7   
%maven nz.ac.waikato.cms.weka:weka-stable:jar:3.8.3
%maven nz.ac.waikato.cms.weka:wekaDeeplearning4j:jar:1.5.13
    
    
import org.apache.commons.io.*;
import java.io.*;
import io.vavr.control.*;
import org.apache.commons.compress.archivers.tar.*;
import org.apache.commons.compress.compressors.gzip.*;

var DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/";
var HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz";
var PROJECT_ROOT_DIR = ".";
var CHAPTER_ID = "end_to_end_project";
var HOUSING_PATH = FilenameUtils.concat("datasets", "housing");
var BUFFER_SIZE = 1024;
void fetch_housing_data(String housingUrl, File housingPath){
   housingUrl = Objects.toString(housingUrl,HOUSING_URL);
   Objects.requireNonNull(housingPath);
   if(!housingPath.exists()){
       Try.run(() -> FileUtils.forceMkdir(housingPath));
   }
   var tgzPath = new File(FilenameUtils.concat(housingPath.getPath(), "housing.tgz"));
   var urlTemp = housingUrl;
   Try.run(() -> FileUtils.copyURLToFile(new URL(urlTemp), tgzPath )); 
   Try.run(() -> extractTarGZ(tgzPath, housingPath) );
}

void extractTarGZ(File in, File destDir) throws Exception {
    GzipCompressorInputStream gzipIn = new GzipCompressorInputStream(new FileInputStream(in));
    try (TarArchiveInputStream tarIn = new TarArchiveInputStream(gzipIn)) {
        TarArchiveEntry entry;

        while ((entry = (TarArchiveEntry) tarIn.getNextEntry()) != null) {
            /** If the entry is a directory, create the directory. **/
            if (entry.isDirectory()) {
                File f = new File(FilenameUtils.concat(destDir.getPath(),entry.getName()));
                boolean created = f.mkdir();
                if (!created) {
                    System.out.printf("Unable to create directory '%s', during extraction of archive contents.\n",
                            f.getAbsolutePath());
                }
            } else {
                int count;
                byte data[] = new byte[BUFFER_SIZE];
                FileOutputStream fos = new FileOutputStream(FilenameUtils.concat(destDir.getPath(),entry.getName()), false);
                try (BufferedOutputStream dest = new BufferedOutputStream(fos, BUFFER_SIZE)) {
                    while ((count = tarIn.read(data, 0, BUFFER_SIZE)) != -1) {
                        dest.write(data, 0, count);
                    }
                }
            }
        }

    }
}


In [43]:
//fetch_housing_data(HOUSING_URL, new File(HOUSING_PATH));

In [44]:
import tech.tablesaw.api.Table;

Table load_housing_data(File housingPathCsv){
 return Try.of(() ->Table.read().csv(housingPathCsv)).get();
}


In [45]:

var housing = load_housing_data(new File(FilenameUtils.concat(HOUSING_PATH,"housing.csv")));
housing.first(4);

                                                                                  housing.csv                                                                                  
 longitude  |  latitude  |  housing_median_age  |  total_rooms  |  total_bedrooms  |  population  |  households  |  median_income  |  median_house_value  |  ocean_proximity  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   -122.23  |     37.88  |                  41  |          880  |             129  |         322  |         126  |         8.3252  |              452600  |         NEAR BAY  |
   -122.22  |     37.86  |                  21  |         7099  |            1106  |        2401  |        1138  |         8.3014  |              358500  |         NEAR BAY  |
   -122.24  |     37.85  |                  52  |         1467  |             190  |         496  |         177  |      

In [46]:
//planning to replicate housing.info()
display(housing.structure());
display(housing.shape());
display(housing.summary());

            Structure of housing.csv            
 Index  |     Column Name      |  Column Type  |
------------------------------------------------
     0  |           longitude  |       DOUBLE  |
     1  |            latitude  |       DOUBLE  |
     2  |  housing_median_age  |      INTEGER  |
     3  |         total_rooms  |      INTEGER  |
     4  |      total_bedrooms  |      INTEGER  |
     5  |          population  |      INTEGER  |
     6  |          households  |      INTEGER  |
     7  |       median_income  |       DOUBLE  |
     8  |  median_house_value  |      INTEGER  |
     9  |     ocean_proximity  |       STRING  |

20640 rows X 10 cols


Table summary for: housing.csv
         Column: longitude          
 Measure   |         Value         |
------------------------------------
        n  |              20640.0  |
      sum  |   -2467918.699999941  |
     Mean  |  -119.56970445736455  |
      Min  |              -124.35  |
      Max  |              -114.31  |
    Range  |   10.039999999999992  |
 Variance  |    4.014139367081234  |
 Std. Dev  |    2.003531723502584  |
         Column: latitude         
 Measure   |        Value        |
----------------------------------
        n  |            20640.0  |
      sum  |  735441.6200000036  |
     Mean  |  35.63186143410859  |
      Min  |              32.54  |
      Max  |              41.95  |
    Range  |  9.410000000000004  |
 Variance  |  4.562292644202738  |
 Std. Dev  |  2.135952397457101  |
    Column: housing_median_age     
 Measure   |        Value         |
-----------------------------------
        n  |             20640.0  |
      sum  |            591119.0

1153b47a-cefb-4b12-bf57-ff2d8bf185aa

In [47]:
housing.xTabCounts("ocean_proximity")

Column: ocean_proximity 
  Category   |  Count  |
------------------------
     ISLAND  |      5  |
 NEAR OCEAN  |   2658  |
   NEAR BAY  |   2290  |
     INLAND  |   6551  |
  <1H OCEAN  |   9136  |

In [48]:


import tech.tablesaw.plotly.components.Figure;
import tech.tablesaw.plotly.components.Layout;
import tech.tablesaw.plotly.api.Histogram;  
import tech.tablesaw.plotly.Plot;
import javax.imageio.ImageIO;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.UUID;
import tech.tablesaw.plotly.components.Page;

void renderPlotly(Figure fig){
    Page page = Page.pageBuilder(fig, "target").build();
    display(page.asJavascript(),"text/html");
}

//renderPlotly(Histogram.create("Distribution of total_rooms", housing, "total_rooms"));

// housing.numericColumns().forEach(f ->{
//          HistogramTrace trace = HistogramTrace.builder(f.asDoubleArray()).build();
//          Plot.show(new Figure(Layout.builder(f.name()).build(), trace));
        
// });


# Preparing for stratified sampling
*I have skipped random sampling in the book*

In [49]:
import java.util.function.ToDoubleFunction;
var incomeCat = housing.doubleColumn("median_income").map((ToDoubleFunction<Double>) f -> Math.ceil(f/1.5)).map((ToDoubleFunction<Double>)cat -> cat > 5 ? 5: cat );
housing.addColumns(incomeCat.setName("income_cat"));
housing.first(2)

                                                                                         housing.csv                                                                                          
 longitude  |  latitude  |  housing_median_age  |  total_rooms  |  total_bedrooms  |  population  |  households  |  median_income  |  median_house_value  |  ocean_proximity  |  income_cat  |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   -122.23  |     37.88  |                  41  |          880  |             129  |         322  |         126  |         8.3252  |              452600  |         NEAR BAY  |         5.0  |
   -122.22  |     37.86  |                  21  |         7099  |            1106  |        2401  |        1138  |         8.3014  |              358500  |         NEAR BAY  |         5.0  |

In [50]:
import tech.tablesaw.api.CategoricalColumn;
import tech.tablesaw.columns.Column;
import static tech.tablesaw.aggregate.AggregateFunctions.*;
Table[] stratifiedSampleSplit(Table table, String column, double table1Proportion){
    final Table first = table.emptyCopy();
    final Table second = table.emptyCopy();
    String categoricalColumn = column;
    Column<?> col = table.column(column);
    if(!CategoricalColumn.class.isInstance(col)){
        categoricalColumn += "_stringified";
        table.addColumns(col.asStringColumn().setName(categoricalColumn));
    }
    table.splitOn(categoricalColumn).asTableList().forEach(tab-> {
       Table[] splits = tab.sampleSplit(table1Proportion); 
        first.append(splits[0]);
        second.append(splits[1]);
    });
    if(!categoricalColumn.equals(column)){
        table.removeColumns(table.column(categoricalColumn));
    }
    return new Table[]{first, second};
}

var strats = stratifiedSampleSplit(housing,"income_cat", 0.2);
strats[1].removeColumns(strats[1].column("income_cat"));
strats[0].removeColumns(strats[0].column("income_cat"));
display(strats[1].shape());
display(strats[0].shape());
display(strats[0].summarize("longitude","median_income", mean, count).apply());
strats[1].summarize("longitude","median_income", mean, count).apply();


16513 rows X 10 cols

4127 rows X 10 cols

                                      housing.csv summary                                       
 Mean [median_income]  |  Count [median_income]  |   Mean [longitude]    |  Count [longitude]  |
------------------------------------------------------------------------------------------------
    3.879343397140783  |                 4127.0  |  -119.57728616428399  |             4127.0  |

                                      housing.csv summary                                       
 Mean [median_income]  |  Count [median_income]  |   Mean [longitude]    |  Count [longitude]  |
------------------------------------------------------------------------------------------------
   3.8685035608308733  |                16513.0  |  -119.56780960455399  |            16513.0  |

# Discover and visualize the data to gain insights

**skipped for now

In [51]:
housing = strats[1].copy();
housing.shape();

16513 rows X 10 cols

**Tablesaw  doesn handle missing values very well so we will set missing values to 0

In [52]:
housing.missingValueCounts();

                                                                                                                                                                   housing.csv summary                                                                                                                                                                   
 Missing Values [housing_median_age]  |  Missing Values [median_income]  |  Missing Values [latitude]  |  Missing Values [median_house_value]  |  Missing Values [ocean_proximity]  |  Missing Values [total_rooms]  |  Missing Values [households]  |  Missing Values [total_bedrooms]  |  Missing Values [longitude]  |  Missing Values [population]  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [53]:
var tBRoom = housing.intColumn("total_bedrooms");
housing.replaceColumn("total_bedrooms",tBRoom.set(tBRoom.isMissing(),0));
var summarizer = housing.summarize("total_bedrooms", mean, sum, count).apply();
summarizer

                             housing.csv summary                             
 Mean [total_bedrooms]  |  Sum [total_bedrooms]  |  Count [total_bedrooms]  |
-----------------------------------------------------------------------------
     532.2989160055724  |             8789852.0  |                 16513.0  |

In [54]:
housing.missingValueCounts();

                                                                                                                                                                   housing.csv summary                                                                                                                                                                   
 Missing Values [housing_median_age]  |  Missing Values [median_income]  |  Missing Values [latitude]  |  Missing Values [median_house_value]  |  Missing Values [ocean_proximity]  |  Missing Values [total_rooms]  |  Missing Values [households]  |  Missing Values [total_bedrooms]  |  Missing Values [longitude]  |  Missing Values [population]  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**lets save the mean because we will need it later

In [55]:
var totalBedroomsMean = summarizer.column(0).get(0);
totalBedroomsMean;

532.2989160055724

**Looking for correlation

In [56]:
import java.util.stream.*;
import org.apache.commons.math3.stat.correlation.PearsonsCorrelation;
import io.vavr.Tuple;  

var medianVector  = housing.intColumn("median_house_value").asDoubleColumn();

var corr = new PearsonsCorrelation();
housing.numericColumns().stream()
    .map(i -> Tuple.of(i.name(),corr.correlation( i.asDoubleArray(), medianVector.asDoubleArray() )))
    .sorted((a, b) -> {
        int c = 0;
        if( a._2 == Double.NaN  && b._2 == Double.NaN ){
            c = 0;
        }
        else if(b._2 == Double.NaN || a._2 > b._2){
            c = 1;
        }
        else if(a._2 == Double.NaN ||a._2 < b._2){
            c = -1;
        }
        
        return c;
    })
    .collect(Collectors.toList());

[(latitude, -0.15057677836355435), (longitude, -0.038404506216789674), (population, -0.02288331529098258), (total_bedrooms, 0.05065901579270411), (households, 0.06801785534914913), (housing_median_age, 0.10503576944407318), (total_rooms, 0.13464004326853155), (median_income, 0.6863969572461165), (median_house_value, 1.0)]

**Plot Pandas correlation later

In [57]:
housing.addColumns( 
    housing.nCol("total_rooms").divide(housing.nCol("households")).setName("rooms_per_household"),
    housing.nCol("total_bedrooms").divide(housing.nCol("total_rooms")).setName("bedrooms_per_room"),
    housing.nCol("total_bedrooms").divide(housing.nCol("households")).setName("population_per_household")
);
housing.summary()


Table summary for: housing.csv
         Column: longitude          
 Measure   |         Value         |
------------------------------------
        n  |              16513.0  |
      sum  |  -1974423.2399999753  |
     Mean  |  -119.56780960455335  |
      Min  |               -124.3  |
      Max  |              -114.31  |
    Range  |    9.989999999999995  |
 Variance  |    4.019445002022454  |
 Std. Dev  |   2.0048553568829983  |
         Column: latitude          
 Measure   |        Value         |
-----------------------------------
        n  |             16513.0  |
      sum  |   588376.7100000009  |
     Mean  |  35.631121540604326  |
      Min  |               32.54  |
      Max  |               41.95  |
    Range  |   9.410000000000004  |
 Variance  |   4.591721828310808  |
 Std. Dev  |  2.1428303312000248  |
    Column: housing_median_age     
 Measure   |        Value         |
-----------------------------------
        n  |             16513.0  |
      sum  |         

In [58]:
var medianVector  = housing.intColumn("median_house_value").asDoubleColumn();

var corr = new PearsonsCorrelation();
housing.numericColumns().stream()
    .map(i -> Tuple.of(i.name(),corr.correlation( i.asDoubleArray(), medianVector.asDoubleArray() )))    
    .sorted((a, b) -> {
        int c = 0;
        if( a._2 == Double.NaN  && b._2 == Double.NaN ){
            c = 0;
        }
        else if(b._2 == Double.NaN || a._2 > b._2){
            c = 1;
        }
        else if(a._2 == Double.NaN ||a._2 < b._2){
            c = -1;
        }
        
        return c;
    })
    .collect(Collectors.toList());

[(bedrooms_per_room, -0.23133972663122307), (latitude, -0.15057677836355435), (population_per_household, -0.051073965333551734), (longitude, -0.038404506216789674), (population, -0.02288331529098258), (total_bedrooms, 0.05065901579270411), (households, 0.06801785534914913), (housing_median_age, 0.10503576944407318), (total_rooms, 0.13464004326853155), (rooms_per_household, 0.15012588849367983), (median_income, 0.6863969572461165), (median_house_value, 1.0)]

In [59]:
housing.missingValueCounts();

                                                                                                                                                                                                                                  housing.csv summary                                                                                                                                                                                                                                  
 Missing Values [housing_median_age]  |  Missing Values [median_income]  |  Missing Values [latitude]  |  Missing Values [ocean_proximity]  |  Missing Values [households]  |  Missing Values [total_bedrooms]  |  Missing Values [population]  |  Missing Values [median_house_value]  |  Missing Values [total_rooms]  |  Missing Values [population_per_household]  |  Missing Values [bedrooms_per_room]  |  Missing Values [longitude]  |  Missing Values [rooms_per_household]  |
--------------------------------------------------------

**I am going with option 1(Get rid of the corresponding districts.)

In [60]:
housing = strats[1].copy();
housing.addColumns( 
    housing.nCol("total_rooms").divide(housing.nCol("households")).setName("rooms_per_household"),
    housing.nCol("total_bedrooms").divide(housing.nCol("total_rooms")).setName("bedrooms_per_room"),
    housing.nCol("total_bedrooms").divide(housing.nCol("households")).setName("population_per_household")
);
//housing =housing.dropRowsWithMissingValues();
housing.missingValueCounts();

                                                                                                                                                                                                                                  housing.csv summary                                                                                                                                                                                                                                  
 Missing Values [housing_median_age]  |  Missing Values [median_income]  |  Missing Values [latitude]  |  Missing Values [ocean_proximity]  |  Missing Values [households]  |  Missing Values [total_bedrooms]  |  Missing Values [population]  |  Missing Values [median_house_value]  |  Missing Values [total_rooms]  |  Missing Values [population_per_household]  |  Missing Values [bedrooms_per_room]  |  Missing Values [longitude]  |  Missing Values [rooms_per_household]  |
--------------------------------------------------------

In [61]:

import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

import weka.core.Attribute;
import tech.tablesaw.api.ColumnType;
import tech.tablesaw.api.NumericColumn;
import tech.tablesaw.api.StringColumn;
import tech.tablesaw.columns.Column;
import tech.tablesaw.table.Relation;
import weka.core.DenseInstance;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.Utils;

/**
 *
 * @author James Akinniranye
 */
public class WekaConverter {

    private Relation table;
    private Instances structure;

    public WekaConverter() {
        
    }
    public WekaConverter(Relation table) {
        this.table = table;
    }
    
    public WekaConverter setRelation(Relation table) {
        this.table = table;
        return this;
    }

    /**
     * Returns a dataset where the response column is numeric. E.g. to be used
     * for a regression
     */
    public Instances numericDataset(String classColName) {
        return dataset(
                table.numberColumn(classColName),
                AttributeType.NUMERIC,
                table.numericColumns().stream().filter(c -> !c.name().equals(classColName)).collect(Collectors.toList()));
    }

    /**
     * Returns a dataset where the response column is numeric. E.g. to be used
     * for a regression
     */
    public Instances numericDataset(int classColIndex, int... variablesColIndices) {
        return dataset(table.numberColumn(classColIndex), AttributeType.NUMERIC, table.columns(variablesColIndices));
    }

    /**
     * Returns a dataset where the response column is numeric. E.g. to be used
     * for a regression
     */
    public Instances numericDataset(String classColName, String... variablesColNames) {
        return dataset(table.numberColumn(classColName), AttributeType.NUMERIC, table.columns(variablesColNames));
    }

    /**
     * Returns a dataset where the response column is nominal. E.g. to be used
     * for a classification
     */
    public Instances nominalDataset(String classColName) {
        return dataset(
                table.numberColumn(classColName),
                AttributeType.NOMINAL,
                table.numericColumns().stream().filter(c -> !c.name().equals(classColName)).collect(Collectors.toList()));
    }

    /**
     * Returns a dataset where the response column is nominal. E.g. to be used
     * for a classification
     */
    public Instances nominalDataset(int classColIndex, int... variablesColIndices) {
        return dataset(table.numberColumn(classColIndex), AttributeType.NOMINAL, table.columns(variablesColIndices));
    }

    /**
     * Returns a dataset where the response column is nominal. E.g. to be used
     * for a classification
     */
    public Instances nominalDataset(String classColName, String... variablesColNames) {
        return dataset(table.numberColumn(classColName), AttributeType.NOMINAL, table.columns(variablesColNames));
    }

    private Instances dataset(NumericColumn<?> classCol, AttributeType type, List<Column<?>> variableCols) {
        List<Column<?>> convertedVariableCols = variableCols.stream()
                .map(col -> col.type() == ColumnType.STRING ? col : table.nCol(col.name()))
                .collect(Collectors.toList());
       

       
        Instances dataset;
        if(structure == null){
             Attribute classAttribute = type == AttributeType.NOMINAL
                ? colAsNominalAttribute(classCol) : new Attribute(classCol.name());
            ArrayList<Attribute> attributes = new ArrayList<>(convertedVariableCols.stream().map(col -> colAsAttribute(col)).collect(Collectors.toList()));
            attributes.add(classAttribute);
            dataset = new Instances(table.name(), attributes,classCol.size());
            dataset.setClass(classAttribute);
        }
        else{
            dataset = new Instances(structure,classCol.size());
        }
        
        for (int i = 0; i < classCol.size(); i++) {
            Instance inst = new DenseInstance(dataset.numAttributes());
            inst.setDataset(dataset);
            final int r = i;
            IntStream.range(0, dataset.numAttributes()-1)
                    .forEach(c -> inst.setValue(c, getDouble(convertedVariableCols.get(c), dataset.attribute(c), r)));
            inst.setValue(dataset.numAttributes()-1, getDouble(classCol, dataset.classAttribute(), r));
            dataset.add(inst);
        }
        if(structure == null){
            structure  = dataset.stringFreeStructure();
        }
        dataset.compactify();
        return dataset;
    }

    private double getDouble(Column<?> col, Attribute attr, int r) {
        if (col.type() == ColumnType.STRING) {
            return attr.indexOfValue(Utils.unquote(((StringColumn) col).get(r)));
        }
        if (col instanceof NumericColumn) {
            return ((NumericColumn<?>) col).getDouble(r);
        }
        throw new IllegalStateException("Error converting " + col.type() + " column " + col.name() + " to Smile");
    }

    private Attribute colAsAttribute(Column<?> col) {
        return col.type() == ColumnType.STRING ? colAsNominalAttribute(col) : new Attribute(col.name());
    }

    private Attribute colAsNominalAttribute(Column<?> col) {
        Column<?> unique = col.unique().removeMissing();
        Attribute att = new Attribute(col.name(),
                unique.mapInto(o -> Utils.unquote(o.toString()), StringColumn.create(col.name(), unique.size())).asList());
        //att.setWeight(1.0);
        return att;
    }

    private static enum AttributeType {
        NUMERIC,
        NOMINAL
    }
}


In [62]:
var cols= housing.columnNames().stream().filter(c -> !c.equals("median_house_value")).toArray(String[]::new);
var wekaConverter = new WekaConverter(housing);
var housingMl = wekaConverter.numericDataset("median_house_value",cols);
housingMl.toSummaryString();

Relation Name:  housing.csv
Num Instances:  16513
Num Attributes: 13

     Name                      Type  Nom  Int Real     Missing      Unique  Dist
 1 longitude                  Num   0%   1%  99%     0 /  0%    82 /  0%   817 
 2 latitude                   Num   0%   1%  99%     0 /  0%   110 /  1%   844 
 3 housing_median_age         Num   0% 100%   0%     0 /  0%     0 /  0%    52 
 4 total_rooms                Num   0% 100%   0%     0 /  0%  2033 / 12%  5499 
 5 total_bedrooms             Num   0%  99%   0%   173 /  1%   456 /  3%  1823 
 6 population                 Num   0% 100%   0%     0 /  0%  1091 /  7%  3635 
 7 households                 Num   0% 100%   0%     0 /  0%   444 /  3%  1719 
 8 median_income              Num   0%   1%  99%     0 /  0%  8396 / 51% 10916 
 9 ocean_proximity            Nom 100%   0%   0%     0 /  0%     0 /  0%     5 
10 rooms_per_household        Num   0%   0% 100%     0 /  0% 15037 / 91% 15662 
11 bedrooms_per_room          Num   0%   0%  99% 

In [63]:
import weka.filters.supervised.attribute.NominalToBinary;
import weka.filters.unsupervised.attribute.Remove;
import weka.filters.Filter;


NominalToBinary nom = new NominalToBinary();

nom.setInputFormat(housingMl);
var h2 = Filter.useFilter(housingMl, nom);

h2.toSummaryString();

Relation Name:  housing.csv-weka.filters.supervised.attribute.NominalToBinary
Num Instances:  16513
Num Attributes: 16

     Name                      Type  Nom  Int Real     Missing      Unique  Dist
 1 longitude                  Num   0%   1%  99%     0 /  0%    82 /  0%   817 
 2 latitude                   Num   0%   1%  99%     0 /  0%   110 /  1%   844 
 3 housing_median_age         Num   0% 100%   0%     0 /  0%     0 /  0%    52 
 4 total_rooms                Num   0% 100%   0%     0 /  0%  2033 / 12%  5499 
 5 total_bedrooms             Num   0%  99%   0%   173 /  1%   456 /  3%  1823 
 6 population                 Num   0% 100%   0%     0 /  0%  1091 /  7%  3635 
 7 households                 Num   0% 100%   0%     0 /  0%   444 /  3%  1719 
 8 median_income              Num   0%   1%  99%     0 /  0%  8396 / 51% 10916 
 9 ocean_proximity=<1H OCEAN  Num   0% 100%   0%     0 /  0%     0 /  0%     2 
10 ocean_proximity=NEAR OCEA  Num   0% 100%   0%     0 /  0%     0 /  0%     2 

In [64]:
h2.firstInstance();

-119.6,36.56,36,738,168,737,186,1.4415,0,0,0,0,3.967742,0.227642,0.903226,54400

In [65]:
import weka.filters.unsupervised.attribute.ReplaceMissingValues;
var rpl = new ReplaceMissingValues();
rpl.setInputFormat(h2);

h2 = Filter.useFilter(h2, rpl);

h2.toSummaryString();

Relation Name:  housing.csv-weka.filters.supervised.attribute.NominalToBinary-weka.filters.unsupervised.attribute.ReplaceMissingValues
Num Instances:  16513
Num Attributes: 16

     Name                      Type  Nom  Int Real     Missing      Unique  Dist
 1 longitude                  Num   0%   1%  99%     0 /  0%    82 /  0%   817 
 2 latitude                   Num   0%   1%  99%     0 /  0%   110 /  1%   844 
 3 housing_median_age         Num   0% 100%   0%     0 /  0%     0 /  0%    52 
 4 total_rooms                Num   0% 100%   0%     0 /  0%  2033 / 12%  5499 
 5 total_bedrooms             Num   0%  99%   1%     0 /  0%   456 /  3%  1824 
 6 population                 Num   0% 100%   0%     0 /  0%  1091 /  7%  3635 
 7 households                 Num   0% 100%   0%     0 /  0%   444 /  3%  1719 
 8 median_income              Num   0%   1%  99%     0 /  0%  8396 / 51% 10916 
 9 ocean_proximity=<1H OCEAN  Num   0% 100%   0%     0 /  0%     0 /  0%     2 
10 ocean_proximity=NEA

In [66]:
import weka.classifiers.functions.SimpleLinearRegression;
import weka.filters.unsupervised.instance.Resample;
import weka.classifiers.evaluation.EvaluationUtils;
import weka.filters.Filter;

var linerReg = new SimpleLinearRegression();

Resample resample = new Resample();
resample.setInputFormat(h2);
resample.setSampleSizePercent((double)5*100/h2.size());
var evalUtil = new EvaluationUtils();
var testH2 = Filter.useFilter(h2, resample);
evalUtil.getTrainTestPredictions(linerReg, h2, testH2 )
    .forEach(c -> display("test "+ c.actual() + " predicted = "+ c.predicted() ));

test 157300.0 predicted = 148527.23488371738

test 140600.0 predicted = 147033.71660345077

test 188500.0 predicted = 125421.62854782745

test 344600.0 predicted = 261528.4177080381

test 500001.0 predicted = 336735.6393448828

In [67]:
import weka.classifiers.functions.LinearRegression;

var linerReg = new LinearRegression();
evalUtil.getTrainTestPredictions(linerReg, h2, testH2)
    .forEach(c -> display("test "+ c.actual() + " predicted = "+ c.predicted() ));


test 157300.0 predicted = 138771.5823239796

test 140600.0 predicted = 203337.41764575848

test 188500.0 predicted = 188896.34623619076

test 344600.0 predicted = 282167.13860088587

test 500001.0 predicted = 364283.5373682962

In [68]:
import weka.filters.Filter;
import java.util.function.Function;
import java.util.HashMap;
class Pipeline{
    
    private final Filter[] filters;
    private final String[] attributeCols;
    private final String classAttribute;
    private WekaConverter converter;
    private Function<Relation, Relation> preProcessors;
    private final HashMap<Integer,Boolean> checks = new HashMap<>();
    public Pipeline( String[] attributeCols, String classAttribute, Filter... filters ){
        this.attributeCols = attributeCols;
        this.classAttribute = classAttribute;
        this.filters = filters;
    }
    
    public void setPreProcessing(Function<Relation, Relation> preProcessors){
        this.preProcessors = preProcessors;
    }
    
    public Instances fitTransom(Relation data){
        if(converter == null){
            converter = new WekaConverter();
        }
        if(preProcessors != null){
                data = preProcessors.apply(data);
        }
        Instances inst = converter.setRelation(data).numericDataset(classAttribute,attributeCols);
        Instances result = inst;
        for(Filter filter : filters){
            if(!checks.containsKey(filter.hashCode()) ){
               Try.run(() ->  filter.setInputFormat(inst));
               checks.put(filter.hashCode(), true);
            }
            Instances resultTemp = result;
            result = Try.of(() -> Filter.useFilter(resultTemp, filter)).get();
        }
        return result;
    }
}

In [69]:
import weka.filters.unsupervised.attribute.Standardize;

var cols= housing.columnNames().stream().filter(c -> !c.equals("median_house_value")).toArray(String[]::new);
var pipe = new Pipeline(cols, "median_house_value", 
    new ReplaceMissingValues(), new Standardize(), new NominalToBinary() );
Function<Relation,Relation> func = (hous) -> hous.addColumns( 
    hous.nCol("total_rooms").divide(hous.nCol("households")).setName("rooms_per_household"),
    hous.nCol("total_bedrooms").divide(hous.nCol("total_rooms")).setName("bedrooms_per_room"),
    hous.nCol("total_bedrooms").divide(hous.nCol("households")).setName("population_per_household")
);
pipe.setPreProcessing(func);
var d = pipe.fitTransom(strats[1].copy());
Resample resample = new Resample();
resample.setInputFormat(d);
resample.setSampleSizePercent((double)5*100/d.size());

var dtest = Filter.useFilter(d, resample);
dtest;

@relation housing.csv-weka.filters.supervised.attribute.NominalToBinary-weka.filters.unsupervised.instance.Resample-S1-Z100.0

@attribute longitude numeric
@attribute latitude numeric
@attribute housing_median_age numeric
@attribute total_rooms numeric
@attribute total_bedrooms numeric
@attribute population numeric
@attribute households numeric
@attribute median_income numeric
@attribute 'ocean_proximity=<1H OCEAN,NEAR OCEAN,NEAR BAY,ISLAND' numeric
@attribute 'ocean_proximity=NEAR OCEAN,NEAR BAY,ISLAND' numeric
@attribute 'ocean_proximity=NEAR BAY,ISLAND' numeric
@attribute ocean_proximity=ISLAND numeric
@attribute rooms_per_household numeric
@attribute bedrooms_per_room numeric
@attribute population_per_household numeric
@attribute median_house_value numeric

@data
1.136147,-1.134538,-1.000264,-0.22512,-0.18438,0.006762,-0.24249,-0.735225,1,1,0,0,-0.063777,0.024056,0.074733,157300
1.186026,-1.335207,-1.158734,0.43739,1.088206,0.201901,0.894145,-0.754066,1,1,0,0,-0.49273,1.112305,0.18

In [70]:
var linerReg = new LinearRegression();
linerReg.buildClassifier(d);
evalUtil.getTestPredictions(linerReg, dtest)
    .forEach(c -> display("test "+ c.actual() + " predicted = "+ c.predicted() ));

test 157300.0 predicted = 138771.58232458684

test 140600.0 predicted = 203337.41764592758

test 188500.0 predicted = 188896.34623607853

test 344600.0 predicted = 282167.13860010303

test 500001.0 predicted = 364283.5373685745

In [71]:
 import weka.classifiers.Evaluation;
 import java.util.Random;

var linerReg = new LinearRegression();
Evaluation eval = new Evaluation(d);
eval.crossValidateModel(linerReg, d, 10, new Random(1));
display("** Linear Regression Evaluation with Datasets **");
display(eval.toSummaryString(false));


** Linear Regression Evaluation with Datasets **

=== Summary ===

Correlation coefficient                  0.8066
Mean absolute error                  49195.8272
Root mean squared error              68259.954 
Relative absolute error                 53.912  %
Root relative squared error             59.1058 %
Total Number of Instances            16513     


60b5a9fe-ccca-489a-b0c6-97e23fe08b3f

In [72]:
import weka.classifiers.rules.DecisionTable;

var desc = new DecisionTable();
Evaluation eval = new Evaluation(d);
eval.crossValidateModel(desc, d, 10, new Random(1));
display("** DecisionTable Evaluation with Datasets **");
display(eval.toSummaryString(false));

** DecisionTable Evaluation with Datasets **

=== Summary ===

Correlation coefficient                  0.8091
Mean absolute error                  47342.078 
Root mean squared error              67879.3971
Relative absolute error                 51.8805 %
Root relative squared error             58.7763 %
Total Number of Instances            16513     


bcb71797-27ce-45a4-911e-78b36fe01f3a

In [73]:
desc.buildClassifier(d);
evalUtil.getTestPredictions(desc,  dtest)
    .forEach(c -> display("test "+ c.actual() + " predicted = "+ c.predicted() ));

test 157300.0 predicted = 164077.14920634922

test 140600.0 predicted = 164077.14920634922

test 188500.0 predicted = 204337.6588235294

test 344600.0 predicted = 280618.005

test 500001.0 predicted = 446745.0

In [74]:
import weka.classifiers.trees.RandomForest;

RandomForest forest=new RandomForest();
//increasing i to 100 makes the model better
forest.setOptions(new String[]{"-I", "10"});
Evaluation eval = new Evaluation(d);
eval.crossValidateModel(forest, d, 10, new Random(1));
display("** RandomForest Regression Evaluation with Datasets **");
display(eval.toSummaryString(false));

** RandomForest Regression Evaluation with Datasets **

=== Summary ===

Correlation coefficient                  0.8825
Mean absolute error                  36905.5292
Root mean squared error              54401.5729
Relative absolute error                 40.4435 %
Root relative squared error             47.106  %
Total Number of Instances            16513     


2a663121-e596-42e7-acc2-98cffe2c675a

In [75]:

//RandomForest forest=new RandomForest();
forest.buildClassifier(d);
forest.setOptions(new String[]{"-I", "20"});
evalUtil.getTestPredictions(forest,  dtest)
    .forEach(c -> display("test "+ c.actual() + " predicted = "+ c.predicted() ));


EvalException: null

In [76]:
import weka.classifiers.functions.SMOreg;

var svm = new SMOreg();
svm.setOptions(new String[]{"-N","0"});
Evaluation eval = new Evaluation(d);
eval.crossValidateModel(desc, d, 10, new Random(1));
display("** SMO SVM Evaluation with Datasets **");
display(eval.toSummaryString(false));

** SMO SVM Evaluation with Datasets **

=== Summary ===

Correlation coefficient                  0.8091
Mean absolute error                  47342.078 
Root mean squared error              67879.3971
Relative absolute error                 51.8805 %
Root relative squared error             58.7763 %
Total Number of Instances            16513     


4027d97d-d1d3-431c-9583-dbd34e33d5ad

In [77]:
svm.buildClassifier(d);
evalUtil.getTestPredictions(svm,  dtest)
    .forEach(c -> display("test "+ c.actual() + " predicted = "+ c.predicted() ));

test 157300.0 predicted = 129305.16623371604

test 140600.0 predicted = 191713.34683264172

test 188500.0 predicted = 165454.01601764496

test 344600.0 predicted = 263344.2995244953

test 500001.0 predicted = 349464.246875261

In [None]:
import weka.classifiers.meta.GridSearch;
var grid = new GridSearch();
grid.setOptions(new String[]{"-E","RMSE","-output-debug-info"});
grid.buildClassifier(d);
grid.enumerateMeasures();


weka.classifiers.meta.GridSearch
Options: -E RMSE -y-property kernel.gamma -y-min -3.0 -y-max 3.0 -y-step 1.0 -y-base 10.0 -y-expression pow(BASE,I) -x-property C -x-min -3.0 -x-max 3.0 -x-step 1.0 -x-base 10.0 -x-expression pow(BASE,I) -sample-size 100.0 -traversal ROW-WISE -log-file /home/will/Documents/Bet-Ml/learning/java-handson-ml -num-slots 1 -S 1 -W weka.classifiers.functions.SMOreg -output-debug-info -- -C 1.0 -N 0 -I "weka.classifiers.functions.supportVector.RegSMOImproved -T 0.001 -V -P 1.0E-12 -L 0.001 -W 1" -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.01"

Step 1:


=== Initial grid - Start ===
Shutting down thread pool...
Starting thread pool with 1 slots...
Determining best pair with 2-fold CV in Grid:
X: -3.0 - 3.0, Step 1.0 (weka.classifiers.functions.SMOreg, property C, expr. pow(BASE,I), base 10.0)
Y: -3.0 - 3.0, Step 1.0 (weka.classifiers.functions.SMOreg, property kernel.gamma, expr. pow(BASE,I), base 10.0)
Dimensions (Rows x Columns): 7 x

Progress: completed=1, failed=0, overall=49


In [79]:
grid.getBestClassifier();

CompilationException: 

In [80]:
grid.getBestFilter();

CompilationException: 

In [81]:
grid.getEvaluation();

CompilationException: 

In [82]:
evalUtil.getTestPredictions(grid.getBestClassifier(),  dtest)
    .forEach(c -> display("test "+ c.actual() + " predicted = "+ c.predicted() ));

CompilationException: 