# Chapter 2 – End-to-end Machine Learning project**

*Welcome to Machine Learning Housing Corp.! Your task is to predict median house values in Californian districts, given a number of features from these districts Using Java.*



# Setup

First, let's make sure this notebook works well in :

In [32]:
%maven commons-io:commons-io:jar:2.6
%maven io.vavr:vavr:jar:0.10.0
%maven org.apache.commons:commons-compress:1.18
%maven tech.tablesaw:tablesaw-core:jar:0.32.7
%maven tech.tablesaw:tablesaw-jsplot:jar:0.32.7
%maven com.github.haifengl:smile-core:jar:1.5.2
%maven com.github.haifengl:smile-plot:jar:1.5.2   
    
import org.apache.commons.io.*;
import java.io.*;
import io.vavr.control.*;
import org.apache.commons.compress.archivers.tar.*;
import org.apache.commons.compress.compressors.gzip.*;

var DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/";
var HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz";
var PROJECT_ROOT_DIR = ".";
var CHAPTER_ID = "end_to_end_project";
var HOUSING_PATH = FilenameUtils.concat("datasets", "housing");
var BUFFER_SIZE = 1024;
void fetch_housing_data(String housingUrl, File housingPath){
   housingUrl = Objects.toString(housingUrl,HOUSING_URL);
   Objects.requireNonNull(housingPath);
   if(!housingPath.exists()){
       Try.run(() -> FileUtils.forceMkdir(housingPath));
   }
   var tgzPath = new File(FilenameUtils.concat(housingPath.getPath(), "housing.tgz"));
   var urlTemp = housingUrl;
   Try.run(() -> FileUtils.copyURLToFile(new URL(urlTemp), tgzPath )); 
   Try.run(() -> extractTarGZ(tgzPath, housingPath) );
}

void extractTarGZ(File in, File destDir) throws Exception {
    GzipCompressorInputStream gzipIn = new GzipCompressorInputStream(new FileInputStream(in));
    try (TarArchiveInputStream tarIn = new TarArchiveInputStream(gzipIn)) {
        TarArchiveEntry entry;

        while ((entry = (TarArchiveEntry) tarIn.getNextEntry()) != null) {
            /** If the entry is a directory, create the directory. **/
            if (entry.isDirectory()) {
                File f = new File(FilenameUtils.concat(destDir.getPath(),entry.getName()));
                boolean created = f.mkdir();
                if (!created) {
                    System.out.printf("Unable to create directory '%s', during extraction of archive contents.\n",
                            f.getAbsolutePath());
                }
            } else {
                int count;
                byte data[] = new byte[BUFFER_SIZE];
                FileOutputStream fos = new FileOutputStream(FilenameUtils.concat(destDir.getPath(),entry.getName()), false);
                try (BufferedOutputStream dest = new BufferedOutputStream(fos, BUFFER_SIZE)) {
                    while ((count = tarIn.read(data, 0, BUFFER_SIZE)) != -1) {
                        dest.write(data, 0, count);
                    }
                }
            }
        }

    }
}


In [33]:
//fetch_housing_data(HOUSING_URL, new File(HOUSING_PATH));

In [34]:

import tech.tablesaw.api.Table;

Table load_housing_data(File housingPathCsv){
 return Try.of(() ->Table.read().csv(housingPathCsv)).get();
}


In [35]:

var housing = load_housing_data(new File(FilenameUtils.concat(HOUSING_PATH,"housing.csv")));
housing.first(4);

                                                                                  housing.csv                                                                                  
 longitude  |  latitude  |  housing_median_age  |  total_rooms  |  total_bedrooms  |  population  |  households  |  median_income  |  median_house_value  |  ocean_proximity  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   -122.23  |     37.88  |                  41  |          880  |             129  |         322  |         126  |         8.3252  |              452600  |         NEAR BAY  |
   -122.22  |     37.86  |                  21  |         7099  |            1106  |        2401  |        1138  |         8.3014  |              358500  |         NEAR BAY  |
   -122.24  |     37.85  |                  52  |         1467  |             190  |         496  |         177  |      

In [36]:
//planning to replicate housing.info()
display(housing.structure());
display(housing.shape());
display(housing.summary());

            Structure of housing.csv            
 Index  |     Column Name      |  Column Type  |
------------------------------------------------
     0  |           longitude  |       DOUBLE  |
     1  |            latitude  |       DOUBLE  |
     2  |  housing_median_age  |      INTEGER  |
     3  |         total_rooms  |      INTEGER  |
     4  |      total_bedrooms  |      INTEGER  |
     5  |          population  |      INTEGER  |
     6  |          households  |      INTEGER  |
     7  |       median_income  |       DOUBLE  |
     8  |  median_house_value  |      INTEGER  |
     9  |     ocean_proximity  |       STRING  |

20640 rows X 10 cols


Table summary for: housing.csv
         Column: longitude          
 Measure   |         Value         |
------------------------------------
        n  |              20640.0  |
      sum  |   -2467918.699999941  |
     Mean  |  -119.56970445736455  |
      Min  |              -124.35  |
      Max  |              -114.31  |
    Range  |   10.039999999999992  |
 Variance  |    4.014139367081234  |
 Std. Dev  |    2.003531723502584  |
         Column: latitude         
 Measure   |        Value        |
----------------------------------
        n  |            20640.0  |
      sum  |  735441.6200000036  |
     Mean  |  35.63186143410859  |
      Min  |              32.54  |
      Max  |              41.95  |
    Range  |  9.410000000000004  |
 Variance  |  4.562292644202738  |
 Std. Dev  |  2.135952397457101  |
    Column: housing_median_age     
 Measure   |        Value         |
-----------------------------------
        n  |             20640.0  |
      sum  |            591119.0

b273c5d2-d322-469e-aded-428c6ca6013b

In [37]:
housing.xTabCounts("ocean_proximity")

Column: ocean_proximity 
  Category   |  Count  |
------------------------
     ISLAND  |      5  |
 NEAR OCEAN  |   2658  |
   NEAR BAY  |   2290  |
     INLAND  |   6551  |
  <1H OCEAN  |   9136  |

In [38]:

import smile.plot.Histogram;
import tech.tablesaw.plotly.components.Figure;
import tech.tablesaw.plotly.components.Layout;
import tech.tablesaw.plotly.api.Histogram;  
import tech.tablesaw.plotly.Plot;
import javax.imageio.ImageIO;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.UUID;
import tech.tablesaw.plotly.components.Page;

void renderPlotly(Figure fig){
    Page page = Page.pageBuilder(fig, "target").build();
    display(page.asJavascript(),"text/html");
}

//renderPlotly(Histogram.create("Distribution of total_rooms", housing, "total_rooms"));

// housing.numericColumns().forEach(f ->{
//          HistogramTrace trace = HistogramTrace.builder(f.asDoubleArray()).build();
//          Plot.show(new Figure(Layout.builder(f.name()).build(), trace));
        
// });


# Preparing for stratified sampling
*I have skipped random sampling in the book*

In [39]:
import java.util.function.ToDoubleFunction;
var incomeCat = housing.doubleColumn("median_income").map((ToDoubleFunction<Double>) f -> Math.ceil(f/1.5)).map((ToDoubleFunction<Double>)cat -> cat > 5 ? 5: cat );
housing.addColumns(incomeCat.setName("income_cat"));
housing.first(2)

                                                                                         housing.csv                                                                                          
 longitude  |  latitude  |  housing_median_age  |  total_rooms  |  total_bedrooms  |  population  |  households  |  median_income  |  median_house_value  |  ocean_proximity  |  income_cat  |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   -122.23  |     37.88  |                  41  |          880  |             129  |         322  |         126  |         8.3252  |              452600  |         NEAR BAY  |         5.0  |
   -122.22  |     37.86  |                  21  |         7099  |            1106  |        2401  |        1138  |         8.3014  |              358500  |         NEAR BAY  |         5.0  |

In [40]:
import tech.tablesaw.api.CategoricalColumn;
import tech.tablesaw.columns.Column;
import static tech.tablesaw.aggregate.AggregateFunctions.*;
Table[] stratifiedSampleSplit(Table table, String column, double table1Proportion){
    final Table first = table.emptyCopy();
    final Table second = table.emptyCopy();
    String categoricalColumn = column;
    Column<?> col = table.column(column);
    if(!CategoricalColumn.class.isInstance(col)){
        categoricalColumn += "_stringified";
        table.addColumns(col.asStringColumn().setName(categoricalColumn));
    }
    table.splitOn(categoricalColumn).asTableList().forEach(tab-> {
       Table[] splits = tab.sampleSplit(table1Proportion); 
        first.append(splits[0]);
        second.append(splits[1]);
    });
    if(!categoricalColumn.equals(column)){
        table.removeColumns(table.column(categoricalColumn));
    }
    return new Table[]{first, second};
}

var strats = stratifiedSampleSplit(housing,"income_cat", 0.2);
strats[1].removeColumns(strats[1].column("income_cat"));
strats[0].removeColumns(strats[0].column("income_cat"));
display(strats[1].shape());
display(strats[0].shape());
display(strats[0].summarize("longitude","median_income", mean, count).apply());
strats[1].summarize("longitude","median_income", mean, count).apply();


16513 rows X 10 cols

4127 rows X 10 cols

                                      housing.csv summary                                       
 Mean [median_income]  |  Count [median_income]  |   Mean [longitude]    |  Count [longitude]  |
------------------------------------------------------------------------------------------------
   3.8697425006057675  |                 4127.0  |  -119.56478555851709  |             4127.0  |

                                      housing.csv summary                                       
 Mean [median_income]  |  Count [median_income]  |   Mean [longitude]    |  Count [longitude]  |
------------------------------------------------------------------------------------------------
    3.870903058196591  |                16513.0  |  -119.57093380972567  |            16513.0  |

# Discover and visualize the data to gain insights

**skipped for now

In [41]:
housing = strats[1].copy();
housing.shape();

16513 rows X 10 cols

**Tablesaw  doesn handle missing values very well so we will set missing values to 0

In [42]:
housing.missingValueCounts();

                                                                                                                                                                   housing.csv summary                                                                                                                                                                   
 Missing Values [housing_median_age]  |  Missing Values [median_income]  |  Missing Values [latitude]  |  Missing Values [median_house_value]  |  Missing Values [ocean_proximity]  |  Missing Values [total_rooms]  |  Missing Values [households]  |  Missing Values [total_bedrooms]  |  Missing Values [longitude]  |  Missing Values [population]  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [43]:
var tBRoom = housing.intColumn("total_bedrooms");
housing.replaceColumn("total_bedrooms",tBRoom.set(tBRoom.isMissing(),0));
var summarizer = housing.summarize("total_bedrooms", mean, sum, count).apply();
summarizer

                             housing.csv summary                             
 Mean [total_bedrooms]  |  Sum [total_bedrooms]  |  Count [total_bedrooms]  |
-----------------------------------------------------------------------------
     531.3648034881594  |             8774427.0  |                 16513.0  |

In [44]:
housing.missingValueCounts();

                                                                                                                                                                   housing.csv summary                                                                                                                                                                   
 Missing Values [housing_median_age]  |  Missing Values [median_income]  |  Missing Values [latitude]  |  Missing Values [median_house_value]  |  Missing Values [ocean_proximity]  |  Missing Values [total_rooms]  |  Missing Values [households]  |  Missing Values [total_bedrooms]  |  Missing Values [longitude]  |  Missing Values [population]  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**lets save the mean because we will need it later

In [45]:
var totalBedroomsMean = summarizer.column(0).get(0);
totalBedroomsMean;

531.3648034881594

**Looking for correlation

In [46]:
import java.util.stream.*;
import smile.math.distance.CorrelationDistance;
import org.apache.commons.math3.stat.correlation.PearsonsCorrelation;
import io.vavr.Tuple;  

var medianVector  = housing.intColumn("median_house_value").asDoubleColumn();

var corr = new PearsonsCorrelation();
housing.numericColumns().stream()
    .map(i -> Tuple.of(i.name(),corr.correlation( i.asDoubleArray(), medianVector.asDoubleArray() )))
    .sorted((a, b) -> {
        int c = 0;
        if( a._2 == Double.NaN  && b._2 == Double.NaN ){
            c = 0;
        }
        else if(b._2 == Double.NaN || a._2 > b._2){
            c = 1;
        }
        else if(a._2 == Double.NaN ||a._2 < b._2){
            c = -1;
        }
        
        return c;
    })
    .collect(Collectors.toList());

[(latitude, -0.137640523720164), (longitude, -0.051933320209414054), (population, -0.024536335050061808), (total_bedrooms, 0.04991302623699379), (households, 0.06683809516116539), (housing_median_age, 0.1105272062254497), (total_rooms, 0.13486449898161207), (median_income, 0.6877984372879941), (median_house_value, 1.0)]

**Plot Pandas correlation later

In [47]:
housing.addColumns( 
    housing.nCol("total_rooms").divide(housing.nCol("households")).setName("rooms_per_household"),
    housing.nCol("total_bedrooms").divide(housing.nCol("total_rooms")).setName("bedrooms_per_room"),
    housing.nCol("total_bedrooms").divide(housing.nCol("households")).setName("population_per_household")
);
housing.summary()


Table summary for: housing.csv
         Column: longitude          
 Measure   |         Value         |
------------------------------------
        n  |              16513.0  |
      sum  |  -1974474.8299999777  |
     Mean  |  -119.57093380972564  |
      Min  |              -124.35  |
      Max  |              -114.31  |
    Range  |   10.039999999999992  |
 Variance  |    4.017495663799331  |
 Std. Dev  |   2.0043691435958926  |
         Column: latitude          
 Measure   |        Value         |
-----------------------------------
        n  |             16513.0  |
      sum  |   588369.8100000032  |
     Mean  |   35.63070368800327  |
      Min  |               32.54  |
      Max  |               41.92  |
    Range  |   9.380000000000003  |
 Variance  |   4.551459122041311  |
 Std. Dev  |  2.1334148968358946  |
    Column: housing_median_age     
 Measure   |        Value         |
-----------------------------------
        n  |             16513.0  |
      sum  |         

In [48]:
var medianVector  = housing.intColumn("median_house_value").asDoubleColumn();

var corr = new PearsonsCorrelation();
housing.numericColumns().stream()
    .map(i -> Tuple.of(i.name(),corr.correlation( i.asDoubleArray(), medianVector.asDoubleArray() )))    
    .sorted((a, b) -> {
        int c = 0;
        if( a._2 == Double.NaN  && b._2 == Double.NaN ){
            c = 0;
        }
        else if(b._2 == Double.NaN || a._2 > b._2){
            c = 1;
        }
        else if(a._2 == Double.NaN ||a._2 < b._2){
            c = -1;
        }
        
        return c;
    })
    .collect(Collectors.toList());

[(bedrooms_per_room, -0.23658451309666478), (latitude, -0.137640523720164), (longitude, -0.051933320209414054), (population_per_household, -0.04476566155335623), (population, -0.024536335050061808), (total_bedrooms, 0.04991302623699379), (households, 0.06683809516116539), (housing_median_age, 0.1105272062254497), (total_rooms, 0.13486449898161207), (rooms_per_household, 0.1461389816781181), (median_income, 0.6877984372879941), (median_house_value, 1.0)]

In [49]:
housing.missingValueCounts();

                                                                                                                                                                                                                                  housing.csv summary                                                                                                                                                                                                                                  
 Missing Values [housing_median_age]  |  Missing Values [median_income]  |  Missing Values [latitude]  |  Missing Values [ocean_proximity]  |  Missing Values [households]  |  Missing Values [total_bedrooms]  |  Missing Values [population]  |  Missing Values [median_house_value]  |  Missing Values [total_rooms]  |  Missing Values [population_per_household]  |  Missing Values [bedrooms_per_room]  |  Missing Values [longitude]  |  Missing Values [rooms_per_household]  |
--------------------------------------------------------

**I am going with option 1(Get rid of the corresponding districts.)

In [50]:
housing = strats[1].copy();
housing.addColumns( 
    housing.nCol("total_rooms").divide(housing.nCol("households")).setName("rooms_per_household"),
    housing.nCol("total_bedrooms").divide(housing.nCol("total_rooms")).setName("bedrooms_per_room"),
    housing.nCol("total_bedrooms").divide(housing.nCol("households")).setName("population_per_household")
);
housing.dropRowsWithMissingValues();
var cols= housing.columnNames();
cols.remove("median_house_value");

housing.csv Summary
		min	q1	median	mean	q3	max
longitude		-124.3500	-121.8000	-118.5000	-119.5709	-118.0100	-114.3100
latitude		32.5400	33.9300	34.2500	35.6307	37.7200	41.9200
housing_median_age		1.0000	18.0000	29.0000	28.7094	37.0000	52.0000
total_rooms		2.0000	1448.0000	2120.0000	2626.0374	3136.0000	39320.0000
total_bedrooms		1.0000	295.0000	434.0000	NaN	646.0000	6445.0000
population		3.0000	785.0000	1164.0000	1423.6361	1720.0000	35682.0000
households		1.0000	280.0000	409.0000	498.2729	601.0000	6082.0000
median_income		0.4999	2.5625	3.5368	3.8709	4.7422	15.0001
ocean_proximity		0.0000	0.0000	1.0000	0.9269	1.0000	4.0000
rooms_per_household		0.8889	4.4363	5.2332	5.4329	6.0560	141.9091
2 more rows...

In [64]:
import java.text.ParseException;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

import smile.data.Attribute;
import smile.data.AttributeDataset;
import smile.data.NominalAttribute;
import smile.data.NumericAttribute;
import tech.tablesaw.api.ColumnType;
import tech.tablesaw.api.NumericColumn;
import tech.tablesaw.api.StringColumn;
import tech.tablesaw.columns.Column;
import tech.tablesaw.table.Relation;

 class SmileConverter {

    private final Relation table;

    public SmileConverter(Relation table) {
        this.table = table;
    }

    /**
     * Returns a dataset where the response column is numeric. E.g. to be used for a regression
     */
    public AttributeDataset numericDataset(String responseColName) {
        return dataset(
            table.numberColumn(responseColName),
            AttributeType.NUMERIC,
            table.numericColumns().stream().filter(c -> !c.name().equals(responseColName)).collect(Collectors.toList()));
    }  

    /**
     * Returns a dataset where the response column is numeric. E.g. to be used for a regression
     */
    public AttributeDataset numericDataset(int responseColIndex, int... variablesColIndices) {
        return dataset(table.numberColumn(responseColIndex), AttributeType.NUMERIC, table.columns(variablesColIndices));
    }  

    /**
     * Returns a dataset where the response column is numeric. E.g. to be used for a regression
     */
    public AttributeDataset numericDataset(String responseColName, String... variablesColNames) {
        return dataset(table.numberColumn(responseColName), AttributeType.NUMERIC, table.columns(variablesColNames));
    }

    /**
     * Returns a dataset where the response column is nominal. E.g. to be used for a classification
     */
    public AttributeDataset nominalDataset(String responseColName) {
        return dataset(
            table.numberColumn(responseColName),
            AttributeType.NOMINAL,
            table.numericColumns().stream().filter(c -> !c.name().equals(responseColName)).collect(Collectors.toList()));
    }  

    /**
     * Returns a dataset where the response column is nominal. E.g. to be used for a classification
     */
    public AttributeDataset nominalDataset(int responseColIndex, int... variablesColIndices) {
        return dataset(table.numberColumn(responseColIndex), AttributeType.NOMINAL, table.columns(variablesColIndices));
    }  

    /**
     * Returns a dataset where the response column is nominal. E.g. to be used for a classification
     */
    public AttributeDataset nominalDataset(String responseColName, String... variablesColNames) {
        return dataset(table.numberColumn(responseColName), AttributeType.NOMINAL, table.columns(variablesColNames));
    }

    private AttributeDataset dataset(NumericColumn<?> responseCol, AttributeType type, List<Column<?>> variableCols) {
        List<Column<?>> convertedVariableCols = variableCols.stream()
            .map(col -> col.type() == ColumnType.STRING ? col : table.nCol(col.name()))
            .collect(Collectors.toList());
        Attribute responseAttribute = type == AttributeType.NOMINAL
            ? colAsNominalAttribute(responseCol) : new NumericAttribute(responseCol.name());
        AttributeDataset dataset = new AttributeDataset(table.name(),
            convertedVariableCols.stream().map(col -> colAsAttribute(col)).toArray(Attribute[]::new),
            responseAttribute);
        for (int i = 0; i < responseCol.size(); i++) {
            final int r = i;
            double[] x = IntStream.range(0, convertedVariableCols.size())
                .mapToDouble(c -> getDouble(convertedVariableCols.get(c), dataset.attributes()[c], r))
                .toArray();
            dataset.add(x, responseCol.getDouble(r));
        }
        return dataset;
    }
    
    private double getDouble(Column<?> col, Attribute attr, int r) {
        if (col.type() == ColumnType.STRING) {
            String value = ((StringColumn) col).get(r);
            try {
                return attr.valueOf(value);
            } catch (ParseException e) {
                throw new IllegalArgumentException("Error converting " + value + " to nominal", e);
            }
        }
        if (col instanceof NumericColumn) {
            return ((NumericColumn<?>) col).getDouble(r);
        }
        throw new IllegalStateException("Error converting " + col.type() + " column " + col.name() + " to Smile");
    }

    private Attribute colAsAttribute(Column<?> col) {
        return col.type() == ColumnType.STRING ? colAsNominalAttribute(col) : new NumericAttribute(col.name());
    }

    private NominalAttribute colAsNominalAttribute(Column<?> col) {
        Column<?> unique = col.unique();
        return new NominalAttribute(col.name(),
            unique.mapInto(o -> o.toString(), StringColumn.create(col.name(), unique.size())).asObjectArray());
    }

    private static enum AttributeType {
        NUMERIC,
        NOMINAL
    }

}

In [65]:

var housingMl = new SmileConverter(housing).numericDataset("median_house_value",cols.toArray(new String[cols.size()]));
housingMl.summary();

housing.csv Summary
		min	q1	median	mean	q3	max
longitude		-124.3500	-121.8000	-118.5000	-119.5709	-118.0100	-114.3100
latitude		32.5400	33.9300	34.2500	35.6307	37.7200	41.9200
housing_median_age		1.0000	18.0000	29.0000	28.7094	37.0000	52.0000
total_rooms		2.0000	1448.0000	2120.0000	2626.0374	3136.0000	39320.0000
total_bedrooms		1.0000	295.0000	434.0000	NaN	646.0000	6445.0000
population		3.0000	785.0000	1164.0000	1423.6361	1720.0000	35682.0000
households		1.0000	280.0000	409.0000	498.2729	601.0000	6082.0000
median_income		0.4999	2.5625	3.5368	3.8709	4.7422	15.0001
ocean_proximity		0.0000	0.0000	1.0000	0.9269	1.0000	4.0000
rooms_per_household		0.8889	4.4363	5.2332	5.4329	6.0560	141.9091
2 more rows...

In [66]:
import smile.feature.OneHotEncoder;
double[][] x = housingMl.toArray(new double[housingMl.size()][]);
double[][] result = new double[housingMl.size()][];
var oneHot = new OneHotEncoder(housingMl.attributes());

In [69]:
for (int i = 0; i < x.length; i++) {
    result[i] = oneHot.feature(x[i]);
}
display(result[0].length);
display(x[0].length);

17

12

a7b7a883-050b-45b0-9042-44f48cf41b33

In [67]:
import smile.data.*;

var attributes = housingMl.attributes();

for (int i = 0, j = 0; j < attributes.length; j++) {
            Attribute attribute = attributes[j];
            if (attribute instanceof NominalAttribute) {
                NominalAttribute nominal = (NominalAttribute) attribute;
                display(nominal.size());
            } 
}

6

In [74]:
Arrays.asList(result[0]);

[[D@f91e2925]