# Regression on yacht hydrodynamics data set
> When doing machine learning most of the time is spend collecting and cleaning the data. I decided to do some practicing on some existing open data from Technical University of Delft. The data contain residuary resistance of sailing yachts.

- toc: false
- branch: master
- badges: true
- comments: true
- categories: [machine learning, regression]
- image: https://www.blur.se/images/df4.jpg
- hide: false
- search_exclude: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

In [1]:
#hide
import warnings
warnings.filterwarnings("ignore")

![](https://www.blur.se/images/df4.jpg)
I found this open [data](http://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics#) from from Technical University of Delft.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import altair as alt
from io import StringIO
import re
import urllib

## Load data

Columns:

| Column | Variable  |Description |
|--------|-----------|------------|
|1.      | lcg       |Longitudinal position of the center of buoyancy, adimensional. |
|2.      | cp        |Prismatic coefficient, adimensional.                           |
|3.      | volume    |Length-displacement ratio, adimensional.                       | 
|4.      | b/d       |Beam-draught ratio, adimensional.                              |
|5.      | l/b       |Length-beam ratio, adimensional.                               |
|6.      | fn        |Froude number, adimensional.                                   |
|7.      | r         |residuary resistance per unit weight of displacement, adimensional|


In [3]:
#collapse
columns = [
'lcg',   
'cp',    
'volume',
'b/d',   
'l/b',   
'fn',    
'r', 
]

data_url = r'http://archive.ics.uci.edu/ml/machine-learning-databases/00243/yacht_hydrodynamics.data'
with urllib.request.urlopen(data_url) as file:
    s_raw=file.read().decode("utf-8")
    
# remove some dirt:
regexp = re.compile(r' \n', flags=re.DOTALL)
s1 = regexp.sub('\n', s_raw)

regexp = re.compile(r' +', flags=re.DOTALL)
s2 = regexp.sub(' ', s1)
s2[0:200]
s=s2

data = StringIO(s)
data = pd.read_csv(data, sep=' ', encoding='utf-8', names=columns)

features = list(set(columns)-set(['r']))
label = 'r'

## Look at data

In [4]:
data.describe()

Unnamed: 0,lcg,cp,volume,b/d,l/b,fn,r
count,308.0,308.0,308.0,308.0,308.0,308.0,308.0
mean,-2.381818,0.564136,4.788636,3.936818,3.206818,0.2875,10.495357
std,1.513219,0.02329,0.253057,0.548193,0.247998,0.100942,15.16049
min,-5.0,0.53,4.34,2.81,2.73,0.125,0.01
25%,-2.4,0.546,4.77,3.75,3.15,0.2,0.7775
50%,-2.3,0.565,4.78,3.955,3.15,0.2875,3.065
75%,-2.3,0.574,5.1,4.17,3.51,0.375,12.815
max,0.0,0.6,5.14,5.35,3.64,0.45,62.42


In [5]:
alt.Chart(data).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),

).properties(
    width=100,
    height=150
).repeat(
    row=[label],
    column=features,
).interactive()