For SheffieldML's August session, we've written a simple "getting started" Kernel, to help those new to machine learning or Kaggle get up and running.

What follows demonstrates the processes of interacting with Kaggle and the data, making and visualising a model to predict house prices. It also gives you a baseline submission score of *1.16083*.

Your challenge is to make a submission to Kaggle that beats it!

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

The stuff above is relevant boilerplate that Kaggle adds when you create a new kernel.

First, I load the training set. I don't need the test set yet, I'll load that later when I'm ready to use it.

In [None]:
train = pd.read_csv("../input/train.csv")


Before I wrote the next block, I checked out what Kaggle says about the [data](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). I could see descriptions of the columns, as well as histograms that gave be a rough idea of what the data in each column looked like. I've also got the advantage of remembering some of what we discovered in the last Sheffield ML session on this competition!

`OverallQual` looks like a good place to start with a linear regression. You'd expect higher quality houses to sell for higher prices, it's never missing and it has numeric values with a meaningful order - a lower number represents a lower quality than a higher one. That means I can use it without having to do any cleaning up or other munging.

Let's see if there's any evidence that my guess about the relationship between `OverallQual` and `SalePrice` is correct. We can see how correlated they are really easily:

In [None]:
train[["OverallQual", "SalePrice"]].corr()

We're interested in the values of 0.79 on the off-diagonal.

Looks good! A correlation of 0 would mean that there's no simple relationship between the values in these columns. That means that a higher quality *does* suggest a higher sale price.

Let's visualise the relationship, by plotting the price against the quality and drawing a best fit line though the points. [Seaborn](https://seaborn.pydata.org/) is a neat visualisation library that solves that problem in one line!

In [None]:
import seaborn as sns
result = sns.regplot(train["OverallQual"], train["SalePrice"])

It's reasonably clear that there's a relationship now! Let's train a model that we can use to predict house prices. The go-to tool in Python is [Scikit-Learn](http://scikit-learn.org/stable/).

In [None]:
from sklearn import linear_model

model = linear_model.LinearRegression()
model.fit(train["OverallQual"].values.reshape(-1, 1), train["SalePrice"].values)

The `.values` property returns our Pandas dataframe as a numpy array, which is the data structure sklearn needs. `.reshape(-1, 1)` just turns our array (think one row) into a 2D matrix with one column instead, which again is just to confirm to what sklearn expects. If you take out those calls, the error messages you see tell you what you need to do to add them back in.

So we have a model. Let's see what it does, by asking it for the house price predictions of `OverallQual` values 1, 5 and 10:

In [None]:
model.predict([[1], [5], [10]]) # argument same as [1, 5, 10].reshape(-1, 1)

Those are the prices that our model would predict for those qualities. It's a really basic model we've built, especially when qualities are integers in the range 1-10 - it'll only ever make one of 10 predictions!
Still, it'll be a lot better than guessing randomly. Let's make our predictions from the supplied test data and submit:

In [None]:
test = pd.read_csv("../input/test.csv")

predicted = model.predict(test["OverallQual"].values.reshape(-1, 1))

my_submission = pd.DataFrame({'Id': test["Id"], 'SalePrice': predicted})

my_submission.to_csv('my_submission.csv', index=False)

That's it! When you commit this notebook, a my_submission.csv will be available in the output in the view you see before you edit the notebook. You can submit by selecting it and hitting submit. [This post](https://www.kaggle.com/dansbecker/submitting-from-a-kernel) explains that part in more detail.

Submitting this results file means that Kaggle compares your predictions to the correct values and calculates your error. See [Evalution](https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation) for more details. You're then assigned a score - this kernel scores 1.16083. Lower is better, and the [best kernels](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard) can score less than 0.1!

To get your own copy of this notebook as a starting point, just hit the "Fork Notebook" button at the top of the screen.

So the question now is - how are you going to beat 1.16083?