*This tutorial is part of the [Learn Machine Learning](https://www.kaggle.com/learn/machine-learning) educational track.*

# Starting Your Project

You are about to build a simple model and then continually improve it. It is easiest to keep one browser tab (or window) for the tutorials you are reading, and a separate browser window with the code you are writing. You will continue writing code in the same place even as you progress through the sequence of tutorials.

** The starting point for your project is at [THIS LINK](https://www.kaggle.com/dansbecker/my-model/).  Open that link in a new tab. Then hit the "Fork Notebook" button towards the top of the screen.**

![Imgur](https://i.imgur.com/GRtMTWw.png)

**You will see examples predicting home prices using data from Melbourne, Australia. You will then write code to build a model predicting prices in the US state of Iowa. The Iowa data is pre-loaded in your coding notebook.**

### Working in Kaggle Notebooks
You will be coding in a "notebook" environment. These allow you to easily see your code and its output in one place.  A couple tips on the Kaggle notebook environment:

1) It is composed of "cells."  You will write code in the cells. Add a new cell by clicking on a cell, and then using the buttons in that look like this. ![Imgur](https://i.imgur.com/Lscji3d.png) The arrows indicate whether the new cell goes above or below your current location. <br><br>
2) Execute the code in the current cell with the keyboard shortcut Control-Enter.


---
# Using Pandas to Get Familiar With Your Data

The first thing you'll want to do is familiarize yourself with the data.  You'll use the Pandas library for this.  Pandas is the primary tool that modern data scientists use for exploring and manipulating data.  Most people abbreviate pandas in their code as `pd`.  We do this with the command

In [1]:
import pandas as pd

The most important part of the Pandas library is the DataFrame.  A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. The Pandas DataFrame has powerful methods for most things you'll want to do with this type of data.  Let's start by looking at a basic data overview with our example data from Melbourne and the data you'll be working with from Iowa.

The example will use data at the file path **`../input/melbourne-housing-snapshot/melb_data.csv`**.  Your data will be available in your notebook at `../input/train.csv` (which is already typed into the sample code for you).

We load and explore the data with the following:

In [2]:
# save filepath to variable for easier access
melbourne_file_path = './data/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()

Unnamed: 0.1,Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,18396.0,18396.0,18396.0,18395.0,18395.0,14927.0,14925.0,14820.0,13603.0,7762.0,8958.0,15064.0,15064.0,18395.0
mean,11826.787073,2.93504,1056697.0,10.389986,3107.140147,2.913043,1.538492,1.61552,558.116371,151.220219,1965.879996,-37.809849,144.996338,7517.975265
std,6800.710448,0.958202,641921.7,6.00905,95.000995,0.964641,0.689311,0.955916,3987.326586,519.188596,37.013261,0.081152,0.106375,4488.416599
min,1.0,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,5936.75,2.0,633000.0,6.3,3046.0,2.0,1.0,1.0,176.5,93.0,1950.0,-37.8581,144.931193,4294.0
50%,11820.5,3.0,880000.0,9.7,3085.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.803625,145.00092,6567.0
75%,17734.25,3.0,1302000.0,13.3,3149.0,3.0,2.0,2.0,651.0,174.0,2000.0,-37.75627,145.06,10331.0
max,23546.0,12.0,9000000.0,48.1,3978.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


# Interpreting Data Description
The results show 8 numbers for each column in your original dataset. The first number, the **count**,  shows how many rows have non-missing values.  

Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.

The second value is the **mean**, which is the average.  Under that, **std** is the standard deviation, which measures how numerically spread out the values are.

To interpret the **min**, **25%**, **50%**, **75%** and **max** values, imagine sorting each column from lowest to highest value.  The first (smallest) value is the min.  If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values.  That is the **25%** value (pronounced "25th percentile").  The 50th and 75th percentiles are defined analgously, and the **max** is the largest number.

--- 
# Your Turn
**Remember, the notebook you want to "fork" is [here](https://www.kaggle.com/dansbecker/my-model/).**

Run the equivalent commands (to read the data and print the summary) in the code cell below.  The file path for your data is already shown in your coding notebook. Look at the mean, minimum and maximum values for the first few fields. Are any of the values so crazy that it makes you think you've misinterpreted the data?

There are a lot of fields in this data.  You don't need to look at it all quite yet.

When your code is correct, you'll see the size, in square feet, of the smallest lot in your dataset.  This is from the **min** value of **LotArea**, and you can see the **max** size too.  You should notice that it's a big range of lot sizes! 

You'll also see some columns filled with `...`.  That indicates that we had too many columns of data to print, so the middle ones were omitted from printing.

We'll take care of both issues in the next step.

# Continue
Move on to the next [page](https://www.kaggle.com/dansbecker/Selecting-And-Filtering-In-Pandas/) where you will focus in on the most relevant columns.

In [3]:
import pandas as pd

data_train = pd.read_csv('./data/house-prices-advanced-regression-techniques/train.csv')

In [4]:
data_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [5]:
data_train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0
