# Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  Fork this notebook and write your code in it.

The data from the tutorial, the Melbourne data, is not available in this workspace.  You will need to translate the concepts to work with the data in this notebook, the Iowa data.

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or comments. 

# Write Your Code Below



***Starting your ML Project***

In [1]:
import pandas as pd

main_file_path = '../input/train.csv'
data = pd.read_csv(main_file_path)
print(data.describe())

***Selecting and Filtering Data***

Printing a list of columns.

In [2]:
print(data.columns)

From the list of columns, find a name of the column with the sales prices of the homes. Use the dot notation to extract this to a variable (as you saw above to create melbourne_price_data.)

Use the head command to print out the top few lines of the variable you just created.

In [3]:
salePrice_homes = data.SalePrice
print(salePrice_homes.head())

Pick any two variables and store them to a new DataFrame (as you saw above to create two_columns_of_data.)

Use the describe command with the DataFrame you just created to see summaries of those variables. 

In [4]:
myVariables = ['SaleType', 'SaleCondition']
twoColumns = data[myVariables]
twoColumns.describe()

***Your First Scikit-Learn Model***

Select the target variable you want to predict. You can go back to the list of columns from your earlier commands to recall what it's called (hint: you've already worked with this variable). Save this to a new variable called y.

Create a list of the names of the predictors we will use in the initial model. Use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

LotArea
YearBuilt
1stFlrSF
2ndFlrSF
FullBath
BedroomAbvGr
TotRmsAbvGrd

Using the list of variable names you just created, select a new DataFrame of the predictors data. Save this with the variable name X.

In [5]:
y = data.SalePrice #prediction target
iowa_predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd'] #list of names of predictors used in the first model
X = data[iowa_predictors] #selecting a new dataframe with the predictors data

Create a DecisionTreeRegressorModel and save it to a variable (with a name like my_model or iowa_model). Ensure you've done the relevant import so you can run this command.

Fit the model you have created using the data in X and the target data you saved above.

In [6]:
from sklearn.tree import DecisionTreeRegressor
my_iowa_model = DecisionTreeRegressor()
my_iowa_model.fit(X, y)

Make a few predictions with the model's predict command and print out the predictions.

In [7]:
print("Making prediction for the first five houses:")
print(X.head())

In [8]:
print("The predictions are as follows: ")
print(my_iowa_model.predict(X.head()))

***Model Validation***

Use the train_test_split command to split up your data.

In [9]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

Fit the model with the training data

In [10]:
my_iowa_model.fit(train_X, train_y)

Make predictions with the validation predictors.

Calculate the mean absolute error between your predictions and the actual target values for the validation data.

In [11]:
from sklearn.metrics import mean_absolute_error
val_predictions = my_iowa_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

***Underfitting, Overfitting and Model Optimization***

In [12]:
def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

***Random Forests***

In [13]:
from sklearn.ensemble import RandomForestRegressor

randomforest_model = RandomForestRegressor()
randomforest_model.fit(train_X, train_y)
iowa_preds = randomforest_model.predict(val_X)
print(mean_absolute_error(val_y, iowa_preds))

***Submitting From A Kernel***

In [15]:
ker_train = pd.read_csv('../input/train.csv') #Read the training data
ker_train_y = ker_train.SalePrice #Pull data into target (y) and predictors (X)
predictor_columns = ['LotArea', 'OverallQual', 'YearBuilt', 'TotRmsAbvGrd']

ker_train_X = ker_train[predictor_columns] #create training predictors data

my_sub_model = RandomForestRegressor()
my_sub_model.fit(ker_train_X, ker_train_y)

In [16]:
ker_test = pd.read_csv('../input/test.csv') #Read test data
ker_test_X = ker_test[predictor_columns] #Treat test data same as train data. Pull same columns
predicted_prices = my_sub_model.predict(ker_test_X) #use model to make predictions
print(predicted_prices)

In [18]:
my_submission = pd.DataFrame({'Id': ker_test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)