**[Machine Learning Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**

---


# Introduction
Machine learning competitions are a great way to improve your data science skills and measure your progress. 

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this micro-course.

The steps in this notebook are:
1. Build a Random Forest model with all of your data (**X** and **y**)
2. Read in the "test" data, which doesn't include values for the target.  Predict home values in the test data with your Random Forest model.
3. Submit those predictions to the competition and see your score.
4. Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.

## Recap
Here's the code you've written so far. Start by running it again.

In [73]:
# Code you have previously used to load data
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import LabelEncoder

# from learntools.core import *

In [102]:
# read data
# Path of the file to read. We changed the directory structure to simplify submitting to a competition

home_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')
home_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = home_data.SalePrice
home_data.drop(['SalePrice'], axis=1, inplace=True)

In [103]:
#just think about only int 
X = home_data.select_dtypes(exclude=['object'])
X_test = test_data.select_dtypes(exclude=['object'])

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

In [105]:
X_full.shape, X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

((1460, 80), (1168, 37), (292, 37), (1168,), (292,))

In [106]:
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64


In [138]:
columns_with_missing_values = [col for col in X_train.columns if X_train[col].isnull().any()] 
reduced_X_train = X_train.drop(axis=1, columns=columns_with_missing_values)
reduced_X_valid = X_valid.drop(axis=1, columns=columns_with_missing_values)
reduced_X_test = X_test.drop(axis=1, columns=columns_with_missing_values)

In [124]:
def score_dataset(X_train, X_valid, y_train, y_valid, n_estimators=100):
    model = RandomForestRegressor(n_estimators=n_estimators, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

In [130]:
for i in [50,100,150,200,250,300,350,400,450,500]:
    print(i, score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid, i))

50 17891.72801369863
100 17952.591404109586
150 17689.875502283103
200 17609.882808219176
250 17585.167068493152
300 17651.289098173515
350 17653.844011741683
400 17613.324871575343
450 17586.851491628615
500 17567.52362328767


# Creating a Model For the Competition

Build a Random Forest model and train it on all of **X** and **y**.  

In [136]:
model = RandomForestRegressor(n_estimators=500, random_state=0)
model.fit(reduced_X_train, y_train)

# Get validation predictions and MAE
preds_valid = model.predict(reduced_X_valid)
print("MAE (Your appraoch):")
print(mean_absolute_error(y_valid, preds_valid))

MAE (Your appraoch):
17567.52362328767


In [141]:
reduced_X_valid.head()

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
529,530,20,32668,6,3,1957,1975,1219,0,816,...,484,0,0,200,0,0,0,0,3,2007
491,492,50,9490,6,7,1941,1950,403,165,238,...,240,0,0,32,0,0,0,0,8,2006
459,460,50,7015,5,4,1950,1950,185,0,524,...,352,0,0,248,0,0,0,0,7,2009
279,280,60,10005,7,5,1977,1977,392,0,768,...,505,288,117,0,0,0,0,0,3,2008
655,656,160,1680,6,5,1971,1971,0,0,525,...,264,0,0,0,0,0,0,0,3,2010


In [144]:
reduced_X_test.dropna(axis=0, inplace=True)

In [149]:
reduced_X_test

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,1461,20,11622,5,6,1961,1961,468.0,144.0,270.0,...,730.0,140,0,0,0,120,0,0,6,2010
1,1462,20,14267,6,6,1958,1958,923.0,0.0,406.0,...,312.0,393,36,0,0,0,0,12500,6,2010
2,1463,60,13830,5,5,1997,1998,791.0,0.0,137.0,...,482.0,212,34,0,0,0,0,0,3,2010
3,1464,60,9978,6,6,1998,1998,602.0,0.0,324.0,...,470.0,360,36,0,0,0,0,0,6,2010
4,1465,120,5005,8,5,1992,1992,263.0,0.0,1017.0,...,506.0,0,82,0,0,144,0,0,1,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,1936,4,7,1970,1970,0.0,0.0,546.0,...,0.0,0,0,0,0,0,0,0,6,2006
1455,2916,160,1894,4,5,1970,1970,252.0,0.0,294.0,...,286.0,0,24,0,0,0,0,0,4,2006
1456,2917,20,20000,5,7,1960,1996,1224.0,0.0,0.0,...,576.0,474,0,0,0,0,0,0,9,2006
1457,2918,85,10441,5,5,1992,1992,337.0,0.0,575.0,...,0.0,80,32,0,0,0,0,700,7,2006


In [146]:
preds_test = model.predict(reduced_X_test)
print(preds_test)

[126588.87  154983.48  184385.732 ... 157689.148 107592.612 234285.978]


In [190]:
test_data = pd.read_csv('data/test.csv')
X_test = test_data.select_dtypes(exclude=['object'])
print(X_test.shape[0])

1459


In [186]:
print(len(X_test.columns))

34


In [185]:
X_test = X_test[reduced_X_train.columns]

In [187]:
from sklearn.impute import SimpleImputer
final_imputer = SimpleImputer(strategy='median')
final_X_test = pd.DataFrame(final_imputer.fit_transform(X_test))

In [188]:
final_X_test.shape[0]

1459

In [189]:
# make predictions which we will submit. 
test_preds = model.predict(final_X_test)
# The lines below shows how to save predictions in format used for competition scoring
# Just uncomment them.

output = pd.DataFrame({'Id': X_test.Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

In [11]:
output.head()

# Test Your Work
After filling in the code above:
1. Click the **Commit** button. 
2. After your code has finished running, click the "Open Version" button.  This brings you into the "viewer mode" for your notebook. You will need to scroll down to get back to these instructions.
3. Click **Output** button on the left of your screen. 

This will bring you to a part of the screen that looks like this: 
![](https://imgur.com/a/QRHL7Uv)

Select the button to submit and you will see your score. You have now successfully submitted to the competition.

4. If you want to keep working to improve your model, select the edit button. Then you can change your model and repeat the process to submit again. There's a lot of room to improve your model, and you will climb up the leaderboard as you work.

# Continuing Your Progress
There are many ways to improve your model, and **experimenting is a great way to learn at this point.**

The best way to improve your model is to add features.  Look at the list of columns and think about what might affect home prices.  Some features will cause errors because of issues like missing values or non-numeric data types. 

The [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) micro-course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.


# Other Micro-Courses
The **[Pandas Micro-Course](https://kaggle.com/Learn/Pandas)** will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. 

You are also ready for the **[Deep Learning](https://kaggle.com/Learn/Deep-Learning)** micro-course, where you will build models with better-than-human level performance at computer vision tasks.

---
**[Machine Learning Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*