#**Housing Prices Competition**

***Start here if...***

You have some experience with R or Python and machine learning basics. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition. 

***Competition Description***


Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

***Practice Skills***
Creative feature engineering 
Advanced regression techniques like random forest and gradient boosting
***Acknowledgments***
The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [3]:
train=pd.read_csv('/kaggle/input/home-data-for-ml-course/train.csv')
test=pd.read_csv('/kaggle/input/home-data-for-ml-course/test.csv')
train.head()

FileNotFoundError: ignored

## Finding the Data-type of each column


In [None]:
train.info()


In [None]:
test.info()


## Finding the Percent of null values in each columns


In [None]:
train.isnull().sum()/train.shape[0] * 100


In [None]:
test.isnull().sum()/test.shape[0] * 100

## Finding the columns in each dataset


In [None]:
train.columns


In [None]:
test.columns


## Dropping some useless column. 


In [None]:
ID_train=train['Id']
ID_test=test['Id']
test=test.drop(columns=['Id'], axis=1)
train.head()


## Finding Numerical & Categorical Features (to be treated seperately later)
### This method is called List Comprehension-where a list is created satisfying some condition


In [None]:
cat_train=[col for col in train.columns if train[col].dtype=='object']
num_train=[col for col in train.columns if train[col].dtype!='object']
# cat_train
num_train

In [None]:
cat_test=[col for col in test.columns if test[col].dtype=='object']
num_test=[col for col in test.columns if test[col].dtype!='object']
# cat_test
num_test

## Finding the following Features (to be treated seperately later)
### This method is called List Comprehension-where a list is created satisfying some condition
### * Continuous Features
### * Discreet Features
### *  Year Features


In [None]:
con_train =[col for col in num_train if train[col].nunique()>25]
dis_train =[col for col in num_train if train[col].nunique()<25]
yea_train =[col for col in train.columns if 'Yr' in col or 'Year' in  col or 'yr' in  col or 'YR' in  col]

# con_train
# dis_train
# yea_train

In [None]:
con_test =[col for col in num_test if test[col].nunique()>25]
dis_test =[col for col in num_test if test[col].nunique()<25]
yea_test =[col for col in test.columns if 'Yr' in col or 'Year' in  col or 'yr' in  col or 'YR' in  col]
# con_test
# dis_test
yea_test

## Imputing the missing values
### Missing values are one of the most common problems you can encounter when you try to prepareyour data for machine learning. The reason for the missing values might be human errors,interruptions in the data flow, privacy concerns, and so on. Whatever is the reason, missing values affect the performance of the machine learning models.


In [None]:
from sklearn.impute import SimpleImputer
nsi = SimpleImputer(strategy='mean')  # For Numerical Features, will replace MISSING NUMERIC values with MEAN
csi = SimpleImputer(strategy='most_frequent')  # For Categorical Features, will replace MISSING CATEGORICAL values with MOST FREQUENT value

train[cat_train] = csi.fit_transform(train[cat_train])
train[con_train] = nsi.fit_transform(train[con_train])
train[dis_train] = nsi.fit_transform(train[dis_train])

train.head()

In [None]:
test[cat_test] = csi.fit_transform(test[cat_test])
test[con_test] = nsi.fit_transform(test[con_test])
test[dis_test] = nsi.fit_transform(test[dis_test])

test.head()

## Apply Log Transform on Continuous Data only

In [None]:
# train[con_train]=np.log(train[con_train])
# test[con_test]= np.log(test[con_test])
train.head()
# test.head()

## Transforming Dates
### If you transform the date column into the extracted columns, the information of them become disclosed and machine learning algorithms can easily understand them.


In [None]:
from datetime import date
train[yea_train]=date.today().year - train[yea_train]
test[yea_test]=date.today().year - test[yea_test]
train.head()
# test.head()

## Standardizing the Discrete Values.
### Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).


In [None]:
from sklearn.preprocessing import StandardScaler
ss= StandardScaler()
train[dis_train]= ss.fit_transform(train[dis_train])
test[dis_test]= ss.fit_transform(test[dis_test])
train.head()
# test.head()

## Handling Categorical Data using Get_Dummies()
### Machine learning models require all input and output variables to be numeric.This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.


In [None]:
train1= pd.get_dummies(train, columns=cat_train, drop_first= True)
test1= pd.get_dummies(test, columns=cat_test, drop_first= True)


## Concatenating the Original Dataset & the One after creating Dummies(get_dummies()
### Get_Dummies() method creates a new DF containing JUST the dummies, MOST People get wrong here)


In [None]:
train2=pd.concat([train,train1],axis=1)
test2=pd.concat([test,test1],axis=1)


## Dropping the columns already concatenated after Get_Dummies()


In [None]:
train=train2.drop(cat_train,axis=1)
test=test2.drop(cat_test,axis=1)
train.head()
# test.head()

In [None]:
train=train.dropna(axis=0,how='any') # I have taken all the necessary features thus dropping null values of unnecessary features
test=test.dropna(axis=0,how='any') 
train.head()

## Splitting X & y


In [None]:
y=train['SalePrice'].iloc[:,0]

X=train.drop(['Id','SalePrice'],axis=1)
y.head()

## Doing the Train_Test_Split


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
y_test.head()

## Using GBoost to fit the Data


In [None]:
from sklearn.ensemble import GradientBoostingRegressor
reg=GradientBoostingRegressor()
reg.fit(X_train,y_train)


## Using the Trained Model to Predict


In [None]:
predict= reg.predict(X_test)
# predict

## Scoring the Trained Model


In [None]:
from sklearn.metrics import r2_score
r2_score(predict, y_test)

# Some ways you may show Like by
### Kaggle - Follow me on Kaggle
### Twitter - https://twitter.com/KumarPython
### LinkedIn - https://www.linkedin.com/in/kumarpython/


In [4]:
 ! pip install -q kaggle

In [None]:
from google.colab import files

files.upload()