# DATA DESCRIPTION

## Here's a brief version of what you'll find in the data description file.

* SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* MSZoning: The general zoning classification
* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* Street: Type of road access
* Alley: Type of alley access
* LotShape: General shape of property
* LandContour: Flatness of the property
* Utilities: Type of utilities available
* LotConfig: Lot configuration
* LandSlope: Slope of property
* Neighborhood: Physical locations within Ames city limits
* Condition1: Proximity to main road or railroad
* Condition2: Proximity to main road or railroad (if a second is present)
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* YearRemodAdd: Remodel date
* RoofStyle: Type of roof
* RoofMatl: Roof material
* Exterior1st: Exterior covering on house
* Exterior2nd: Exterior covering on house (if more than one material)
* MasVnrType: Masonry veneer type
* MasVnrArea: Masonry veneer area in square feet
* ExterQual: Exterior material quality
* ExterCond: Present condition of the material on the exterior
* Foundation: Type of foundation
* BsmtQual: Height of the basement
* BsmtCond: General condition of the basement
* BsmtExposure: Walkout or garden level basement walls
* BsmtFinType1: Quality of basement finished area
* BsmtFinSF1: Type 1 finished square feet
* BsmtFinType2: Quality of second finished area (if present)
* BsmtFinSF2: Type 2 finished square feet
* BsmtUnfSF: Unfinished square feet of basement area
* TotalBsmtSF: Total square feet of basement area
* Heating: Type of heating
* HeatingQC: Heating quality and condition
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: $Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale

# Importing libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import neighbors
from math import sqrt
%matplotlib inline

# Loading Train and Test datasets

In [None]:
df1= pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
df1

In [None]:
valid = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
valid

# Exploratory Data Analysis(EDA)

In [None]:
df1.head()

In [None]:
df1.describe()

In [None]:
df1.info()

In [None]:
df1.isnull().sum()

In [None]:
df1.isnull().mean()

### Missing values percentage

In [None]:
def missing (df1):
    missing_number = df1.isnull().sum().sort_values(ascending=False)
    missing_percent = ((df1.isnull().sum()/df1.isnull().count())*100).sort_values(ascending=False)
    missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=['Missing_Number', 'Missing_Percent'])
    return missing_values

In [None]:
missing(df1)

### Dropping the columns which have more than or equal to 40% of values as null 

In [None]:
for col in df1.columns:
    if df1[col].isnull().mean()*100>40:
        df1.drop(col,axis=1,inplace=True)

In [None]:
df1

In [None]:
df1.columns

In [None]:
sns.countplot(df1.dtypes.map(str))
plt.show()

In [None]:
df1.dtypes.value_counts()

## year based filling of nullvalues in train dataset

In [None]:
f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.mode().iloc[0]
df1 = df1.fillna(df1.groupby('YrSold').transform(f))
df1

### finding q1,q2,q3,mean,median,mode,skewness,kurtosis

In [None]:
for col in df1.columns:
    if df1[col].dtypes != object:
        q1 = df1[col].quantile(0.25)
        q2 = df1[col].quantile(0.50)
        q3 = df1[col].quantile(0.75)
        IQR = q3 - q1
        llp = q1-1.5*IQR
        ulp = q3+1.5*IQR
        print('column name',col)
        print('q1',q1)
        print('q2',q2)
        print('q3',q3)
        print('IQR',IQR)
        print('llp',llp)
        print('ulp',ulp)
        print('mean:',df1[col].mean())
        print('median:',df1[col].median())
        print('mode',df1[col].mode()[0])
        print('skewness:',df1[col].skew())
        print('kurtosis:',df1[col].kurtosis())
        print('std',df1[col].std())
        print('max',df1[col].max())
        print('min',df1[col].min())
        print('null_value count:',df1[col].isnull().sum())
        print('\n')

In [None]:
df1.dtypes

In [None]:
df1['MSZoning'].unique()

In [None]:
df1['RoofMatl'].unique()

In [None]:
Q1 = df1.quantile(0.25)
Q3 = df1.quantile(0.75)
IQR = Q3 - Q1
print('outliers count of each columns')
((df1 < (Q1 - 1.5 * IQR)) | (df1 > (Q3 + 1.5 * IQR))).sum()

# Data visualizations

### Data visualisations using distplot,boxplot(because distplot and boxplot shows how data is distributed and if there are any outliers

In [None]:
count=1
plt.subplots(figsize=(30,25))
for i in df1.columns:
    if df1[i].dtypes!='object':
        plt.subplot(6,7,count)
        sns.distplot(df1[i])
        count+=1

plt.show()

In [None]:
count=1
plt.subplots(figsize=(30,25))
for i in df1.columns:
    if df1[i].dtypes!='object':
        plt.subplot(6,7,count)
        sns.boxplot(df1[i])
        count+=1

plt.show()

In [None]:
df1.dtypes

In [None]:
pip install autoviz

In [None]:
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
df_av = AV.AutoViz('../input/house-prices-advanced-regression-techniques/train.csv')

# Label encoding the train dataset

In [None]:
le=LabelEncoder()
for col in df1.columns:
    if df1[col].dtypes == object:
        df1[col]= le.fit_transform(df1[col])

# Feature Selection

In [None]:
X=df1.drop('SalePrice',axis=1)
y=df1['SalePrice']

# Train test and split 

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)

# Accuracies of different algorithms applied

In [None]:
def train_models(X_train, y_train):
    
 #use Decision Tree
   
    tree = DecisionTreeRegressor(max_features=75,max_depth=4, random_state = 0)
    tree.fit(X_train, y_train)
    y_pred_tree = tree.predict(X_test)

  #use the RandomForestRegressor
    
    rf = RandomForestRegressor(n_estimators = 100,max_features =75, random_state = 0)
    rf.fit(X_train, y_train)
    y_pred_rf= rf.predict(X_test)
    
  # use the support vector regressor
    #from sklearn.svm import SVR
    svr= SVR(kernel = 'rbf')
    svr.fit(X_train, y_train)
    y_pred_svr = svr.predict(X_test)
    
    #from sklearn.svm import SVR
    svr_l= SVR(kernel = 'linear')
    svr_l.fit(X_train, y_train)
    y_pred_svr_linear = svr_l.predict(X_test)

  # use the knn regressor
    knn = neighbors.KNeighborsRegressor()
    knn.fit(X_train, y_train)
    y_pred_knn = knn.predict(X_test)
    
  # metrics of decision tree regressor
    meanAbErr_tree= metrics.mean_absolute_error(y_test, y_pred_tree)
    meanSqErr_tree= metrics.mean_squared_error(y_test, y_pred_tree)
    rootMeanSqErr_tree= np.sqrt(metrics.mean_squared_error(y_test, y_pred_tree))

  # metrics of random forest regressor
    meanAbErr_rf= metrics.mean_absolute_error(y_test, y_pred_rf)
    meanSqErr_rf= metrics.mean_squared_error(y_test, y_pred_rf)
    rootMeanSqErr_rf= np.sqrt(metrics.mean_squared_error(y_test, y_pred_rf))
  
  # metrics of knn regressor
    meanAbErr_knn = metrics.mean_absolute_error(y_test, y_pred_knn)
    meanSqErr_knn = metrics.mean_squared_error(y_test, y_pred_knn)
    rootMeanSqErr_knn= np.sqrt(metrics.mean_squared_error(y_test, y_pred_knn)) 

  # metrics of svr regressor
    meanAbErr_svr = metrics.mean_absolute_error(y_test, y_pred_svr_linear)
    meanSqErr_svr = metrics.mean_squared_error(y_test, y_pred_svr_linear)
    rootMeanSqErr_svr= np.sqrt(metrics.mean_squared_error(y_test, y_pred_svr_linear)) 

  #print the tranning accurancy of each model

    print('[1]Decision Tree Training Accurancy: ', r2_score(y_test,y_pred_tree))
    print('Mean Absolute Error:', meanAbErr_tree)
    print('Mean Square Error:', meanSqErr_tree)
    print('Root Mean Square Error:', rootMeanSqErr_tree)
    print('\t')
    print('[2]RandomForestRegressor Training Accurancy: ',r2_score(y_test,y_pred_rf))
    print('Mean Absolute Error:', meanAbErr_rf)
    print('Mean Square Error:', meanSqErr_rf)
    print('Root Mean Square Error:', rootMeanSqErr_rf)
    print('\t')    
    print('[3]SupportvectorRegression Accuracy(rbf): ', r2_score(y_test,y_pred_svr))
    print('\t')
    print('[4]SupportvectorRegression Accuracy(linear): ', r2_score(y_test,y_pred_svr_linear))
    print('Mean Absolute Error:', meanAbErr_svr)
    print('Mean Square Error:', meanSqErr_svr)
    print('Root Mean Square Error:', rootMeanSqErr_svr)
    print('\t')
    print('[5]knn Training Accurancy: ', r2_score(y_test,y_pred_knn))
    print('Mean Absolute Error:', meanAbErr_knn)
    print('Mean Square Error:', meanSqErr_knn)
    print('Root Mean Square Error:', rootMeanSqErr_knn)
    print('\t')
    


In [None]:
train_models(X_train, y_train)

### User defined function is for showing purpose only that's why
### We can't use user defined function for model prediction for validation
### so that why we have to apply algorithms seperately, iam using random forest because it gets good accuracy

# Multiple Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
mlr = LinearRegression()  
mlr.fit(X_train, y_train)

In [None]:
y_pred_mlr= mlr.predict(X_test)
y_pred_mlr

In [None]:
r2_mlr =r2_score(y_test,y_pred_mlr)
print('r2_score:',r2_mlr*100)

# validation dataset

### Same EDA and preprocessing steps have to be followed for validation dataset same as for train dataset

In [None]:
valid

In [None]:
missing(valid)

## Dropping nullvalues of more than 40% for validation dataset

In [None]:
for col in valid.columns:
    if valid[col].isnull().mean()*100>40:
        valid.drop(col,axis=1,inplace=True)

In [None]:
valid

In [None]:
f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.mode().iloc[0]
valid = valid.fillna(valid.groupby('YrSold').transform(f))
valid

In [None]:
valid.columns

In [None]:
le=LabelEncoder()
for col in valid.columns:
    if valid[col].dtypes == 'object':
        valid[col]= le.fit_transform(valid[col])

In [None]:
valid['MSZoning'].value_counts()

In [None]:
valid

In [None]:
y_valid = mlr.predict(valid)

## Validation data prediction

In [None]:
y_valid

In [None]:
output = pd.DataFrame({"Id": valid['Id'],"SalePrice": y_valid})
output

In [None]:
# Save the output
output.to_csv("submission5.csv", index=False)
output.head(10)

In [None]:
nan

In [None]:
nan