**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/machine-learning-competitions).**

---


# Introduction


We have a dataset with information about houses that have been sold in the past.The goal is to predict new prices based on a lot of features that each house have. For that reason:

I will use the <b> sales price </b> as the target variable (*).

(*)"_this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence_"-. 

In [None]:
# Import the libraries
import pandas as pd
import seaborn as sns
import numpy as np
from mpl_toolkits import mplot3d
import matplotlib.pyplot as plt

In [None]:
# save filepath to variable for easier access
iowa_file_path = '../input/train.csv'

In [None]:
# read the data and store data in DataFrame titled iowa_data
iowa_data = pd.read_csv(iowa_file_path) 

In [None]:
#separate the target
target = iowa_data.SalePrice
train = iowa_data.drop(['SalePrice'], axis=1, inplace=True)
print("iowa size is : {}".format(iowa_data.shape))


In [None]:
# print first rows of the data in Iowa data
iowa_data.head()

In [None]:
iowa_data.info()

In [None]:
#let see how the distribution of the target looks
iowa_data.SalePrice.hist()

In [None]:
# Improve the visualization of the target distribution
fig, ax = plt.subplots(figsize=(10,5))

sns.distplot(iowa_data.SalePrice, bins=30, kde=True, ax=ax, color='blue')
plt.title('Housing Sale Price Histogram', fontsize=15)
plt.xlabel('Sale Price, $', fontsize=12);

<div>
    <section> 
        <article>
            <h2><b> Comments: </b></h2>
            <p> We have non-normal distribution values on the target variable.</p><p> We   can't use the standard deviations distance from the mean as we usually do with a Gaussian distribution </p>
        </article>
    </section>
</div>

In [None]:
# Statistics summary of the data in Iowa data
iowa_data.describe()

<div>
    <section> 
        <article>
            <h2><b> Comments: </b></h2>
            <p> Lets use this statistics data from iowa_data.describe() in terms of Area, Age, and Price </p>
        </article>
    </section>
</div>

In [None]:
# Area
print('|---------------------AREA (ft^2)-------|')
max_lot_size = iowa_data.LotArea.max()
print('\tmax_lot_size: ',max_lot_size)
min_lot_size = iowa_data.LotArea.min()
print('\tmin_lot_size: ',min_lot_size)
avg_lot_size = iowa_data.LotArea.mean()
avg_lot_size = round(avg_lot_size, 2)
print('\tavg_lot_size: ',avg_lot_size)

print('|---------------------AGE(year)---------|')
newest_home = iowa_data.YearBuilt.max()
print('\tnewest_home:  ', newest_home)
oldest_home = iowa_data.YearBuilt.min()
print('\toldest_home:  ', oldest_home)
range_year_built = newest_home - oldest_home
print('\trange_year_built:  ', range_year_built, 'years')

print('|---------------------PRICE(usd)--------|')
max_price_sale = iowa_data.SalePrice.max()
print('\tmax_price_sale: ', max_price_sale)
min_price_sale = iowa_data.SalePrice.min()
print('\tmin_price_sale: ', min_price_sale)
avg_price_sale = iowa_data.SalePrice.mean()
avg_price_sale = round(avg_price_sale, 2)
print('\tavg_price_sale: ', avg_price_sale)


<div>
    <section> 
        <article>
            <h2><b> Comments: </b></h2>
            <p> It seems logical that there are a relationship between Area and Price 
            (By now I set Age as a constant). My hypothesis from the common sense asumption is that perhaps there are a linear relationship
            between Area and Price.
            </p>
        </article>
    </section>
</div>

In [None]:
# Plot a scatter visualization of the variables
fig, ax = plt.subplots(figsize=(10,5))

sns.scatterplot(iowa_data.LotArea, iowa_data.SalePrice,ax=ax, color='blue')
plt.title('Housing Sale Price vs Lot Area', fontsize=15)
plt.ylabel('Sale Price, $', fontsize=12);
plt.xlabel('Lot Area , $', fontsize=12);
plt.show()

<div>
    <section> 
        <article>
            <h2><b> Comments: </b></h2>
            <p> Then I had found that the relationship is probably not linear. Lets do the same for year and      Price.
            </p>
        </article>
    </section>
</div>

In [None]:
# Plot a scatter visualization of the variables
fig, ax = plt.subplots(figsize=(10,5))

sns.scatterplot(iowa_data.YearBuilt, iowa_data.SalePrice,ax=ax, color='blue')
plt.title('Housing Sale Price vs Year Built', fontsize=15)
plt.ylabel('Sale Price, $', fontsize=12);
plt.xlabel('Year Built , $', fontsize=12);
plt.show()

<div>
    <section> 
        <article>
            <h2><b> Comments: </b></h2>
            <p> Anyone of my first two asumptions about the linearity between age and year vs Price was found through a first graophical approach
             Now I want to observe these three variables in a 3D plot for a more consistent representation.
            </p>
        </article>
    </section>
</div>

In [None]:
# Creating dataset
x = iowa_data.LotArea
y = iowa_data.SalePrice
z = iowa_data.YearBuilt
 
# Add x, y gridlines
#ax.grid(visible = True, color ='grey', linestyle ='-.', linewidth = 0.3, alpha = 0.2)
 
# Creating color map
my_cmap = plt.get_cmap('hsv')
 
# Creating figure
fig = plt.figure(figsize = (20, 20))
ax = plt.axes(projection ="3d")
 
# Creating plot
ax.scatter3D(x, y, z, color = "green")
plt.title("simple 3D scatter plot")
    
# Creating plot
sctt = ax.scatter3D(x, y, z, alpha = 0.8, cmap = my_cmap, marker ='^')
 
plt.title("3D scatter plot")
ax.set_xlabel('Lot Area', fontweight ='bold')
ax.set_ylabel('Price', fontweight ='bold')
ax.set_zlabel('Year Built', fontweight ='bold')
fig.colorbar(sctt, ax = ax, shrink = 0.5, aspect = 5)
 
# show plot
plt.show()

<h1>FEATURE ENGINEERING</H1>

In [None]:
#A quick view on LotFrontage
iowa_data.isnull().sum()

 <div>
    <section> 
        <article>
            <h2><b> Comments: </b></h2>
            <p> We have 259 missing values on the LotFrontage variable.
            and also in other vaiables. But with common sense in mind, 
            maybe there are many features that doesn't have a commercial matter
            and it means that those aren't key variables
            </p>
        </article>
    </section>
</div>

In [None]:
iowa_data.SalePrice.isnull().sum()

<div>
    <section> 
        <article>
            <h2><b> Comments: </b></h2>
            <p> We have non-missing values on the target variable.</p>
        </article>
    </section>
</div>

In [None]:
iowa_data_na = (iowa_data.isnull().sum() / len(iowa_data)) * 100
iowa_data_na = iowa_data_na.drop(iowa_data_na[iowa_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :iowa_data_na})
missing_data.head(20)

In [None]:
figure, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=iowa_data_na.index, y=iowa_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

In [None]:
#Correlation map to see how features are correlated with SalePrice
corrmat = iowa_data.corr()
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, square=True)

<div>
    <section> 
        <article>
            <h2><b> Comments: </b></h2>
            <p> Though the heat map is useful and got me an idea. Make a new normalized scale for the year in order to split the data by decades in the year built range.
            </p>
        </article>
        </section>
    </div>
        <!DOCTYPE html>
<html>
<head>
<style>
table {
  font-family: arial, sans-serif;
  border-collapse: collapse;
  width: 100%;
}

td, th {
  border: 1px solid #dddddd;
  text-align: left;
  padding: 8px;
}

tr:nth-child(even) {
  background-color: #dddddd;
}
</style>
</head>
<body>

<h2>SCORING AGE</h2>

<table>
  <tr>
    <th>Id</th>
    <th>Year Built</th>
    <th>normalized score</th>
  </tr>
  <tr>
    <td>0</td>
    <td>2003</td>
    <td>?</td>
  </tr>
  <tr>
    <td>1</td>
    <td>1976</td>
    <td>?</td>
  </tr>
  <tr>
    <td>2</td>
    <td>2001</td>
    <td>?</td>
  </tr>
  <tr>
    <td>3</td>
    <td>1915</td>
    <td>?</td>
  </tr>
     <tr>
    <td>4</td>
    <td>2000</td>
    <td>?</td>
  </tr>
</table>

</body>
</html>


In [None]:
# Building a normalized scale for decades in Year Built data columns
normalized_score_bins = range_year_built * 0.25 
#norm_ = (data - data.min())/ (data.max() - data.min())
range_year_built
uns_YearBuilt = list(iowa_data.YearBuilt)
uns_YearBuilt.sort()
#print(uns_YearBuilt)

#iowadata_year_cols = iowa_data[['Id','YearBuilt', 'normalized_score']]
#iowadata_year_cols

In [None]:
import itertools
from itertools import groupby
uns_YearBuilt2 = np.array(uns_YearBuilt)
#uns_YearBuilt2.reshape(-1)
print('sorted year array', uns_YearBuilt2)
#eliminate duplicates with set function
year_set = set(uns_YearBuilt2)
#convert the set into a list 
year_list = list(year_set)
year_set_array = np.array(year_list) 
print('year set array', year_set_array)
output = np.array([list(g) for k,g in groupby(year_set_array,lambda i: i//20)])
print('grouped year set array by decade', output)

In [None]:
output.reshape(-1)
ex = output
ex

In [None]:
uns_SalePrice = list(iowa_data.SalePrice)
uns_SalePrice.sort()
uns_salePrice2 = np.array(uns_SalePrice)
#uns_SalePrice.reshape(-1)
print('sorted price array', uns_salePrice2)
#eliminate duplicates with set function
price_set = set(uns_salePrice2)
#convert the set into a list 
price_list = list(price_set)
price_list.sort()
price_set_array = np.array(price_list) 
print('Price set array', price_set_array)
output_ = np.array([list(g) for k,g in groupby(price_set_array,lambda i: i//100000)])
output_.reshape(-1)
print('grouped price set array by one hundred thousand usd', output_)

In [None]:
# Building a normalized scoring scale for decades in 8 groups of 20 Years
score = np.array([ 0.125,0.25,0.375,0.5,0.625, 0.75, 0.875, 1.0])

In [None]:
grouped_year_decade_df = pd.DataFrame(ex, columns = ['groups_year_decade'])
grouped_year_decade_df

In [None]:
#Adding as the new column
grouped_year_decade_df['year_score'] = score.tolist()
grouped_year_decade_df

In [None]:
grouped_price_df = pd.DataFrame(output_, columns = ['groups_sale_price'])
grouped_price_df

<h1>Comment</h1>
<p>I fill the missing values with the highest percentage:
PoolQC	99.520548
MiscFeature	96.301370
Alley	93.767123
Fence	80.753425
</p>

In [None]:
#Fill the missing values with avg
iowa_data["PoolQC"] = iowa_data["PoolQC"].fillna("None")
iowa_data["MiscFeature"] = iowa_data["MiscFeature"].fillna("None")
iowa_data["Alley"] = iowa_data["Alley"].fillna("None")
iowa_data["Fence"] = iowa_data["Fence"].fillna("None")


<h1>Comment</h1>
<p>This estep is the label encoding, that's for the categorical variables that probably contain information</p>

In [None]:
from sklearn.preprocessing import LabelEncoder #Encode target labels with value between 0 and n_classes-1.
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(iowa_data[c].values)) 
    iowa_data[c] = lbl.transform(list(iowa_data[c].values))

# shape        
print('Shape all_data: {}'.format(iowa_data.shape))




In [None]:
iowa_data.head()

In [None]:
#Obtain dummy categorical values
iowa_data = pd.get_dummies(iowa_data)
print(iowa_data.shape)

In [None]:
train = iowa_data
test = pd.read_csv('../input/test.csv')


In [None]:
ntrain = iowa_data.shape[0]
ntest = test.shape[0]

In [None]:
iowa_data = iowa_data.reset_index()

<h1>MODEL</h1>

In [None]:
from sklearn.linear_model import Lasso , LogisticRegression
from sklearn.model_selection import train_test_split


print(ntrain)
print(ntest)

x_train, x_test, y_train, y_test = train_test_split(train, target,test_size=0.25, random_state=0)

In [None]:
#Fit LogisticRegression
lr_model = LogisticRegression(random_state = 0, solver='lbfgs', multi_class='auto')
lr_model.fit(x_train, y_train)

In [None]:
#Lasso implementation
modelLasso = Lasso(alpha=0.02).fit(x_train, y_train)
y_predict_lasso = modelLasso.predict(x_test)

In [None]:
#MSE to know what is the goodness of data that passed througth each linear regresor
lr_loss = mean_squared_error(y_test, lr_output)
print("lr Loss:", lr_loss)

lasso_loss = mean_squared_error(y_test, y_predict_lasso)
print("Lasso Loss: ", lasso_loss)


---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*