The first component, due one week into the project, requires students to submit a simplistic model (MODEL1) that can be used for predicting the sale price of houses. This component is used to verify that students understand the assignment and are familiar with the methodology for submitting their models. The second component, due two
weeks into the project, requires students to submit a more complex model (MODEL2) that represents their best effort at predicting housing prices. This component will be applied to a validation set to determine a “fit” grade that comprises 30% of their project grade. The final component, due on the last day of class, is a written report that contains all the analysis, interpretation, and information for the two submitted models. The written report completes the remaining 70% of the total project grade.
MODEL2 is evaluated through a cross-validation or data splitting technique where the original data set is split into two data sets: the training set and the validation set. The students are given the training set for the purpose of developing their model and I retain the validation set for use in evaluating their model. 

 I chose to use randomization to create my Boston sets but those wishing to achieve a more consistent split may want to use a systematic sampling scheme. Simply order the original data set by a variable of interest (such as sale price) and select every kth observation to achieve the desired sample size (k=2 for a 50/50 split or k=4 for a 75/25 split).
 The most common error I have found is students losing track of what they have done in creating complex variables such as transformations and interactions (i.e. they think that their new variable v13 is an interaction between v1 and v3 when in actuality it is some other combination or transformation).

I remind the students of the concept of the validation set (mentioned earlier in the semester) and then talk about the four main criteria I use for evaluating their model. In each measure, the actual home price (Y) of each observation in the validation set is compared the predicted value (Yhat) obtained from their model.

 Bias –  $\Sigma (Yhat-Y)/N $– This concept is the easiest for the students to understand as positive values indicate the model tends to overestimate price (on average) while negative values indicate the model tends to underestimate price.

 Maximum Deviation - Max $|Y-Yhat|$ - Students also find this measure easy to understand as it identifies the worst prediction they made in the validation data set.

 Mean Absolute Deviation –$\Sigma |Yhat-Y|/N $ - Although not as intuitive to the students, once contrasted with bias, students grasp that it is the average error (regardless of sign).

 Mean Square Error –$\Sigma |Yhat-Y|^2/N $– The least intuitive and least meaningful measure for the students. I only include it so that I can compare its calculation to the methodology used to obtain the coefficient estimates from the original data set (linking back to the idea of Least Squares Regression).

## The Data
There are two data sets included in the data folder: `Ames_Housing_Price_Data.csv` and `Ames_Real_Estate_Data.csv`.

The `Ames_Housing_Price_Data.csv` set contains $81$ data columns, including the key feature **SalePrice** which will be used as the target of the predictive/descriptive modeling. **PID** refers to the land parcel ID, which can merged on the *MapRefNo* column of the **Ames Accessor Data** (`Ames_Real_Estate_Data.csv`) to find the property address. Using a free service, such as **geopy**, we can find the long-lat coordinates of the houses.

The columns of the data are mostly attributes associated with the land and the houses. There are size related attributes, quality and condition attributes, house attachment attributes, etc.

To establish a foundation for your team's data analytics, we offer some insights on the house sizes vs. prices.

In [None]:
# Import the datasets and the libraries
%matplotlib inline
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import csv
from sklearn.decomposition import PCA
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
from sklearn.feature_selection import f_regression
from sklearn.linear_model import LinearRegression
import seaborn as sns
import statistics as stats
realEstate = pd.read_csv("Ames_Real_Estate_Data.csv")
realEstate = realEstate[['MapRefNo','Prop_Addr','MA_Zip1']]
geocode_data =pd.read_csv("geocode_data.csv")

housing = pd.read_csv('Ames_HousePrice.csv', index_col=0)
housing = housing[housing.GrLivArea<3700]
from scipy import stats
housing.head()

In [None]:
housing.PID.unique().shape

In [None]:
geocode_data.head()

In [None]:
housing = pd.merge(housing, geocode_data.iloc[:,1:6], how='left', left_on='PID', right_on ="PID")
housing

In [None]:

qual_related = housing.filter(regex='Qual$|Cond$').fillna("TA")
qual_related

In [None]:
qual_related.GarageCond.value_counts()

In [None]:
qual_related.fillna("TA", inplace=True)
def Rating(t):
    if t =="Ex": return  7
    elif t == "Gd": return 5
    elif t == "TA": return 3
    elif t == "Fa": return 2.5
    elif t == "Po": return 1
    else: return 0
for ele in qual_related.iloc[:,2:]:
    
    housing[ele]=qual_related[ele].copy().map(Rating)
housing

### UpSampling the Street labels

In [None]:
housing.Street.value_counts()
def Ratio(t):
    if t == 'Pave': return 1.0
   
    else: return 180.0
# the returned values must be integers   
ratios = housing['Street'].map(Ratio)
index_repeat = housing.index.repeat(ratios)
index_repeat = pd.Series(index_repeat, name='repeat')
index_repeat.shape

housing = housing.loc[index_repeat].Street.value_counts()

In [None]:
housing = pd.merge(index_repeat, housing, how='left', left_on='repeat', right_index=True)

In [None]:
housing.shape

In [None]:
def Ratio2(t):
    if t =="Normal": return  1.0
    elif t == "Partial": return 4
    elif t == "Alloca": return 5
    elif t == "Abnormal": return 6
    elif t == "Family": return 12
    else: return 100

ratios = housing['SaleCondition'].map(Ratio2)
index_repeat = housing.index.repeat(ratios)
index_repeat = pd.Series(index_repeat, name='repeat')
index_repeat
housing = pd.merge(index_repeat, housing, how='left', left_on='repeat', right_index=True)

In [None]:
# Checking unique PID #s
uni = housing.PID.unique()
uni.shape

In [None]:
housing.shape

In [None]:
# How does the prices vary by neighbourhood
plt.style.use('ggplot')



housing.boxplot(column ='SalePrice', by = 'Neighborhood')

### Encoding and Dummyfication

In [None]:
# get the averge price by neighborhood
dummy = housing.groupby("Neighborhood")[["SalePrice"]].mean()
dummy.rename(columns = {"SalePrice":"Price_by_hood"}, inplace =True)
dummy

In [None]:

housing = pd.merge(housing, dummy, how='left', on=['Neighborhood', 'Neighborhood'])
housing

In [None]:
housing.isna().sum()

In [None]:
housing

In [None]:
housing.columns

In [None]:

# We trim the outliers from the list
#housing = housing
#leng = len(housing)
#print(leng)
#housing["Gradient"] = (housing.SalePrice-15000)/(housing.GrLivArea)

#housing=housing.sort_values(by="Gradient")[(housing.sort_values(by="Gradient")["Gradient"]>30) & (housing.sort_values(by="Gradient")["Gradient"]<220)]
#housing["Gradient2"] = (housing.SalePrice)/(housing.GrLivArea-1600.01) 
#housing=housing.sort_values(by="Gradient2")[ (housing.sort_values(by="Gradient2")["Gradient2"]>250)|(housing.sort_values(by="Gradient2")["Gradient2"]<0)]
#housing["Gradient3"] = (housing.SalePrice -100000)/(housing.TotalBsmtSF +1) 
#housing=housing.sort_values(by="Gradient3")[ (housing.sort_values(by="Gradient3")["Gradient3"]<200)]
#housing["Gradient4"] = (housing.SalePrice)/(housing.TotalBsmtSF-1200.01) 
#housing=housing.sort_values(by="Gradient4")[ (housing.sort_values(by="Gradient4")["Gradient4"]>300000/1300)|(housing.sort_values(by="Gradient4")["Gradient4"]<0)]

#housing
#leng2 = len(housing)
#outlier_pct = 100*(leng-leng2)/leng
#outlier_pct


In [None]:
housing.iloc[:,65:].head()

### Relationship between the Price and some features

In [None]:

size_related = housing.filter(regex='SF$|Area$')
size_related.head()

In [None]:
size_related.isnull().sum(axis=0)

In [None]:
size_related = size_related.fillna(1)  # We fill the very minor missing values by 0.0
F_values, p_values = f_regression(size_related, housing['SalePrice'])
pd.Series(p_values, index=size_related.columns).sort_values()

In [None]:
price        = housing['SalePrice']

In [None]:

corr = pd.concat([size_related, housing['SalePrice']], axis=1).corr()
sns.heatmap(corr)

In [None]:
# From the heatmap, SalPrice has strong positive ralation with GrLivArea, TotalBsmtSF,GarageArea and negative relation with LowQuanlFnSF

In [None]:
p_values

### Importance of the Features

Most of the size related columns have significant p-values on their correlations with **SalePrice**. The **Gross Living Area** (GrLivArea) has a p-value of zero, which indicates a very strong statistical relationship. We will focus our research on **GrLivArea**.

In [None]:
#housing['SalePrice'] = housing['SalePrice'].apply(lambda x: np.log(x))

housing[['GrLivArea', 'SalePrice']].plot(kind='scatter', x='GrLivArea', y='SalePrice')

In [None]:
#housing['SalePrice'] = housing['SalePrice'].apply(lambda x: np.log(x))

housing[['TotalBsmtSF', 'SalePrice']].plot(kind='scatter', x='TotalBsmtSF', y='SalePrice')


In [None]:
# fit the price against grLivArea
lm = LinearRegression()
grLivArea = size_related[['TotalBsmtSF', "GrLivArea"]]
grLivArea.isna().sum()
lm.fit(grLivArea, price)
lm.score(grLivArea, price)

In [None]:
lm.intercept_, lm.coef_

In [None]:
# fit the price against qual_related
qual_related = housing.filter(regex='Qual$|Cond$')
lm = LinearRegression()

lm.fit(qual_related, price)
lm.score(qual_related, price)

Schematically, the above linear regression can be expressed as

$$price = \beta_0 +\beta_1\cdot grLivArea + \epsilon = -31601.646+79\cdot TotalBsmtSF +86\cdot grLivArea \epsilon$$

This  formula explains 58% of the variation in the price for all the housing transactions.
Overall, the size of the property explains 76.7 of the variation in the price while the quality explains 72 percent of the variation in the price

In [None]:
#housing['SalePrice'] = housing['SalePrice'].apply(lambda x: np.log(x))

plt.scatter( housing['GrLivArea'].apply(lambda x: np.log(x+2)), housing['TotalBsmtSF'].apply(lambda x: np.log(x+2)))


### Interaction of numerical features with SalePrice

### Feature to Feature Interaction

### Fixing Missing Values

In [None]:
def missing_values_table(df): 
        mis_val = df.isnull().sum()
        mis_val_pct = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_pct], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        return mis_val_table_ren_columns.sort_values(by= "Missing Values")[mis_val_table_ren_columns["Missing Values"]>10] 
    

In [None]:
missing_values_table(housing)["Missing Values"].plot.bar()

In [None]:
housing=housing.drop(["PoolQC","MiscFeature"], axis=1)

In [None]:
housing[housing.columns[housing.isnull().any()]]

In [None]:
#changing the NA values which aren't NAs to different values to work better with the data set
medl = housing.LotFrontage.median()
medm = housing.MasVnrArea.median()
housing.Alley = housing.Alley.fillna("No Alley Access")
housing.LotFrontage = housing.LotFrontage.fillna(medl)
housing.MasVnrArea = housing.MasVnrArea.fillna(medm)
housing.MasVnrType = housing.MasVnrType.fillna('None')
housing.BsmtQual = housing.BsmtQual.fillna("No Basement")

In [None]:
# Fill the NA with the right values
housing.BsmtCond = housing.BsmtCond.fillna("No Basement")
housing.BsmtExposure = housing.BsmtExposure.fillna("No Basement")
housing.BsmtFinType1 = housing.BsmtFinType1.fillna("No Basement")
housing.BsmtFinType2 = housing.BsmtFinType2.fillna("No Basement")
housing.FireplaceQu = housing.FireplaceQu.fillna("No Fireplace")
housing.GarageType = housing.GarageType.fillna("No Garage")
housing.GarageFinish = housing.GarageFinish.fillna("No Garage")
housing.GarageQual = housing.GarageQual.fillna("No Garage")
housing.GarageCond = housing.GarageCond.fillna("No Garage")
housing.Fence = housing.Fence.fillna("No Fence")
housing.Electrical = housing.Electrical.fillna("None")

med1 = housing.BsmtFinSF1.median()
med2 = housing.BsmtFinSF2.median()
medf = housing.BsmtUnfSF.median()
medt = housing.TotalBsmtSF.median()
meda = housing.GarageArea.median()
medlon =housing.long.median()
medlat = housing.lat.median()
medist = housing.dist.median()
medinc = housing.income.median()

housing.BsmtFinSF1 = housing.BsmtFinSF1.fillna(med1)
housing.BsmtFinSF2 = housing.BsmtFinSF2.fillna(med2)
housing.BsmtUnfSF = housing.BsmtUnfSF.fillna(medf)
housing.TotalBsmtSF = housing.TotalBsmtSF.fillna(medt)
housing.GarageArea = housing.GarageArea.fillna(meda)
housing.long = housing.long.fillna(medlon)
housing.lat = housing.lat.fillna(medlat)
housing.dist = housing.dist.fillna(medist)
housing.income = housing.income.fillna(medinc)

housing.BsmtFullBath = housing.BsmtFullBath.fillna(0.0)
housing.BsmtHalfBath = housing.BsmtHalfBath.fillna(0.0)
housing.GarageCars = housing.GarageCars.fillna(0.0)

housing.GarageYrBlt = np.where(housing.GarageYrBlt.notnull(),housing.GarageYrBlt, housing.YearBuilt)



In [None]:
missing_values_table(housing)

# Feature Engineering

In [None]:
housing.YearBuilt = 2010 - housing.YearBuilt
housing.GarageYrBlt = 2010 - housing.GarageYrBlt
housing.YrSold = 2010 -housing.YrSold 
housing.YearRemodAdd = 2010 -housing.YearRemodAdd

In [None]:
# Creation of new column combining full and half bathrooms into one
bathrm = (housing['FullBath'] + housing['BsmtFullBath'] +
(housing['HalfBath']*0.5) + (housing['BsmtHalfBath']*0.5))
housing['bathrm_cnt'] = bathrm

# Creation of new column combining deck/porch-related sq footage into one
patioSF = (housing['WoodDeckSF'] + housing['OpenPorchSF']+ housing['EnclosedPorch'] + 
           housing['3SsnPorch'] + housing['ScreenPorch'])
housing['patioSF'] = patioSF

# Consider removing this session if the R**2 does not improve

# Zoning Dummy

dummies = pd.get_dummies(housing, prefix=['MSZoning'], columns = ['MSZoning'], drop_first = True)
dummies = dummies[['PID','MSZoning_RH','MSZoning_RL','MSZoning_RM']]
housing = housing.merge(dummies, left_on = 'PID', right_on = 'PID')
print(housing.shape)
def near_rr(df):
    rr = ['RRAe', 'RRAn', 'RRNn','RRNe']
    if df['Condition1'] in rr:
        return 1
    if df['Condition2'] in rr:
        return 1
    else:
        return 0
#housing = near_rr(housing)
# Creating near RR column
housing['NearRR'] = housing.apply(near_rr, axis =1)
print(housing.shape)
def near_pos(df):
    pos = ['PosA', 'PosN']
    if df['Condition1'] in pos:
        return 1
    if df['Condition2'] in pos:
        return 1
    else:
        return 0

# Creating near Positive Feature column
housing['NearPos'] = housing.apply(near_pos, axis = 1)
print(housing.shape)
# Creating function to see if Condition1 or Condition2 shows house is adjacent to arterial road
def near_art(df):
    art = ['Artery']
    if df['Condition1'] in art:
        return 1
    if df['Condition2'] in art:
        return 1
    else:
        return 0

# Creating adjacent to arterial road column
housing['Artery'] = housing.apply(near_art, axis = 1)
print(housing.shape)
# Function to converting ordinal KitchenQual to number
def qual_to_num_kit(df):
    if df['KitchenQual'] == 'Po':
        return 1
    if df['KitchenQual'] == 'Fa':
        return 2
    if df['KitchenQual'] == 'TA':
        return 3
    if df['KitchenQual'] == 'Gd':
        return 4
    if df['KitchenQual'] == 'Ex':
        return 5

# Replacing Kitchen Qual string values with numerical
housing['KitchenQual'] = housing.apply(qual_to_num_kit, axis = 1)
print(housing.shape)





# separate dummy df
dum_bldgtype = pd.get_dummies(housing.BldgType, prefix='BldgType')
dum_bldgtype.drop('BldgType_'+str(housing['BldgType'].mode()[0]), axis=1, inplace=True)
housing = pd.concat([housing, dum_bldgtype], axis=1)
print(housing.shape)

# House Style

housing['HouseStyle'].replace('2.5Fin', '2Story', inplace=True)
housing['HouseStyle'].replace('2.5Unf', '1Story', inplace=True)
housing['HouseStyle'].replace('1.5Unf', '1.5Fin', inplace=True)

dum_housestyle = pd.get_dummies(housing.HouseStyle, prefix='HouseStyle')
dum_housestyle.drop('HouseStyle_'+str(housing['HouseStyle'].mode()[0]), axis=1, inplace=True)
# concatenating dum_housestyle with train
housing = pd.concat([housing, dum_housestyle], axis=1)
print(housing.shape)

housing.shape

housing.columns

coldrop = ['MSSubClass']
housing = housing.drop(coldrop, axis = 1)

housing.shape







In [None]:
plt.hist(housing.SalePrice, bins = 50)


In [None]:
sns.distplot(housing.SalePrice, hist = False, kde = True,
            kde_kws = {'shade': True, 'linewidth': 2})
plt.show()

In [None]:
housing["SalePrice"].mean()

In [None]:
#use the log function to make the data normal
plt.hist(np.log(housing.SalePrice+1), bins = 50)

In [None]:
sns.distplot(np.log(housing.SalePrice+1), hist = False, kde = True,
            kde_kws = {'shade': True, 'linewidth': 2})
plt.show()

In [None]:
np.log(housing["SalePrice"]).mean()

In [None]:
# Check the different types of foundations
#print(housing.Foundation.value_counts())
#sns.countplot(housing.Foundation)


# Function to converting ordinal KitchenQual to number
def qual_to_num_kit(df):
    if df['KitchenQual'] == 'Po':
        return 1
    if df['KitchenQual'] == 'Fa':
        return 2
    if df['KitchenQual'] == 'TA':
        return 3
    if df['KitchenQual'] == 'Gd':
        return 4
    if df['KitchenQual'] == 'Ex':
        return 5
# Replacing Kitchen Qual string values with numerical
housing['KitchenQual'] = housing.apply(qual_to_num_kit, axis = 1)

In [None]:
# Keep the numerical data to the left and categorical data to the right.
# Visualise the proportion of each categorical labels
categorical_data=[]
housing_new =pd.DataFrame()
#print(len(housing_new))
housing_new["SalePrice"]=housing["SalePrice"]
for ele in housing.columns:
    if np.dtype(housing[ele])== "int64" or np.dtype(housing[ele])=="float64":
        housing_new[ele] = housing[ele]
        print(len(housing_new))
    else:
        categorical_data.append(ele)
        sns.countplot(housing[ele])
    plt.show()

In [None]:
for name in categorical_data:
    print(name, ': number of values', len(housing[name].value_counts()))

In [None]:
for ele in categorical_data:
    housing_new[ele] = housing[ele]

In [None]:
# Get the dummies of each categorical Data.
for ele in categorical_data:
    # Converting type of columns to category
    housing_new=pd.get_dummies(housing_new, prefix="{}_".format(ele), 
                            columns=[ele], 
                            drop_first=True)
    

housing_new#=housing_new.drop(["repeat","repeat_x", "repeat_y"], axis=1)

In [None]:
#housing_new

**Fitting and Evaluating Multiple Linear Regression**


x_m= np.array(housing_new.iloc[:1500,1:])
y_m = np.array(housing_new.iloc[:1500, 0])
x_t= np.array(housing_new.iloc[1500:,1:])
y_t = np.array(housing_new.iloc[1500:, 0])
ols.fit(x_m, y_m)
print("beta_1, beta_2: " + str(np.round(ols.coef_, 3)))
print("beta_0: " + str(np.round(ols.intercept_, 3)))
print("RSS: %.2f" % np.sum((ols.predict(x_m) - y_m) ** 2))
print("R^2: %.5f" % ols.score(x_t, y_t))

In [None]:
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression 
X_train, X_test, y_train, y_test = train_test_split(housing_new.iloc[:,1:], housing_new.iloc[:,0], test_size=0.5, random_state=0)
ols = LinearRegression()
ols.fit(X_train, y_train)
print("R^2 for train set: %f" %ols.score(X_train, y_train))

print('-'*50)

print("R^2 for test  set: %f" %ols.score(X_test, y_test))

In [None]:
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression 
lst =[]
test_r =[]
train_r=[]

for ele in housing_new.iloc[:,1:].columns:
    ols = LinearRegression()
    X_train, X_test, y_train, y_test = train_test_split(housing_new[[ele]], housing_new.iloc[:,0], test_size=0.5, random_state=0)
    lst.append(ele)
    ols.fit(X_train, y_train)
    #print('-'*50)
    #print(ele.upper())
    train_r.append(ols.score(X_train, y_train))

   

    test_r.append(ols.score(X_test, y_test))
    #print('-'*50)
feature_importance =pd.DataFrame( {"element":lst, "train_r":train_r,"test_r":test_r}).sort_values(by="train_r")[::-1]
feature_importance

In [None]:
# Check the contribution of each feature by importance
housingimp=pd.DataFrame()
housingimp["SalePrice"]=housing.SalePrice
housingimp[list(feature_importance.element)]=housing_new[list(feature_importance.element)]
    
lst =[]
test_r =[]
train_r=[]

for ele in range(2,len(housingimp.columns)):
    ols = LinearRegression()
    X_train, X_test, y_train, y_test = train_test_split(housingimp.iloc[:,1:ele], housing_new.iloc[:,0], test_size=0.5, random_state=0)
    lst.append(housingimp.iloc[:,1:ele].columns)
    ols.fit(X_train, y_train)
    #print('-'*50)
    
    train_r.append(ols.score(X_train, y_train))

    test_r.append(ols.score(X_test, y_test))
    #print('-'*50)
nfeature_importance =pd.DataFrame( {"element":lst, "train_r":train_r,"test_r":test_r}).sort_values(by="train_r")
nfeature_importance 

In [None]:
plt.hist(nfeature_importance["train_r"], bins=20)


# BoxCox Transformation

Some of the distributions are not normal and would affect the performance of our model. 
We do a box cox tranformation to  make it normal.

In [None]:
#Take the boxcox transform of numerical dtypes
lst = list(housing_new.columns)[1:38]
for ele in lst:
    print(ele)
    fitted_data, fitted_lambda = stats.boxcox(housing_new[ele]+1)
    housing_new["Log_{}".format(ele)]=fitted_data
    print(fitted_lambda)
housing_new

In [None]:
housing_new.iloc[:,38:]

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression 
X_train, X_test, y_train, y_test = train_test_split(housing_new.iloc[:,1:300], housing_new.iloc[:,0], test_size=0.3, random_state=0)
ols = LinearRegression()
ols.fit(X_train, y_train)
print("R^2 for train set: %f" %ols.score(X_train, y_train))

print('-'*50)

print("R^2 for test  set: %f" %ols.score(X_test, y_test))

### Scale the data using the standard scaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# transform data
scaled = pd.DataFrame(scaler.fit_transform(housing_new.iloc[:,1:]))
scaled


X_train, X_test, y_train, y_test = train_test_split(scaled, housing_new.iloc[:,0], test_size=0.3, random_state=0)
ols = LinearRegression()
ols.fit(X_train, y_train)
print("R^2 for train set: %f" %ols.score(X_train, y_train))

print('-'*50)

print("R^2 for test  set: %f" %ols.score(X_test, y_test))
# This model does not persorm well on the test set.

In [None]:
housing_new.columns

In [None]:
housing_new.shape

In [None]:
N =len(housingimp.columns)
breaks = range(1,N, 2) # Deternine the number of columns to run PCA with.
pca=PCA(n_components=2)
exp_ratio = []
# We will not use the longitude, latitude and price in the PCA to make sure that our result is blind to locations and price.
data =housing_new[["SalePrice"]] # We assume that longitude, latiude are independent predictors.
pca.set_params(n_components=2)

for i in range(len(breaks)-1):
    principal_components_ = pca.fit_transform(housing_new.iloc[:,breaks[i]:breaks[i+1]])
    

    data1 =housing_new[["SalePrice"]]
    # Visualize data across the linear components
     # Create a new dataframe for the PCA values
 
    total_var = sum(pca.explained_variance_ratio_)*100
    data1["PCA_{}".format(i)] =   list(principal_components_[:,0]) # Add the first pricipal component to the data1
    
    exp_ratio.append(pca.explained_variance_ratio_[0]) 
    
    data["PCA_"+"{}".format(i)] = list(principal_components_[:,0])
    if pca.explained_variance_ratio_[1]>0.25:
        data["PCA_2"+"{}".format(i)] = list(principal_components_[:,1])
        exp_ratio.append(pca.explained_variance_ratio_[1])
    
    #fig = plt.figure(figsize=(15, 10))
    #fig = px.scatter_3d(
        #np.array(data1), x=0, y=1, z=2, color=gadf3['price'],
        #title=f'Total Explained Variance: {total_var:.2f}%,  PCA_1:  {100*pca.explained_variance_ratio_[0]:.2f}%, PCA_2: {100*pca.explained_variance_ratio_[1]:.2f}% ',
        #labels={"Longitude", "Latitude", "PCA_"}

    #)
    #fig.show()
    

In [None]:
data

In [None]:
exp_ratio

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:,1:], housing_new.iloc[:,0], test_size=0.3, random_state=0)
ols = LinearRegression()
ols.fit(X_train, y_train)
print("R^2 for train set: %f" %ols.score(X_train, y_train))

print('-'*50)

print("R^2 for test  set: %f" %ols.score(X_test, y_test))

- Do **multiple linear regression** with a new data set.
- Report the coefficient of determination from the training and testing sets.

In [None]:
X_train.shape

In [None]:
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression 
X_train, X_test, y_train, y_test = train_test_split(housing_new.iloc[:,1:300], housing_new.iloc[:,0], test_size=0.3, random_state=0)
ols = LinearRegression()
ols.fit(X_train, y_train)
print("R^2 for train set: %f" %ols.score(X_train, y_train))

print('-'*50)

print("R^2 for test  set: %f" %ols.score(X_test, y_test))

In [None]:
housing_new

# Merge the second DATASET

In [None]:
# Merging original DF with additional dataset
geodata = housing_new.merge(realEstate, left_on = 'PID', right_on = 'MapRefNo')
geodata = geodata[['PID','Prop_Addr','MA_Zip1']]
print(geodata.shape)
geodata.head(10)

hs_na = geodata.isna().sum()
hs_na.plot.bar()
hs_na

In [None]:
geodata

In [None]:

# Importing additional libraries for further geographical analysis
import geopy
from geopy import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from geopy.distance import geodesic

In [None]:
census =pd.read_csv("Cenus_data.csv")
geocode = pd.read_csv("GeocodeResults2.csv")
geocode