# Property Condition data analysis and modeling

## Correlation Analysis

### Looking at the data corresponding to the condition of the property we subset the following columns:
>'condition_code', 'grade_code', 'age', 'age_of_renovations', and 'renovated' 
### We then look at correlations between these variables

In [None]:
# First I'll crate some basic variables that will be use through the analysis and modeling
df_condition = df_cities[['price','condition_code','grade_code','age','age_of_renovations','renovated']]
X_condition = df_condition[['condition_code','grade_code','age','age_of_renovations','renovated']]
y_condition = df_condition.price

In [None]:
# Now lest look at the feature correlations
X_condition.corr()

In [None]:
# We can view this as a heat map to give a better view
plt.figure(figsize=(8,6))
sns.heatmap(X_condition.corr(), annot=True)
plt.show()

### We can see that 'age_of_renovations' is closely correlated with 'renovated' and may cause multicolinearity issues. Indeed they are related to the same characteristsic and tell the same tale. We can eliminate one of these from our analysis.
>#### If a property is showing a value  > 1 in 'age_of_renovations' then we know it has been renovated which tells us the same as the 'renovated' column... so we will drop 'renovated'

In [None]:
formula_cond = 'price ~ condition_code + grade_code + age + age_of_renovations'

condition_model = ols(formula=formula_cond, data=df_condition).fit()
condition_model_summ = condition_model.summary()

print(condition_model_summ)

### Lets take a look at some regressions and see what will give us the strongest model based on the condition variables

### *Using all of our features, we get a strong score on both a training data set and also the test set*

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_condition, 
                                                    y_condition,
                                                    test_size=None,
                                                    random_state=42
                                                   )

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)
lr.score(X_test, y_test)
y_hat_train = lr.predict(X_train)
y_hat_test = lr.predict(X_test)

print(lr.score(X_train, y_train), np.sqrt(mean_squared_error(y_train, y_hat_train)))
print(lr.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_hat_test)))

### *We don't need all of these features though so lets iterate through some linear regressions with different feature combinations and see what will give us the simplest, but highest scoring model*

In [None]:
# dropping the 'renovated' feature that was giving us multicolinearity issues

X_condition_1 = df_condition[['condition_code','grade_code','age','age_of_renovations']]
X_train, X_test, y_train, y_test = train_test_split(X_condition_1, 
                                                    y_condition,
                                                    test_size=None,
                                                    random_state=42
                                                   )

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)
lr.score(X_test, y_test)
y_hat_train = lr.predict(X_train)
y_hat_test = lr.predict(X_test)

print(lr.score(X_train, y_train), np.sqrt(mean_squared_error(y_train, y_hat_train)))
print(lr.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_hat_test)))

### *Conditon and Grade tell a similar story about a property. Looking at he data we see Grade is a lot more in depth and provides more bins, so it is potentially more powerful. Lets drop Condition.*

In [None]:
# dropping the 'condition_code' feature from the model

X_condition_2 = df_condition[['grade_code','age','age_of_renovations']]
X_train, X_test, y_train, y_test = train_test_split(X_condition_2, 
                                                    y_condition,
                                                    test_size=None,
                                                    random_state=42
                                                   )

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)
lr.score(X_test, y_test)
y_hat_train = lr.predict(X_train)
y_hat_test = lr.predict(X_test)

print(lr.score(X_train, y_train), np.sqrt(mean_squared_error(y_train, y_hat_train)))
print(lr.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_hat_test)))

### *That barely chaged our model score, and indeed when we look at just Condition it doesn't score well.*

In [None]:
X_condition_3 = df_condition[['condition_code']]
X_train, X_test, y_train, y_test = train_test_split(X_condition_3, 
                                                    y_condition,
                                                    test_size=None,
                                                    random_state=42
                                                   )

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)
lr.score(X_test, y_test)
y_hat_train = lr.predict(X_train)
y_hat_test = lr.predict(X_test)

print(lr.score(X_train, y_train), np.sqrt(mean_squared_error(y_train, y_hat_train)))
print(lr.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_hat_test)))

### *However, looking at just Grade we can see that it scores well and is indeed our strongest feature.*

In [None]:
X_condition_4 = df_condition[['grade_code']]
X_train, X_test, y_train, y_test = train_test_split(X_condition_4, 
                                                    y_condition,
                                                    test_size=None,
                                                    random_state=42
                                                   )

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)
lr.score(X_test, y_test)
y_hat_train = lr.predict(X_train)
y_hat_test = lr.predict(X_test)

print(lr.score(X_train, y_train), np.sqrt(mean_squared_error(y_train, y_hat_train)))
print(lr.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_hat_test)))

### *Including the 'age_of_renovations' feature does not add much to the model, so we will drop that as well*

In [None]:
X_condition_5 = df_condition[['grade_code','age_of_renovations']]
X_train, X_test, y_train, y_test = train_test_split(X_condition_5, 
                                                    y_condition,
                                                    test_size=None,
                                                    random_state=42
                                                   )

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)
lr.score(X_test, y_test)
y_hat_train = lr.predict(X_train)
y_hat_test = lr.predict(X_test)

print(lr.score(X_train, y_train), np.sqrt(mean_squared_error(y_train, y_hat_train)))
print(lr.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_hat_test)))

### *Interestingly, Age does not score well by itself. However, when coupled with the Grade feature it gives a large improvement to the overall score.

### *This combination of features gives us the highest score, with the least features.*

In [None]:
X_condition_6 = df_condition[['age']]
X_train, X_test, y_train, y_test = train_test_split(X_condition_6, 
                                                    y_condition,
                                                    test_size=None,
                                                    random_state=42
                                                   )

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)
lr.score(X_test, y_test)
y_hat_train = lr.predict(X_train)
y_hat_test = lr.predict(X_test)

print(lr.score(X_train, y_train), np.sqrt(mean_squared_error(y_train, y_hat_train)))
print(lr.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_hat_test)))

In [None]:
X_condition_7 = df_condition[['grade_code','age']]
X_train, X_test, y_train, y_test = train_test_split(X_condition_7, 
                                                    y_condition,
                                                    test_size=None,
                                                    random_state=42
                                                   )

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)
lr.score(X_test, y_test)
y_hat_train = lr.predict(X_train)
y_hat_test = lr.predict(X_test)

print(lr.score(X_train, y_train), np.sqrt(mean_squared_error(y_train, y_hat_train)))
print(lr.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_hat_test)))

### Indeed, these are the two columns we will include on the overall "Linear Regression" model

## Lets now look at some Polynomial regressions on the "condition" data and see if we can improve on our model

In [None]:
#  Creat a list of all feature combinations above to iterate through

X_conditions = [X_condition, X_condition_1, X_condition_2, X_condition_3, X_condition_4, X_condition_5, X_condition_6, X_condition_7]

In [None]:
# A function thet interates through our combinations to determine the the best number of Polynomial Features
# It returns an array of train and test scores for each combination

def poly_scores_array (X_lst, y):
    poly_scores = []
    for X in X_lst:
        train_scores = []
        test_scores = []
        for i in range(1,7):
            poly = PolynomialFeatures(i)
            X_poly = pd.DataFrame(poly.fit_transform(X))
            X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=None, random_state=42)
            model = LinearRegression()
            model.fit(X_train, y_train)
            train_scores.append(model.score(X_train, y_train))
            test_scores.append(model.score(X_test, y_test))
            scores = (train_scores, test_scores)
            poly_scores.append(scores)
    return poly_scores

poly_scores_array(X_conditions, y_condition)

## Just eyeballing the array we conistently get our highest scores up to 4 polynomial features and then our test scores drop significantly.

In [None]:
# Having determined the best number of polynomial features is 4 for the property condition data
# Create a function to iterate through our feature combinations to determine the best model

def best_poly_model (X_lst, y):
    poly_model_scores = []
    for X in X_lst:
        poly_2 = PolynomialFeatures(4)
        X_poly = pd.DataFrame(poly_2.fit_transform(X))
        X_train, X_test, y_train, y_test = train_test_split(X_poly, y,
                                                    test_size=None,
                                                    random_state=42)
        lr_poly = LinearRegression()
        lr_poly.fit(X_train, y_train)
        train_score = lr_poly.score(X_train, y_train)
        test_score = lr_poly.score(X_test, y_test)
        score = (train_score, test_score)
        poly_model_scores.append(score)
    return poly_model_scores

best_poly_model (X_conditions, y_condition)

### The array above corresponds to our feature combinations. In our linear regression analysis we determined that X_condition_7 gave us our best model and we can see above that the same combination gives us excellent results and even scores the best on the test data overall.

### Lets see this isolated below

In [None]:
poly_7 = PolynomialFeatures(4)
X_poly_7 = pd.DataFrame(poly_7.fit_transform(X_condition_7))
X_train, X_test, y_train, y_test = train_test_split(X_poly_7, y_condition,
                                                    test_size=None,
                                                    random_state=42)
lr_poly_7 = LinearRegression()
lr_poly_7.fit(X_train, y_train)
lr_poly_7.score(X_train, y_train)
lr_poly_7.score(X_test, y_test)
y_hat_train = lr_poly_7.predict(X_train)
y_hat_test = lr_poly_7.predict(X_test)

print(lr_poly_7.score(X_train, y_train), np.sqrt(mean_squared_error(y_train, y_hat_train)))
print(lr_poly_7.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_hat_test)))

## **After performing both linear and ploynomial regression analysis on the Property Condition data, we have determined that _Grade (as coded in 'grade_code') and Age are the best variables to use in our model_**

## Lets take a look at how our model performs on our data set

In [None]:
x_line = np.linspace(0,2000000)
fig,axs = plt.subplots(figsize = (12,8))
sns.scatterplot(np.concatenate((y_test,y_train)),np.concatenate((y_hat_test, y_hat_train)), marker = "." , s = 8,alpha = 1, label = "Individual house sell price")
axs.plot(x_line, x_line, color ="red", label = "Ideal prediction line")
axs.set_xlim(0,2000000) ; axs.set_ylim(0,2000000)
axs.yaxis.set_major_formatter(currency)
axs.xaxis.set_major_formatter(currency)
axs.set_aspect("equal")
axs.set_title("House Price vs house predicted price plot")
axs.set_xlabel("House sell price")
axs.set_ylabel("House predicted sell price")
axs.legend();
plt.savefig("./images/property_condition_scatter.png")