# Trust Property Exercise:


### Background
Tina Davis has been appointed senior data analyst at Trust Property, a company that thas been building, operating, and overseeing residential properties in the US. Her first task is to create a model that predicts whether a house will sell or not. She will report the results to her supervisor who is responsible for overseeing all data analysts in the company.

After some data analysis, Tina decides to use logistic regression to predict whether a house will sell. Calculate logistic regression coefficients using all predictor variables listed, and determine which of the following is most likely accurate.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

#Apply fix to the statsmodels library
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

In [2]:
raw_data = pd.read_csv('TrustProperty.csv')
raw_data

Unnamed: 0,Price,Resid_area,Air_qual,Num_rooms,Age,Teachers,Poor_prop,N_hos_beds,N_hot_rooms,Rainfall,Parks,Airport,Sold
0,24.0,32.31,0.538,6.575,65.2,24.7,4.98,5.480,11.1920,23,0.049347,Yes,0
1,21.6,37.07,0.469,6.421,78.9,22.2,9.14,7.332,12.1728,42,0.046146,No,1
2,34.7,37.07,0.469,7.185,61.1,22.2,4.03,7.394,101.1200,38,0.045764,No,0
3,33.4,32.18,0.458,6.998,45.8,21.3,2.94,9.268,11.2672,45,0.047151,Yes,0
4,36.2,32.18,0.458,7.147,54.2,21.3,5.33,8.824,11.2896,55,0.039474,No,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,28.6,43.92,0.437,6.678,31.1,24.0,6.27,6.972,10.2288,44,0.042031,Yes,0
296,27.1,43.92,0.437,6.549,51.0,24.0,7.39,10.442,10.2168,26,0.045356,No,1
297,20.3,43.92,0.437,5.790,58.0,24.0,15.84,7.606,11.1624,42,0.052079,Yes,0
298,22.5,32.24,0.400,6.345,20.1,25.2,4.97,9.450,13.1800,48,0.036878,No,1


**After Loading the data, we come across the following columns:**
1. Price - Prices of apartmnets and houses (USD, Thousands) in New York State over the last couple years
2. Resid_area - Proportion of residential areas in the city/town
3. Air_qual = Air quality of the residing neighorhood
4. Num_rooms - Number of rooms in the hosues of that locality
5. Age - Number of years since the property was built
6. Teachers - Number of teachers per 1000 population in the city/town
7. Poor_prop - Numer of the poor population in the city/town
8. N_hos_beds - Number of hospital beds per 1000 population in the city/town
9. N_hot_rooms - number of hotel rooms per 1000 population in the city/town
10. Rainfall - Yearly average rainfall
11. Parks - Proportion of land designated as parks and green areas in the city/town
12. Sold - Whether or not the house will sell (our dependent variable)

In [3]:
data = raw_data.copy()
data['Airport'] = data['Airport'].map({'Yes':1,'No':0})

In [4]:
data.columns.values

array(['Price', 'Resid_area', 'Air_qual', 'Num_rooms', 'Age', 'Teachers',
       'Poor_prop', 'N_hos_beds', 'N_hot_rooms', 'Rainfall', 'Parks',
       'Airport', 'Sold'], dtype=object)

#### Set independent and dependent variables

In [None]:
targets = data['Sold']
inputs = data.drop(['Sold'], axis=1)

#### Fit data to model

In [6]:
x = sm.add_constant(inputs)
reg_log = sm.Logit(targets,x)
results_log = reg_log.fit()
results_log.summary()

Optimization terminated successfully.
         Current function value: 0.581359
         Iterations 7


0,1,2,3
Dep. Variable:,Sold,No. Observations:,300.0
Model:,Logit,Df Residuals:,287.0
Method:,MLE,Df Model:,12.0
Date:,"Tue, 25 Jul 2023",Pseudo R-squ.:,0.156
Time:,16:23:31,Log-Likelihood:,-174.41
converged:,True,LL-Null:,-206.64
Covariance Type:,nonrobust,LLR p-value:,3.437e-09

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-9.2246,3.501,-2.634,0.008,-16.087,-2.362
Price,-0.2644,0.048,-5.550,0.000,-0.358,-0.171
Resid_area,0.0332,0.030,1.101,0.271,-0.026,0.092
Air_qual,-3.8683,3.370,-1.148,0.251,-10.473,2.736
Num_rooms,1.3215,0.502,2.632,0.008,0.338,2.306
Age,0.0069,0.007,0.930,0.353,-0.008,0.021
Teachers,0.2873,0.076,3.775,0.000,0.138,0.436
Poor_prop,-0.1976,0.045,-4.373,0.000,-0.286,-0.109
N_hos_beds,0.2028,0.085,2.400,0.016,0.037,0.369


Based on the results, it looks like there are a number of insignificant predictor variables when looking at their p-values: Number of Hotel Rooms, Rainfall, Airport, Parks, Age of the Property, Air Quality, Residential Area Proportion all seem to not contribute significantly to the model.

Let's take a look at the model's Confusion Matrix:

In [7]:
cm_df = pd.DataFrame(results_log.pred_table())
cm_df.columns= ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index = {0:'Actual 0' ,  1:'Actual 1'})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,118.0,46.0
Actual 1,51.0,85.0


The Confusion Model shows that out of 300 observations, 118+85 = 203 observations were correctly predicted while 97 were incorrect. This show that the accuracy of the model is 68 while the misclassification rate is 32%

We will drop the insignificant coefficients (with low p-values) and apply a more streamlined model where the inputs we include will only have Price, Num_rooms, Poor_prop as the coefficients:

In [8]:
targets = data['Sold']
inputs = data[['Price', 'Num_rooms', 'Poor_prop']]

In [9]:
x = sm.add_constant(inputs)
reg_log = sm.Logit(targets,x)
results_log = reg_log.fit()
results_log.summary()

Optimization terminated successfully.
         Current function value: 0.626278
         Iterations 5


0,1,2,3
Dep. Variable:,Sold,No. Observations:,300.0
Model:,Logit,Df Residuals:,296.0
Method:,MLE,Df Model:,3.0
Date:,"Tue, 25 Jul 2023",Pseudo R-squ.:,0.09075
Time:,16:23:31,Log-Likelihood:,-187.88
converged:,True,LL-Null:,-206.64
Covariance Type:,nonrobust,LLR p-value:,3.599e-08

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.1079,2.197,-0.049,0.961,-4.414,4.198
Price,-0.1811,0.039,-4.649,0.000,-0.257,-0.105
Num_rooms,0.9246,0.431,2.144,0.032,0.079,1.770
Poor_prop,-0.1371,0.033,-4.159,0.000,-0.202,-0.073


Need to create a Confusion Matrix as well:

In [12]:
cm_df = pd.DataFrame(results_log.pred_table())
cm_df.columns= ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index = {0:'Actual 0' ,  1:'Actual 1'})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,107.0,57.0
Actual 1,57.0,79.0


The Confusion Model after we dropped the insignificant coefficients here shows that out of 300 observations, 107+79 = 186 observations were correctly predicted while 114 were incorrect. This show that the accuracy of the model is 62% while the misclassification rate is 38%. This means that dropping the coefficients lead to a **worse** accuracy than expected.

**Hypothetical:** After some discussion with Tina about the quality of his model, her supervisor decides to add an extra feature in an effort to improve his model's accuracy. He believes that the presence of an airport in the city would be a significant factor whether on a property would be successfully sold:

In [15]:
targets = data['Sold']
inputs = data[['Price', 'Num_rooms', 'Poor_prop', 'Airport']]
x = sm.add_constant(inputs)
reg_log = sm.Logit(targets,x)
results_log = reg_log.fit()
results_log.summary()

Optimization terminated successfully.
         Current function value: 0.625240
         Iterations 5


0,1,2,3
Dep. Variable:,Sold,No. Observations:,300.0
Model:,Logit,Df Residuals:,295.0
Method:,MLE,Df Model:,4.0
Date:,"Tue, 25 Jul 2023",Pseudo R-squ.:,0.09226
Time:,16:34:24,Log-Likelihood:,-187.57
converged:,True,LL-Null:,-206.64
Covariance Type:,nonrobust,LLR p-value:,1.055e-07

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.1113,2.200,-0.051,0.960,-4.422,4.200
Price,-0.1782,0.039,-4.563,0.000,-0.255,-0.102
Num_rooms,0.9273,0.431,2.149,0.032,0.082,1.773
Poor_prop,-0.1339,0.033,-4.047,0.000,-0.199,-0.069
Airport,-0.2010,0.255,-0.790,0.430,-0.700,0.298


The Airport coefficient ends up being -0.201 and is statistically insignificant (p-value =0.430)

**Hypothetical:** Tina's supervisor thinks every model should be tested on an unfamiliar dataset. Using the 'TrustProperty_Test.csv' file, calculate the accuracy and misclassification of his model.

In [17]:
test = pd.read_csv('TrustProperty_Test.csv')

In [18]:
test_actual = test['Sold']
test_data = test.drop(['Sold'],axis=1)
test_data = sm.add_constant(test_data)


In [19]:
#Define function and inputs
def confusion_matrix(data, actual_values, model):
    #predicted values are determined by feeding data into the model
    pred_values = model.predict(data)
    bins=np.array([0,0.5,1])
    cm = np.histogram2d(actual_values,pred_values, bins=bins)[0]
    accuracy = (cm[0,0]+cm[1,1])/cm.sum()
    return cm, accuracy

In [22]:
cm = confusion_matrix(test_data, test_actual, results_log)
cm_df = pd.DataFrame(cm[0])
cm_df.columns= ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index = {0:'Actual 0' ,  1:'Actual 1'})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,3.0,5.0
Actual 1,0.0,2.0


In [23]:
print('Missclassification Rate: ' +str((5+0)/10))

Missclassification Rate: 0.5


From the Confusion Matrix after testing the test data, we see that the manager's model is worse than Tina's model as the accuracy is even lower from testing at 50%. This alternatively also means that misclassifcation is 50% from the test. We also noticed that from the summary tables from before, Teachers and Hospital Beds Proportions had significant p-values, let's try to run another logistic regression to check on that model's accuracy:

In [24]:
data = raw_data.copy()
targets = data['Sold']
inputs = data[['Price', 'Num_rooms', 'Poor_prop', 'Teachers', "N_hos_beds"]]

In [25]:
x = sm.add_constant(inputs)
reg_log = sm.Logit(targets,x)
results_log = reg_log.fit()
results_log.summary()

Optimization terminated successfully.
         Current function value: 0.589436
         Iterations 6


0,1,2,3
Dep. Variable:,Sold,No. Observations:,300.0
Model:,Logit,Df Residuals:,294.0
Method:,MLE,Df Model:,5.0
Date:,"Tue, 25 Jul 2023",Pseudo R-squ.:,0.1442
Time:,16:59:14,Log-Likelihood:,-176.83
converged:,True,LL-Null:,-206.64
Covariance Type:,nonrobust,LLR p-value:,1.464e-11

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-8.8004,3.024,-2.910,0.004,-14.728,-2.873
Price,-0.2689,0.047,-5.707,0.000,-0.361,-0.177
Num_rooms,1.4528,0.472,3.076,0.002,0.527,2.378
Poor_prop,-0.1569,0.034,-4.549,0.000,-0.224,-0.089
Teachers,0.2728,0.067,4.044,0.000,0.141,0.405
N_hos_beds,0.2069,0.082,2.520,0.012,0.046,0.368


In [26]:
cm_df = pd.DataFrame(results_log.pred_table())
cm_df.columns= ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index = {0:'Actual 0' ,  1:'Actual 1'})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,117.0,47.0
Actual 1,51.0,85.0


The accuracy of the model from the confusion matrix comes out to 67.33%, which is still worse than Tina's original model which did not exclude anything from the original dataset. Other methods could involve the following to make the model better:

1. **Regularization:** Regularization is a technique that penalizes the model for having large coefficients. This can help to reduce overfitting and improve the model's predictive performance. There are two main types of regularization: L1 regularization and L2 regularization. L1 regularization penalizes the absolute values of the coefficients, while L2 regularization penalizes the squared values of the coefficients.
2. **Cross-validation:** Cross-validation is a technique for evaluating the performance of a model on unseen data. This is done by splitting the data into a training set and a test set. The model is trained on the training set and then tested on the test set. This helps to ensure that the model is not overfitting the training data.
3. **Hyperparameter tuning:** The hyperparameters of a logistic regression model are the parameters that control the way the model is fit. These parameters can be tuned to improve the model's performance. Some of the most important hyperparameters to tune include the learning rate, the number of iterations, and the regularization strength.