# Evaluating the Test Data

This notebook is solely for evaluating the test data set in relation to the training set. I don't want to spend too much time on it as it is not going into my model, however when fit issues arise, it is helpful to see where categories/columns may differ to identify those disparities so that I can correct them by accounting for them in my model.

In [1]:
#imports
from scipy.stats import ttest_ind
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
import re
#week 3
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
#week 4
from sklearn.linear_model import LogisticRegression


In [2]:
test = pd.read_csv('./datasets/test.csv') #read in test data

In [3]:
train = pd.read_csv('./datasets/train.csv')

It looks like by default there is one extra column in the training set - this is of course the saleprice data that is not provided in the test set (for obvious reasons)

In [11]:
test.shape

(878, 80)

In [12]:
train.shape

(2051, 81)

I was curious how many neighborhoods are represented in the test set as opposed to the training set; there are 28 in the training set, but only 26 in the test set. I need to be sure my model can account for this difference.

In [9]:
test['Neighborhood'].value_counts().count() #check neighborhoods in test data

26

In [8]:
train['Neighborhood'].value_counts().count()

28

I decided to try making Overall Quality a categorical variable instead of numerical - after Charlie's lesson on dummies and categoricals it struck me that while higher quality is "better", we can't say definitively that the 1-point difference between quality is the true ratio of value between respective qualities.  

I discovered, however, that I ended up with one extra column in my train set when I made 'Overall Qual' into dummies as compared to the test set. It turns out this is because there are no quality ratings of "1" in the test set (displayed below). I therefore revised my cleaning process to combine categories 1 and 2 as "1.5" so that the columns match up.

In [10]:
test['Overall Qual'].value_counts() #the test set value counts

5     262
6     226
7     171
8     100
4      67
9      30
3      11
10      7
2       4
Name: Overall Qual, dtype: int64

In [12]:
train['Overall Qual'].value_counts() #the train set value counts

5     563
6     506
7     431
8     250
4     159
9      77
3      29
10     23
2       9
1       4
Name: Overall Qual, dtype: int64