In [1]:
# Library imports and .py file import

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


import os
os.chdir('../assets')
import ames as ames
os.chdir('.')

In [2]:
# Max columns and rows for easier viewing of certain operations below

# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

In [3]:
# Read in CSVs into two different dataframes

df_train = pd.read_csv('../data/train.csv')
df_test = pd.read_csv('../data/test.csv')

I read in the training and testing data into two separate dataframes since I would have to clean/map and preprocess the data in the same method for both datasets. I then dropped two outlier rows using the .drop() function below. The second .drop() function doesn't actually do anything since the two datapoints that are dropped in the first function are the same that would have been dropped in the second. However, I wanted to include functions in case I ever expanded the input dataset.

In [4]:
# Drop outlier house based on bsmt_sf and gr_liv_area 
# https://www.geeksforgeeks.org/drop-rows-from-the-dataframe-based-on-certain-condition-applied-on-a-column/

df_train.drop(df_train[df_train['Total Bsmt SF'] > 5000].index, inplace = True)
df_train.drop(df_train[df_train['Gr Liv Area'] > 5000].index, inplace = True)

Next, I ran my two dataframes through the two functions in my Python file. The first function, ames.rename(), renamed all the columns and trimmed the dataset to just the columns I was interested in. This was applied to both dataframes using an if/else due to the df_test dataframe not having a SalePrice column. The second function, ames.map(), took care of filling NAs, mapping all of the columns categories that needed to be adjusted, and recategorizing certain quality or condition metrics that I wanted to combine. One other adjustment I made in my mapping function was to rewrite certain values of cond_1 if the cond_2 value was a worse quality. While most of the greater negative impacting conditions were listed in cond_1, I used this mapping to be able to drop the cond_2 column to simplify my model and only focus on the worst conditions. I also processed several feature combinations and interaction terms in my Feature Engineering.

In [5]:
df_train_renamed = ames.rename(df_train)
df_test_renamed = ames.rename(df_test)

In [6]:
df_train_cleaned = ames.map(df_train_renamed)
df_test_cleaned = ames.map(df_test_renamed)

Lastly, I saved my renamed and mapped data to new CSVs.

In [7]:
df_test_cleaned.to_csv('../data/test_cleaned.csv', index=False)
df_train_cleaned.to_csv('../data/train_cleaned.csv', index=False)