GA Data Science - Final Project
James Clark

Real estate value can be extremely subjective. Almost all single family and rental properties can be summed up based on some standard objective features - square footage, number of bedrooms and bathrooms, lot size, and so on. But, there are other features that may be attractive to potential purchasers that are not as easily quantifiable.

The goal of this analysis is to use the language used to describe the property listing to determine other feature of a property listing. Additionally, a value could be associated with these objective features.

Analysis will be done using a set of Trulia property listing data found on kaggle.com (https://www.kaggle.com/promptcloud/real-estate-data-from-trulia).

In [None]:
! pip install stringcase

In [57]:
import pandas as pd
import stringcase
import re
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer
from sklearn.feature_extraction.text import CountVectorizer

%matplotlib

Using matplotlib backend: Qt5Agg


In [125]:
def drop_unused_columns(df: pd.DataFrame):
    #Drop unused columns
    columns_to_keep = {'description', 'price', 'sqr_ft', 'style', 'longitude', 'latitude', 'lot_size', 'beds', 'bath'}
    df.drop(columns = df.columns.difference(columns_to_keep), inplace = True)
    
    return df

def numerize_columns(df: pd.DataFrame):    
    for column in ['sqr_ft', 'price', 'lot_size']:
        df[column] = df[column].apply(lambda sqft: re.sub("[^\d.]+", "", str(sqft)) if pd.notnull(sqft) else sqft)
    
    return df

def remove_rows_with_empty_descriptions(df: pd.DataFrame):
    df.dropna(subset = ['description'], inplace = True)
    return df
    
def import_and_clean(filename: str):
    df = pd.read_csv(filename)    
    #Snake case columns    
    df.columns = [stringcase.snakecase(column.lower()) for column in df.columns]    
    df = drop_unused_columns(df)    
    
    #Numberize some columns - sqr_ft
    df = numerize_columns(df)
    
    #Clean na values
    df = remove_rows_with_empty_descriptions(df)
    
    #Clean description of stopwords, stem words
    
    return df

In [126]:
df = import_and_clean('data/home/sdf/marketing_sample_for_trulia_com-real_estate__20190901_20191031__30k_data.csv')
df.head()

Unnamed: 0,description,price,style,sqr_ft,longitude,latitude,lot_size,beds,bath
0,NEW CONSTRUCTION in the North Central Corrido...,895900,4 Beds / 4.5 Baths,3447,-112.081985,33.560055,7895.0,4.0,4.5
1,UPDATED EAST DALLAS HOME READY FOR MOVE-IN. H...,247000,3 Beds / 2 Baths,1767,-96.67625,32.829227,7877.0,3.0,2.0
2,This single-family home is located at 30 Hurl...,44900,3 Beds / 1 Bath,1232,-78.82519,42.913,3510.0,3.0,1.0
3,"Beautiful semi detached, ranch type corner ho...",959000,3 Beds / 2 Baths,1417,-73.86017,40.72296,2598.0,3.0,2.0
4,"great investor opportunity!!! , beautiful stu...",83500,Studio / 1 Bath,440,-80.206314,25.937965,,,1.0


In [131]:
df[df.beds.isna()].head(50)

Unnamed: 0,description,price,style,sqr_ft,longitude,latitude,lot_size,beds,bath
4,"great investor opportunity!!! , beautiful stu...",83500.0,Studio / 1 Bath,440,-80.206314,25.937965,,,1.0
51,"Charming, front-facing studio overlooking pic...",534000.0,Studio / 1 Bath,410,-71.07848,42.35354,410.0,,1.0
66,"South Side Slopes - Nice cleared lot, able to...",32900.0,,1476,-79.96876,40.419426,5040.0,,
73,Investor opportunity! This property is being ...,165149.0,,1672,-76.68773,39.252,7500.0,,2.0
83,Vacant lot near So. Pasadena in Monterey Hill...,325000.0,,42288,-118.1917,34.1142,0.97,,
109,Bright and sunny duplex on an oversized lot a...,19900.0,,2268,-87.9263,43.081432,6969.0,,
223,Completely remodeled elegant unit on the 11th...,299000.0,Studio / 1 Bath,490,-118.184616,33.768402,0.52,,1.0
230,"Motivated Seller, Empty lot in close Distance...",49000.0,,1548,-75.15996,39.98902,1146.0,,
265,This Duplex Rowhome is the perfect opportunit...,475000.0,,1746,-75.165855,39.975178,1583.0,,
332,This property is no longer available to rent o...,137771.0,,1920,-96.022554,36.091499,10367.0,,3.0
