 # **US Rental House Price Prediction using Machine Learning**

---

#### **Author:** Kartik

#### **Contact:** kkakar664@gmail.com

#### **Date:** Oct 16, 2023

## Table of Contents

1. [Introduction](#introduction)
2. [Key Questions](#keyquestions)
3. [Methods](#methods)
4. [Loading and Setup](#loading)
5. [Feature Overview](#overview)
6. [Understanding out Data](#understanding)
6. [Checking and Dealing with missing values](#dealing)
7. [Handling Features](#handling)
8. [Summary](#summary)

## Introduction <a class="anchor" id="introduction"></a>
---

This project focuses on using data science techniques, like Linear Regression and Decision Trees, to shed light on the dynamic world of rental house prices in the United States. Our goal is to construct a predictive model that can accurately estimate these prices, providing valuable insights for renters, landlords, and the real estate industry. In this preliminary phase, we will do some Data Cleaning and Exploratory Data Analysis (EDA) to lay the foundation for our predictive journey, revealing any trends, patterns and relationships within the data. This project aims to be a valuable resource for anyone seeking to navigate the complexities of the U.S. rental housing market.

## Key Questions <a class="anchor" id="keyquestions"></a>
---

#### The analysis aims to answer these following questions:
Using machine learing, are we able to accurately predict the monthly rental house prices (based on the given attributes) in US and what are the key factors influencing these prices?

This question serves as the foundation for the analysis and provide a framework to explore the relationships and gather insights from our dataset. 

## Methods <a class="anchor" id="methods"></a>
---

We used Python programming language and its libraries such as pandas for data manipulation, matplotlib for data visualization, and sklearn for machine learning and data preprocessing. We also used Jupyter notebook as our integrated development environment (IDE) which allows for interactive programming and visualization.

## Loading and Setup <a class="anchor" id="loading"></a>
---

In [1]:
# If the libraries are not preinstalled, please uncomment the following and install as needed:
# pip install numpy
# pip install pandas
# pip install matplotlib
# pip install seaborn
# pip install scipy
# pip install statsmodels


# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
# Filter warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# So we don't have to write 'print' everytime we want to display the output.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
# Load the dataset from the "hosing.csv" file.
df = pd.read_csv('raw_dataset.csv')

In [4]:
# Set display options to show all columns.
pd.set_option('display.max_columns', None)


In [5]:
# Display df / Sanity check
df

Unnamed: 0,id,url,region,region_url,price,type,sqfeet,beds,baths,cats_allowed,dogs_allowed,smoking_allowed,wheelchair_access,electric_vehicle_charge,comes_furnished,laundry_options,parking_options,image_url,description,lat,long,state
0,7049044568,https://reno.craigslist.org/apa/d/reno-beautif...,reno / tahoe,https://reno.craigslist.org,1148,apartment,1078,3,2.0,1,1,0,0,0,0,w/d in unit,carport,https://images.craigslist.org/01616_daghmBUvTC...,Ridgeview by Vintage is where you will find al...,39.5483,-119.796,ca
1,7049047186,https://reno.craigslist.org/apa/d/reno-reduced...,reno / tahoe,https://reno.craigslist.org,1200,condo,1001,2,2.0,0,0,0,0,0,0,w/d hookups,carport,https://images.craigslist.org/00V0V_5va0MkgO9q...,Conveniently located in the middle town of Ren...,39.5026,-119.789,ca
2,7043634882,https://reno.craigslist.org/apa/d/sparks-state...,reno / tahoe,https://reno.craigslist.org,1813,apartment,1683,2,2.0,1,1,1,0,0,0,w/d in unit,attached garage,https://images.craigslist.org/00t0t_erYqC6LgB8...,2BD | 2BA | 1683SQFTDiscover exceptional servi...,39.6269,-119.708,ca
3,7049045324,https://reno.craigslist.org/apa/d/reno-1x1-fir...,reno / tahoe,https://reno.craigslist.org,1095,apartment,708,1,1.0,1,1,1,0,0,0,w/d in unit,carport,https://images.craigslist.org/00303_3HSJz75zlI...,MOVE IN SPECIAL FREE WASHER/DRYER WITH 6 OR 12...,39.4477,-119.771,ca
4,7049043759,https://reno.craigslist.org/apa/d/reno-no-long...,reno / tahoe,https://reno.craigslist.org,289,apartment,250,0,1.0,1,1,1,1,0,1,laundry on site,,https://images.craigslist.org/01616_fALAWFV8zQ...,"Move In Today: Reno Low-Cost, Clean & Furnishe...",39.5357,-119.805,ca
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
384972,7049053337,https://reno.craigslist.org/apa/d/reno-2x2-thi...,reno / tahoe,https://reno.craigslist.org,1295,apartment,957,2,2.0,1,1,1,0,0,0,w/d in unit,carport,https://images.craigslist.org/00O0O_atyH2pgYeH...,MOVE IN SPECIAL FREE WASHER/DRYER WITH 6 OR 12...,39.4477,-119.771,ca
384973,7049052968,https://reno.craigslist.org/apa/d/sparks-over-...,reno / tahoe,https://reno.craigslist.org,1549,apartment,1034,2,2.0,1,1,0,0,0,0,w/d in unit,,https://images.craigslist.org/00808_3EobCZHFEx...,AN OASIS OF YOUR OWN Introducing Lumina at Spa...,39.6269,-119.708,ca
384974,7049050454,https://reno.craigslist.org/apa/d/sparks-1mont...,reno / tahoe,https://reno.craigslist.org,1249,apartment,840,2,1.0,1,1,1,0,0,0,laundry on site,off-street parking,https://images.craigslist.org/01111_kr3uKMhzrf...,***Newly MODERNIZED Apartment Home*** âï¸ ...,39.5358,-119.746,ca
384975,7049050149,https://reno.craigslist.org/apa/d/sparks-ready...,reno / tahoe,https://reno.craigslist.org,1429,apartment,976,2,2.0,1,1,1,0,0,0,w/d in unit,carport,https://images.craigslist.org/00c0c_1GslcQnpLP...,Welcome Home We welcome you to The Villas at D...,39.5585,-119.703,ca


In [6]:
 # Get the shape of our dataset.
df.shape

(384977, 22)

**`Finding`** 

Presently, there are **384,977** rows and **22** cloumns in our raw dataset. 

#### Moving forward

We will do a **deep dive** into our dataset and get ourselves familiar with all the columns. As we are empowering ourselves with that knowledge, we will decide which rows and columns are necessary for our analysis and start dropping the rest. By doing this, we will transform our raw data into cleaner and usable dataset to deploy our machine learning models in. 

## Feature Overview <a class="anchor" id="overview"></a>
---

We will start with looking at all the **features** (also called attributes) in our dataset and understanding what kind of information they hold. 

#### Attributes:
- **Id**: Listing id
- **url**: Listing URL
- **region**: Craigslist region
- **region_url**: Craigslist region URL
- **price**: Rent per month (Target Column)
- **type**: Housing type
- **sqfeet**: Total square footage
- **beds**:Number of Beds
- **baths**:Number of Bathrooms
- **cats_allowed**: Cats allowed boolean (1 = yes, 0 = no)
- **dogs_allowed**: Dogs allowed boolean (1 = yes, 0 = no)
- **smoking_allowed**: Smoking allowed boolean (1 = yes, 0 = no)
- **wheelchair_access**: Has wheelchair access boolean (1 = yes, 0 = no)
- **electric_vehicle_charge**: Has electric vehicle charger boolean (1 = yes, 0 = no)
- **comes_furnished**: Comes with furniture boolean (1 = yes, 0 = no)
- **laundry_options**: Laundry options available
- **parking_options**: Parking options available
- **image_url**: URL of the image
- **description**: Description by poster
- **lat**: Latitude
- **long**: Longitude
- **state**: State of listing

## Understanding our Data <a class="anchor" id="understanding"></a>

In [7]:
# Display the first 5  rows of the dataframe (df) to get an overview of the provided data. 
df.head()

Unnamed: 0,id,url,region,region_url,price,type,sqfeet,beds,baths,cats_allowed,dogs_allowed,smoking_allowed,wheelchair_access,electric_vehicle_charge,comes_furnished,laundry_options,parking_options,image_url,description,lat,long,state
0,7049044568,https://reno.craigslist.org/apa/d/reno-beautif...,reno / tahoe,https://reno.craigslist.org,1148,apartment,1078,3,2.0,1,1,0,0,0,0,w/d in unit,carport,https://images.craigslist.org/01616_daghmBUvTC...,Ridgeview by Vintage is where you will find al...,39.5483,-119.796,ca
1,7049047186,https://reno.craigslist.org/apa/d/reno-reduced...,reno / tahoe,https://reno.craigslist.org,1200,condo,1001,2,2.0,0,0,0,0,0,0,w/d hookups,carport,https://images.craigslist.org/00V0V_5va0MkgO9q...,Conveniently located in the middle town of Ren...,39.5026,-119.789,ca
2,7043634882,https://reno.craigslist.org/apa/d/sparks-state...,reno / tahoe,https://reno.craigslist.org,1813,apartment,1683,2,2.0,1,1,1,0,0,0,w/d in unit,attached garage,https://images.craigslist.org/00t0t_erYqC6LgB8...,2BD | 2BA | 1683SQFTDiscover exceptional servi...,39.6269,-119.708,ca
3,7049045324,https://reno.craigslist.org/apa/d/reno-1x1-fir...,reno / tahoe,https://reno.craigslist.org,1095,apartment,708,1,1.0,1,1,1,0,0,0,w/d in unit,carport,https://images.craigslist.org/00303_3HSJz75zlI...,MOVE IN SPECIAL FREE WASHER/DRYER WITH 6 OR 12...,39.4477,-119.771,ca
4,7049043759,https://reno.craigslist.org/apa/d/reno-no-long...,reno / tahoe,https://reno.craigslist.org,289,apartment,250,0,1.0,1,1,1,1,0,1,laundry on site,,https://images.craigslist.org/01616_fALAWFV8zQ...,"Move In Today: Reno Low-Cost, Clean & Furnishe...",39.5357,-119.805,ca


In [8]:
# Display the last 5 rows of the dataframe to get an overview of the data
df.tail()

Unnamed: 0,id,url,region,region_url,price,type,sqfeet,beds,baths,cats_allowed,dogs_allowed,smoking_allowed,wheelchair_access,electric_vehicle_charge,comes_furnished,laundry_options,parking_options,image_url,description,lat,long,state
384972,7049053337,https://reno.craigslist.org/apa/d/reno-2x2-thi...,reno / tahoe,https://reno.craigslist.org,1295,apartment,957,2,2.0,1,1,1,0,0,0,w/d in unit,carport,https://images.craigslist.org/00O0O_atyH2pgYeH...,MOVE IN SPECIAL FREE WASHER/DRYER WITH 6 OR 12...,39.4477,-119.771,ca
384973,7049052968,https://reno.craigslist.org/apa/d/sparks-over-...,reno / tahoe,https://reno.craigslist.org,1549,apartment,1034,2,2.0,1,1,0,0,0,0,w/d in unit,,https://images.craigslist.org/00808_3EobCZHFEx...,AN OASIS OF YOUR OWN Introducing Lumina at Spa...,39.6269,-119.708,ca
384974,7049050454,https://reno.craigslist.org/apa/d/sparks-1mont...,reno / tahoe,https://reno.craigslist.org,1249,apartment,840,2,1.0,1,1,1,0,0,0,laundry on site,off-street parking,https://images.craigslist.org/01111_kr3uKMhzrf...,***Newly MODERNIZED Apartment Home*** âï¸ ...,39.5358,-119.746,ca
384975,7049050149,https://reno.craigslist.org/apa/d/sparks-ready...,reno / tahoe,https://reno.craigslist.org,1429,apartment,976,2,2.0,1,1,1,0,0,0,w/d in unit,carport,https://images.craigslist.org/00c0c_1GslcQnpLP...,Welcome Home We welcome you to The Villas at D...,39.5585,-119.703,ca
384976,7049050010,https://reno.craigslist.org/apa/d/reno-2x2-thi...,reno / tahoe,https://reno.craigslist.org,1295,apartment,957,2,2.0,1,1,1,0,0,0,w/d in unit,carport,https://images.craigslist.org/00O0O_atyH2pgYeH...,MOVE IN SPECIAL FREE WASHER/DRYER WITH 6 OR 12...,39.4477,-119.771,ca


In [9]:
# Randomly sample 15 rows from the dataframe
df.sample(15)

Unnamed: 0,id,url,region,region_url,price,type,sqfeet,beds,baths,cats_allowed,dogs_allowed,smoking_allowed,wheelchair_access,electric_vehicle_charge,comes_furnished,laundry_options,parking_options,image_url,description,lat,long,state
129205,7048353659,https://neworleans.craigslist.org/apa/d/new-or...,new orleans,https://neworleans.craigslist.org,2400,apartment,1127,1,1.0,1,1,0,0,0,0,w/d in unit,attached garage,https://images.craigslist.org/00707_fTNoXIYDCm...,Welcome to The Annex Luxury Apartment Homes! W...,29.9555,-90.0703,la
90255,7048756726,https://chicago.craigslist.org/chc/apa/d/chica...,chicago,https://chicago.craigslist.org,2000,apartment,1500,2,1.0,1,0,1,0,0,0,w/d in unit,off-street parking,https://images.craigslist.org/00M0M_cWhcyPKjsw...,Huge 2 bedroom - 1 Full Bath with In-Unit Sta...,41.9542,-87.7044,il
370568,7041330287,https://fresno.craigslist.org/apa/d/fresno-5-b...,fresno / madera,https://fresno.craigslist.org,1795,house,2500,5,3.0,0,0,1,0,0,0,w/d in unit,attached garage,https://images.craigslist.org/00n0n_gGTauRP8Nr...,Rent is Cheap for that PRIME location and larg...,36.8579,-119.766,ca
140329,7042230912,https://boston.craigslist.org/nos/apa/d/stoneh...,boston,https://boston.craigslist.org,1395,apartment,460,0,1.0,0,0,0,0,0,0,laundry on site,off-street parking,https://images.craigslist.org/00505_aikuMDOhWZ...,"Group Showing Saturday 12/21 @ 11am. Updated,...",42.4828,-71.0978,ma
96290,7048133598,https://stlouis.craigslist.org/apa/d/saint-lou...,"st louis, MO",https://stlouis.craigslist.org,375,apartment,600,1,1.0,1,0,1,0,0,0,w/d hookups,street parking,https://images.craigslist.org/00H0H_5cwf2vXd52...,This is a nice apartment. Three rooms: one bed...,38.7139,-90.2375,il
146529,7048239501,https://battlecreek.craigslist.org/apa/d/battl...,battle creek,https://battlecreek.craigslist.org,818,apartment,1050,2,2.0,1,1,1,0,0,0,laundry on site,off-street parking,https://images.craigslist.org/00X0X_5RgnRHMYeU...,Forest Hills Apartments is located in the hear...,42.2951,-85.1993,mi
295613,7051145090,https://dallas.craigslist.org/sdf/apa/d/lewisv...,dallas / fort worth,https://dallas.craigslist.org,1473,apartment,1060,1,1.5,1,1,1,0,0,0,w/d in unit,,https://images.craigslist.org/00H0H_iMYi2K2rMM...,"We offer so much, you will never wonder what t...",33.0056,-97.017,tx
355239,7047836166,https://anchorage.craigslist.org/apa/d/quiet-w...,anchorage / mat-su,https://anchorage.craigslist.org,1100,apartment,1000,2,1.5,0,0,1,0,0,0,w/d in unit,attached garage,https://images.craigslist.org/00N0N_4w9sZ0HoTs...,This is a clean duplex conveniently located cl...,61.5864,-149.471,ak
355573,7039565064,https://anchorage.craigslist.org/apa/d/anchora...,anchorage / mat-su,https://anchorage.craigslist.org,1450,condo,1190,3,2.0,1,1,0,0,0,0,w/d in unit,detached garage,https://images.craigslist.org/00R0R_ickbbHCczO...,Nicely Updated 3 Bedroom Condo Downtown! ...,61.2084,-149.879,ak
339520,7044497798,https://martinsburg.craigslist.org/apa/d/marti...,eastern panhandle,https://martinsburg.craigslist.org,1314,apartment,1089,2,2.0,1,1,1,0,0,0,w/d in unit,,https://images.craigslist.org/00B0B_lnz8l259Lr...,You will enjoy our community's beautiful lands...,39.4607,-77.9945,wv


We looked at amples of the dataset to get an overview of the data and to see if we have any missing values.

In [10]:
# Display all the column names as a list.
df.columns

Index(['id', 'url', 'region', 'region_url', 'price', 'type', 'sqfeet', 'beds',
       'baths', 'cats_allowed', 'dogs_allowed', 'smoking_allowed',
       'wheelchair_access', 'electric_vehicle_charge', 'comes_furnished',
       'laundry_options', 'parking_options', 'image_url', 'description', 'lat',
       'long', 'state'],
      dtype='object')

**Interpretation**

As we can see, **'.columns'** gives us a list/array of all the column names. We have alot of useful columns which we will be keeping for our Analysis. However, we have to look for any missing information in these rows. 

In [11]:
# Display the data types of the dataset
df.dtypes

id                           int64
url                         object
region                      object
region_url                  object
price                        int64
type                        object
sqfeet                       int64
beds                         int64
baths                      float64
cats_allowed                 int64
dogs_allowed                 int64
smoking_allowed              int64
wheelchair_access            int64
electric_vehicle_charge      int64
comes_furnished              int64
laundry_options             object
parking_options             object
image_url                   object
description                 object
lat                        float64
long                       float64
state                       object
dtype: object

In [12]:
# Information about our dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384977 entries, 0 to 384976
Data columns (total 22 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   id                       384977 non-null  int64  
 1   url                      384977 non-null  object 
 2   region                   384977 non-null  object 
 3   region_url               384977 non-null  object 
 4   price                    384977 non-null  int64  
 5   type                     384977 non-null  object 
 6   sqfeet                   384977 non-null  int64  
 7   beds                     384977 non-null  int64  
 8   baths                    384977 non-null  float64
 9   cats_allowed             384977 non-null  int64  
 10  dogs_allowed             384977 non-null  int64  
 11  smoking_allowed          384977 non-null  int64  
 12  wheelchair_access        384977 non-null  int64  
 13  electric_vehicle_charge  384977 non-null  int64  
 14  come

As we can see, this function offers a concise **summary of the dataframe**, providing valuable insights from the output.

In [13]:
# Check to see if there are any 'NaN' values.
df.isnull().sum()

id                              0
url                             0
region                          0
region_url                      0
price                           0
type                            0
sqfeet                          0
beds                            0
baths                           0
cats_allowed                    0
dogs_allowed                    0
smoking_allowed                 0
wheelchair_access               0
electric_vehicle_charge         0
comes_furnished                 0
laundry_options             79026
parking_options            140687
image_url                       0
description                     2
lat                          1918
long                         1918
state                           0
dtype: int64

In [14]:
# Check for duplicate rows
df.duplicated().sum()

0

There are **No duplicated rows** in our dataset.

#### Based on this summary:

1. The dataframe comprises **384,977** entries, indicating the number of rows.

2. There are **22** columns present in the dataframe.

3. The column names include ***'id', 'url', 'region', 'region_url', 'price', 'type', 'sqfeet', 'beds', 'baths', 'cats_allowed', 'dogs_allowed', 'smoking_allowed', 'wheelchair_access', 'electric_vehicle_charge', 'comes_furnished', 'laundry_options', 'parking_options', 'image_url', 'description', 'lat', 'long', 'state'.***

4. **NaN values:**  **79,026** in 'laundry_options', **140,687** in 'parking_options', **1,918** in 'lat' and 'long'. We have to deal with these values accordingly because neglecting them could affect our model and analysis and our predictions can be biased or skewed.

5. The data types of the columns are as follows:
    - **int64**: 'id', 'price', 'sqfeet', 'beds', 'cats_allowed', 'dogs_allowed', 'smoking_allowed', 'wheelchair_access', 'electric_vehicle_charge' and 'comes_furnished'.
    - **float64**: 'baths', 'lat' and 'long'.
    - **Object (string)**: 'url', 'region', 'region_url', 'type', 'laundry_options', 'parking_options', 'image_url', 'description' and 'state'.
    
6. There are **No duplicated rows** in our dataset.
    
6. The dataframe consumes approximately **64.6+ MB** of memory.

This information is highly valuable for comprehending the data's structure, facilitating subsequent data analysis decisions. For instance, understanding the data types of columns aids in selecting appropriate statistical methods.

In [15]:
# Check the number of unique values in each column
df.nunique()

id                         384977
url                        384977
region                        404
region_url                    413
price                        3961
type                           12
sqfeet                       3277
beds                           11
baths                          20
cats_allowed                    2
dogs_allowed                    2
smoking_allowed                 2
wheelchair_access               2
electric_vehicle_charge         2
comes_furnished                 2
laundry_options                 5
parking_options                 7
image_url                  181068
description                280446
lat                         56772
long                        54035
state                          51
dtype: int64

#### **Interpretation**

Each point in the output generated by **df.nunique()** suggests different insights about the dataset, which can be valuable for data analysis and preprocessing:

- **High Cardinality:** The columns ***'id'*** and ***'url'*** have a unique value for each row, indicating they are likely unique identifiers. These can be useful for data indexing but may not provide meaningful information for analysis.
- **Categorical Variables:** The ***'type'*** column has **12** unique values, suggesting there are 12 distinct property types in the dataset. This could be valuable for categorization and analysis.
- **Binary Columns:** Several columns like ***'cats_allowed', 'dogs_allowed', 'smoking_allowed', 'wheelchair_access'*** and ***'electric_vehicle_charge'*** have only **2** unique values (0 or 1), indicating binary categorical variables (yes or no) related to property features.
- **Numeric Variables:** The ***'sqfeet', 'beds', 'baths'*** and ***'price'*** columns have varying numbers of unique values, indicating numeric or continuous variables. This information is important for understanding the distribution and range of these variables.
- **Text Data:** The ***'description'*** column has a high number of unique values ***(280,446)***. This suggests diverse property descriptions. 
- **Geospatial Information:** The ***'lat'*** and ***'long'*** columns have multiple unique values, suggesting geospatial data related to property locations. These coordinates can be used for mapping and location-based analysis (however we won't be doing that).
- **Categorical Features with Multiple Options:** The ***'laundry_options'*** and ***'parking_options'*** columns have several unique values, indicating multiple options for laundry **(5)** and parking facilities **(7)**, respectively.
- **Regional Data:** The ***'region'*** and ***'state'*** columns suggest regional or location-based information, with ***'region'*** having **404 unique values*** and ***'state'*** having ***51 unique values***. These will be very important for our predictive model as we will be needing these to input as variables in our analysis.
- **Images:** The ***'image_url'*** column has a substantial number of unique values ***(181,068)***, indicating multiple unique image URLs. This suggests that some images have been used repeatedly and this could affect our model so we will be dropping ths column.
 

## Checking and Dealing with missing values <a class="anchor" id="dealing"></a>
---

In [16]:
# Check for duplicate rows in percentage.
df.isnull().sum()*100/len(df)

id                          0.000000
url                         0.000000
region                      0.000000
region_url                  0.000000
price                       0.000000
type                        0.000000
sqfeet                      0.000000
beds                        0.000000
baths                       0.000000
cats_allowed                0.000000
dogs_allowed                0.000000
smoking_allowed             0.000000
wheelchair_access           0.000000
electric_vehicle_charge     0.000000
comes_furnished             0.000000
laundry_options            20.527460
parking_options            36.544261
image_url                   0.000000
description                 0.000520
lat                         0.498212
long                        0.498212
state                       0.000000
dtype: float64

#### Findings

Above code calculates the percentage of missing (NaN) values in each column of our DataFrame 'df.' The result is expressed as a percentage of the total number of rows in the dataset (384,977 rows). The answer provides the following insights:

**1.** Many columns, such as ***'id', 'url', 'region', 'region_url', 'price', 'type', 'sqfeet', 'beds', 'baths', 'cats_allowed', 'dogs_allowed', 'smoking_allowed', 'wheelchair_access', 'electric_vehicle_charge', 'comes_furnished', 'image_url'*** and ***'state'*** have no missing values, indicated by a **0% missing value rate**. This suggests that these columns are complete and contain data for every row.

**2.** ***'laundry_options'*** has a moderate amount of missing data, with approximately **20.53%** of its values being "Nan". This column may require further attention, so we **cannot drop** it as it can comprise of essential information and can make a huge impact on our analysis. 

**3.** Similarly, ***'parking_options'*** has a higher rate of missing data, **36.54%**. This column also has a substantial amount of missing information, and decisions on how to deal with these missing values may impact the analysis.

**4.** ***'description', 'lat'*** and ***'long'*** have a very low percentage of missing values **(0.05% to 0.50%)**, indicating that these columns are almost complete and may not require extensive data imputation. So we will be dropping them.

### Dealing with NaN values


After doing all our preliminery checks and getting an understanding of our dataset, we will now proceed to work on our dataset like dealing with the missing values.

But First, we will make a **copy of our dataset called "house"**, so that when we clean our data, we can still have the original dataset (raw data) if we need it.

In [17]:
# Generate a copy of our original dataset.
house = df.copy()

In [18]:
# Sanity check
house.head()

Unnamed: 0,id,url,region,region_url,price,type,sqfeet,beds,baths,cats_allowed,dogs_allowed,smoking_allowed,wheelchair_access,electric_vehicle_charge,comes_furnished,laundry_options,parking_options,image_url,description,lat,long,state
0,7049044568,https://reno.craigslist.org/apa/d/reno-beautif...,reno / tahoe,https://reno.craigslist.org,1148,apartment,1078,3,2.0,1,1,0,0,0,0,w/d in unit,carport,https://images.craigslist.org/01616_daghmBUvTC...,Ridgeview by Vintage is where you will find al...,39.5483,-119.796,ca
1,7049047186,https://reno.craigslist.org/apa/d/reno-reduced...,reno / tahoe,https://reno.craigslist.org,1200,condo,1001,2,2.0,0,0,0,0,0,0,w/d hookups,carport,https://images.craigslist.org/00V0V_5va0MkgO9q...,Conveniently located in the middle town of Ren...,39.5026,-119.789,ca
2,7043634882,https://reno.craigslist.org/apa/d/sparks-state...,reno / tahoe,https://reno.craigslist.org,1813,apartment,1683,2,2.0,1,1,1,0,0,0,w/d in unit,attached garage,https://images.craigslist.org/00t0t_erYqC6LgB8...,2BD | 2BA | 1683SQFTDiscover exceptional servi...,39.6269,-119.708,ca
3,7049045324,https://reno.craigslist.org/apa/d/reno-1x1-fir...,reno / tahoe,https://reno.craigslist.org,1095,apartment,708,1,1.0,1,1,1,0,0,0,w/d in unit,carport,https://images.craigslist.org/00303_3HSJz75zlI...,MOVE IN SPECIAL FREE WASHER/DRYER WITH 6 OR 12...,39.4477,-119.771,ca
4,7049043759,https://reno.craigslist.org/apa/d/reno-no-long...,reno / tahoe,https://reno.craigslist.org,289,apartment,250,0,1.0,1,1,1,1,0,1,laundry on site,,https://images.craigslist.org/01616_fALAWFV8zQ...,"Move In Today: Reno Low-Cost, Clean & Furnishe...",39.5357,-119.805,ca


#### We have successfully made a copy of our dataset called 'house'. We will be using this for now onwards.  

In [19]:
# List of columns
house.columns

Index(['id', 'url', 'region', 'region_url', 'price', 'type', 'sqfeet', 'beds',
       'baths', 'cats_allowed', 'dogs_allowed', 'smoking_allowed',
       'wheelchair_access', 'electric_vehicle_charge', 'comes_furnished',
       'laundry_options', 'parking_options', 'image_url', 'description', 'lat',
       'long', 'state'],
      dtype='object')

##### Out of all these coloumns, we know that ***'laundry_options'*** and ***'parking_options'*** have a high number of missing values. So, we will look at all the **unique values** held by these columns and deal with those **'NaN'** values accordingly. 

In [25]:
# Looking at the unique values in 'laundry_options' column.
house["laundry_options"].unique()

array(['w/d in unit', 'w/d hookups', 'laundry on site', 'laundry in bldg',
       nan, 'no laundry on site'], dtype=object)

In [26]:
# Looking at the unique values in 'parking_options' column.
house["parking_options"].unique()

array(['carport', 'attached garage', nan, 'off-street parking',
       'detached garage', 'street parking', 'no parking', 'valet parking'],
      dtype=object)

Now we will be replacing those 'NaN' values with the mode of these tables respectively. 

#### Handling Missing values

In [27]:
# We will fill all the 'NaN' values in the columns: 'laundry_options' and 'parking_options' with mode. 
house["laundry_options"]=house["laundry_options"].fillna(house["laundry_options"].mode()[0])
house["parking_options"]=house["parking_options"].fillna(house["parking_options"].mode()[0])

##### Methodology:

- **house["laundry_options"].fillna(house["laundry_options"].mode()[0]):** In this line, the code focuses on the 'laundry_options' column of the 'house' DataFrame. It calculates the mode (most frequent value) of this column using "house["laundry_options"].mode()[0]" and then fills any missing values in this column with that mode value using **fillna()**. Essentially, it replaces NaN values in 'laundry_options' with the most commonly occurring value in that column.

- **house["parking_options"].fillna(house["parking_options"].mode()[0]):** Similar to the previous line, this code deals with the **'parking_options'** column. It calculates the mode of 'parking_options' and fills missing values in this column with that mode value.

**Reason** for using this piece of code is to handle missing data in these two columns based on a common strategy: Both ***'laundry_options'*** and ***'parking_options'*** columns had a high number of missing values, with 'parking_options' having a relatively high percentage of missing data (36.54%). By filling missing values with the mode, we aim to replace the NaN entries (missing data) with the most frequent option for each respective feature. This approach is often used when dealing with categorical data to maintain data integrity and ensure that the imputed values align with the most common choices for laundry and parking options, which can help preserve the data's representativeness.


In [28]:
# Replaces all the missing values in the DataFrame (house) with zeros.
house.fillna(0, inplace=True)

#### Why we did this?

This is done as these changes are made directly in the DataFrame itself. This way, we don't have to assign the result back to 'house'. This will be useful because it will help us replace missing values with a specific default value to ensure that our data is complete and ready for analysis or modeling.

In [29]:
# Recheck the unique values in 'laundry_options' column.
house["laundry_options"].unique()

# Check Value counts of each unique item in this column.
house["laundry_options"].value_counts()

array(['w/d in unit', 'w/d hookups', 'laundry on site', 'laundry in bldg',
       'no laundry on site'], dtype=object)

laundry_options
w/d in unit           210809
w/d hookups            75568
laundry on site        58873
laundry in bldg        36103
no laundry on site      3624
Name: count, dtype: int64

##### Now, there are no 'NaN' values in this column. They are all replaced with the mode. 

In [30]:
# Recheck the unique values in 'parking_options' column.
house["parking_options"].unique()

# Check Value counts of each unique item in this column.
house["parking_options"].value_counts()

array(['carport', 'attached garage', 'off-street parking',
       'detached garage', 'street parking', 'no parking', 'valet parking'],
      dtype=object)

parking_options
off-street parking    269189
attached garage        40591
carport                38955
detached garage        16940
street parking         15951
no parking              3188
valet parking            163
Name: count, dtype: int64

##### Similarly, there are no 'NaN' values left in 'parking_options' as they have also been replaced by the mode number of this column. 

In [34]:
# Sanity check for the missing values in the above columns.
house.isnull().sum()

id                         0
url                        0
region                     0
region_url                 0
price                      0
type                       0
sqfeet                     0
beds                       0
baths                      0
cats_allowed               0
dogs_allowed               0
smoking_allowed            0
wheelchair_access          0
electric_vehicle_charge    0
comes_furnished            0
laundry_options            0
parking_options            0
image_url                  0
description                0
lat                        0
long                       0
state                      0
dtype: int64

#### Interpretation

We have **successfully** dealt with the missing values. In the output of above code, we can see that all the **"NaN" values are now filled** leaving no null value. 
Now we can decide which columns/features will be essential for our analysis. All the rest, we can drop.

## Handling Features <a class="anchor" id="handling"></a>
---

We will be dropping the columns/features which are not necessary for our analysis like ***'id', 'url', 'region_url', 'image_url','description', 'lat' and 'lon'.***

Here are the reasons for dropping each of the specified columns for our project:

- **id:** The "id" column is a unique identifier for each data point, providing no valuable information for predicting rental prices. Besides, we can use indexes as id's. 
- **url and region_url:** "url" and "region_url" columns contain web links or URLs related to the listings. They don't offer any insights into the rental prices and can be safely removed.
- **image_url:** Similarly, the "image_url" column contains URLs of property images. While images are important for visualization, they don't contribute to the numerical features needed for price prediction. 
- **description:** "description" text data contain information about the properties, but for our model it won't be necessary to use these. So, we can drop this column. 
- **lat and long:** Latitude ("lat") and longitude ("long") coordinates might be useful for location-based analysis, but we are not making this project about that. Also, including these coordinates can lead to multicollinearity and overfitting in predictive models.

In [35]:
# Dropping unrequired columns.
house.drop(columns=["id", "url", "region_url", "image_url", "description", "lat", "long"], axis=1, inplace = True)

In [36]:
# Sanity check
house.head()

Unnamed: 0,region,price,type,sqfeet,beds,baths,cats_allowed,dogs_allowed,smoking_allowed,wheelchair_access,electric_vehicle_charge,comes_furnished,laundry_options,parking_options,state
0,reno / tahoe,1148,apartment,1078,3,2.0,1,1,0,0,0,0,w/d in unit,carport,ca
1,reno / tahoe,1200,condo,1001,2,2.0,0,0,0,0,0,0,w/d hookups,carport,ca
2,reno / tahoe,1813,apartment,1683,2,2.0,1,1,1,0,0,0,w/d in unit,attached garage,ca
3,reno / tahoe,1095,apartment,708,1,1.0,1,1,1,0,0,0,w/d in unit,carport,ca
4,reno / tahoe,289,apartment,250,0,1.0,1,1,1,1,0,1,laundry on site,off-street parking,ca


In [37]:
# Getting the shape of our dataframe
house.shape

(384977, 15)

We have successfully dropped the columns and are now left with **384,977 rows** and **15 columns** in our dataframe called "house".

In [38]:
# Column names.
house.columns

Index(['region', 'price', 'type', 'sqfeet', 'beds', 'baths', 'cats_allowed',
       'dogs_allowed', 'smoking_allowed', 'wheelchair_access',
       'electric_vehicle_charge', 'comes_furnished', 'laundry_options',
       'parking_options', 'state'],
      dtype='object')

**We have successfully dealt with all the missing 'NaN' values and removed all the unnecessary columns/features from our dataset.**

---

### Remaining features / attributes:

- region
- type
- sqfeet
- beds
- baths
- cats_allowed
- dogs_allowed
- smoking_allowed
- wheelchair_access
- electric_vehicle_charge
- comes_furnished
- laundry_options
- parking_options
- state


### Target variable:

- price

#### We will be using all the remaining features/attributes to make a machine learning model to predict the price of the rental places. **'price' being our Dependent/Target variable.**

In [39]:
# We will calculate summary statistics for each numerical column in the DataFrame 'house'.
house.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,384977.0,8825.722318,4462200.0,0.0,805.0,1036.0,1395.0,2768307000.0
sqfeet,384977.0,1059.899565,19150.76,0.0,750.0,949.0,1150.0,8388607.0
beds,384977.0,1.905345,3.494572,0.0,1.0,2.0,2.0,1100.0
baths,384977.0,1.480718,0.6180605,0.0,1.0,1.0,2.0,75.0
cats_allowed,384977.0,0.72689,0.4455574,0.0,0.0,1.0,1.0,1.0
dogs_allowed,384977.0,0.707918,0.4547206,0.0,0.0,1.0,1.0,1.0
smoking_allowed,384977.0,0.731771,0.4430381,0.0,0.0,1.0,1.0,1.0
wheelchair_access,384977.0,0.082111,0.2745347,0.0,0.0,0.0,0.0,1.0
electric_vehicle_charge,384977.0,0.012871,0.1127177,0.0,0.0,0.0,0.0,1.0
comes_furnished,384977.0,0.048128,0.214036,0.0,0.0,0.0,0.0,1.0


**Here are some of the key statistics:**

- The average rent price (***'price'***) is approximately **$8,825.72**.
- The average space/sq. ft. (***'sqfeet'***) is approximately **1,059.90 sq/ft**.
- The average no of bedrooms (***'beds'***) is approximately **2**.
- The average no of baths (***'baths'***) is approximately **1.5**.
- ***'price'*** and ***'sqfeet'*** has alot of **0** values. We have to fix that.<br>
---

#### We will get rid of the rows which either has 'price' under $100 or whose 'sqfeet' is under 120 sq. feet, because they are unrealistic and can affect our analysis. 

In [40]:
# Removing rows.
house = house[house["price"] > 100]
house = house[house["sqfeet"] >= 120] 

In [41]:
# Sanity check
house.shape

(381738, 15)

Now, we are left with **381,738** rows and **15** columns with no Duplicates, no Nulls and realistic values in our dataframe **"house"**.

---

#### Unique values

Now, we will see what the **unique values** in our columns are. This will help us find out if there are any **outliers** in our dataset. 

In [42]:
# Printing unique values in our columns.
print('Unique values of beds \n ',house.beds.unique(),'\n',
      'Unique values of baths \n ',house.baths.unique(),'\n',
      'Available laundry options \n ',house.laundry_options.unique(),'\n',
      'Available parking options \n ',house.parking_options.unique())

Unique values of beds 
  [   3    2    1    0    4    5    7    6    8 1100 1000] 
 Unique values of baths 
  [ 2.   1.   3.   1.5  2.5  3.5  0.   4.5  5.   4.   6.   5.5  7.   6.5
  8.5 75.   7.5 25. ] 
 Available laundry options 
  ['w/d in unit' 'w/d hookups' 'laundry on site' 'laundry in bldg'
 'no laundry on site'] 
 Available parking options 
  ['carport' 'attached garage' 'off-street parking' 'detached garage'
 'street parking' 'no parking' 'valet parking']


**Based on our findings, we can see:**

- ***'beds'*** have an extremely large maximum number, they could be our outliers. 
- ***'baths'*** values are uneven, so we will replace them with the next whole number. 

In [43]:
# Replacing the decimal number in 'baths' to the next whole number for cleaner data.
house.baths.replace(to_replace = (1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 8.5, 7.5), value = (2, 3, 4, 5, 6, 7, 9, 8), inplace = True )

#### Explaination

The code **house.baths.replace(to_replace=(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 8.5, 7.5), value=(2, 3, 4, 5, 6, 7, 8, 9), inplace=True)** is used to replace specific values in the 'baths' column of the 'house' DataFrame.

It searches for values (1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 8.5, 7.5) in the ***'baths'*** column and replaces them with corresponding values (2, 3, 4, 5, 6, 7, 9, 8).
The **inplace = True** parameter ensures that the changes are made directly to the **'house' DataFrame**, rather than creating a new DataFrame. This code is useful for standardizing or correcting values in the 'baths' column, such as converting fractional values to whole numbers.

In [44]:
# Rechecking the unique values in the 'baths' column.
house.baths.unique()

array([ 2.,  1.,  3.,  4.,  0.,  5.,  6.,  7.,  9., 75.,  8., 25.])

#### Finding

As we can see, the 'baths' column now contains only whole numbers, representing the number of bathrooms in the properties. However, there are some unusual values in the output:

**1. 0:** Indicates properties with no bathrooms, which might be an error or represent a unique case.<br>
**2. 75:** This is an extreme outlier and could be a data entry error, as it's highly unlikely for a property to have 75 bathrooms.<br>
**3. 25:** Similar to 75, having 25 bathrooms in a property is extremely uncommon and suggests a potential data issue or outlier.<br>


The presence of extreme values like 75 and 25 suggests potential data quality issues or outliers that should be investigated and addressed if necessary. These values may require further validation or cleaning to ensure the data's accuracy and integrity for analysis.

### Interpretation:

1. Difference between ***mean*** and ***standard deviation*** shows that there so many **variances** in data.

2. ***Min*** and ***Max*** Values of **Sqfeet, Beds, Baths and Price** are to exterme, because of the **outliers**. 

In [45]:
# Sanity check
house.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,381738.0,8897.797487,4481090.0,106.0,814.0,1043.0,1399.0,2768307000.0
sqfeet,381738.0,1062.350814,19229.65,120.0,750.0,950.0,1150.0,8388607.0
beds,381738.0,1.90225,3.114062,0.0,1.0,2.0,2.0,1100.0
baths,381738.0,1.535346,0.6673239,0.0,1.0,1.0,2.0,75.0
cats_allowed,381738.0,0.728238,0.4448686,0.0,0.0,1.0,1.0,1.0
dogs_allowed,381738.0,0.709125,0.4541665,0.0,0.0,1.0,1.0,1.0
smoking_allowed,381738.0,0.732107,0.4428622,0.0,0.0,1.0,1.0,1.0
wheelchair_access,381738.0,0.082148,0.2745904,0.0,0.0,0.0,0.0,1.0
electric_vehicle_charge,381738.0,0.012933,0.1129855,0.0,0.0,0.0,0.0,1.0
comes_furnished,381738.0,0.047682,0.2130926,0.0,0.0,0.0,0.0,1.0


### Revised Findings

**Updated key statistics:**

- After removing the zeroes and limiting the price to over ***100 USD***, the average rental price (***'price'***) has increased approximately **$8,897.80**.
- Also, by limiting the area to 120 sq/ft and above, the new average space/sq. ft. (***'sqfeet'***) is approximately **1,062.35 sq/ft**.
- The average no of bedrooms (***'beds'***) is still approximately **2**.
- The average no of baths (***'baths'***) is a little more than approx **1.5**.
- ***'price'*** and ***'sqfeet'*** had alot of **0** values. They have been fixed.<br>
---

### Cleaned Data

In [46]:
# List the remaining columns
house.columns.tolist()

['region',
 'price',
 'type',
 'sqfeet',
 'beds',
 'baths',
 'cats_allowed',
 'dogs_allowed',
 'smoking_allowed',
 'wheelchair_access',
 'electric_vehicle_charge',
 'comes_furnished',
 'laundry_options',
 'parking_options',
 'state']

In [47]:
# Show a sample of the cleaned dataset.
house.head()

Unnamed: 0,region,price,type,sqfeet,beds,baths,cats_allowed,dogs_allowed,smoking_allowed,wheelchair_access,electric_vehicle_charge,comes_furnished,laundry_options,parking_options,state
0,reno / tahoe,1148,apartment,1078,3,2.0,1,1,0,0,0,0,w/d in unit,carport,ca
1,reno / tahoe,1200,condo,1001,2,2.0,0,0,0,0,0,0,w/d hookups,carport,ca
2,reno / tahoe,1813,apartment,1683,2,2.0,1,1,1,0,0,0,w/d in unit,attached garage,ca
3,reno / tahoe,1095,apartment,708,1,1.0,1,1,1,0,0,0,w/d in unit,carport,ca
4,reno / tahoe,289,apartment,250,0,1.0,1,1,1,1,0,1,laundry on site,off-street parking,ca


In [48]:
# Last sanity check
house.shape

(381738, 15)

#### After cleaning our dataset, we have **381,738** rows and **15** columns to work with.<br>
---

We are done with data cleaning, now let's save this cleaned dataset for later use:

In [49]:
# Save the cleaned data to a new .csv file
house.to_csv('cleaned_dataset.csv', index=False)

---

## Summary <a class="anchor" id="summary"></a>

### To conclude, following are the steps we used for data cleaning:

**1.** First, we imported all the necessary libraries to work with our data.<br>
**2.** Then, we loaded our dataset.<br>
**3.** Got an overview of how our data looks (shape, columns, datatypes, info).<br>
**4.** Checked for missing ('NaN') values, found in 4 columns: 'laundry_options', 'parking_options', 'lat' and 'long'.<br>
**5.** Dealt with missing and unique values.<br>
**6.** Dropped unnecessay columns.<br> 
**7.** Defined our attributes and target variable.<br>
**8.** Dealt with 0's in 'price' and 'sqfeet' columns.<br> 
**9.** Looked at our cleaned dataset.<br>
**10.** Saved our dataset into another .csv file for further use.<br>



---