# Fremont Residential Market Analysis

### Process:
##### Section 1: Preview and Data Cleaning

##### Section 2: Exploratory Data Analysis

##### Section 3: Price Prediction

##### Section 4: Data-Driven Suggestions


<div class="alert alert-block alert-info">

## Feature Information:
- ID: Unique ID for each single house
- DOM: Date on market
- Area: Area codes in Fremont
- LP: Listing price
- SP: Sold price
- SqFt: Square footage for each single house
- BR: Number of bedroom
- Bth: Number of full-bathroom
- PB: Number of partial-bathroom
- Gar: Garage ('Y'/'N')
- GarSp: Number of garage space
- YrBlt: Year built
- Lot SqFt: The land size according to the survey of boundary lines determined by the city
- HOA Fee: Homeowner's association fee
- Closing Date: The date when the transaction is officially done

</div>

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
%matplotlib inline
plt.style.use('seaborn-whitegrid')

In [10]:
df = pd.read_csv('Fremont_Sold_DE.csv')

In [11]:
df.head()

Unnamed: 0,ID,DOM,Area,LP,SP,SqFt,BR,Bth,PB,Gar,GarSp,YrBlt,Lot SqFt,HOA Fee,Closing Date
0,40843932,45,3700,"$848,888","$805,000",1165,2.0,2.0,0.0,N,2.0,1976.0,5916,,1/2/19
1,40844383,28,3700,"$1,189,000","$1,220,000",1480,4.0,3.0,0.0,Y,2.0,1958.0,7875,150.0,1/2/19
2,SF478231,52,3700,"$1,280,000","$1,100,000",1192,3.0,1.0,0.0,,,,7735,,1/2/19
3,40844295,52,3700,"$1,280,000","$1,100,000",1192,3.0,1.0,0.0,Y,2.0,1954.0,7735,,1/2/19
4,ML81728224,42,3700,"$1,089,950","$1,041,515",1915,3.0,2.0,0.0,Y,2.0,1958.0,7700,150.0,1/3/19


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4064 entries, 0 to 4063
Data columns (total 15 columns):
ID              4064 non-null object
DOM             4064 non-null int64
Area            4064 non-null int64
LP              4064 non-null object
SP              4064 non-null object
SqFt            4064 non-null int64
BR              4063 non-null float64
Bth             4063 non-null float64
PB              2953 non-null float64
Gar             3901 non-null object
GarSp           4058 non-null float64
YrBlt           4062 non-null float64
Lot SqFt        4064 non-null object
HOA Fee         1009 non-null float64
Closing Date    4064 non-null object
dtypes: float64(6), int64(3), object(6)
memory usage: 476.4+ KB


In [13]:
df.isna().sum()

ID                 0
DOM                0
Area               0
LP                 0
SP                 0
SqFt               0
BR                 1
Bth                1
PB              1111
Gar              163
GarSp              6
YrBlt              2
Lot SqFt           0
HOA Fee         3055
Closing Date       0
dtype: int64

In [14]:
df[df.duplicated()]

Unnamed: 0,ID,DOM,Area,LP,SP,SqFt,BR,Bth,PB,Gar,GarSp,YrBlt,Lot SqFt,HOA Fee,Closing Date


### Data cleaning
- Drop the row with missing value in BR, Bth, and YrBlt columns
- Drop the Gar column since we have the GarSp to show if the house has garage. The column also contain input mistake.
- Replace 'NaN' with 0 in PB, GarSp, HOA fee columns
- Remove '\$' and ',' from LP and SP columns
- Change LP, SP, BR, Bth, PB, GarSp, YrBlt, HOA Fee data type to 'int'

In [35]:
df_clean = df.copy()

In [36]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4064 entries, 0 to 4063
Data columns (total 15 columns):
ID              4064 non-null object
DOM             4064 non-null int64
Area            4064 non-null int64
LP              4064 non-null object
SP              4064 non-null object
SqFt            4064 non-null int64
BR              4063 non-null float64
Bth             4063 non-null float64
PB              2953 non-null float64
Gar             3901 non-null object
GarSp           4058 non-null float64
YrBlt           4062 non-null float64
Lot SqFt        4064 non-null object
HOA Fee         1009 non-null float64
Closing Date    4064 non-null object
dtypes: float64(6), int64(3), object(6)
memory usage: 476.4+ KB


In [37]:
df_clean.isna().sum()

ID                 0
DOM                0
Area               0
LP                 0
SP                 0
SqFt               0
BR                 1
Bth                1
PB              1111
Gar              163
GarSp              6
YrBlt              2
Lot SqFt           0
HOA Fee         3055
Closing Date       0
dtype: int64

In [38]:
# Drop the row with missing value in BR, Bth, and YrBlt columns

df_clean.dropna(subset=['BR','Bth','YrBlt'], inplace=True)

In [39]:
# Drop the 'Freq' column

df_clean.drop('Gar', axis=1, inplace=True)

In [40]:
# Replace 'NaN' with 0 in PB, GarSp, HOA fee columns

df_clean.update(df[['PB', 'GarSp', 'HOA Fee']].fillna(0))

In [41]:
# Remove '\$' and ',' from LP and SP columns

df_clean['LP'] = df_clean.LP.apply(lambda x: x.replace('$','').replace(',',''))
df_clean['SP'] = df_clean.SP.apply(lambda x: x.replace('$','').replace(',',''))
df_clean['Lot SqFt'] = df_clean['Lot SqFt'].apply(lambda x: x.replace(',',''))

In [43]:
# Change data type
col = ['LP', 'SP', 'BR', 'Bth', 'PB', 'GarSp', 'YrBlt', 'Lot SqFt', 'HOA Fee']

for i in col:
    df_clean[i] = df_clean[i].astype('int')

In [45]:
df_clean['Closing Date'] = pd.to_datetime(df_clean['Closing Date'])

In [46]:
# Check the data set
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4061 entries, 0 to 4063
Data columns (total 14 columns):
ID              4061 non-null object
DOM             4061 non-null int64
Area            4061 non-null int64
LP              4061 non-null int64
SP              4061 non-null int64
SqFt            4061 non-null int64
BR              4061 non-null int64
Bth             4061 non-null int64
PB              4061 non-null int64
GarSp           4061 non-null int64
YrBlt           4061 non-null int64
Lot SqFt        4061 non-null int64
HOA Fee         4061 non-null int64
Closing Date    4061 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(12), object(1)
memory usage: 475.9+ KB


In [47]:
# Store the cleaned data set into a csv file

df_clean.reset_index(drop=True)
df_clean.to_csv('Fremont_Sold_data_cleaned.csv', index=False)