# Real Estate Market Analysis with Python Project

### Data Preprocessing
- The initial stage involves cleaning and preparing the data for real estate analysis, including handling missing values, correcting inconsistencies, and transforming data types if necessary. You must clean and preprocess the customers and property tables, ensure column names are in order and that missing values appear correctly, and apply any column data type changes you see fit.

- Finally, you should combine the properties  and customers tables into one unified real estate dataset using the shared customer_id column. You must consolidate inconsistencies or missing values to perform the following analysis and obtain the correct results.

### Properties and Customers

You must review the list, preprocess, and clean both datasets accordingly.

1 - Descriptive statistics: You can start the analysis with descriptive statistics of the data and check for missing values.

2 - Datatypes: Evaluate the datatypes of the columns and decide whether some of the datatypes need to be changed.

3 - Column names: Check if there is an issue with any column names and rename them if necessary.

4 - Categorical to numerical: Change categorical values to numeric when possible and needed. Use the 0 1 convention when mapping.

5 - Case: If there are inconsistencies with Capitol and lowercases, unify them using the lowercase convention.

6 - Missing values: Ensure missing values are correctly indicated.

7 - Date variables: Make sure you handle and transform the date variables as a date.

### Combining the Two Datasets

- This phase aims to merge our two cleaned datasets (properties  and customers) into a comprehensive dataset.

1 - Preliminary Checks. Visually inspect the two datasets and decide which variable to merge. As both datasets share only column – you must opt for the customer_id  column.

2 - Initial Merge Attempt. You can leverage the pandas' functionality to combine the two tables without initial preprocessing. Think about what kind of join would make sense in the context of the given problem.

3 - Resolving Merge Issues. If the initial attempt to merge the data does not yield the correct data frame, you must examine and preprocess the data further. Think about which variable is likely causing the merge issue and review it closely- it makes sence to check the customer_id column. You must examine the values of the ids for both properties  and customers datasets. You will notice that there are unneccessary spaces in one of the datasets which you must remove. After that you should be able to merge successfully.

##### Sanity check: You should obtain a pandas data frame of 267 rows and 19 columns.

##### Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [19]:
customers = pd.read_csv('customers.csv')
properties = pd.read_csv('properties.csv')

In [20]:
customers.head()

Unnamed: 0.1,Unnamed: 0,﻿customerid,entity,name,surname,birth_date,sex,country,state,purpose,deal_satisfaction,mortgage,source
0,0,C0110,Individual,Kareem,Liu,5/11/1968,F,USA,California,Home,4,Yes,Website
1,1,C0010,Individual,Trystan,Oconnor,11/26/1962,M,USA,California,Home,1,No,Website
2,2,C0132,Individual,Kale,Gay,4/7/1959,M,USA,California,Home,4,Yes,Agency
3,3,C0137,Individual,Russell,Gross,11/25/1959,M,USA,California,Home,5,No,Website
4,4,C0174,Company,Marleez,Co,,,USA,California,Investment,5,No,Website


In [21]:
properties.head()

Unnamed: 0.1,Unnamed: 0,﻿id,building,date_sale,type,property#,area,price,status,customerid
0,0,1030,1,11/1/2005,Apartment,30,743.09,"$246,172.68",Sold,C0028
1,1,1029,1,10/1/2005,Apartment,29,756.21,"$246,331.90",Sold,C0027
2,2,2002,2,7/1/2007,Apartment,2,587.28,"$209,280.91",Sold,C0112
3,3,2031,2,12/1/2007,Apartment,31,1604.75,"$452,667.01",Sold,C0160
4,4,1049,1,11/1/2004,Apartment,49,1375.45,"$467,083.31",Sold,C0014


In [30]:
customers['customerid']


KeyError: 'customerid'

In [22]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162 entries, 0 to 161
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         162 non-null    int64 
 1   ﻿customerid        162 non-null    object
 2   entity             162 non-null    object
 3   name               162 non-null    object
 4   surname            162 non-null    object
 5   birth_date         155 non-null    object
 6   sex                155 non-null    object
 7   country            162 non-null    object
 8   state              154 non-null    object
 9   purpose            162 non-null    object
 10  deal_satisfaction  162 non-null    int64 
 11  mortgage           162 non-null    object
 12  source             162 non-null    object
dtypes: int64(2), object(11)
memory usage: 16.6+ KB


In [23]:
customers.describe()

Unnamed: 0.1,Unnamed: 0,deal_satisfaction
count,162.0,162.0
mean,80.5,3.45679
std,46.909487,1.333276
min,0.0,1.0
25%,40.25,3.0
50%,80.5,4.0
75%,120.75,5.0
max,161.0,5.0


In [24]:
customers.shape

(162, 13)

### Data Preparation

In [25]:
# Drop the first column 'Unnamed: 0' from both the dataframes
customers.drop('Unnamed: 0', axis=1, inplace=True)
properties.drop('Unnamed: 0', axis=1, inplace=True)

### Checking Datatypes

In [26]:
print(customers.columns)

Index(['﻿customerid', 'entity', 'name', 'surname', 'birth_date', 'sex',
       'country', 'state', 'purpose', 'deal_satisfaction', 'mortgage',
       'source'],
      dtype='object')


In [29]:
print(customers['customerid'].nunique())

KeyError: 'customerid'

In [10]:
customers.dtypes

customerid          object
entity               object
name                 object
surname              object
birth_date           object
sex                  object
country              object
state                object
purpose              object
deal_satisfaction     int64
mortgage             object
source               object
dtype: object

In [11]:
# Convert the data types of the columns to their appropriate types 'strings'
#customers['customerid'] = customers['customerid'].astype('string')
customers['entity'] = customers['entity'].astype('string')
customers['name'] = customers['name'].astype('string')
customers['surname'] = customers['surname'].astype('string')
customers['sex'] = customers['sex'].astype('string')
customers['country'] = customers['country'].astype('string')
customers['state'] = customers['state'].astype('string')
customers['purpose'] = customers['purpose'].astype('string')
#customers['mortage'] = customers['mortage'].astype('string')
customers['source'] = customers['source'].astype('string')

In [12]:
# Convert the data types of the columns to their appropriate types 'date'
customers['birth_date'] = customers['birth_date'].astype('datetime64[ns]')

In [13]:
customers.dtypes

customerid                  object
entity               string[python]
name                 string[python]
surname              string[python]
birth_date           datetime64[ns]
sex                  string[python]
country              string[python]
state                string[python]
purpose              string[python]
deal_satisfaction             int64
mortgage                     object
source               string[python]
dtype: object

KeyError: 'customerid'

In [12]:
# For customers dataframe
customers.isnull().sum()

customerid          0
entity               0
name                 0
surname              0
birth_date           7
sex                  7
country              0
state                8
purpose              0
deal_satisfaction    0
mortgage             0
source               0
dtype: int64

In [10]:
# For properties dataframe
properties.isnull().sum()

Unnamed: 0     0
﻿id            0
building       0
date_sale      0
type           0
property#      0
area           0
price          0
status         0
customerid    72
dtype: int64