# Exploratory Data Analysis (EDA)

This is an EDA performed on the real estate market trends in Conneticut. 


The raw data file was obtained from https://catalog.data.gov/dataset/real-estate-sales-2001-2018. On the wbsite, the file is described to include

>town, property address, date of sale, property type (residential, apartment, commercial, industrial or vacant land), sales price, and property assessment. 

>Annual real estate sales are reported by grand list year (October 1 through September 30 each year). For instance, sales from 2018 GL are from 10/01/2018 through 9/30/2019 (Data.gov).



Frequently used libraries are imported:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from bokeh.plotting import figure, show, output_notebook, output_file, reset_output
output_notebook()
from bokeh.layouts import gridplot
from bokeh.models import HoverTool

The dataset imported from the csv file:

In [None]:
real_estate=pd.read_csv('')


To get the basic idea about the dataset, 10 rows are called:

In [None]:
real_estate.sample(10)

Findings:

- Serial number does not match the index
- Specific `Address`es and `Town`s are avilable
- `Sales ratio` is `Assessed Value` / `Sale Amount`
- NaN values for several columns
- `Date Recorded` is date of sale according to the website.
- Some rows in `Location` include longitude and latitude.

Next, file types are checked:

In [None]:
real_estate.info()

In [None]:
#Missing values check
real_estate.isna().sum()

Findings:
   - `Date` column can be changed to DateTime format
   - `Date Recorded`, `Town`, `Address`, `Assessed Value`, `Sale Amount`, `Sale Ratio` have very few missing values.
   - About 1/3 of rows missing in `Property Type` and `Residential Type`.
   - About 20% of the data has `Location` or longitude and latitude.
   

In [None]:
#Change data type
real_estate['Date Recorded'] = pd.to_datetime(real_estate['Date Recorded'])

In [None]:
real_estate.describe()

In [None]:
for column in (real_estate.columns):
    nunique_values = real_estate[column].nunique()
    unique_values = real_estate[column].unique()
    value_counts = real_estate[column].value_counts()
    
    print(f"The column name is {column}\n")
    print(f"The total number of unique values are: {nunique_values}\n")
    print(f"The unique values are: {unique_values}\n\n\n\n")
    print(f"The value counts are: {value_counts}\n\n\n\n")

In [None]:
for column in (real_estate.columns):
    print(column)
    plt.figure()
    plt.hist(real_estate[column])
    plt.xlabel(column)
    plt.ylabel('count')
    plt.tight_layout()
    plt.show()

In [None]:
corr_df = real_estate.corr()
mask = np.triu(corr_df)
plt.figure(figsize=(20,20))
sns.heatmap(corr_df.round(2), annot=True,mask=mask, vmin=-1, vmax=1, cmap='coolwarm')
plt.show()

Findings:
   - `Assessed Value` and `Sale Amount` is weakly correlated (r=0.11).

In [None]:
sns.lmplot(x='List Year', y='Sale Amount', data=real_estate)

In [None]:
# Assuming 'real_estate' is a pandas DataFrame with a 'Sale Amount' column
print(real_estate[real_estate['Sale Amount'] == real_estate['Sale Amount'].max()])



In [None]:
real_estate = real_estate.drop([real_estate.index[59835]])

In [None]:
real_estate[real_estate['Sale Amount'] == real_estate['Sale Amount'].max()]

In [None]:
sns.boxplot(x="Property Type", y='Sale Amount', data=real_estate)

In [None]:
plt.figure()
plt.scatter(real_estate["Assessed Value"], real_estate["Sale Amount"])

plt.show()

In [None]:
sns.boxplot(x='List Year', y='Sale Amount', data=real_estate)

In [None]:
sns.boxplot(x="Property Type", y='Sale Amount', data=real_estate)

In [None]:
plt.barh(real_estate.groupby('Property Type').mean()["Sale Amount"].index, real_estate.groupby('Property Type').mean()["Sale Amount"])
plt.xticks(rotation=45)
plt.title('Mean Sale Amount by Property Type')


In [None]:
plt.barh(real_estate.groupby('List Year').mean()['Sale Amount'].index, real_estate.groupby('List Year').mean()["Sale Amount"])
plt.xticks(rotation=45)
plt.title('Mean Sale Amount by List Year')
