# EDA on Boston Housing
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp25&branch=main&urlpath=tree%2Fdata271_sp25%2Flectures%2Fdata271_lec21_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
sns.set_style("darkgrid")
import warnings 
warnings.filterwarnings('ignore') 

In [None]:
iris = sns.load_dataset('iris')
iris

## Useful Seaborn plots for EDA

In [None]:
# pairplot
sns.pairplot(iris)
plt.show()

In [None]:
# pairplot customizations
sns.pairplot(data = iris, hue='species')
plt.show()

In [None]:
# boxplot
sns.boxplot(data = iris)
plt.show()

In [None]:
# boxplot for numeric variable vs categorical
sns.boxplot(data = iris, x = 'species',y='sepal_width')
plt.show()

In [None]:
# heatmaps
corrmat = iris.corr(numeric_only=True) # make correlation matrix

sns.heatmap(corrmat, annot = True) 
plt.title('Correlation Coefficients')
plt.show()

# Group Activity: EDA on Housing Data 

### Importing the data

In [None]:
data = pd.read_csv('BostonHousingData.csv')
data

### Understanding the Features
Info about the columns is [here](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html#:~:text=The%20Boston%20Housing%20Dataset,the%20area%20of%20Boston%20Mass).

*Fill in descriptions of each variable here*

## Discussion question
Looking at the variables and their descriptions above, do any of them stand out? Are there any you have more questions about?

## EDA - Descriptive statistics

In [None]:
# General data info
data.info()

In [None]:
# statistical description of the data
data.describe()

In [None]:
# look at mean of each attribute 
data.describe().loc['mean']

## EDA - Data Preparation

In [None]:
# Search for null values
data.isna().sum()

In [None]:
# check for duplicate entries
data.duplicated().sum()

## EDA - Visualization

In [None]:
# Could start with a pairplot; but this can be computationally expensive
# and not too informative

# sns.pairplot(data)

### Reduce number of plots by making scatterplots

In [None]:
rows = 2
cols = 7
fig, ax = plt.subplots(rows, cols, figsize = (16,4) ) 
index = 0

# plot price as dependent variable (y-axis) against all other variables
for i in range(rows):
    for j in range(cols):
        sns.scatterplot(data = data, x = data.columns[index], y = 'MEDV', ax = ax[i][j]) 
        index = index + 1
        
plt.tight_layout()
plt.show()


### Correlation Coefficients and Heatmaps

In [None]:
# correlation matrix
corrmat = data.corr(numeric_only=True) 
corrmat  

In [None]:
# plot as a heat map to visualize the information in the matrix
plt.figure(figsize = (9, 9)) 
sns.heatmap(corrmat, annot = True) 
plt.show()

### Heatmap and Pair Plot of Correlated Data

In [None]:
# Which variables are highly (>0.5) correlated with price?
abs(corrmat.MEDV) > 0.5

In [None]:
corrmat.index[abs(corrmat.MEDV) > 0.5]

In [None]:
correlated_data = data[corrmat.index[abs(corrmat.MEDV) > 0.5]]
correlated_data.head()

In [None]:
# plot pair plots to display columns which are highly correlated with the price
sns.pairplot(correlated_data)
plt.tight_layout()

### Distributions of variables and boxplots

In [None]:
# distributions of data
sns.boxplot(data)
plt.show()

In [None]:
# plot price vs CHAS
sns.boxplot( x = 'CHAS',y = 'MEDV', data = data)
plt.show()