# EDA on Boston Housing
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp24&branch=main&urlpath=tree%2Fdata271_sp24%2Fdemos%2Fdata271_demo21_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
sns.set_style("darkgrid")
import warnings 
warnings.filterwarnings('ignore') 

In [None]:
iris = sns.load_dataset('iris')
iris

## Useful Seaborn plots for EDA

In [None]:
# pairplot
sns.pairplot(iris)
plt.show()

In [None]:
# pairplot customizations
sns.pairplot(data = iris, hue='species')
plt.show()

In [None]:
# boxplot
sns.boxplot(data = iris)
plt.show()

In [None]:
# boxplot for numeric variable vs categorical
sns.boxplot(data = iris, x = 'species',y='sepal_width')
plt.show()

In [None]:
# heatmaps
corrmat = iris.corr(numeric_only=True) # make correlation matrix

sns.heatmap(corrmat, annot = True) 
plt.title('Correlation Coefficients')
plt.show()

# Group Activity: EDA on Housing Data 

### Importing the data

In [None]:
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22,header = None)
data = pd.DataFrame(np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :3]]))
data.columns = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM","AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT","PRICE"]
data

### Understanding the Features


 - CRIM   -  per capita crime rate by town
 - ZN    -   proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS  -  proportion of non-retail business acres per town
- CHAS  -   Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX  -    nitric oxides concentration (parts per 10 million)
- RM   -    average number of rooms per dwelling
- AGE  -    proportion of owner-occupied units built prior to 1940
- DIS   -   weighted distances to five Boston employment centres
- RAD  -    index of accessibility to radial highways
- TAX  -    full-value property-tax rate per \$10,000
- PTRATIO - pupil-teacher ratio by town
- B    -    1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT -   \% lower status of the population
- MEDV   -  Median value of owner-occupied homes in \$1000's

## Discussion question
Looking at the variables and their descriptions above, do any of them stand out, seem strange, or seem potentially problematic?

## EDA - Descriptive statistics

In [None]:
# General data info
data.info()

In [None]:
# statistical description of the data
data.describe()

In [None]:
# look at mean of each attribute 
data.describe().loc['mean']

## EDA - Data Preparation

In [None]:
# Search for null values
data.isna().sum()

In [None]:
# check for duplicate entries
data.duplicated().sum()

## EDA - Visualization

In [None]:
# start with a pairplot
sns.pairplot(data)  

### Reduce number of plots by making scatterplots

In [None]:
rows = 2
cols = 7
fig, ax = plt.subplots(rows, cols, figsize = (16,4) ) 
index = 0

# plot price as dependent variable (y-axis) against all other variables
for i in range(rows):
    for j in range(cols):
        sns.scatterplot(x = data.columns[index], y = 'PRICE', data = data, ax = ax[i][j]) 
        index = index + 1
        
plt.tight_layout()
plt.show()


### Correlation Coefficients and Heatmaps

In [None]:
# correlation matrix
corrmat = data.corr(numeric_only=True) 
corrmat  

In [None]:
# plot as a heat map to visualize the information in the matrix
plt.figure(figsize = (9, 9)) 
sns.heatmap(corrmat, annot = True) 
plt.show()

### Heatmap and Pair Plot of Correlated Data

In [None]:
# Which variables are highly (>0.5) correlated with price?
abs(corrmat.PRICE) > 0.5

In [None]:
corrmat.index[abs(corrmat.PRICE) > 0.5]

In [None]:
correlated_data = data[corrmat.index[abs(corrmat.PRICE) > 0.5]]
correlated_data.head()

In [None]:
# plot pair plots to display columns which are highly correlated with the price
sns.pairplot(correlated_data)
plt.tight_layout()

### Distributions of variables and boxplots

In [None]:
# distributions of data
sns.boxplot(data)
plt.show()

In [None]:
# plot price vs CHAS
sns.boxplot( x = 'CHAS',y = 'PRICE', data = data)
plt.show()

**IMPORTANT NOTE:** The boston housing data set has an ethical problem (actually, several)! It used to be available to import from Python libraries and repositories, and it was often used for benchmarking machine learning models. However, it has been phased out of many libraries due to its ethical issues.  It is an example of how systemic racism can occur in data, and alerts us to the need to be aware of societal biases that can manifest in data sets and resulting analyses.

We will talk about the history of this dataset and problematic ways in which the variables were constructed during class. If you return to the dataset another time, be careful about how you use it and be aware of its issues. 