# Boston Housing | Exploratory Data Analysis

### Questions To Solve:

What is the distribution of values for each feature in the dataset?

Are there any missing values in the dataset, and if so, how are they distributed across the features?

Are there any outliers or extreme values in the dataset, and how do they affect our understanding of the data?

What is the correlation between each feature and the target variable (MEDV), and which features are most strongly correlated?

Are there any significant differences in the distribution of features between neighborhoods in Boston?

How do different combinations of features affect the distribution of MEDV values in the dataset?

Are there any patterns or trends in the relationship between features and MEDV that can be observed through visualizations?

Are there any interactions or nonlinear relationships between features that affect the target variable?

How do different statistical transformations of the features (such as logarithmic or polynomial transformations) affect the relationship with MEDV?

How well can different regression models predict the target variable based on the available features, and what are the most important predictors for these models?

-----------------------------------------------------------------------
What is the distribution of MEDV values in the dataset?

Which features are most strongly correlated with MEDV?

Is there a relationship between the proportion of non-retail business acres per town (INDUS) and housing prices (MEDV)?

How does the crime rate (CRIM) affect housing prices (MEDV)?

Is there a significant difference in housing prices (MEDV) between neighborhoods with different levels of air pollution (NOX)?

How do the distances to employment centers (DIS) and radial highways (RAD) affect housing prices (MEDV)?

Is there a relationship between the percentage of lower status of the population (LSTAT) and housing prices (MEDV)?

How does the pupil-teacher ratio (PTRATIO) affect housing prices (MEDV)?

Are there any significant differences in housing prices (MEDV) between neighborhoods that border the Charles River (CHAS) and those that do not?

Can we build an accurate regression model to predict housing prices (MEDV) based on the other input features in the dataset?

----------------------------------------------------

https://www.kaggle.com/datasets/vikrishnan/boston-house-prices

ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS: proportion of non-retail business acres per town

CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX: nitric oxides concentration (parts per 10 million)

1https://archive.ics.uci.edu/ml/datasets/Housing

123

20.2. Load the Dataset 124

RM: average number of rooms per dwelling

AGE: proportion of owner-occupied units built prior to 1940

DIS: weighted distances to ﬁve Boston employment centers

RAD: index of accessibility to radial highways

TAX: full-value property-tax rate per $10,000

PTRATIO: pupil-teacher ratio by town 12. 

B: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town 13. 

LSTAT: % lower status of the population

MEDV: Median value of owner-occupied homes in $1000s

We can see that the input attributes have a mixture of units.

-------------------------------------------------------------

### EDA 

In [6]:
link = 'https://raw.githubusercontent.com/vishalv91/capstoneproject-realestate/master/HousingData.csv'


In [7]:
import pandas as pd 
df = pd.read_csv('https://raw.githubusercontent.com/vishalv91/capstoneproject-realestate/master/HousingData.csv')

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     486 non-null    float64
 1   ZN       486 non-null    float64
 2   INDUS    486 non-null    float64
 3   CHAS     486 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      486 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    486 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


In [8]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,,36.2


In [10]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,486.0,486.0,486.0,486.0,506.0,506.0,486.0,506.0,506.0,506.0,506.0,506.0,486.0,506.0
mean,3.611874,11.211934,11.083992,0.069959,0.554695,6.284634,68.518519,3.795043,9.549407,408.237154,18.455534,356.674032,12.715432,22.532806
std,8.720192,23.388876,6.835896,0.25534,0.115878,0.702617,27.999513,2.10571,8.707259,168.537116,2.164946,91.294864,7.155871,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.0819,0.0,5.19,0.0,0.449,5.8855,45.175,2.100175,4.0,279.0,17.4,375.3775,7.125,17.025
50%,0.253715,0.0,9.69,0.0,0.538,6.2085,76.8,3.20745,5.0,330.0,19.05,391.44,11.43,21.2
75%,3.560263,12.5,18.1,0.0,0.624,6.6235,93.975,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [9]:
df.isnull().sum()

CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
B           0
LSTAT      20
MEDV        0
dtype: int64

In [13]:
df.dropna(inplace=True) # dropping the null rows...

In [25]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,394.0,394.0,394.0,394.0,394.0,394.0,394.0,394.0,394.0,394.0,394.0,394.0,394.0,394.0
mean,3.690136,11.46066,11.000863,0.068528,0.553215,6.280015,68.932741,3.805268,9.403553,406.431472,18.537563,358.490939,12.769112,22.359645
std,9.202423,23.954082,6.908364,0.252971,0.113112,0.697985,27.888705,2.098571,8.633451,168.312419,2.16646,89.283295,7.30843,9.142979
min,0.00632,0.0,0.46,0.0,0.389,3.561,2.9,1.1296,1.0,187.0,12.6,2.6,1.73,5.0
25%,0.081955,0.0,5.13,0.0,0.453,5.87925,45.475,2.1101,4.0,280.25,17.4,376.7075,7.125,16.8
50%,0.26888,0.0,8.56,0.0,0.538,6.2015,77.7,3.1992,5.0,330.0,19.1,392.19,11.3,21.05
75%,3.435973,12.5,18.1,0.0,0.624,6.6055,94.25,5.1167,24.0,666.0,20.2,396.9,17.1175,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [14]:
df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 394 entries, 0 to 504
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     394 non-null    float64
 1   ZN       394 non-null    float64
 2   INDUS    394 non-null    float64
 3   CHAS     394 non-null    float64
 4   NOX      394 non-null    float64
 5   RM       394 non-null    float64
 6   AGE      394 non-null    float64
 7   DIS      394 non-null    float64
 8   RAD      394 non-null    int64  
 9   TAX      394 non-null    int64  
 10  PTRATIO  394 non-null    float64
 11  B        394 non-null    float64
 12  LSTAT    394 non-null    float64
 13  MEDV     394 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 46.2 KB


In [18]:
df[df.duplicated()] # it indicates duplicate values

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV


In [24]:
df['RAD'].unique()

array([ 1,  2,  3,  5,  4,  8,  6,  7, 24])

In [23]:
# Not Necessary But might be helpful for other data sets

def find_unique_values(df):
    unique_values = {}
    for col in df.columns:
        unique = df[col].unique()
        unique_values[col] = unique
    return unique_values
find_unique_values(df)

{'CRIM': array([6.32000e-03, 2.73100e-02, 2.72900e-02, 3.23700e-02, 2.98500e-02,
        1.44550e-01, 2.11240e-01, 2.24890e-01, 1.17470e-01, 9.37800e-02,
        6.29760e-01, 6.27390e-01, 1.05393e+00, 7.84200e-01, 8.02710e-01,
        7.25800e-01, 1.25179e+00, 8.52040e-01, 1.23247e+00, 9.88430e-01,
        7.50260e-01, 8.40540e-01, 6.71910e-01, 9.55770e-01, 7.72990e-01,
        1.00245e+00, 1.13081e+00, 1.35472e+00, 1.38799e+00, 1.15172e+00,
        1.61282e+00, 8.01400e-02, 1.75050e-01, 2.76300e-02, 3.35900e-02,
        1.27440e-01, 1.41500e-01, 1.22690e-01, 1.71420e-01, 1.88360e-01,
        2.53870e-01, 2.19770e-01, 8.87300e-02, 5.36000e-02, 1.36000e-02,
        1.31100e-02, 2.05500e-02, 1.43200e-02, 1.54450e-01, 1.03280e-01,
        1.49320e-01, 1.71710e-01, 1.10270e-01, 1.26500e-01, 1.95100e-02,
        3.58400e-02, 4.37900e-02, 5.78900e-02, 1.35540e-01, 1.28160e-01,
        8.82600e-02, 1.58760e-01, 9.16400e-02, 9.51200e-02, 1.01530e-01,
        8.70700e-02, 5.64600e-02, 4.11300e-