Project Tasks:

-Task 1: Familiarize yourself with the dataset 
-Task 2: (Optional) If you do not already have an instance of Watson Studio, create an IBM Cloud Lite account and provision an instance of Waston Studio.
-Task 3: Load the dataset in a Jupyter Notebook using Watson Studio.
-Task 4: Generate basic statistics and visualizations for upper management. 
-Task 5: Use the appropriate tests to answer the questions provided.
-Task 6: Share your Jupyter Notebook.

Import of needed libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats
from statsmodels.formula.api import ols
import seaborn as sns
import statsmodels.api as sm
import datetime

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

np.random.seed(0)
np.set_printoptions(suppress=True)

In [None]:
# download doesn't work --> Please use local copy of the file
#boston_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/boston_housing.csv'
#boston_df=pd.read_csv(boston_url)

In [None]:
# only for local copy usage
# reading data from csv
df = pd.read_csv('./boston_housing.csv')

# checking the data frame
df

The data fram is 506 rows × 14 columns with the columns as defined in the task

-CRIM - per capita crime rate by town
-ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
-INDUS - proportion of non-retail business acres per town.
-CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
-NOX - nitric oxides concentration (parts per 10 million)
-RM - average number of rooms per dwelling
-AGE - proportion of owner-occupied units built prior to 1940
-DIS - weighted distances to five Boston employment centres
-RAD - index of accessibility to radial highways
-TAX - full-value property-tax rate per $10,000
-PTRATIO - pupil-teacher ratio by town
-LSTAT - % lower status of the population
-MEDV - Median value of owner-occupied homes in $1000's

In [None]:
# checking metadata
df.info()

In [None]:
# generate descriptive statistics
df.describe()

Data is all float type with no NaNs

In [None]:
# checking the column labels
df.columns

In [None]:
# visualization of the data
# make a histogram of the dataframe’s columns, using matplotlibs hist function
df.hist(bins = 50, figsize = (20,10))

# add a centered suptitle to the figure
plt.suptitle('Feature Distribution', x = 0.5, y = 1.02, ha = 'center', fontsize = 'large')

# adjust the padding between and around subplots
plt.tight_layout()

# plot...
plt.show()

The following tasks were given:

-For the "Median value of owner-occupied homes" provide a boxplot
-Provide a  bar plot for the Charles river variable
-Provide a boxplot for the MEDV variable vs the AGE variable. (Discretize the age variable into three groups of 35 years and younger, between 35 and 70 years and 70 years and older)
-Provide a scatter plot to show the relationship between Nitric oxide concentrations and the proportion of non-retail business acres per town. What can you say about the relationship?
-Create a histogram for the pupil to teacher ratio variable

In [None]:
#Following straight forward the task „For the "Median value of owner-occupied homes" provide a boxplot
# defining figure size
plt.figure(figsize=(15,10))

# drawing a box plot to show distributions with respect to category MEDV, using seaborns boxplot function
sns.boxplot(x = df.MEDV)

# add the title
plt.title('Boxplot for MEDV')

# plot...
plt.show()

In [None]:
# and provide a histogram for the Charles river variable, as demanded
# defining figure size
plt.figure(figsize=(15,10))

# draw the histogram for CHAS, using seaborns distplot function
sns.distplot(a = df.CHAS,bins = 10, kde = False)

# add the title
plt.title('Histogram for Charles River')

# plot...
plt.show()

In [None]:
# Next task:
# Provide a boxplot for the MEDV variable vs the AGE variable. 
# (Discretize the age variable into three groups of 35 years and younger, between 35 and 70 years and 70 years and older)

# discretize the age variable in three groups
# group one:
df.loc[(df['AGE'] <= 35),'age_group'] = '35 years and younger'

# group two:
df.loc[(df['AGE'] > 35) & (df["AGE"]<70),'age_group'] = 'between 35 and 70 years'

# group three:
df.loc[(df['AGE'] >= 70),'age_group'] = '70 years and older'

In [None]:
# checking the result of the operation
df

In [None]:
# creating the boxplot
plt.figure(figsize=(15,10))

# drawing a box plot to show distributions with respect to category MEDV and the discretized age variable in three groups, using seaborns boxplot function
sns.boxplot(x = df.MEDV, y = df.age_group, data = df)

# add the title
plt.title('Boxplot for MEDV vs AGE ')

# ... plot
plt.show()

In [None]:
# Next task:
# Provide a scatter plot to show the relationship between Nitric oxide concentrations and the proportion
# of non-retail business acres per town. What can you say about the relationship?

plt.figure(figsize = (15,10))

# drawing a scatter plot to show scattering with respect to category NOX and the INDUS using seaborns scatterplot function
sns.scatterplot(x = df.NOX, y = df.INDUS, data = df)

# add the title
plt.title('Relationship between NOX Concentrations (NOX) and Proportion of Non-Retail Business Acres per Town (INDUS)')

# plot...
plt.show()

In [None]:
# Next task:
# Create a histogram for the pupil to teacher ratio variable
plt.figure(figsize = (15,10))

# draw the histogram with respect to PTRATIO using seaborns distplot function
sns.distplot(a = df.PTRATIO, bins=10, kde=False)

# add the title
plt.title('Histogram for the pupil to teacher ratio variable (PTRATIO)')

# ... plot
plt.show()

Overview of task for statistical evaluation

For each of the following questions:
-Is there a significant difference in median value of houses bounded by the Charles river or not? (T-test for independent samples)
-Is there a difference in Median values of houses (MEDV) for each proportion of owner occupied units built prior to 1940 (AGE)? (ANOVA)
-Can we conclude that there is no relationship between Nitric oxide concentrations and proportion of non-retail business acres per town? (Pearson Correlation)
-What is the impact of an additional weighted distance  to the five Boston employment centres on the median value of owner occupied homes? (Regression analysis)

Be sure to:
-State your hypothesis.
-Use α = 0.05
-Perform the test Statistics.
-State the conclusion from the test.

First Task:
Is there a significant difference in median value of houses bounded by the Charles river or not? (T-test for independent samples)

Null Hypothesis: Both average MEDV are the same

In [None]:
# Alternative Hypothesis():
# creating a series containing counts of unique values
df['CHAS'].value_counts()

# creating two lists 'a' and 'b' for evaluation via t-test for CHAS == 0 and CHAS == 1
a = df[df['CHAS'] == 0]['MEDV']
b = df[df['CHAS'] == 1]['MEDV']

In [None]:
# checking a
a

In [None]:
# checking b
b

In [None]:
# performing t-test, using scipy and the generated lists 'a' and 'b'
scipy.stats.ttest_ind(a, b, axis = 0, equal_var = True)

Since p-value p = 7.3906..... is more than alpha value of 0.05, we failed to reject null hypothesis --> there is NO statistical significance.

In [None]:
# Next task:
# Is there a difference in Median values of houses (MEDV) for each proportion of owner occupied units built prior to 1940 (AGE)? (ANOVA)
# creating a series containing counts of unique values of column 'AGE'
df['AGE'].value_counts()

In [None]:
# again grouping ages
# group one:
df.loc[(df['AGE'] <= 35),'age_group'] = '35 years and younger'

# group two:
df.loc[(df['AGE'] > 35) & (df["AGE"]<70),'age_group'] = 'between 35 and 70 years'

# group three:
df.loc[(df['AGE'] >= 70),'age_group'] = '70 years and older'

In [None]:
# checking data frame
df

Null-Hypothesis: All aging classes have the same population mean

In [None]:
# separating the pre-defined groups in three classes, for the evaluation 
lower = df[df['age_group'] == '35 years and younger']['MEDV']
middle = df[df['age_group'] == 'between 35 and 70 years']['MEDV']
upper = df[df['age_group'] == '70 years and older']['MEDV']

# perform one-way ANOVA --> The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean.
# The test is applied to samples from two or more groups, possibly with differing sizes.
f_stats, p_value = scipy.stats.f_oneway(lower, middle, upper, axis = 0)

# print the result of the ANOVA
print('F-Statistic={0}, P-value={1}'.format(f_stats, p_value))

Since P-value = 1.7105011022702984e-15 more than alpha value of 0.05, we failed to reject Null Hypothesis --> there is NO statistical significance.

Next task:
Can we conclude that there is no relationship between Nitric oxide concentrations and proportion of non-retail business acres per town? (Pearson Correlation)

Hypotheses:
1.NOX is not correlated with INDUS

2.NOX is correlated with INDUS

In [None]:
# applying pearsons to 'NOX' and 'INDUS'
pearson,p_value = scipy.stats.pearsonr(df['NOX'],df['INDUS'])

# print the result of pearsons
print('Pearson Coefficient value={0}, P-value={1}'.format(pearson, p_value))

Since the P-value=7.913361061240043e-98 < 0.05, we reject the Null Hypothesis
Additionally, we conclude that there exists a relationship between Nitric Oxide and non-retail business acres per town.

Next Task:
What is the impact of an additional weighted distance to the five Boston employment centres on the median value of owner occupied homes? (Regression analysis)

Null Hypothesis: Weighted distances to five Boston employment centres are not related to median value

Alternative Hypothesis: Weighted distances to five Boston employment centres are related to median value

In [None]:
# defining two series objects, based on columns 'MEDV' and 'DIS'
y = df['MEDV']
x = df['DIS']

# adding a column of ones to x for evaluation
x = sm.add_constant(x)

# applying Ordinary Least Squares to y and x
results = sm.OLS(y,x).fit()

# summarizing the results of the OLS
results.summary()

In [None]:
# from the results above, we see that R-quared is 0.062 --> need to have R --> calculate the square root of R-squared
np.sqrt(0.062)

The square root of R-squared is 0.24899799195977465 --> there is only a weak correlation between both features

In [None]:
# using correlation function to find the pairwise correlation of all columns in the data frame
df.corr()

In [None]:
# Last Task =)
# Creating heatmap
# defining figure size
plt.figure(figsize=(20,15))

# creating heat map based on the above correlation matrix, using blue-green colour mapping,  setting the output to two digits
sns.heatmap(df.corr(),cmap = 'crest', annot = True, fmt ='.2f', linewidths = 1, cbar = True)

# ... plotting
plt.show()