## Exercise: Analyzing and Visualizing Data with Matplotlib, Pandas, and NumPy

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following describes the dataset columns:

- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per $10,000
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000's

### Part 1: Data Preparation

In this part of the exercise, you will use Pandas and NumPy to load and preprocess data. For this exercise, we will use a dataset of housing prices in Boston.

1.  Download the Boston Housing dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data) and save it to your local computer.
2.  Load the dataset into a Pandas DataFrame and examine its contents. The dataset should have 506 rows and 14 columns.
3.  Use NumPy to calculate some basic statistics about the dataset, such as the mean, median, standard deviation, minimum, and maximum values of each column.
4.  Clean the dataset by removing any missing or invalid values. You can use Pandas' `dropna()` method for this.



In [None]:
!pip install pandas yfinance matplotlib seaborn

### Part 2: Data Visualization

In this part of the exercise, you will use Matplotlib to create some visualizations of the housing data.

1.  Create a scatter plot of the housing prices (`MEDV`) versus the number of rooms (`RM`). Label the axes and add a title to the plot.
2.  Create a histogram of the housing prices (`MEDV`). Label the axes and add a title to the plot.
3.  Create a box plot of the housing prices (`MEDV`) grouped by the proximity to employment centers (`DIS`). Label the axes and add a title to the plot.

### Part 3: Data Analysis

In this part of the exercise, you will use Pandas and NumPy to perform some basic data analysis on the housing data.

1.  Calculate the correlation coefficient between the housing prices (`MEDV`) and the number of rooms (`RM`) using NumPy's `corrcoef()` method.
2.  Calculate the mean housing price (`MEDV`) for each neighborhood (`CHAS`). You can use Pandas' `groupby()` method for this.
3.  Calculate the percentage of houses that have more than 7 rooms (`RM`). You can use Pandas' `value_counts()` method for this.

### Part 4: Putting it all together

In this final part of the exercise, you will use all three libraries to create a complete data analysis and visualization pipeline.

1.  Load the dataset and preprocess it as before.
2.  Create the scatter plot and histogram as before.
3.  Calculate the correlation coefficient between the housing prices and the number of rooms as before.
4.  Group the data by the proximity to employment centers and calculate the mean housing prices as before.
5.  Create a bar chart of the mean housing prices by proximity to employment centers. Label the axes and add a title to the plot.