#### CIS 9 - Lab 1

Topics: Review Python, Jupyter Notebook, Numpy

In [None]:
# Don't forget to add your name so you can get credit for your work
# Name: Nitya Kashyap

For this lab you're analyzing CA housing data. The data came from the US Census Bureau and uploaded on [Kaggle](https://www.kaggle.com/datasets/shibumohapatra/house-price), and then further prepared as 3 CSV files.

1. `CAhousing.csv` has the housing data, which is in table format with multiple rows and columns.<br>
Each row is for a district or block, which is the smallest geographical unit for the census bureau.<br>
2. `header.csv` has the text strings which are the column headers. The strings are the description of data in each column.<br>
3. `location.csv` contains the type of location of each district (each row).

NOTE: This lab is an exercise in using numpy, therefore, <u>do not use pandas for this lab</u>. There will be opportunities to use pandas soon, so labs that use pandas will receive no credit.

In [42]:
# import modules
import numpy as np

1. __Read data from `CAhousing.csv`__ into an appropriate container.<br>
__Print the number of rows and columns of data__ along with an explanation for the  numbers.

In [43]:
data = np.genfromtxt("CAhousing.csv", delimiter= ",", dtype=float) 
# print(data.shape)
print("number of rows:", data.shape[0], "\nnumber of columns:", data.shape[1])

number of rows: 20433 
number of columns: 6


Note: By default, genfromtxt assumes delimiter=None, meaning that the line is split along white spaces 
    (including tabs) and that consecutive white spaces are considered as a single white space. 
    Alternatively, we may be dealing with a fixed-width file, where columns are defined as a given number of characters. [Source](https://numpy.org/doc/stable/user/basics.io.genfromtxt.html#)

2. __Read data from `header.csv`__ into an appropriate container.<br>
__Print the number of rows and columns of data__ to confirm that the dimension matches the housing data above.<br>
Then __print the data__ so you can see what the column headers are.

In [147]:
headers = np.genfromtxt("header.csv", delimiter=",", dtype=str)
# print(headers)
# print(headers.shape)
print("number of rows:", headers.shape[0])
print("\nColumn Headers:\n", "\n".join(headers), sep="")

number of rows: 6

Column Headers:
house_median_age
square_feet
population
households
median_income
median_house_value


3a. First we take a look at the age of the houses.

__Print the highest, lowest, and median of the ages__ of all the houses.

_Print an explanation along with the values, don't just print 3 numbers_.
_And print the average with 1 digit after the decimal point_.

In [45]:
# create a view of only the age column
ages = data[:,0] # or data[:,:1]
# print(ages)
print(f"highest age: {ages.max():0.1f} years \nlowest age: {ages.min():0.1f} years \nmedian age: {np.median(ages):0.1f} years")

highest age: 52.0 years 
lowest age: 1.0 years 
median age: 29.0 years


3b. Create a Raw NBConvert cell to __<u>explain</u> whether the houses tend to be older or newer__. Your explanation should discuss the median value.

4a. Now we investigate the population.<br>
The _population_ column shows how many people are in a district, and the _households_ column shows how many households are in the district.

__Find the number of persons per household__.<br>
Then __print the mean and the standard deviation of the number of persons per household__.

_Print an explanation along with the numbers, and the numbers should have 1 digit after the decimal point._

In [46]:
population = data[:,2]
households = data[:,3]
people_per_household = population/households
print(f"mean persons per household: {np.mean(people_per_household):0.1f} persons/household \nstandard deviation of the number of persons per household: {np.std(people_per_household):0.1f} persons/household")

mean persons per household: 3.1 persons/household 
standard deviation of the number of persons per household: 10.4 persons/household


4b. Do the numbers in step 4a show that the number of persons per household data has a large or small spread?<br>
Create a Raw NBConvert cell to __show your answer and explain your reasoning__.

4c. __Find the 25th, 50th, and 75th percentiles__ of the persons per household.

In [47]:
print(f"25th percentile: {np.percentile(people_per_household, 25):0.1f} persons per household")
print(f"50th percentile (median): {np.percentile(people_per_household, 50):0.1f} persons per household")
print(f"75th percentile: {np.percentile(people_per_household, 75):0.1f} persons per household")

25th percentile: 2.4 persons per household
50th percentile (median): 2.8 persons per household
75th percentile: 3.3 persons per household


4d. Based on the output of steps 4c above, would you expect the standard deviation to be large?<br>
What might be a reason why the standard deviation is large?

Create a Raw NBConvert cell to __answer the 2 questions above__.

5. We want to see if the districts with lowest incomes have lower housing price than the districts with the highest incomes.

5a. __Find the lowest income, the count of districts with lowest incomes, and the house prices for these districts__.<br>
Print the 3 results with explanation.

In [161]:
incomes = data[:,4]

lowest = incomes.min()
print(f"The lowest median income in the data set is {lowest:.2f}.")

indices_of_lowest = np.where(incomes==lowest)
# print(indices_of_lowest)

print(f"The number of districts with the lowest median income in the data set is {np.size(indices_of_lowest)} districts.")


house_values = data[:,5]
lowest_income_house_values = house_values[indices_of_lowest]
# ^^ equivalent to:
# lowest_income_house_values = data[indices_of_lowest,5]
# using the two step method because house values will be needed later
print(f"\nmedian house values for the districts with lowest median income:")
for value in  lowest_income_house_values:
    print(f"{value:0.2f}")

The lowest median income in the data set is 0.50.
The number of districts with the lowest median income in the data set is 12 districts.

median house values for the districts with lowest median income:
67500.00
100000.00
73500.00
500001.00
90600.00
112500.00
500001.00
162500.00
55000.00
82500.00
56700.00
162500.00


5b. __Find the highest income, the count of districts with highest incomes, and the house prices for these districts__.<br>
Print the 3 results with explanation.

In [155]:
incomes = data[:,4]
highest = incomes.max()
print(f"The highest median income in the data set is {highest:.2f}.")
indices_of_highest = np.where(incomes==highest)
# print(indices_of_highest)
print(f"The number of districts with the lowest median income in the data set is {np.size(indices_of_highest)} districts.")

highest_income_house_values = house_values[indices_of_highest]
print(f"\nmedian house values for the districts with highest median income:")
for value in highest_income_house_values:
    print(f"{value:0.2f}")

The highest median income in the data set is 15.00.
The number of districts with the lowest median income in the data set is 48 districts.

median house values for the districts with highest median income:
350000.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
500001.00
131300.00
400000.00


5c. __Find the highest and lowest house prices__.<br>
Print the results with explanation.

In [156]:
# house values view already created
print(f"highest house value: {house_values.max():.2f}")
print(f"lowest house value: {house_values.min():.2f}")

highest house value: 500001.00
lowest house value: 14999.00


5d. Using your results in steps 5a-c, create a Raw NBConvert cell to __explain the difference in house prices__ between the highest and lowest income districts.

6. Last, we look at the mean house prices of each location type, to see if the real estate mantra "location, location, location" is true in that the location affects the house price.

6a. __Read data from `location.csv`__ into an appropriate container.<br>
Then __print the size of container__ to confirm that it matches the housing data dimensions<br>
and __print the location data__.

In [157]:
locations = np.genfromtxt("location.csv", delimiter=",", dtype=str)
print(f"size of container: {locations.shape}")
print(f"location data:\n{locations}")

size of container: (20433,)
location data:
['NEAR BAY' 'NEAR BAY' 'NEAR BAY' ... 'INLAND' 'INLAND' 'INLAND']


6b. From the output above, we see that the number of locations is the same as the number of rows of the housing data. Each element of the location container corresponds to 1 row of the housing data.

__Find and print the unique locations__ in the location data.

In [158]:
print(f"Unique locations: {', '.join(set(locations))}")

Unique locations: <1H OCEAN, NEAR BAY, ISLAND, NEAR OCEAN, INLAND


6c. From the output above, we see that there are 5 types of locations.
- NEAR BAY: next to a bay
- <1H OCEAN: less than 1 hour away from the ocean
- NEAR OCEAN: next to the ocean
- INLAND: self explanatory
- ISLAND: self explanatory

__Create 5 different views of the housing data__.<br>
Each view contains the rows of the housing data that are for one location.

In [159]:
Near_Bay_indices = np.where(locations=="NEAR BAY")
OneH_Ocean_indices = np.where(locations=="<1H OCEAN")
Near_Ocean_indices = np.where(locations=="NEAR OCEAN")
Inland_indices = np.where(locations=="INLAND")
Island_indices = np.where(locations=="ISLAND")

Near_Bay = data[Near_Bay_indices]
OneH_Ocean = data[OneH_Ocean_indices]
Near_Ocean = data[Near_Ocean_indices]
Inland = data[Inland_indices]
Island = data[Island_indices]

6d. __Find the mean of the house prices in each view__<br>
and __print the results__ with explanation and with numbers rounded to the nearest whole number.

_As a review of your Python skills, print the 5 results in 2 clear columns:_ `location  price`

In [163]:
print("location", " "*3, "mean price")
print("-"*23)
print(f"{'NEAR BAY':12s} {Near_Bay[:,5].mean():1.0f}")
print(f"{'<1H OCEAN':12s} {OneH_Ocean[:,5].mean():1.0f}")
print(f"{'NEAR OCEAN':12s} {Near_Ocean[:,5].mean():1.0f}")
print(f"{'INLAND':12s} {Inland[:,5].mean():1.0f}")
print(f"{'ISLAND':12s} {Island[:,5].mean():1.0f}")

location     mean price
-----------------------
NEAR BAY     259279
<1H OCEAN    240268
NEAR OCEAN   249042
INLAND       124897
ISLAND       380440


6e. Does the location affect the house price?<br>
Create a Raw NBConvert cell to __explain your answer__.