<a href="https://colab.research.google.com/github/mattwassif/NeighborhoodProject/blob/main/Copy_of_F23_HC7%268_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HC7 Data Exploration and Cleaning

**Our Bronx Notebook**

Mitchell Lipyansky

Anthony Tommaso

Matthew Wassif

Isabel Arce


## These are the three libraries that we need to import in order to properly convey our data:

---


"pandas" is used for data sets.

"matplotlib.pylot" is for plotting & arrays.

Finally, "gdown" is imported for the user to download a file from Google Drive to Python.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import gdown

In [None]:
# download the file from our drive
!wget https://huntercsci127.github.io/files/clean_heat_dataset.csv

In [None]:
#list the files in the current directory to confirm the file is there
!ls

In [None]:
#We're having the code read in the csv into a data frame.
clean_heat = pd.read_csv("clean_heat_dataset.csv")

print("The dimension of the table is: ", clean_heat.shape)

In [None]:
print("Number of dataponts with null entry for each column:\n",clean_heat.isnull().sum())


In [None]:
clean_heat['BIN'].value_counts()

In [None]:
columns= ['Borough, Block, Lot #', 'Street Address', 'Postcode', 'Borough', 'Utility', 'Building Manager', 'Owner', 'Owner Address', 'Owner Telephone', 'DEP Boiler Application #', '#6 Deadline', 'Boiler Model', '# of Identical Boilers', 'Boiler Capacity (Gross  BTU)', 'Boiler Installation Date', 'Boiler Age Range', 'Est. Retirement Year', 'Burner Model', 'Primary Fuel', 'Total Gallons (High)', 'Total Gallons (Low)', 'Total MMBTU (High)', 'Total MMBTU (low)', 'Greener Greater Buildings', 'GGB Deadline', 'Building Type', 'Council District', 'Community Board', 'Bldg Sqft', '# of Bldgs', '# of Floors', '# of Res. Units', 'Total Units', 'Year Built', 'Condo?', 'Coop?', 'Latitude', 'Longitude', 'Census Tract', 'BIN', 'BBL', 'NTA']
cat_columns= ['Street Address', 'Borough', 'Utility', 'Building Manager', 'Owner','Owner Address','Owner Telephone', 'DEP Boiler Application #','Boiler Model','Boiler Age Range', 'Burner Model', 'Primary Fuel', 'Building Type', 'Community Board', 'Bldg Sqft', 'Condo?', 'Coop?', 'NTA']
num_columns=['Borough, Block, Lot #','Postcode','#6 Deadline','# of Identical Boilers','Boiler Capacity (Gross  BTU)', 'Boiler Installation Date',  'Est. Retirement Year', 'Total Gallons (Low)', 'Total MMBTU (High)', 'Total MMBTU (low)', 'Greener Greater Buildings', 'GGB Deadline', 'Council District', '# of Bldgs', '# of Floors', '# of Res. Units', 'Total Units', 'Year Built','Latitude', 'Longitude', 'Census Tract', 'BIN', 'BBL']

In [None]:
clean_heat[num_columns]=clean_heat[num_columns].fillna(value=0)

In [None]:
clean_heat[cat_columns]=clean_heat[cat_columns].fillna(value="")

# Now, we will dive deeper with our data!

We are going to further analyze the data and manipulate it in order to convey results pertaining to the borough of the Bronx.

In [None]:
# General data exploration
print(clean_heat.head())  # Showing the first few rows to understand the data structure
print(clean_heat.describe())  # Summary statistics for the dataset
print(clean_heat.isnull().sum())  # Checking for missing values


In [None]:
# Analyzing the Bronx
bronx_data = clean_heat[clean_heat['Borough'] == 'Bronx']
print(bronx_data.describe())  # Summary statistics for the Bronx

In [None]:
# Average building age in the Bronx
average_age_bronx = bronx_data['Year Built'].mean()
print("Average building age in the Bronx:", average_age_bronx.round())

We notice that the **"average building age"** of most buildings in the Bronx is **1929**. This is a huge contribution to the health of the children and elderly living in certain parts like Melrose or Mott Haven, since they are at increased chances of getting asthma. The materials used in the buildings are also extremely old, and most likely spread cancer causing or gaseous air throughout the borough.

In [None]:
# Building Type distribution in the Bronx
bronx_building_type = bronx_data['Building Type'].value_counts(normalize=True)
print("Building Type distribution in the Bronx:")
print(bronx_building_type)

Concerningly, the reader could notice that *"Hospitals and Health"* are towards the ***bottom*** of the list. This may explain why so many people have health issue within the Bronx that go untreated, since there are no nearby hospitals. There are also an **alarming number of Factory & Industrial buildings**, which we can note that there is an increased risk of c02 emissions since there is a large amount of them.

In [None]:
# Average boiler installation date in the Bronx
average_boiler_age_bronx = bronx_data['Boiler Installation Date'].mean()
print("Average boiler installation date in the Bronx:", average_boiler_age_bronx.round())

In [None]:
# Average MMBTU totals in the Bronx
average_mmbtu_bronx = bronx_data['Total MMBTU (low)'].mean()
print("Average MMBTU totals in the Bronx:", average_mmbtu_bronx)

In [None]:
# Average boiler capacity in the Bronx
average_boiler_capacity_bronx = bronx_data['Boiler Capacity (Gross  BTU)'].mean()
print("Average boiler capacity in the Bronx:", average_boiler_capacity_bronx)

In [None]:
# Distribution of primary fuel used in the Bronx
primary_fuel = bronx_data['Primary Fuel'].value_counts(normalize=True)
print("Primary fuel type distribution in the Bronx:")
print(primary_fuel)

Number 4 and 6 fuels are derived from petroleum, and are used all throughout heating system engines in the borough. With number 4 being the oil of higher usage, it greatly contributes to the ongoing chemical and air pollution within the Bronx. As shown in our groups HC3, "In NYC, these oils were identified as significant contributors to pollution, being responsible for 86% of soot pollution despite being
used in only 1% of buildings." This thereby elucidates the fact that oils 4 & 6 are heavy contributers of pollution, and they are both used.

In [None]:
# Distribution of burner models in the Bronx
burner_model = bronx_data['Burner Model'].value_counts(normalize=True)
print("Burner model distribution in the Bronx:")
print(burner_model)

#Now that we have given general information, we are going to manipulate it in order to show graphs based on both the borough & individual neighborhoods!

---



In [None]:
#Now, we will only be selecting rows pertaining to the Bronx itself and graphing it.
st = clean_heat[clean_heat['Borough'].isin(['Bronx'])]
print("Number of entries in the Bronx: ", len(st))

In [None]:
boro_group = clean_heat.groupby(['Borough'])

In [None]:
boro_group['Total Gallons (Low)'].mean().plot.bar()
plt.title('Average Total Gallons (Low)')
plt.xlabel('Borough')
plt.ylabel('Gallons')


In [None]:
boro_group['Total MMBTU (low)'].mean().plot.bar()
plt.title('Average Total MMBTU (low)')
plt.xlabel('Borough')
plt.ylabel('MMBTU')

In [None]:
percentages = boro_group['Total Units'].mean()
plt.ylabel('')
plt.xlabel('Total Units')
percentages.plot.pie(autopct='%1.1f%%')

In [None]:
boro_group['# of Floors'].max().plot.bar()
plt.title('Max Number of Floors')
plt.xlabel('Borough')
plt.ylabel('Number of Floors')


From this bar graph, we can see that Manhattan has the highest number of floors for buildings in the borough. Interestingly enough, the Bronx is second, with the max number of floors being around 30. From this, we may be able to make the assumption that both Manhattan and Bronx suffer from heavy air pollution, and this is because the buildings can contribute to pollutants being spread amongst the environment. This, combined with other emissions, according to our groups HC4 Emissions Report, contriubtes to "approximately 11% of the local fine particulate matter and 28% of the nitrogen oxide emissions.”

DATA FOR BRONX NEIGHBORHOODS

In [None]:
neighborhoods = ['Fordham South', 'West Farms-Bronx River', 'Bronxdale', 'Pelham Bay-Country Club-City Island', 'Westchester-Unionport']

# Filter the DataFrame to include only the specific neighborhoods
NTA_data = clean_heat[clean_heat['NTA'].isin(neighborhoods)]
NTA_group = NTA_data.groupby(['NTA'])



In [None]:
NTA_group['# of Res. Units'].mean().plot.bar()
plt.title("Number of Residental Units")
plt.xlabel('Neighborhood')
plt.ylabel('Floors')

From the following bar graph, we are able to determine that Bronxdale has the most amount of housing for residents, with the number being well over 80. South Fordham, on the other hand, exhibts the lowest amount of available housing. Since South Fordham is located in the South Bronx, when compared to our findings in HC4, the South Bronx often has increasingly concerning traffic congestion. This may elucidate the idea that there are too much highways and roads, but not enough space for residents. They also contribute heavily to the c02 emissions - affecting around 17% of young children in the South Bronx neighborhoods.

In [None]:
NTA_group['Total MMBTU (low)'].max().plot.bar()
plt.title('Low MMBTU')
plt.xlabel('Neighborhoods')
plt.ylabel('MMBTU Output')

With Bronxdale having an alarmingly high MMBTU, it goes to show that there is most likely a high demand for things like heat and a high volume of citizens living there. However, high MMBTU contributes to varioius types of pollution such as Greenhouse Gas Emissions, land pollution (waste disposal), indoor air pollution, and chemical pollution since there are constant chemicals being released into the air. This may also contribute to the idea we discussed in HC4, where the Bronx is lagging behind the other boroughs in terms of its emissions goal. Last time we researched, it was only determined that the Bronx reached a mere 7% dec

In [None]:
NTA_group['Boiler Capacity (Gross  BTU)'].mean().plot.bar()
plt.title('Boiler Capacity')
plt.xlabel('Neighborhoods')
plt.ylabel('Gross BTU Boiler Capacity')