In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list names of anyone you worked with on this homework.

# [ER-131] Homework 3: EDA Fire Incident Data
<br>

### Table of Contents
[Introduction](#intro)<br>
1 - [The IOU data](#data)<br>
2 - [Merging IOU and Weather Station Data](#merge)<br>
3 - [EDA](#eda)<br>
4 - [Exploring data through tables and visuals](#tables_plots)<br>
5 - [Summarizing data](#summarize)<br>

### Introduction <a id='intro'></a>

In this homework, you will investigate fire incident data from the three California Investor Owned Utilities (IOUs). The main goal for this assignment is to establish different ways to explore your data and its limitations, as well as ways to summarize and re-organize data.

We will accomplish this by utilizing exploratory data analysis (EDA) to analyze utility-reported data alongside weather data.

### Topics Covered 

* Work with different file types
* Merge dataframes and perform operations to add new columns
* View data through lens of structure, granularity, scope, temporality and faithfulness
* Perform basic data cleaning operations

### Dependencies

**Question 0:** Import the NumPy, Pandas, matplotlib, and Geopandas packages using their common pseudonyms. 

In [None]:
# YOUR CODE HERE

In [None]:
# Run this cell to import a few more packages.
import csv
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')

from IPython.display import display, Latex, Markdown
pd.set_option('display.max_columns', 36)

----
## Section 1: The IOU Data<a id='data'></a>

In this notebook, you'll be working with data from the [California Public Utilities Commission](https://www.cpuc.ca.gov/fireincidentsdata/). The three California IOUs (PG&E, SCE, and SDGE) are required to report fire incidents to the CPUC, along with certain characteristics of the fire and the electrical system in the area.

<br>**Question 1.1:** Look through the `data` folder and then read the Shapefiles into the homework so we can easily work with the data. These files were retrieved from the CPUC website, and small adjustments were made in Excel to make them easily retrievable in the notebook. The first example (PG&E) has been done for you.

Take a look at the arguments that are passed to the `read_file` function. First, we specify the file location. We also set  `index_col` to `False`.  This forces numbered indices. As an alternative we could have passed a number to `index_col`; if we pass $n$ in, then Pandas uses the $n+1^{\text{st}}$ column of the csv as the index. 

In [None]:
pge = gpd.read_file('data/PGEfireincidents.shp',index_col=False)

In [None]:
pge.head()

Now load Southern California Edison ('SCEfireincidents') and San Diego Gas and Electric ('SDGEfireincidents') data

In [None]:
#YOUR ANSWER HERE
sce = ...
sdge = ...

Geopandas truncates column names that exceed 10 characters. Run the script below to replace the column names with their original names. 

In [None]:
# Run this cell
raw = pd.read_csv('data/PGEfireincidents.csv')
names = list(raw.columns) + ['geometry']
names.remove('Latitude')
names.remove('Longitude')

for df in [pge, sce, sdge]:
    df.columns = names

In [None]:
sce.head()

In [None]:
sdge.head()

**Question 1.2:** What type of geometry are the objects in the `pge`, `sce`, and `sdge` GeoDataFrames?

In [None]:
# scratch work here


*YOUR ANSWER HERE*

Let's plot the three IOU datasets on the same map to get a sense of where the data lie.

In [None]:
# Run this cell
fig, ax = plt.subplots(figsize = (12,8))
pge.plot(ax = ax, label = 'PG&E', color = 'gold', marker = '*', markersize = 50)
sce.plot(ax = ax, label = 'SCE', color = 'maroon', marker = 'o', markersize = 50)
sdge.plot(ax = ax, label = 'SDG&E', color = 'teal', marker = '^', markersize = 50)
plt.legend(loc = 'lower center', bbox_to_anchor=(0.5, -4))
plt.title('Fire incidents in IOU territories')
plt.show();

Uh oh! Looks like something is wrong with our data for SDG&E.

**Question 1.3:** Examine the graph carefully and describe a possible data error that is leading to the unexpected output of our map. 

*YOUR ANSWER HERE*

**Question 1.4:** The most extreme latitude and longitude points in California are approximately as follows: 
* North: 42.0095 
* East -114.1312 
* South: 32.5341
* West: -124.4096 
 

Fill in the ellipses below to identify any points in the `sdge` data that fall outside of California. Note: the pipe | symbol is the equivalent of an "or" statement. In other words, the `.loc` function in the code block below is filtering the `sdge` GeoDataFrame if *any* of the four specified criteria are true.

In [None]:
sdge.loc[(sdge.geometry.y > ...) | # Replace the ellipses to check that no points fall above the northernmost boundary, or...
         (sdge.geometry.x > ...) | # ...to the right of the easternnmost boundary, or...
         (sdge.geometry.y < ...) |  # ... below the southernmost boundary, or...
         (sdge.geometry.x < ...)]   # ... to the left of the westernmost boundary.

**Question 1.5:** For the sake of expediency, `drop` the extraneous points identified in Question 1.4 from the `sdge` GeoDataFrame (an alternative could be to research each of these fires and manually correct the location). Make sure to specify the `labels`, `axis`, and `inplace` parameters correctly. You should not rename the dataframe.

In [None]:
#YOUR CODE HERE

In [None]:
assert len(sdge) == 89

Let's try our plot again.

In [None]:
# Run this cell
fig, ax = plt.subplots(figsize = (12,8))
pge.plot(ax = ax, label = 'PG&E', color = 'gold', marker = '*', markersize = 50)
sce.plot(ax = ax, label = 'SCE', color = 'maroon', marker = 'o', markersize = 50)
sdge.plot(ax = ax, label = 'SDG&E', color = 'teal', marker = '^', markersize = 50)
plt.title('Fire incidents in IOU territories')
plt.legend()
plt.show()

Much better!

## Section 2: Merging IOU and Weather Station Data<a id='merge'></a>

We'll also be working with weather data from the National Oceanic and Atmospheric Administration (NOAA). [Daily Summary Data](https://www.ncdc.noaa.gov/cdo-web/datasets#GHCND) were obtained for one land-based weather station per IOU service area from January 2014 to December 2016. 

**Question 2.1**: Load the file noaa_dailysummary.csv into a Pandas dataframe.

In [None]:
# YOUR CODE HERE
weather = ...

In [None]:
weather.head()

We're going to be merging fire incident data between each IOU and a land-based weather station in that IOU's service area. There are three weather stations in the dataframe `weather`, as shown in the output below. 'SAN DIEGO INTERNATIONAL AIRPORT, CA US' is in SDG&E's service area, 'SAN FRANCISCO DOWNTOWN, CA US' is in PG&E's service area, and 'RIVERSIDE MUNICIPAL AIRPORT, CA US' is in SCE's service area. <br>

In [None]:
weather["NAME"].unique() # look at weather station values

**Question 2.2** Since we're going to use the `merge()` function, we want the fields (i.e., columns) that we merge on to have the same name. Both the IOU and weather station data have a field for date, but these columns have different names between the two datasets. Rename the "DATE" column in `weather` to match the "Fire Start Date" column in the IOU dataframes using the function `.rename()`. Set the `inplace` parameter so that you do not have to reassign the dataset to a new name.

In [None]:
# YOUR CODE HERE

Finally, run the code below to convert the data type of all the date columns to `datetime`.<br>

In [None]:
#Run this cell
for df in [pge, sce, sdge, weather]: # change date data type to datetime
    df["Fire Start Date"] = pd.to_datetime(df["Fire Start Date"], errors = "coerce")

**Question 2.3**: Create three new dataframes - `weather_sdge`, `weather_pge`, and `weather_sce` - that correspond to just the weather data in that IOU's service area. (Hint: the `str_contains` method might come in handy.)

In [None]:
# YOUR CODE HERE
weather_sdge = ...
weather_pge = ...
weather_sce = ...

In [None]:
weather_sdge.head()

In [None]:
weather_pge.head()

In [None]:
weather_sce.head()

**Question 2.4**: Merge each utility's fire incident and weather data and save the merged dataframes as `sdge_fireweather`, `pge_fireweather`, and `sce_fireweather`. The data should be merged on the date fields. 

In [None]:
# YOUR CODE HERE
pge_fireweather = ...
sce_fireweather = ...
sdge_fireweather = ...

**Question 2.5:** What type of merge did you use in Question 1.5b (inner, outer, left, right)? Why did you choose this type?

*YOUR ANSWER HERE*

In [None]:
pge_fireweather.head()

In [None]:
sce_fireweather.head()

In [None]:
sdge_fireweather.head()

**Question 2.6:** Compare the number of records in the merged dataframes to the original pge, sce, and sdge dataframes. Did you lose any records when you performed the merge? Why or why not?

In [None]:
#scratch work here

*YOUR ANSWER HERE*

Before combining data from all three IOUs, we'll run the following `assert` statements to make sure that the column names are the same across the three dataframes.

*Note*: because the reporting is standardized for these IOUs, and because of some Excel cleaning that was done beforehand, the column names should match up. But if you're working with a dataset where column names need to be changed, you can use the [`rename` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) or a value assigment (e.g., `sdge_fireweather.columns = sce_fireweather.columns` would set the columns of `sdge_fireweather` to be the same as those in `sce_fireweather`, as long as you were certain that the columns represented the same values).

In [None]:
assert all(pge_fireweather.columns == sce_fireweather.columns)
assert all(sce_fireweather.columns == sdge_fireweather.columns)

Now, we can use [`concat()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat) to combine all three dataframes into one, called `alliou`. Run the cell below to combine the three IOU dataframes. We want our new dataframe `alliou` to renumber the indices (otherwise we'd have three rows with row index = 0, three rows with row index = 1, etc.). Check the [`concat()` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat) and make sure that we've set the appropriate argument to achieve this.

In [None]:
alliou = pd.concat([pge_fireweather,sce_fireweather,sdge_fireweather], ignore_index = True)

In [None]:
alliou.head()

In [None]:
assert alliou.shape[0] > 1000
assert all(iou in alliou["Utility Name"].unique() for iou in ["PG&E", "SCE", "SDG&E"])

## Section 3: EDA<a id='eda'></a>

**Question 3.1:** 
Analyze the `alliou` table and see what data types are within the table. 
<br>What is the:
1. structure of the data?<br>
2. granularity of the data?<br>
3. scope of the data?<br>
4. temporality of the data?<br>
5. faithfulness of the data?<br>

Some questions to ask yourself:
* Structure - What was the format or file type of the imported data? Are there are any differences in data types between the individual IOU dataframes, the weather dataframe, and the combined dataframe?
* Granularity - What does each row of data represent? Do any of the fields represent aggregated data (data that is reduced or summarized in some way)? What's the resolution in time (eg. hourly, monthly) of the data?
* Scope - You can think of scope in different dimensions, but geographic scope and temporal scope is one place to start.
* Temporality - What do the dates and times represent?
* Faithfulness - Where do the data come from? Is there any reason to question it? Where do you find null values? How have the manipulations we have conducted in this notebook revealed (and impacted) faithfulness?

Please make a couple observations for each category (structure, granularity, etc). The [NOAA's Daily Summary Documentation](https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf) might be a helfpul resource.

*YOUR ANSWER HERE*

**Question 3.2**: To get a basic estimate of weather conditions on the day of the fire incident, we took daily summaries from one weather station in the service area of each IOU. This approach isn't particularly granular - the IOU datasets provide more detail both in terms of geography and time than the weather data that we are using. Let's say you wanted to refine this approach to more effectively uncover the weather conditions in the location and at the time of the fire incident. In a few sentences, qualitatively describe an alternative approach. <br>

You don't have to specify any code or functions, but you should reference which columns you would use (either in the IOU or weather datasets) and which datasets you would use - you can take a look at [available NOAA data here](https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets).

*YOUR ANSWER HERE*

**Question 3.3:** What are the unique `Size` categories in the `alliou` table? Are there any redundancies in how the fire sizes are bucketed?

In [None]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

----
## Section 4: Exploring data through tables and visuals<a id='tables_plots'></a>

In this section, we'll do some data cleaning with the objective of exploring the fire incident data.


**Question 4.1**: Use `pd.value_counts()` to get the number of reported fire incidents by utility. What do you notice about the relative number of reports by utility? What are some factors that could explain the differences in number of reports, particularly between PG&E and SCE?

In [None]:
pd.value_counts(...) # YOUR CODE HERE

*YOUR ANSWER HERE*

**Question 4.2:** Create a column called `Size_cleaned` that contains cleaned values from the `Size` column of the `alliou` dataframe, renamed to address any redundancies. The resulting column should have 9 unique values.

In [None]:
# copy column and rename so we retain the original column. The uncleaned column can be deleted later if you'd like - 
# but this way avoid any irreversible edits
alliou["Size_clean"] = alliou["Size"]

# YOUR CODE HERE to clean size column
...

# Check your results
alliou["Size_clean"].unique()

In [None]:
assert len(alliou["Size_clean"].unique()) == 9

**Question 4.3:** Create a bar plot of how often each fire size category appears in the `alliou` dataframe. Use the function `pd.value_counts()` and the method `.plot` on the data frame. Give your plot a title and a y-axis label. Which fire sizes come up the most frequently in the dataset?

In [None]:
# YOUR CODE HERE
pd.value_counts(...).plot.bar() 
plt.title(...)
plt.ylabel(...);

*YOUR OBSERVATIONS HERE*

Let's use the Pandas datetime functionality to add a column to the `alliou` called `day_of_year`. 

In [None]:
# Run this cell
alliou['day_of_year'] = alliou['Fire Start Date'].dt.dayofyear
alliou.head()

**Question 4.4:** Plot a histogram (using `plt.hist()`) of the `day_of_year` column. Set the number of `bins` to a meaninful value. Title and label your graph. What might you infer about the seasonality of fire ignitions across all IOUs?

In [None]:
plt.hist(...)
plt.title(...)
plt.xlabel(...)
plt.ylabel(...)
;

*YOUR OBSERVATIONS HERE*

**Question 4.5**: Examine the "Was There an Outage" column. Perform any necessary data cleaning operations, then use `pd.pivot_table` to create a table showing the number of incidents in each service territory that led to an outage. Your table should have two rows ("no" and "yes") indicating whether there was an outage, and three columns ("PG&E", "SCE", and "SDG&E") representing the three utilities. (Hint: to return a table with only three columns, you'll need to specify a column for the `values` parameter in the `pivot_table`.)

In [None]:
# YOUR CODE HERE

<br>

----

## Section 5. Summarizing data<a id='summarize'></a>

One of the CPUC's goals when collecting this data is to identify operational and environmental trends related to fire incidents, with the objective of improving regulations and internal standards for utilities. In this section, you'll create a two new dataframes: one that summarizes fire incident data by material at origin, and another that summarizes weather data by year. In the process, you'll gain more experience with using `.groupby()` along with summarizing data that is non-numerical or doesn't lend itself as well to `.groupby()`. 
<br>

**Question 5.1:** Define a new dataframe, `alliou_matl`, that contains a single column with every unique value for "Material at Origin".

In [None]:
alliou_matl = pd.DataFrame()
alliou_matl["Material at Origin"] = ... # YOUR CODE HERE

In [None]:
alliou_matl

**Question 5.2:** The first set of values that we want to add to the dataframe is a count of the total number of fire incidents associated with each material type. Start by using `groupby().size()` to get a count of records for each material and save it to variable `counts`.

In [None]:
counts = # YOUR CODE HERE

In [None]:
counts

**Question 5.3**: Now we want to put the values from `counts` into a new column in dataframe `alliou_matl`. Do this below, making sure the right values from `counts` map to the correct material types. The resulting `alliou_matl` dataframe should have two columns, one for material and one for the count of fire incidents.<br>

In [None]:
# YOUR CODE HERE

In [None]:
alliou_matl

**Question 5.4** Next, we want to find out what percentage of fire incidents involving each material are associated with outages. Add a column called "% Outage" to `alliou_matl` that provides this value. There are lots of ways to approach finding the percentage of fire incidents associated with outages per material type, but some helpful functions might be `groupby()` and `np.divide()`.

In [None]:
#YOUR CODE HERE

...

alliou_matl["% Outage"] = ...

alliou_matl

**Question 5.5**: You're working for the CPUC, and as you're exploring the fire incident data a colleague notices that almost 85% of fire incidents involving buildings are associated with outages. Your colleague concludes that the focus of the commission should be to work with utilities to inspect and retrofit facilities in the utility territory. Do you agree with your colleague? Why or why not? Is there any additional data that you would want to collect before deciding where to focus maintenance review efforts? <br>

*YOUR ANSWER HERE*

**Question 5.6** We'd also like to explore annual weather trends. To start off, create a new column in `alliou` called "Fire Start Year" that includes the year of the fire incident (the `.dt.year` method is helfpul here).

In [None]:
# YOUR CODE HERE

alliou["Fire Start Year"] = ...

In [None]:
alliou.head()

In [None]:
assert alliou.shape[1] == 33

**Question 5.7** Use `.groupby()` to create a dataframe called `alliou_year` that shows *average* weather data values for each year and utility. To do so, you'll need to give `.groupby()` two arguments in the form of a list.<br>

*Note*: You'll notice that the dataframe `alliou_year` will only provide grouped data for the weather-related variables (and the day_of_year column we added earlier), since none of the variables in the IOU dataset are stored as numbers (and so we can't calculate their mean).

In [None]:
alliou_year = ... #YOUR CODE HERE

In [None]:
alliou_year

**Question 5.8** Define a function `temp_range()` that uses `alliou_year` and takes as input the year (as an integer) and the utility name (as a string) and returns the difference between the average maximum and average minimum value for that service area and year, rounded to two decimal places. Check out the [MultiIndex documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-indexing-with-hierarchical-index) for more information on how to use `.loc()` to access the values you want from `alliou_year`.

In [None]:
def temp_range(year, utility):
    """
    Calculate the difference between the average maximum and average minimum value for a given utility's land-based temperature in a certain year.
    
    Args:
        year, an integer (acceptable values are 2014, 2015, or 2016)
        utility, a string representing the utility (acceptable values are "PG&E", "SCE", and "SDG&E")
        
    Returns:
        The difference between average maximum and minimum value of the temperature, rounded to two decimal places (float)
    """
    
    # YOUR CODE HERE


In [None]:
print(temp_range(2014, "SCE"))
print(temp_range(2016, "PG&E"))

----

## Submission

Congrats, you're done with homework 3!

Before you submit, click **Kernel** --> **Restart & Clear Output**. Then, click **Cell** --> **Run All**. Then, go to the toolbar and click **File** -> **Download as** -> **.html** and submit the file **as both an .html and .ipynb file through bCourses**.

----

## Bibliography

- CPUC Fire Incident Data Collection: https://www.cpuc.ca.gov/fireincidentsdata/
- NOAA Daily Summary Documentation: https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf